CONfIDENCE INTERVLAS FOR FUNCTIONS OF QUANTILES USING LINEAR COMBINATIONS OF ORDER, STAHSnCS

by

Seth Mich~el Ste.tnberg Department of Biostatistics University of North CaroHna (J,t Chapel Hill

Institute of Mimeo Series No, 1433

r~arcn 1983 CONFIDENCE INTERVALS FOR FUNCTIONS OF QUANTILES USING LINEAR COMBINATIONS OF ORDER STATISTICS

by

Seth Michael Steinberg

A Dissertation submitted to the faculty of The University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Biostatistics, School of Public Health.

Chapel Hi 11

1983

Approved by: , j .,_ ...-, f / i i /j.... tJ l..!/'-"rv\ ,

ABSTRACT

SETH MICHAEL STEINBERG. Confidence Intervals for Functions of Quan­ tiles Using Linear Combinations of Order Statistics. (Under the direction of C.E. DAVIS)

Estimators for quantiles based on linear combinations of order statistics have been proposed by Harrell and Davis (1982) and Kaigh and Lachenbruch (1982). Both have been demonstrated to be at least as efficient for small sample point estimation as an ordinary sample quantile based on one or two order statis- tics. Distribution free confidence intervals for quantiles can be constructed using either of the two approaches. By means of a simu- lation study, these confidence intervals have been compared with several other methods of constructing confidence intervals for quan- tiles in small samples. For the , the Kaigh and Lachenbruch method performed the best overall. For other quantiles, no method performed better than the method which uses pairs of order statistics. The interquantile difference is often useful as a measure of dispersion. Both the Harrell-Davis and Kaigh-Lachenbruch estimators are modified to estimate interquantile differences. Theoretical developments needed to establish large-sample use of the normal distribution for these estimators are presented. Both of these methods are used to form pivotal quantities with asymptotic normal distributions, and thus are readily used for construction of confi- dence intervals. The poi nt estimators of i nterquantil e di fference are compared iii through simulations on the basis of relative mean squared errors. The estimator based on the Harrell-Davis method generally performed best in this regard. Confidence intervals are constructed and com­ pared with a method based on pairs of order statistics. This order statistic method produced very conservative intervals. The perfor­ mance of the other estimators varied, and was better for symmetric distributions. Neither method could consistently produce intervals of the desired confidence. Finally, an example using data from the Lipid Research Clinics Program is presented to illustrate use of the new estimators for point and interval estimation of quantiles and interquantile dif­ ferences. iv e. ACKNOWLEDGEMENTS My committee, chaired by Dr. C.E. Davis, was extraordinary in the amount of time and effort put forth to assist with this project. I sincerely thank and am grateful to Dr. Davis for his availability to discuss my work, suggestions, and guidance throughout my writing of the dissertation. I want to thank the other members of the com­ mittee, Drs. Shrikant Bangdiwala, Frank Harrell, Abdel Omran, and Dana Quade, for their comments and suggestions, and for maintaining a strong interest in the project. The past few years in Chapel Hill have been very enjoyable. This is due in large part to the many wonderful friends I have made here. I thank them all for their support along the way. My parents deserve special thanks for encouraging me to obtain a worthwhile education and for supporting me throughout the whole process. I would like to thank Dr. P.K. Sen for initially suggesting my investigation of this area of research, and for providing helpful information when it was needed. Dr. William Kaigh of The University of Texas at El Paso, and Dr. Bruce Schmeiser of Purdue made avail- able some results of their own research and is gratefully acknowledged. Data for the example in Chapter V are used with permission of the National Heart, Lung, and Blood Institute. Finally, I would like to thank Ernestine Bland for providing superb, speedy typing services for this manuscript, and the entire faculty and staff of the Department of Biostatistics for making my v experience pleasant and rewarding. Funding was provided by NICHD training grant #5-T32-HD07l02-05, and by Survey Design, Inc. vi

TABLE OF CONTENTS

Page ACKNOWLEDGEMENTS...... i v L1ST OF TABLES...... • .. .. i x

CHAPTER I INTRODUCTION AND REVIEW OF THE LITERATURE . 1.1 Introducti on...... 1 1.2 Revi ew of the Literature...... 2 1. 2.1 Simple Point and Interval Estimators 2 1. 2. 2 Various Median Estimation Methods 8 1.2.3 Estimators for the p-th Quantile 16 1.2.4 Quantile Estimators for Specific Distributions 23 1.2.4.1 Normal Distribution Quantiles ..... 23 1.2.4.2 Exponential Distribution Quantiles 25 1.2.4.3 Quantile Estimation for Other Distributions 28 1.2.5 Estimation of Quantile Intervals 28 1.2.6 Estimation of Quantile Differences 31 1.3 Outl ine of the Research Proposal 32

II A COMPARISON OF CONFIDENCE INTERVALS FOR QUANTILES ..... 34 2.1 Introducti on 34 2.2 Selection of Interval Estimators for Comparison 34 2.3 Note on the Use of the Kaigh and Lachenbruch Estimator 36 2.4 Eval uation of Confi dence Interval s 38 2.4.1 Exact Confidence Intervals 38 2.4.1.1 Determination of Confidence 38 2.4.1.2 Expected Length of Confidence Intervals 39 2.4.2 Simulated Confidence Intervals 43 2.4.2.1 Determination of Confidence 44 2.4.2.2 Expected Lengths of Intervals 45 vii

2.4.3 Selection of Distribution for Pivotal Quantity 47 2.5 Details of the Simulation Process 47 2.6 Results from Simulated or Theoretical Construction of Intervals 49 2.7 Conclusions 51

III THEORY FOR ESTIMATION OF AN INTERQUANTILE DIFFERENCE ... 69 3.1 Introduction 69 3.2 Theory for the L-COST Estimator of Inter- quanti 1e Di fference...... 70 3.2.1 The L-COST Interquantile Difference Estimator 70 3.2.2 Theoretical Framework for Convergence toN0 rma 1i ty ...... 71 3.2.2.1 L-estimators and the L-COST Estimator 71 3.2.2.2 Establishing Conditions for Convergence...... 72 3.2.3 Convergence Theorems for the L-COST Estimator of Interquantile Difference ..... , 77 3.2.4 Confidence Interval Estimator Based on L-COST Interquantile Difference Estimator 80 3.3 Theory for the Kaigh and Lachenbruch (1982) Estimator of an Interquantile Difference 81 3.3.1 The K-L Interquantile Difference Esti rna to r...... 81 3.3.2 Convergence Theorems for the K-L Estimator of Interquantile Difference 82 3.3.3 Confidence Interval Estimator for the K-L Interquantile Difference Estimator..... 84

IV A COMPARISON OF POINT AND CONFIDENCE INTERVAL ESTIMATORS OF INTERQUANTILE DIFFERENCES 86 4.1 Introducti on 86 4.2 Point Estimators for the Interquantile Di fference...... 87 4.3 Eval uation of Point Estimators 87 4.3.1 Methodology for Comparisons 87 4.3.2 Results of Comparisons 89 viii

4.4 Evaluation of Confidence Intervals•.•.•.••••.••••• 91 4.5 Results from Simulated Confidence Intervals ••••.•• 93 4.6 Conclusion and Sunmary 95

V EXAMPLE OF QUANTILE ESTIMATION METHODS •••.•••.••••.•••• 105

5. 1 Introducti on •••••.••••••••.•.•••••..•.••.•.•••••••105 5.2 Comparison of Results for the Example .•.•••.••.•.•106 5.3 Conclusion ...••••••••••••••••.•..•••.•.•..••.••••.108

VI SU~~ARY AND SUGGESTIONS FOR FURTHER RESEARCH .•.•.•.•.••119

6. 1 Summa ry •.•••••.••••••••••.••.•••.••••••...•••..•.•11 9 6.2 Suggestions for Further Research 121

BIBLIOGRAPHy •..•••...••••••.•••••..•.•.••.•.••••.•••.•..••.••.•123

APPENDIX •..•.••.•.•.•..•.•..•.••.•.•.••..•..•••..••••••••.••••.129 ix

LIST OF TABLES Page TABLE 2.1 Order Statistics X(j); X(k) Comprising a Confidence Interval (with Theoretical Confidence) for Various Quantiles and Sample Sizes 54 2.2 Expected Lengths of 95% Confidence Intervals (and Theoretical or Observed Confidence) Computed for Various Quantiles of the Uniform Distribution, with Three Sample Sizes 55 2.3 Expected Lengths of 99% Confidence Intervals (and Theoretical or Observed Confidence) Computed for Various Quantiles of the Uniform Di~tribution, with Three Sample Sizes 56 2.4 Expected Lengths of 95% Confidence Intervals (and Theoretical or Observed Confidence) Computed for Various Quantiles of the Normal Distribution, with Three Sample Sizes 57 2.5 Expected Lengths of 99% Confidence Intervals (and Theoretical or Observed Confidence) Computed for Various Quantiles of the Normal Distribution, with Three Sample Sizes 58 2.6 Expected Lengths of 95% Confidence Intervals (and Theoretical or Observed Confidence) Computed for Various Quantiles of the Cauchy Distribution, with Three Sampl e Sizes 59 2.7 Expected Lengths of 99% Confidence Intervals (and Theoretical or Observed Confidence) Computed for Various Quantiles of the Cauchy Distribution, with Three Sample Sizes 60 2.8 Expected Lengths of 95% Confidence Intervals (and Theoretical or Observed Confidence) Computed for Various Quantiles of the Exponential Distribution, with Three Sample Sizes 61 2.9 Expected-Lengths of 99% Confidence Intervals (and Theoretical or Observed Confidence) Computed for Various Quantiles of the Exponential Distribution, with Three Sample Sizes 63 x

2.10 Expected Lengths of 95% Confidence Intervals (and Theoretical or Observed Confidence) Computed for Various Quantiles of the Lognormal Distribution, with Three Sample Sizes ~ 65 2.11 Expected Lengths of 99% Confidence Intervals (and Theoretical or Observed Confidence) Computed for Various Quantiles of the Lognormal Distribution, with Three Sample Sizes 67

4.1 Relative Bias of Proposed Estimators 97 4.2 Relative of Proposed Estimator vs. Sample Quantiles Method 98 4.3 Indices for Order Statistics Selected for Forma- tion of Confidence Intervals Described by Chu (1957) ... 99 4.4 Expected Lengths of Confidence Intervals (and Observed Confidence) Computed for Interquantile Distances from the Uniform Distribution .100 4.5 Expected Lengths of Confidence Intervals (and Observed Confidence) Computed for Interquantile Distances from the Normal Distribution 101 4.6 Expected Lengths of Confidence Intervals (and Observed Confidence) Computed for Interquantile Distances from the Cauchy Distribution 102 4.7 Expected Lengths of Confidence Intervals (and Observed Confidence) Computed for Interquantile Distances from the Exponential Distribution 103 4.8 Expected Lengths of Confidence Intervals (and Observed Confidence) Computed for Interquantile Distances from the Lognormal Distribution 104

5.1 Estimates of Median, Lipid Data, Sample Size 51, Users and Nonusers of Oral Contraceptives 110 5.2 Limits for 95% Confidence Intervals for Median of Lipid Data, Sample Size 51, Users and Nonusers of Oral Contraceptives 111 5.3 Limits for 99% Confidence Intervals for Median of Lipid Data, Sample Size 51, Users and Nonusers of Oral Contraceptives 112 xi

5.4 Estimates of Interdecile Difference, Lipid Data, Sample Size 51, Users and Nonusers of Oral Contracepti ves 113 5.5 Limits for 95% Confidence Intervals on Inter­ decile , Lipid Data, Sample Size 51, Users and Nonusers of Oral Contracepti ves 114 5.6 Limits for 99% Confidence Intervals on Inter­ decile Range, Lipid Data, Sample Size 51, Users and Nonusers of Oral Contraceptives 115 5.7 Estimates of Interquartile Difference, Lipid Data, Sample Size 51, Users and Nonusers of Oral Contraceptives 116 5.8 Limits for 95% Confidence Intervals on Inter­ quartile Range, Lipid Data, Sample Size 51, Users and Nonusers of Oral Contraceptives 117 5.9 Limits for 99% Confidence Intervals on Inter­ quartile Range, Lipid Data, Sample Size 51, Users and Nonusers of Oral Contraceptives 118 CHAPTER I

INTRODUCTION AND REVIEW OF THE LITERATURE

1.1 Introduction

Suppose there is interest in the probability distribution of some random variable, X, having cumulative distribution function denoted F(x) and probability density (or mass) function f. It may be desired to estimate various characteristics of this distribution by means of a random sample of size n. Denote this random sample by Xl"" ,X n and its observed realization by xl"" ,xn' Often a mean and are estimated from this sample, but there are many instances where additional measures of location and dispersion are of more value. For example, to determine the number of months of marriage after which half of the mothers in a study gave birth to their first child, estimate the median time until first birth. Or it may be necessary to know the cholesterol level which is exceeded by only 5% of the studied population. In each of these examples, the quantity of interest is called a population quantile. Formally, a p-th quantile, denoted F-l(p) or ~p' of a probability distribution F(x) is defined by f~~ f(x)dx = p. Thus, s.5 is the median of the distribution, ~.95 is the 95th "percentile,1I and so on. 2

If X is discretely distributed, P(X < ~p) < p ~ P(X ~ ~p) and 'p = F-l(p), where F-l(p) = inf(x: F(x) > p). In addition to the quantiles themselves, important functions of quantiles exist. For example, ~.75 - ~.25 is called the inter­ quartile range and '.90 - ~.10 is the interdecile range. Each of these quantities provides a useful measure of dispersion in a popu- lation, especially if there is uncertainty about the shape of the distribution from which the data arise.

1.2 Review of the Literature

1.2.1 Simple Point and Interval Estimators Many methods have been proposed for estimating a quantile, and the very simplest one is based on a single ordered value from the random sample. If the observed values of the sample, xl , ... ,xn are arranged in ascending order of magnitude and denoted by x(i)' then x(l) ~ ... ~ x(n) constitute the order statistics corresponding to the random sample. One such definition of the p-th sample quantile is

x(np) if [npJ = np X- P ={ x([np]+l) if [npJ < np, where [yJ denotes the greatest integer less than or equal to y.. Xp may be used as a point estimator of ~r; for example, see Ogawa (1962). This estimator has the following important property: If f(x) is differentiable in the neighborhood of x = ~p' and f(~p) r 0, the distribution of the 3

~ random variable In/p(l-p) f(~p)(Xp-~p) tends to that of a N(O,l) random variable (normal random

variable with mean 0, 1) as n + 00, as explained by Ogawa (1962).

Another estimator, defined as X[p(n+l)]' is biased for ~p as shown by a simple example in Schmeiser (1975). Hogg and Craig (1978, pp. 308-311) demonstrate, however, why X((n+l)p) is the 100 p-th percentile of the sample only for p such that (n+l)p is an integer. Let Z = F(X). Then Z is uniformly distributed on the unit interval.

If the sample is ordered X(l) < X(2) < ... < X(n)' then with

Zi = F(X(i))' the density function for zk is

h ( ) = n! zkk- l (l-Zk)n-k. k zk (k-l)!(n-k)! ' o < zk < 1,

so o < zl < 1.

Define the random variables Wl = F(X(1))

W2­ - F(X(2)) - F(X(l))

Wi is called a coverage of the random interval {I: X(i_l) < I < X(i)}'

and each Wi has the same pdf as Zl = F( X( 1))' Thus, the expected value of each coverage can be shown to be

1 1 E(W.) .= J nw .(1-w.)n- dw. = 1/ (n+ 1) . 1 0 1 1 1 4

(The order statistics lead to a partition of the probability distri­ bution into n+l sections, with common expected value l/(n+l». Thus, because F(X(j» - F(X(i» is the sum of j-i coverages, E(F(X(j» ­

F(X(i») = (j-l)/(n+l), so if (n+l)p = k, E(F(X(k») = k/(n+l} = (n+l)p/(n+l) = p. This implies that X(k) would be a reasonable estimator for F-l(p} = ~p' Another quantile estimator is based on two adjacent order sta­ tistics. As reported-,....in Harrell and Davis (1982) as well as in Schmeiser (1975), F~ (p) = (l-a}X(r) + a X(r+l) where r = [p(n+l}], a = p(n+l} - [p(n+l}], and p E [l/(n+l}, n/(n+l}]. Schmeiser (1975) demonstrates that this is unbiased for data from a uniform (O,l) dis- tribution. Only point estimators have been discussed so far. If X is a continuous random variable, a confidence interval (X(j}'X(k)} can be derived for ~p with an approximate confidence coefficient of y = l-a. As shown in Mood, Graybill, and Boes (1974),

= 1 - P[F(X(j» > p] - P[F(X(k}} < p]

It is easily shown that the probability density function of

= n! } j-l (}]n-j () f(x(j}} (j-l}!(n-j)! [F(x(j}] [1 -F x(j) f xU} 5

Thus, if we let Z = F(X(j))' then ~~ = f(X(j)). So,

j fz(z) = IdZ)dXI f(X(j)) = (j-l)~in-j)! zj-l (l_z)n- ,

(O

where the beta function is

a 1 r~a~+~ B(a,b) • I: t - (l_t)b-l dt, (a > 0, b > 0) = b) and the incomplete beta function is defined by

Ip(a,b) = JP t a-l (l_t)b-l dt/S(a,b). o In practice, the appropriate interval is defined by (X(j)'X(k)) which have j and k satisfying

Ip(j,n-j+l) - Ip(k,n-k+l) = Y = l-a.

An alternative derivation in David (1981) leads to

n n = I (~) pi (l_p)n-i - L (~)pi (l_p)n-i i=j 1 i=k 1 k-l = I (~)pi (l_p)n-i =n(n,j,k,p), i =j 1 6

In the discrete case, p(X < ~p) ~ p and P(X < ~p) > P together imply that

So, P(X(j) < ~p ~ X(k)) ? n(n,j,k,p),

P(X(j) < ~p < X(k)) < n(n,j ,k,p).

Many years ago, Thompson (1936) and later Scheffe (1943), pre- sented this result in terms of intervals for the unknown median, M, of a population: P(X(k) < M< X(n-k+l)) = 1 - 2I. 5(n-k+l,k) for 2k < n+1. Nair (1940) compares this result to a very similar one obtained independently by Savur (1937). The latter method only differs be- cause of the assumption that there is a finite probability for an individual observation to equal the population median, which implies F(x) is noncontinuous. Scheffe and Tukey (1945) discuss the problem of interval esti- mation of quantiles paying particular attention to discrete distri­ butions. This work is limited to showing how to employ the proba- bility integral transformation to define the interval regardless of the form the cdf assumes. Noether (1948) proceeds by means of step functions defined to be parallel to Fn(x), the empirical cdf. He then demonstrates that his method leads to the same kind of interval as that of Thompson. 7

Wilks (1948) be~ins with a slight rewriting of the interval defini- tion,

= P(U < P < U+ V), but arrives at Thompson's result as well. This methodology is still very much in use today. Lever (1969) provides a simple example of how to apply this type of interval to cover the p-th quantile from a mortality distribution for data ob- served in a laboratory setting. Because it is based on the binomial distribution, the one sample sign test (for example, Mood, Graybill, and Boes (1974), p. 514) can easily be converted into a confidence interval for any desired quan- tile. This is done by including in the interval any gaps between order statistics whose binomial probability of occurrence (under the null hypothesis of p = Po' the true value for the p-th quantile) can be added appropriately to bracket the hypothesized value with a stated total confidence. For example, in order to estimate the lower quartile (~.25)' first form the binomial array based on (~) .25i .75n- i , i=l, ... ,n. Then, beginning with the gap between 1 order statistics with the largest confidence, determine the interval by adding together probabilities around this value until at least the required confidence has been achieved. The interval should be closed by including the values for the order statistics defining the ends of the intervals. 8

As well, two methods have been discussed which are based on the large sample normal approximation to the binomial distribution. The first, a simple rule of thumb, is presented in David (1981). For a sample size of at least 10, an approximate l-a confidence interval for the median is obtained by counting off ~;n ~-1(1-a/2) observations on either side of X[~(n+l)J' the sample median, where ~-1(1-a/2) is the upper ~ point of a normal distribution. The second method, described by Wilks (1962), defines n as the l ~ Det~oi number of components of the sample less than p. By the vre- Laplace theorem, nl - Bin(n,p) is distributed asymptotically as N(np,np(l-p)). Thus,

n,-np lim P(-y < < Y) = y n+oo y Inp(l-p) y where 2 __1__ fYY e-~ x dx = y !21T -yy (n l -np)2 Solving np(l-p) = yy2 for py provides an approximate y confidence interval (p ,p) for p. J y This leads to using (x[np J,x[nP-J) as a J y y-level confidence interval for ~p'

1.2.2 Various Median Estimation Methods

In recent years, many competing estimators have been proposed. Some have application only to the population median, while others can be applied to virtually any quantile. Much attention has been devoted to point and interval estima­ tion of the median. Some relatively simple interval estimators for 9 the median were proposed by Walsh (1958). One of his two-sided con- fidence intervals for ~.5 can be written as

where for small samples, 1 ~ i ~ 5 ~ n ~ 12 and i ~ n-4. Another two-sided interval is of the form

The confidence coefficient for this interval has a value

p[min[~(x(1)+ x(l+i))' x(2)] < i:. 5]

+ P[max[~(x(n) + x(n_j))' x(n-1) > l;.5] - 1.

The lower and upper bounds are determined by setting each of these probability expressions equal to its upper and lower bound value. These probabilities are tabled according to sample size and param­ eters A and B which reflect the degree of population symmetry. Two simple median estimation methods are proposed by Ekblom

(1973). Define A as a constant whose value can be between 1 and

+ 00. Assume there is a sample of size n (even). Let the sample median be defined by:

if n = 2k-l

+ x( k+ 1) ) if n = 2k.

Then, the two estimators are:

X(k) if x(k) - x(l) ~ A(X(n) - x(k+l))

p( A) = x(k+1) if x(n) x(k+l) ~ A(x(k) - x(l))

(x(k) + x(k+l))/2 if otherwi se 10 and

X(k) if x(n) - x(k+l) ~ A(x(k) - x(l))

N(A) = x(k+l) if x(k) - x(l) ~ A(X(n) - x(k+l))

(x(k) + x(k+l))/2 if otherwise.

If A = 00, the sample median, M, would be used. Monte Carlo tests on various values of A indicate that P(2) had higher relative efficiency than Mfor the normal, triangle, and uniform distributions. N(A), A = 1, 1.5, or 2, performed better for these distributions as well as the Cauchy. Since the estimator is a simple function of order statistics, asymptotic normal theory results lead to constructing a confidence interval of the form P(A) ± ¢-1(1-a/2) S(P(A)) or

N(A) ± ¢-1(1-a/2) S(N(A)) where S(P(A)), S(N(A)) are estimated stan- dard deviations of the estimators. Maritz and Jarrett (1978) provide formulas for obtaining esti- mates of the variance of the sample median based on both even and odd sample sizes. For odd (n = 2m+l) sample sizes it is implied '" '" that the variance, E(X 2) - [E(X )J2, where n n

m E(X~r) = (2m+l~! foo xr[F(x)(l-F(x))J f(x)dx,

(m! ) _00 can be obtained by using Fn(X) to estimate y ~ F(x). Then, the estimate of the variance can be written in terms of

where 11 j/n w. = (2rn+1)! ym (1-y)mdy. J (m!)2 f(j-l) n A more complicated expression is derived for the estimate of the variance when the sample size is even. Applying the results of these methods can lead to approximate confidence intervals for the median. Another method of arriving at confidence intervals for the median is mentioned in Hartigan (1969). It requires forming all N = 2n_l possible subsets of the index set {l, ... ,n}. For each subset s, form the "subsample mean" L:. x(. )/L:. 1. Letting lES 1 lES Z(l)""'Z(N) denote these subsample means, arrange them in ascend- ing order of magnitude, and then use (Z(i)'Z(j)) as the confidence interval where i and j are selected to be consistent with the interval's confidence of (j-l)/N. A simple method, which is Lanke's (1974) main focus, requires a unimodal and symmetric distribution around ~.5' Define R = X(n) ­ X(l)' then for every A ~ 0, it is shown that

P(X(l) - AR < ~.5 < X(n) + AR) ~ 1 - (2+2A)-n+l

and the lower bound will be the best possible. Specifically, if A = ~a-l/(n-l) - 1 is selected, then the interval (X(l) - AR, X(n) + 'AR) has a confidence coefficient of at least l-a for the median. If the underlying population distribution is normal, use numerical integration to obtain A values for which the true confi­ dence level is l-a. Compared to the usual parametric confidence interval, this one has been shown to be no more than 11% longer. 12

A second modification is also discussed which performs very well for a uniform distribution. Gui1baud (1979) provides an interval estimate for the median which extends Thompson's results. Let r-1 C (r)= 1 - 2 ~ (n)2-n n v=O v and if In(r) = [x(r),x(s)J, 1 ~ r: s = n-r+1, it has been shown that

P(~.5 E In(r)) > Cn(r). Define a second interval

where 1 ~ r ~ s = n-r+1, and 0 ~ t ~ s-r. If ~.5 is such that

P(X < ~.5) ~ ~ ~ P(X ~ ~o5)' and r,t,n satisfy 1 ~ r ~ n-r+1, and o $ t ~ n-2r+1 then P(~.5 E In(r;t)) > Ln(r;t), where Ln(r;t) = ~ Cn(r) + ~ Cn(r+t). The lower bound in this inequality is stated to be the best possible. If F(x) is continuous and strictly in­ creasing, and F'(u) is uniquely defined and continuous on (0,1), then the probability of lying in the interval In(r;t) is increased by a function of SF(u), a rather complex "symmetry function". A slight modification to the confidence interval derived from the sign test is presented by Noether (1973). The technique only requires looking at the m largest differences among IXi - nol , 2 ~ m $ n, and forming the statistics

m m

~ ~ o T = to, T+ = (l-t ). j=l J j=l J ~.5 ~~5' The test is: reject Ho' that = if min(T+,T_) $ C. Because a symmetric population with a continuous cdf is assumed, 13

C the significance level is a = 2 I b(s;m;~) where s is the number s=O of successes and m is the number of trials. It is explained that the confidence interval associated with this modification can be written as:

where g and h can be chosen to minimize the expected length of the interval. The confidence coefficient associated with this is g-l Ygh ~ 1 - a = 1 - (~)g+h-2 'I (g+h-l). s=O s

Confidences corresponding to various values of 9 and h are tabled. A graphical method for obtaining a confidence interval for the median, based on Wilcoxon's signed rank test, is discussed briefly in Moses (1965). Assuming Wilcoxon's signed rank statistic, W, is formed, where n W= L Z.R. , i =1 1 1

x. > 0 Z. ={ +1 1 1 _1 x. < 0 1 and Ri ;: Rank of IXii among IX11, ... ,IXnl, the critical value for a test of the median=O is called S*. The procedure described essen­ tially requires forming all possible averages from two sample observations. Moses proposes that this can be done by a graphical technique involving the intersection of lines from each pair of sample observations. The smallest and largest S* values among the averages, and the data points themselves, are excluded. The (S*-l)-th lowest and highest remaining averages or observed values 14 constitute the endpoints of the interval. Efron (1979) describes a technique for estimating the expected squared error of estimation for the sample median. The method, entitl ed the "Bootstrap ", is descri bed as fo 11 ows: Assume it is de- sired to estimate the sampling distribution of R =R(X,F) = t(~) - 8(F). This random variable is the difference between a parameter of inter­ est, 8(F), and its estimator, t(~). A bootstrap sample is obtained by drawing a random sample of size n from the population. Then, the

A sample probability distribution, F, is constructed by equally weight-

A ing each data point by lin. From this F, a subsample of size n is drawn, with replacement, which will be called the bootstrap sample * -- (* *). 1 *. (* *) and be denoted X Xl"" ,X n ' wlth observed va ues ~ = Xl"" ,xn . Thus, this is not a permutation of the original sample unless by chance each element is selected exactly one time. The sampling dis­ tribution of R(X,F) can be approximated by the distribution of

A R* = R(X*,F), which is the bootstrap distribution induced by select-

A ing a bootstrap sample from a fixed F. The distribution of R* would equal the distribution of R if F were exactly equal to F, and, as Efron explains, must be "close" since F is "close" to F. Exactly how well R*'s distribution approximates that of R depends upon the form of R. For estimation of the median, use 8(F) = median of F, and t(X) = X(m)' the sample median from a sample of size n = 2m-l. Let

N* = [Nl,* ... ,N *Jn where N*i denotes the number of times Xi is selected with the bootstrap sampling procedure. Within this bootstrap sample, denote the ordered values x(l) 5 ... ~ x(n) and the corresponding 15

N* values N(l), ... ,N(n). Then, the bootstrap value of R is R* = R(X*,F) = X(m) - x(m)' the sample median of the bootstrap dis-

A tribution minus the median from the empirical distribution, F. To obtain the estimate of the variance, for any integer l, 1 ~ l ~ n, ca1cul ate

= Prob* {K(l)**+ ... + N(l) 5 m-l}

= Prob {Sin(n,g) < m-l} n - (~)(f) = mf j (n-l) n-j . (1.1) j=O J n n Thus, Prob* {R* =x(l) - x(m)} = P{Sin (n'--n--l-l) ~ m-l}

P{Sin (n,~) ~ m-l}.

Finally, for a random sample of size n, calculate

as an estimate of the expected squared error of estimation for the sample median. If E(t(X)) ; 8(F) can be assumed, a confidence interval for the median can be based on

at least approximately. In a later work, Efron (1981) presents an

A estimator, aSoot which is based on repeatedly drawing bootstrap A samples from the empirical distribution function, F. Denoting 16 these bootstrap estimates of the median by 81'.*1 , ... ,8A*N ,

aBoot = N 1

The resulting confidence interval is then t(X) ± ~-1(1-a/2) ~Boot· Other median estimators deserve mention. Bauer (1972) demon- strates one way of forming the confidence interval for the median of a symmetric distribution based on the Wilcoxon signed rank test. This method is an algebraic expression of that discussed by Moses (1965) but appears more difficult to implement. Desu and Rodine

(1969) present an estimator of the form T(a,r) = aX(r) + (1-a)X(n_r+1) where 0 < a < 1 and 1 ~ r ~ [n/2]. They derive its density and distribution assuming symmetric underlying densities, but the density function is extremely complex and confidehce intervals are difficult to obtain. Finally, three separate articles written by Reid (1981), Emerson (1982), and Brookmeyer and Crowley (1982) dis- cuss median estimation from survival data both with and without censoring. The sign test is modified to handle these situations.

1.2.3 Estimators for the p-th Quantile

The estimators and intervals discussed above are intended specif- ical1y for the population median. There are several other estimators and intervals which can be used to estimate arbitrary p-th quanti1es, although perhaps with restrictions. Kubat and Epstein (1980) develop two point estimators of ~p from any distribution in the location-scale family: 17 t.; -A F (~ ) = F(~) = p. x p <5

Their simpler estimator is based on two-order statistics. It can be

expressed as Xt.;(a,b) = C1X(L) + C2X(M) subject to Cl +C 2 = 1, • Z~, ~p <5Z~+A, C1Za + C2Zb = where = Q

The variance of the estimator is derived, a* and b* are chosen to maximize the asymptotic relative efficiency, and the optimal X(L) and X(M) are determined. The final estimator then becomes

X~(a*,b*) = CiX(L*) + C2X(M*), *_ Zt"-Zb* Z~-Zb. where L* = [na*J+l, M* = [nb*J+l, Cl - Since this estimate is a simple function of order statistics, asymptotic normal results apply, and the resulting confidence interval is of the form

where v(X~(a,b)) is presented in the paper. A similar estimator is proposed for three order statistics. For both estimators, the full sample need not be observed, and knowledge of F(o) is only required

in an interval covering ~p. Another type of interval estimator is discussed in Schmeiser (1975). Based on the above-mentioned estimator,

a normal theory estimator is formed. Let there be estimates 18

val for the p-th quantile can then be constructed as 4' G(F n (p)) ± t a/ 2;m-l S/!Im where

Schmeiser explains the validity of the interval in terms of a re­ sult that as l + 00, and p = r/l is fixed,

x + N[F~)' p(l-p)f 2] , (r) n [f[F~l(p)JJ with f corresponding to Fn (from Gibbons (1971), p. 40). Obtaining a confidence interval of pre-assigned length for any p-th quantile is the subject of Weiss's (1960) work. His method lets us select a pre-assigned length 6 for the confidence interval for the p-th quantile with desired confidence coefficient B. First, two definitions need to be given:

1) Define the p-th sample quantile from a random sample of size n as the sample value with exactly [npJ observations

be low it. (A va riation 0 f Xp. ) 19

2) Let 0 5 a, Y ~ 1 be any two values, and define N(a,y) as the smallest positive integer n which satisfies

I min [J [J. f (l,q+y) ~n- > ~n. y np (1-y,_anp -1 dy max(O,q-y)

To construct the interval, choose quantities a,w,r, all between o and 1 so that aw = Band r > max(p,l-p). Then, select a sample of mobservations where m is the smallest positive integer that satis­ m fies 1_[mr - 1 - (m-1)rmJ ~ w. Next, defining Land U to be the smallest and largest among these mobservations, let

_.[ (1) r-(l-p) 6 1-0 6 Y- mln r-p, r- -p, U-L 2' rr:L2'J

Then take a second sample of n observations, where n is the smallest integer greater than N(a,y) such that np is not an integer. Denote by Z the p-th sample quantile of the second sample. The confidence interval for ~p is then

Azza1ini (1981) presents a method based on the so-called kernel estimate of the density f(·). The estimator is

A 1 n x-x. f(x) = nb I w(~) i=l where w(·) is some bounded density function with w(t) = w(-t) for all real (t,b) > O. The distribution function estimate then is simply

A 1 R x-xi F(x) =- I W(----) , n1= . 1 b 20 ~p where W(t) = ftW(U)dU. To estimate = F-l(p), use xp defined by A _00 P = F(xp). Under regularity conditions, it is shown that xp has the same asymptotic distribution as its corresponding sample quantile. Thus, it is possible to form a confidence interval of the form

where S(x ) is the estimated standard deviation of x • p p Another novel approach is called Nomination Sampling by Wil1emain

(1980). The method is demonstrated for estimating the median, but works for any quantile by making slight modifications. Instead of drawing a random sample of size n, draw n independent random subsam- ples of size N. We then "nom inate" the largest value x(N)i from each of the i=l, ... ,n random subsamp1es. The proposed estimator is of the form where

i+1) liN (S+l) - 1/2 1 i = [!l~ ], and a =--~-:-:------.~ 2 l:!l.} 1IN _ r-.i-J 1IN (S+ 1 lS+ 1

Note that X(i) provides an estimate of the (s1 1) fracti1e of the dis­ tribution of nominees, so it provides an estimate of the ((S~l)) liN fractile of the distribution of the general population. An approxi­ mate confidence interval based on this estimator can be formed by ~-1(1-a/2) using 8p ± S(8p)' The standard deviation, S(8p)' can be estimated numerically by using the empirical cdf in the formula presented for the probability density, and then calculating appropriate 21 moments. If a finite population, TIN' with N elements is assumed, and a simple random sample of size n is selected without replacement, an interval estimation method described by Wilks (1962) and extended in Sedransk and Meyer (1978) may be used. Let t be a fixed integer,

1 ::: t·::; N, so X(t) is the (k)-th quantile of TIN" If the confidence interval is defined to be (x(i)'X(j»' where 1 ~ i < j < n, we have that P(x(i) ~ X(t) ~ x(j» = P(X(i) ::; X(t» - P(x(j) ~ X(t_l» which tu rns out to be

The estimator which will be the major focus of this disserta­ tion has been developed by Harrell and Davis (1982). Their paper states that since lim E[X((n+l)p)] = F-l(P) for p E (0,1), it rr+ro would be desirable to estimate E[X((n+l)p)] and hence the p-th quan- tile of the population, whether or not (n+l)p is an integer. They suggest using

Q = .1 f\-l(y) y(n+l)p-l (l_y)(n+l)(l-p)-l dy p B((n+l)p,(n+l)(l-p» 0 n l whe're y = F(x), F (x) = n- ':S x). Following Maritz and Jarrett n H(x..- 1 (1978), Qp can be expressed as n Q = I W ,X(,'} (1.2) P i=l n,' where i n 1 f / y(n+l)p-l (l_y)(n+l)(l-P)-l dy B((n+l)p,(n+l)(l-p» i-l n 22

= Ii/n{p(n+l),(l-p)(n+l)} -Ii -l {p(n+l),(l-p)(n+l)}. n Thus, the estimator for the p-th quantile is a linear function of all the order statistics. A jackknifed variance estimator is pre­ sented in the form:

where S. is the quantile estimate with the j-th order J statistic removed;

S. = J I(ifj) Wn- 1,1- . 1['1>J'J - -1 n and S = n I Sj' (1. 3) j=l

By appealing to asymptotic normality results, a confidence in­ ~-1(1-a/2) terval for this method is seen to be Qp ± S(Qp)' where S(Qp) is the estimated standard deviation for the estimator, obtained by the jackknife procedure. The estimator's performance is tested on a variety of shapes of distributions, and is generally shown to be more efficient than traditional estimators based on one or two order statistics. Kaigh and Lachenbruch (1982) present another method using all of the order statistics. This technique requires drawing all pos­ sible subsamples of size k from the n elements selected in a random sample. Then, the average of the p-th sample quantiles from the sub­ samples is used to form the estimator for the p-th quantile. This estimator is a U-statistic which can be expressed as a linear combi- nation of order statistics: 23

r+n-k j-l n-j n = .I [(r-l)(k-r)/(k)] X(J')' where r = [(k+l)p]. (1.4) J=r A confidence interval for this estimator may be written as

since the statistic is shown to be asymptotically normally distri-

A buted. A jackknife estimator for S(sp(K-l)) is presented in Kaigh (1982). Monte Carlo studies have generally shown this estimator to be more efficient than the sample quantile.

1.2.4 Quantile Estimators for Specific Distributions

Many articles have been written which present estimators for specific distributions. They use a variety of techniques for devel­ oping the estimator and a variety of criteria for evaluating the estimator's utility. Often, the principle used for obtaining these estimators is maximum likelihood estimation. As Green (1969) states, for any vector ~ of unknown parameters of the distribution, IIIf the p-thquantile, sp' is equal to 9(0) and § is the maximum likelihood estimator of ~, then ~p = g(~).11

1.2.4.1 Normal Distribution Quantiles

For example, Green notes,for a normal distribution, ~p n~ = X+ cp sA( 1)} (n~1)S2 where ¢(Ep) = p; X and are the usual maximum likelihood estimators, (MLEs), of ~ and 0 2. Several other authors have addressed the problem of estimating 24 the p-th quantile from the normal distribution. Owen (1968) defined a confidence interval for ~ + Epa, the p-th quantile, where Ep is the standardized normal deviate corresponding to probability p. This interval is defined by: - P{X + E(l_y) S ~ ~ + Epa $ X + E(l+y) S} = y, 2 2 where y =l-a, S is the sample standard deviation, and E1lzrl,E~ refer to deviates from a non-central t-distribution, values which are tabled in the article.

A _ Zidek (1971) presents the minimax estimator for ~p' 8p = X + 1 S~ IICn where: n X = I X;ln, i=l

n is a given standardized deviate, and n) (y, ( n+ 1) )- 1 Cn =f(2 2-f-2- ,n=2,3, .... He shows this estimator to be "inadmissible" because there is another es timator,

where 2 H(t) = ~n(nt) - nCn(nt) + 1, _00 < t < 00

for which A 2 A 2 E (81-~p) < E(8p-~ ). ~,a p

Confidence intervals appear difficult to form, however, Dyer, Keating, and Hensley (1977) present two other estimators for normal quantiles. These are: 25

1) A minimum variance unbiased estimator: _ r (n/2) -L)~ m - r (n-l) ( n-l 2 • •

2) A IIBest-invariantll estimator, Xp(BIE} = X +

is the Pitman-closest estimator defined as 81, where P( /81-81 < /8 2-8/) 2: .5 for any other estimator 82 and un­ known parameter 8.

These two estimators are also compared with each other as well as with the maximum likelihood estimator. Different conclusions re9ard- ing relative quality are found depending on the judgement criterion chosen. Dyer and Keating (1979) also present an estimator which meets a fairly complex criterion of achieving minimum mean absolute error. The estimator is of the form:

where t.5 (f,o) is the solution to T(t;f,o) = .5 from the noncentral t-distribution with noncentrality parameter °on f degrees of freedom. Confidence intervals for any of these estimators can be based on the noncentral t-distribution as described in Owen (1968).

1.2.4.2 Exponential Distribution Quantiles

In addition, estimation of quantiles from the exponential dis- tribution has been discussed. Robertson (1977) proposed three formulas for linear estimators of the quantile kp8 of the single parameter exponential distribution with pdf 1/8 exp(-x/8), 8>0, 26

_00 < x < 00. Essentially, the first method sets out to find a con­ stant K so that 1 - exp(KX) will be of minimal MSE. This K is shown to be K = K = n[ exp{ kpl (n+ 1)} - 1] o [2 - exp{kp/(n+l)}]

Without regard to the minimal MSE criterion, another choice is to let K=Kl=k p' The third choice, K2 = n[exp(kp/n)-lJ, makes exp(-K2X) unbiased for exp(-kp)' suppressing G. Obtaining exact confidence intervals based on these estimators is not discussed, but mean squared errors for the predicted distribution function are presented to indi- cate performance. A recent work by Rukhin and Strawderman (1982) deals with esti- mating the quantile Gp = A + ba from the two-parameter exponential distribution with pdf exp _(X-A), where A,a are both unknown, and 1a a b is a given constant ~ O. They demonstrate that an estimator often considered to be the best equivariant estimator,

( -1 - ) °0 = x(1) + b-n )(x-x(l) , is not as good an estimator as

= - 2(n+l)-1 [(b-l-n-l)(x-x(l)) - (bn-l)x(l)J o °0 when b > 1 + n- l . Confidence intervals based on this estimator ap- pear difficult to obtain. The performance of this estimator is measured through risk functions. Greenberg and Sarhan (1962) propose a nonparametric estimator X -A of the same distribution's quantile. If Zp = ~, then their estimator takes the form: 27

The variance of this estimator is

An approximate l-a confidence interval can be formed based upon this estimator: ~p(S-G) ± ~-1(1-a/2) S(~p(S-G)).

S(~p(S-G)) may be easily obtained since Zp = -In(l-P), and A and a can be estimated from the data. Ali, Umbach, and Hassanein (1981) use much of the Kubat and Epstein (1980) technique to estimate a quantile from the two- parameter exponential distribution. After algebra designed to mini­ mize the variance of the estimator, the result for this distribution is presented as:

(1 - Zp/l.59362)x(l) + (Zp/l.59362)X([.7967n] + 1)'

for 0 s p '5 .3339; .9296 < p; xp = .745x([(1 .50l37p-.50l37)n]+1) + .255 x([~305063+.69494)n]+1)'

for .3339 ~ P '5 .9296.

A confidence interval based on this estimator assumes the form

± ~-1(1-a/2) S(X ) since this method follows the work of Kubat Xp p and Epstein. This method has been shown to have much higher asymp­ totic relative efficiency than the sample quantile. 28

1.2.4.3 Quantile Estimation for Other Distributions

Much has also been written about estimating quantiles from other distributions. For the Weibull, Lawless (1975), Mann and Fertig (1975, 1977), Schafer and Angus (1979), as well as others, have proposed relevant estimators. Many other distributions have had their quantiles estimated, generally by methods not easily leading to confidence interval con~ struction. Angus and Schafer (1979) discuss estimation of logistic quantiles. Ali, Umbach, and Hassanein (1981) estimate double expo­ nential quantiles. Umbach, Ali, and Hassanein (1981) estimate pareto quantiles. Lawless (1975), and Mann and Fertig (1977) estimate quantiles from the extreme-value distribution. Sarhan and Greenberg .. (1962) estimate the p-th quantile from the Uniform (0,1) distribution. Finally, Weissman (1978) demonstrates estimation of large quantiles based on the k largest observations in the sample if the cdf's have the general forms:

x I. A(x) = exp(-e- ), -co < X < co,

a II. ¢(x) = exp(-x- ), x > 0, a > 0, or

III. ljJ(x) = exp(- (- x)a), X < 0, a > 0.

1.2.5 Estimation of Quantile Intervals

Previously, we have only discussed estimation of individual quantiles of a distribution. There is often interest in other re­ lated quantities which are functions of quantiles. Two such quan­ tities are quantile intervals (s ,s ) and quantile differences, Pl P2 29

~P2- ~Pl' An important example of the latter is ~.75 - ~.25' the . This is shown in Chu (1957) to be a reasonable measure of dispersion for many distributions since it is only a con­ stant multiplier of the standard deviation. Wilks (1962) defines an outer confidence interval for the quan­ tile interval (~Pl'~P2) based on X(i)'X(j) (l~i

P{ X( .) < ~ < ~ < X( .)} = 1 Pl P2 J

n! j-i-l k ·+1 1 = I (-1) pl I(l_p )(n-j+l,j-i-k) [k!(n-i-k)!(i+k)J- ( i - 1) ! k';;O 1 2 and

where It(a,b) is the incomplete beta function.

David (1981) provides lower bounds for these intervals as:

and 30

Krewski (1976) produces tighter bounds on the confidence coef­ ficient for outer confidence intervals as follows:

I(l_P ) (n-j+l,j+l) ] where 0. 2 j/(n+l). 1 = [ I(l_p) (n-j+l,j) 2

P(X(i) < Pl < P2 < X(j)) > I (i,n-i+l) - I(Pl/o. ) (i,j-l) I (j,n-j+l), .. Pl 2 p2

Ip (j+l, n-j+1) --=2=-- j / ( n+ 1) where 0.2 = max I p/ j ,n-j+1)

Shortly thereafter, Reiss and Ruschendorf (1976) proposed a method which sometimes led to sharper confidence bounds than Krewski could produce. Their bounds involve partitioning the interval be­ tween Pl and P2 into smaller sections and labeling each partition by ai' so Pl = aO

Sathe and Lingras (1981) improve upon both of the previous papers' ideas by introducing the notion of convex functions. This method is extremely complicated and will not be discussed. Essen­ tially, they demonstrate that they can obtain even sharper bounds than Reiss and RUschendorf's, and that the bounds can be made even sharper by subdividing the interval between the two probabilities.

1.2.6 Estimati on of Quant;' e Di fferences

Only one paper has been published specifically discussing inter- vals for interquartile ranges or other quantile differences. The main theorem of Chu (1957) provides bounds on the confidence coef­ ficients for estimates of differences of the form ~q - ~p' Let the confidence intervals for ~q - ~p be of the form

(x(v) -x(u)' x(s) -x(r))' Then, if

r . . Bn(r,p) == I (~)pl (l_p)n-l, 1 1=. -0 it can be shown that

and

David (1981) provides a simpler proof than does Chu. The article also explains that if k = [np]+l, and m= [nq]+l, then as n -+ 00, x(m) - x(k) B N(~q-~p'O(*)). Estimation of a symmetric quasi-range, such as the interquarti1e range or interdeci1e range, is also discussed. 32

1.3 Outline of the Research Proposal

In the present work, the papers of Harrell and Davis (1982), and Kaigh and Lachenbruch (1982) are extended. Chapter II will con- sider the problem of relative performance of the proposed estimators for construction of confidence intervals. The confidence interval for the linear combination of order statistics estimator from Harrell and Davis will be of the form ~ ± CS(~), where S(~) is the estimated standard deviation of the estimator, and C is a standard normal devi- ate. The confidence interval for the other estimator will be as proposed in Kaigh (1982). Under the assumption that the data follow the uniform, normal, exponential, lognormal, or Cauchy distributions, confidence intervals based on the proposed estimators will be con­ structed. The confidence intervals so obtained will be judged rela­ tive to other confidence interval estimators with regard to expected length and ability to preserve the desired confidence level. This comparison will consider parametric and nonparametric estimators under the distributions mentioned. In Chapter III, estimators for the difference of two quantiles,

~q - ~p' such as the interquartile range will be constructed based on the linear combination of order statistics estimators. The estimators will be of the following form: 33

Important distributional properties for these estimators will be derived. For example, under what conditions asymptotic normality will hold, and how to estimate their variance will be discussed. Confidence intervals will be constructed based on distribution re­ sults in large samples. Chapter IV will present results and methodology of simulations used to demonstrate the performance of the estimators of interquan­ tile differences. Their performance will be discussed relative to the confidence interval bounds presented by Chu (1957). In Chapter V, an example will be presented to demonstrate the estimators' appli­ cation to health data. This will consist of a demonstration on quantiles of lipid distribution from the Lipid Research Clinics Project. Finally, Chapter VI will summarize results obtained and will offer a few suggestions for further research. CHAPTER II

A COMPARISON OF CONFIDENCE INTERVALS FOR QUANTILES

2.1 Introduction

In order to evaluate the performance of a confidence interval for quantile estimation, two criteria will be used. Ability to pre­ serve the desired confidence level is foremost. If a constructed interval of intended confidence (l-a) is demonstrated to be of con- fidence clearly below l-a, then this interval is of little use. Among those intervals which can be shown to hold the desired (l-a) confidence, the preferred interval is the shortest. Throughout this chapter, the notation (n,p)L(l_a) will refer to the length of a (l-a) confidence interval for the p-th quantile based on a sample of size n. The expected length of this confidence inter­ val is denoted E((n,p)L(l_a)) and is the difference between the expected values of the random variables comprising the ends of the interval. This chapter will systematically evaluate various confi- dence intervals constructed for estimating quantiles and offer recom- mendations regarding their use.

2.2 Selection of Interval Estimators for Comparison

As is evident from the previous chapter, there are numerous 35 methods available to construct confidence intervals for quanti1es. The decision to include or exclude estimators for the comparisons in this chapter is based on several criteria. For nonparametric esti­ mators, the first criterion is that the confidence interval can be constructed from a single random sample of arbitrary size n. This excludes, for example, Schmeiser's (1975) normal-theory estimator which employs m such samples, as well as nomination sampling proposed by Wi11emain (1980) which also requires multiple samples. Walsh's (1958) method was excluded because it was only readily computable for samples up to size 12. Secondly, no knowledge of the shape or type of the underlying distribution should be required. Several estimators require symmetric underlying distributions to estimate the median, so these were not considered. Kubat and Epstein's (1980) method was not considered because knowledge of the distribution is required ina small interval around the quantile. Finally, the method must be straightforward to calculate for the quantile(s) of interest. The methods of Ekblom (1973), Guilbaud (1979), Weiss (1960), and Azza1ini (1981) were all eliminated for th i s rea so n. The remaining candidates for comparison were few in number. The 1inear combination of order statistics (L-COST) estimator proposed by Harrell and Davis (1982), the method of Kaigh and Lachenbruch (1982), the bootstrap for the median, discussed by Efron (1979), and the method using the order statistics X(j) and X(k) for endpoints, as discussed in David (1981, p. 15) and elsewhere, were all considered 36 reasonable to compare. Finally, when one or more parametric methods for construction of the interval for a particular distribution ap­ pear in the literature, the simplest such method is included for comparison purposes.

2.3 Note on Use of the Kaigh and Lachenbruch Estimator

The purpose of this chapter is to compare the above-mentioned confidence interval estimators for a quantile. In order. to use the Kaigh and Lachenbruch (1982) estimator, either as a point estimator or in a confidence interval, it is necessary to select a value for k, the subsample size, as explained in Chapter I. Kaigh and Lachenbruch suggest choosing k so as to minimize E(~p(K-L) _ ~p)2, where ~p(K-L) denotes their point estimator for the p-th quantile. However, making the choice of k in this way implies knowledge concern­ ing the underlying distribution. Empirical results have indicated that lengths of confidence intervals and levels of observed confi- dence are not very sensitive to moderate variation in k (see Kaigh (1982)). Thus, choosing a reasonable, but not necessarily optimal k will be sufficient. Secondly, since the Kaigh and Lachenbruch con- fidence interval estimator is constructed using a t-distribution with n-k degrees of freedom, when the sample is small, it is preferable to select a smaller value of k, all else being equal. This will prevent the critical t value from becoming excessively large. Finally, re­ gardless of quantile, a value of k should not be selected which is small enough to permit the extreme, or nearly-extreme, order statistics to be included. This is of most help when dealing potentially with 37 long-tailed distributions. Kaigh and Lachenbruch (1982) suggest choosing moderate values of k for median estimation, and somewhat larger values for the esti­ mation of other quanti1es. This might suggest, for example, choosing a value of k about n/3 for estimation of the median, and about 3n/4 for other quantiles, assuming a sample of size n. The method used in this chapter is more complicated, but may lead to an interval centered around a point estimator which has lower bias. The first step in this process, regardless of the quantile to be estimated, is to examine the negative hypergeometric probability distribution (weights) for the particular quantile and sample size of interest, for various values of k. A sufficiently large number of values of k was selected so as to reasonably cover a range from small subsamples up to the entire sample size. Choosing about n/5 values of k was considered adequate. One probability distribution was con­ structed for each value of k (and nand p) under consideration. To estimate the median, the order statistics which combine to form the sample median were determined. All weight distributions computed were roughly symmetric about this sample median. A value of k was chosen for which about one-third of the order statistics on either side of the sample median, nearest the ends, have zero weight. This resulted in a probability distribution with a well­

li defined, but smooth, II pea k at the sample median. That is, one for which the probabilities decrease greatly in magnitude when more than four or five order statistics away from the sample median. This is intended to reduce variability. 38

To estimate a quantile other than the median, the order statistic(s) which form(s) the sample quantile were determined. Among the probability distributions computed, the one which assigns the greatest probability very close to, or at, the sample quantile was identified. When more than one value of k had a distribution with high probability near the sample quantile, the value of k for which the probabilities surrounding the sample quantile appeared to be in the most well-defined, but smooth, cluster was chosen. These steps were taken to help reduce the bias of the point estimator, and hence lead to an interval centered near the quantile

of interest. It must be realized, however, that varying k leads to variation in the standard deviation of the estimator as well as in the critical t-value. Hence, choosing k according to these sugges­ tions mayor may not lead to realization of the interval with the most confidence.

2.4 Evaluation of Confidence Intervals

The intervals considered are either determined exactly or re­ quire simulation. These need to be considered separately with re­ gard to their evaluation.

~ 2.4.1 Exact Confidence Intervals

2.4.1.1 Determination of Confidence Although it is intended to construct intervals of (l-a)~OO% confidence, this may not always be the case. Two of the methods do permit nearly exact determination of the confidence of an interval 39 obtained. In the cases of the uniform, exponential (A=l), and N(O,l) distributions, parametric estimators exist in the literature. In each case, the expected value of the estimator depends on p,n,and

2.4.1.2 Expected Lenath of Confidence Intervals

If an underlying N(O,l) distribution is assumed, then Owen (1968) provides - P(X + E S < ~ + E a < X + E S) = 1-0. (~) P (1-~) where S = sample standard deviation

In E ,In E are, respectively, lower and upper critical (~) (1-~) points of a noncentra1 t-distribution with noncentra1ity parameter In

Then,

E((n,p)L(l_a)) = E(X + E S) - E(X + E S) (l-~) (~) ,2 2

= (E l -a/2-Ea/ 2) E(S).

When X is distributed as a N(O,l) random variable, (n:l)~ E(S) = f(n/2) (f(n2l))-1. Thus, ( (n,p) )_( )( )( 2 )~ ( (n-l))-l E L(l-a) - El -a/ 2 - Ea/ 2 f n/2 n-l f --2-- .

The last column of Tables 2.4 and 2.5 presents these expected lengths. Greenberg and Sarhan (1962) present an appropriate parametric estimator for a quantile of the one-parameter exponential distribution. Following the notation established in the previous chapter,

~p(S-G) = -8* In(l-p) where

This estimator has variance

with 8 2 estimated by 8*2. This would then lead to an estimator of the standard deviation:

S(~ (S-G)) = 8* In(l-p). p In Following a large-sample approach, the confidence interval can be constructed as

Cp(S-G) ± ¢-1(1-a/2) S(~p(S-G)). 41

Thus, E((n,p)L· ) = 2 ¢-1(1-a/2) E(8* In(l-p)l (l-a) hi" )

= 2 ¢-l(l_a/~ In(l-p) E(8*). hi" It is easy to show that when X - exponential (A = 1/8),

So,

= n~ 1 (8( 1 - ~))

=8.

Therefore, E((n,p)L(l_a)) = 2 ¢-1(1-a/2) 8 In(l-p) . In These expected lengths are presented in the last column of Tables 2.8 and 2.9. If the data have an underlying uniform distribution on the interval (O,e), then ~p(S-G) = p8* where n+l 8* = -n- X(n) .

This estimator has variance

A ( )) 2 . [n+l ] Va r ( t,;p S- G = P Va r -n- X( n)

_ p2 1 82 - n( n+2) - = ["n(:2,]'- 42

These formulae permit an asymptotic (l-a)xlOO% confidence interval to be constructed of the form

where S(~p(S-G)) = p8* In(n+2)

As 8* is unbiased for 8, the expected length is

These tabled values are presented in the last column of Tables 2.2 and 2.3. For several distributions, the exact expected value of the order statistics comprising the endpoints of this type of interval can be computed readily. The interval of the form (X(j)'X(k)) is constructed such that

and as close to l-a as possible. For the N(~,02) distribution with cdf ~, the expected value of the k-th order statistic is

and thus the expected length of the confidence interval is difficult to obtain in a closed form. Harter (1961) has performed the required numerical integration to obtain the expected value of each order statistic from a N(O,l) 43 distribution. Using his tables, the expected value of the interval is easily obtained as the difference of the expected values of the order statistics at the endpoints. The next to last column of Tables 2.4 and 2.5 presents the expected lengths. Since the expected value of the k-th order statistic from an exponential (A = 1/8) distribution is

The expected lengths are presented in the next to last column of Tables 2.8 and 2.9. For a uniform distribution on (0,8), the expected value of the j-th order statistic is

E(X(j)) =8j/(n+l).

Thus, under a uniform (0,8) distribution,

These lengths are in the next to last column of Tables 2.2 and 2.3.

2.4.2 Simulated Confidence Intervals

In the case of the L-COST, Kaigh and Lachenbruch, and Bootstrap methods, the intervals were simulated. As a check, the order statistic method was also simulated. 44

2.4.2.1 Determination of Confidence

The observed confidence from the methods requiring simulation were computed as S o I. (A. < ~ < B.), I ', p - y (2.1) i=l So - So where if Ai < ~ < B., I. (A. < ~ < B.) P " p , ={' ~O otherwi se

So = number of simulations performed

(Ai,B i ) = confidence interval constructed. The number of simulations performed for each combination of sample size, quantile, method, underlyi ng distribution, and desired confidence of the interval is So' A confidence interval is constructed from each simulated sample. Based on the known, underlying distribution of the sample, the p-th population quantile is calculated. Formula (2.1) is used to find an estimate of the true confidence. To decide whether the observed con- fidence is acceptable, the confidence interval may be evaluated on the assumption that inclusion/exclusion of ~p from the interval is distributed as a binomial random variable with probability parameter y = l-a, and sample size, So' An appropriate ~robability statement takes the form:

(2.2) 45

Since sY is the observed confidence, a 95% interval for y is readily o obtained from (2.2).

2.4.2.2 Expected Lengths of Intervals

The expected lengths of simulated intervals can be easily esti­ mated. For each underlying distribution and quantile to be considered, confidence intervals have been simulated based on X(j) and X(k). The bootstrap method of Efron (1979) can be used to form a confi- dence interval for the median. From Chapter I, an interval can be

The expected squared error of estimation for the sample median is estimated by E*(R*)2 and can be simulated based on formulae presented as (1.1) of Chapter I. The expected length of the interval is

2 (p-1(1-a/2) E/E*(R*)2.

The L-COST interval is of the form

as shown in (1.2) and (1.3) of Chapter I. The expected length of the confidence interval is then

with S(Qp) being simulated. Finally, the Kaigh and Lachenbruch (1982) estimator can be used in a confidence interval

This leads to: 46

A variance estimator, 52(~p(K-L) = V(~p(K-L)), can be based on the jackknife as follows. Writing ~p(K-L) as 8~, and 8~_1 for the point estimate for a sample with the i-th observation removed, and the weights readjusted accordingly, a jackknifed esti­ mator for the point estimator can be written as A AO Ai 8~ = n8 - (n-l)8 . (2.3) 1 n n-l Using (2.3), an estimator for the variance can be constructed as foll ows: n S2 = I (8.A* - 8A*)2 I ()n-l (2.4) i=l 1 where

A 8* =

Thus, AO Ai . 2 n 8n (~-1)8n_l ]J] 52 = (n-l) -1 I ..f(n8AO - (n-l}8Ai)_ - (I n C - i =1 n n l i=l 2 n n AO n = (n-l) -1 I (n8AO - (n-l)§i 1 - I e + (n-l) I 8~_1] i=l n n- i =1 n i =1 n Ai Ai = (n-l}-l I [(n-l) 8 (n-l)8 j' i=l Jl n-l n-l n n = (n-l) I (8 i - e- i ) 2 (2.5) i =1 n-l n-l

-i n\' Ai where 8 1 = L 8 lin. n- ;=1 n- 47

2.4.3 Selection of Distribution for Pivotal Quantity

The statistics 8 ) (0 - o S(8) is frequently known as a pivotal quantity. When 8 = sps and 8 is based on the Bootstrap or L-COST method for estimations the normal distribution is used to approximate the distribution of the pivotal quantity. Tukey (1958) would argue in favor of using at-statistic with n-l degrees of freedom when both the point estimator and its standard error are based on a jackknife. Neither the L-COST nor the Bootstrap point estimator is a jackknife estimator. SOs his argu­ ment does not necessarily apply to pivotal quantities based on these estimators. Since the pivotal quantities have been shown to be dis­ tributed normallys at least asymptotically (sees e.q. s Kaigh (1982)), the normal distribution was chosen to approximate L-COST and Bootstrap critical values. Since Kaigh (1982) suggests using t n_k for his esti­ mators this was used for the comparisons in this chapter.

2.5 Details of the Simulation Process

In order to represent a range of shapes of distributionss the uniform (OslL N(Osl)s exponential (.\=l)s standard Cauchys and stan- dard lognormal distributions were selected for the simulation study. Because of the ease of uses PROC MATRIX in SAS (1979) was employed to code and execute the simulations performed. To generate uniformly distributed random variables s the UNIFORM function was used. For N(Osl) random variables s the NORMAL function was used. To generate 48 exponential random variables, the probability integral transform was employed: If X has a continuous cumulative distribution function, F(x), then Y = F(X) is uniformly distributed on the interval (O,l). Let U be a uniform (0,1) random variable as generated by the SAS function UNIFORM. Then E = -In(l-U) is a standard exponential random variable and C = tan(TI(U-~)) is a standard Cauchy random variable. To simulate a sample from the lognormal distribution, the relation that X is distributed as a lognormal random variable if Y =log X is distributed as a N(O,l) random variable was employed. Thus, to gen­ erate X from a lognormal distribution, Y was generated from a N(O,l) distribution and the transformation X = EXP(Y) was used. Five hundred simulations were computed. This allows observed confidences within about .02 of .95 to be acceptable for the 95% con­ fidence interval, and within about .01 of .99 to be acceptable for the 99% interval. Three odd sample sizes, 11, 31, and 51, were chosen to permit direct use of the Bootstrap method, and to permit some general­ ization of results for small-to-medium sized samples. The median, quartiles, and deciles were reasonable to estimate from the largest sample size, but the deciles were not estimated from the sample of size 31, and only the median was estimated from the smallest sample size. This was done in order to only use quantiles whose confidence can be estimated by the order statistic method, without extending be­ yond the first or last order statistic, in order to achieve a desired confidence. As the uniform and normal distributions are symmetric, only the quantiles at or below the median were estimated. The Cauchy is also symmetric, but due to very wide tails, it appeared useful to 49

estimate the upper and lower quartiles as the two estimates may differ. Deciles from the Cauchy were not estimated as trial results indicated very poor stability of estimates over repeated simulations.

2.6 Results from Simulated or Theoretical Construction of Intervals

This section contains the results of the simulated or exact confidence intervals constructed for the cases described above. The first table, 2.1, contains the order statistics which would form the endpoints of the order statistic estimator, as well as the theoretical confidences obtainable. Then, tables 2.2 through 2.11 report each underlying distribution's results in separate tables for 95% confi­ dence intervals and 99% confidence intervals. Each expected length estimated or calculated is presented with its observed or theoretical confidence directly beneath in parentheses. The first comparison of results is between the simulated order statistic interval and the exact order statistic interval. For the distributions which permit exact calculation of the order statistic interval, it is clear that the simulated interval agrees to a large degree with the exact one, both in terms of expected length and ability to preserve the confidence level. This agreement testifies to the quality of the samples which were generated. Thus, the esti­ mates which result from simulation of other methods are likely to be reasonable. Considering median estimation, it seems that the L-COST interval estimator, constructed using the normal distribution, is generally unable to preserve the desired confidence. The exception is for a 50 small sample generated to be from a Cauchy distribution. It was not a poor estimator, but did violate this important criterion. On the other hand, the Kaigh and Lachenbruch method almost always exceeded or equaled 93% confidence for a 95% confidence interval and 98% for a 99% confidence interval. The bootstrap estimator tended to perform about as well as the

Kaigh and Lachenbruch estimator with respect to preservation of con~ fidence. In every instance, except for estimation of a Cauchy median by samples of size 11, the length of the confidence interval from a bootstrap was at least as great as that from the Kaigh and Lachenbruch method. The simulated order statistics interval is quite good with regard to preserving confidence, but is always longer than the equi­ valent Kaigh and Lachenbruch estimator. Evaluating the estimators' performance at other quanti1es pro­ vides different results. For distributions other than the Cauchy, neither the Kaigh and Lachenbruch nor the L-COST interval, as presently formulated, is able to provide the specified confidence. It is evident that selection of k by the more complicated method described in section 2.3 is not adequate to ensure proper confidence from these quanti1es. Kaigh's (1982) article describing construction of confidence intervals from the Kaigh and Lachenbruch estimator also presents results of simulations. The article compares the expected length of confidence intervals and observed confidence from the Kaigh and Lachenbruch estimator, the order statistics estimator, and an esti­ mator which is a generalization of the L-COST estimator. This 51 general i zed estimator assumes the form j n n / 1 r-l k-r L* = .~ I B{r,k-r+l) x (l-x) , o < x < 1 J-l . 1 L n where r = [(k+l)p]. Letting k=n produces the L-COST estimator. His results show that L* is better able, under the symmetric distribu­ tions presented, to preserve the desired confidence than is L-COST. The results are presented only for sample sizes 19 and 99, and for the uniform, normal, double exponential, and Cauchy distributions, so ability to perform under a wider variety of distributions is not considered. The problem of selecting k would remain under this modi­ fication, so it could be considered equivalent in many respects to the Kaigh and Lachenbruch estimator. To the L-COST's credit, it should be noted that when the confi- dence obtained by the L-COST estimator is at least as great as that of the Kaigh and Lachenbruch estimator, the expected length of the former interval is never greater than that of the latter. The order statistics confidence interval, however, clearly preserves the de- sired confidence, and except for the 99% confidence interval for Cauchy quantiles, its expected lengths are quite close to those of methods which cannot preserve the desired confidence.

2.7 Conclusions

From the results presented in the previous section, it is apparent that the L-COST interval estimator cannot be depended upon to provide a confidence interval of stated level (l-a) in small to 52 moderate samples so long as the normal distribution is used to con­ struct the interval. Only under certain distributions is an assump­ tion of following an approximate normal distribution in small samples valid. In view of the merits of the L-COST method for point estimation, it may be worthwhile to explore its performance when constructed with other than the normal distribution. For example, if a tn- 1 statistic were used in construction of the interval, the length would increase about 13% for a sample of size 11, but only 4% for a sample of size 31, and 2% for 51. Based upon the results obtained, this may, in some cases, be sufficient extra length to provide adequate coverage. For median estimation, the Kaigh and Lachenbruch estimator per­ forms satisfactorily with regard to preservation of confidence when considering each interval separately. However, it appears to be biased towards being below the desired confidence level when results are examined overall. This is reflected in its expected length almost always being less than that of other interval estimates. For estimation of quanti1es other than the median, the estimator of choice is the order statistic method given the other intervals as they are presently constructed. This method consistently produced intervals of the desired confidence, or better, and its interval length was generally quite comparable to interval lengths from esti­ mators which could not attain the desired confidence with enough regularity. Of course, all the estimators considered above were nonparametric. If there is definite know1edqe regarding the underlying distribution, 53

it is obvious that the parametric estimator is the best choice, re­ gardless of quantile. With regard to the last point, it should be noted that in the only instances for which the parametric estimator was greater in ex­ pected interval length than the L-COST interval, the L-COST interval was not providing an interval with nearly the specified confidence. The confidence interval appears to be biased upwards, and its length is biased towards being too short. An adjustment in length resulting from employing a t-distribution would change this, and should be con­ sidered. Or perhaps, the question of whether to jackknife the linear combination of order statistics should be reconsidered, as discussed in Efron (1979), and Parr and Schucany (1982). In any event, when constructing confidence intervals from small to medium sized samples for the median, the Kaigh and Lachenbruch method is on the borderl i ne of acceptabil ity, but the order statistic method assures the desired confidence. For other quantiles, the order statistic method is also preferable when compared with other methods as they are presently constructed. The L-COST interval might also perform satisfactorily when constructed usin9 other than the normal distribution. TABLE 2.1

ORDER STATISTICS X(j) ;X(k) COMPRISING A CONFIDENCE INTERVAL (WITH THEORETICAL CONFIDENCE) FOR VARIOUS QUANTILES AND SAMPLE SIZES

Quantile Desired Confidence n .10 .25 .50 .75 .90

11 Start*;X(5) Sta rt;X (7) X(1);X(10) X(5) ;end* X(7) ;end ( .997) ( .992) ( .994) (.992) (.997)

.99 31 Start;X(9) X(3);X(16) X(8) ;X(23) X(16) ;X(29) X(23) ;end ( .997) ( .990) ( .993) ( .990) ( .997)

51 X(1) ;X(l2) X(6);X(22) X( 16) ; X( 35) X(31) ;X(47) X( 40) ;X( 51) (.992) ( .991) (.992) ( .991) (.992)

11 Start ;'X (4) Start ;X( 6) X(2) ;X(9) x(6);end X(8) ;end ( .981) ( .966) ( .961) ( .966) ( .981)

Start;X (7) X(4);X(14) X(1l);X(23) X(18) ;X(28) X(25);end .95 31 (.969) ( .958) (.959) ( .958) ( .969)

51 X(2) ;X(l1) X(8) ;X(21) X(19);X(33) X(31) ;X ( 44) X(41) ; X( 50 )

(.958) ( .953) (.951) ( .953) ( .958) U1 ~ *Start = X such that F(x) = 0; End = X such that F(x) = 1. e e e e e e

TABLE 2.2 EXPECTED LENGTHS OF 95% CONFIDENCE INTERVALS (AND THEORETICAL OR OBSERVED CONFIDENCE) COMPUTED FOR VARIOUS QUANTILES OF THE UNIFORM DISTRIBUTION, WITH THREE SAMPLE SIZES - Kai gh Simulated Exact and Order Order P N K* L-COST Lachenbruch Bootstrap Statistic Statistic Parametric

.1 51 29 .15 .15 .17 .17 .01 ( .92) ( .90) ( .98) ( .96) ( .95)

31 33 .26 .31 .31 .31 .03 ( .90) ( .90) ( .96) ( .96) ( .95) .25 51 29 .21 .22 .25 .25 .02 ( .93) (.91) ( .97) ( .95) ( .95)

11 3 .45 .48 .57 .59 .58 .16 ( .89) ( .95) ( .95) (.98) ( .96) ( .95) 31 9 .30 .30 .36 .37 .38 .06 .5 ( .89) ( .93) ( .92) ( .96) ( .96) (.95) 51 19 .25 .25 .28 .27 .27 .04 ( .92) ( .94) ( .93) (.94) ( .95) ( .95)

*Kaigh and Lachenbruch method only.

lT1 lT1 TABLE 2.3 EXPECTED LENGTHS OF 99% CONFIDENCE INTERVALS (AND THEORETICAL OR OBSERVED CONFIDENCE) COMPUTED FOR VARIOUS QUANTILES OF THE UNIFORM DISTRIBUTION, WITH THREE SAMPLE SIZES

Kaigh Simulated Exact and Order Order P N K* L-COST Lachenbruch Bootstrap Statistic Statistic Parametric

.1 51 29 .19 .21 .21 .21 .01 ( .96) ( .95) ( .99) ( .99) (.99)

31 23 .34 .46 .40 .41 .04 ( .95) ( .95) ( .99) ( .99) ( .99) .25 51 29 .28 .30 .31 .31 .02 ( .97) (.97) (1.0) (.99) (.99)

11 3 .58 .70 .75 .74 .75 .22 (.96) ( .99) ( .98) (.99) ( .99) ( .99) 31 9 .40 .41 .47 .47 .47 .08 .5 (.95) ( .97) ( .97) (.99) ( .99) ( .99) 51 19 .33 .34 .37 .37 .37 .05 ( .96) ( .98) ( .98) ( .99) ( .99) ( .99)

*Kaigh and Lachenbruch method only. c..n O'l e - e e e e

TABLE 2.4 EXPECTED LENGTHS OF 95% CONFIDENCE INTERVALS (AND THEORETICAL OR OBSERVED CONFIDENCE) COMPUTED FOR VARIOUS QUANTILES OF THE NORMAL DISTRIBUTION, WITH THREE SAMPLE SIZES

Kaigh Simulated Exact and Order Order P N K* L-COST Lachenbruch Bootstrap Statistic Statistic Parametric

.1 51 29 .84 .92 1.06 1.03 .76 (.89) ( .89) ( .95) ( .96) ( .95)

31 23 .86 1.08 1.04 1.04 .81 (.90) ( .92) ( .94) (.96) (.95) .25 51 29 .69 .75 .80 .80 .62 ( .91) ( .91) ( .94) ( .95) ( .95)

11 3 1.26 1.43 1.62 1.80 1. 79 1. 31 (.89) ( .94) ( .95) (.97) ( .96) ( .95) 31 9 .81 .82 .95 1.01 1.01 .73 .5 ( .92) ( .94) (.96) ( .96) (.96) (.95) 51 19 .64 .65 .72 .70 .70 .56 ( .93) ( .95) ( .95) ( .95) ( .95) ( .95)

*Kaigh and Lachenbruch method only.

<.T1 " TABLE 2.5 EXPECTED LENGTHS OF 99% CONFIDENCE INTERVALS (AND THEORETICAL OR OBSERVED CONFIDENCE) COMPUTED FOR VARIOUS QUANTILES OF THE NORMAL DISTRIBUTION, WITH THREE SAMPLE SIZES

Kaigh Simulated Exact and Order Order p N K* L-COST Lachenbruch Bootstrap Statistic Statistic Parametri c

.1 51 29 1.10 1.25 1. 50 1.51 1.02 ( .95) ( .96) ( .99) ( .99) ( .99)

31 23 1.14 1.57 1.39 1.38 1.09 ( .95) ( .97) ( .99) ( .99) (.99) .25 51 29 .91 1.01 1.04 1.02 .83 ( .95) ( .97) ( .99) (.99) (.99)

11 3 1.65 2.08 2.13 2.70 2.64 1.86 ( .95) ( .99) (.98) ( .99) (.99) ( .99) 31 9 1.06 1.12 1.25 1.29 1.29 .98 .5 ( .97) ( .99) (.99) ( .99) (.99) ( .99) 51 19 .83 .87 .95 .98 .97 .75 (.97) ( .98) ( .98) ( .99) ( .99) { .99)

*Kaigh and Lachenbruch method only.

(.Jl (Xl e e e e e e

TABLE 2.6 EXPECTED LENGTHS OF 95% CONFIDENCE INTERVALS (AND THEORETICAL OR OBSERVED CONFIDENCE) COMPUTED FOR VARIOUS QUANTILES OF THE CAUCHY DISTRIBUTION, WITH THREE SAMPLE SIZES

Kaigh Simulated and Order PN K* L-COST Lachenbruch Bootstrap, Statistic

.25 51 29 1. 59 1.88 1.84 ( .94) ( .96) ( .97)

11 3 2.45 3.41 3.39 4.74 ( .97) ( .99) (.99) ( .98)

31 9 1. 12 1.23 1.34 1.50 .5 ( .92) ( .95) ( .95) ( .96)

51 19 .86 .89 .98 .96 ( .93) ( .95) (.96) (.94)

.75 51 39 1. 59 1. 75 1.82 ( .95) ( .94) ( .96)

*Kaigh and Lachenbruch method only.

Kaigh Simulated and Order PN K* L-COST Lachenbruch Bootstrap Statistic

.25 51 29 2.09 2.55 2.93 ( .98) ( .98) (1.0)

11 3 3.22 4.96 4.46 31.87 ( .99) ( .99) ( .99) ( .99)

31 9 1.47 1.68 1. 76 2.05 .5 ( .97) ( .98) ( .98) ( .99)

51 19 1.13 1. 20 1.29 1.38 ( .98) ( .99) ( .99) ( .99)

.75 51 39 2.09 2.45 3.56 ( .98) ( .99) ( .99)

*Kaigh and Lachenbruch method only.

O'l o e e e e e e

TABLE 2.8 EXPECTED LENGTHS OF 95% CONFIDENCE INTERVALS (AND THEORETICAL OR OBSERVED CONFIDENCE) COMPUTED FOR VARIOUS QUANTILES OF THE EXPONENTIAL DISTRIBUTION, WITH THREE SAMPLE SIZES

Kaigh Simulated Exact and Order Order P N K* L-COST Lachenbruch Bootstrap Sta.tistic Statistic Parametric

.1 51 29 . 17 .17 .20 .20 .06 ( .93) ( .93) (.96) (.96) ( .95)

31 23 .37 .45 .46 .45 . 16 (.90) ( .89) ( .95) (.96) ( .95) .25 51 29 .30 .30 .36 .36 .12 (.94) ( .91) (.96 ) ( .95) ( .95)

11 3 1.03 1.20 1.36 1.31 1.32 .82 ( .91) ( .96) ( .96) ( .98) ( .96) ( .95) 31 9 .65 .67 .78 .88 .88 .49 .5 ( .90) ( .93) ( .93) ( .95) ( .96) ( .95) 51 19 .52 .53 .60 .57 .56 .38 ( .93) (.94) ( .95) ( .94) ( .95) (.95)

*Kaigh and Lachenbruch method only. m -' TABLE 2.8

(continued)

Kaigh Simulated Exact and Order Order P N K* L-COST Lachenbruch Bootstrap Statistic Statistic Parametric

31 23 1.07 1. 31 1. 31 1.35 .98 ( .91 ) (.90) ( .96) ( .96) ( .95) .75 51 39 .88 1.01 1.01 1.05 .76 ( .93) ( .92) (.97) ( .95) ( .95)

.9 51 39 1.53 1.81 1.95 1.91 1.26 ( .92) ( .89) ( .98) ( .96) ( .95)

*Kaigh and Lachenbruch method only.

'"N e e e e e e

TABLE 2.9 EXPECTED LENGTHS OF 99% CONFIDENCE INTERVALS (AND THEORETICAL OR OBSERVED CONFIDENCE) COMPUTED FOR VARIOUS QUANTILES OF THE EXPONENTIAL DISTRIBUTION, WITH THREE SAMPLE SIZES

- Kaigh Simulated Exact and Order Order P N K* L-COST Lachenbruch Bootstrap Statistic Statistic Parametri c

.1 51 29 .22 .23 .25 .25 .08 ( .97) (.96) (.99) (.99) ( .99)

31 23 .49 .65 .62 .61 .21 (.95) ( .96) (.99) (.99) (.99) .25 51 29 .39 .41 .44 .43 .. 16 ( .98) ( .96) ( .99) (.99) (.99)

11 3 1.35 1. 75 1. 79 1.89 1.92 1.08 (.96) (.99) ( .99) (.99) ( .99) (.99) 31 9 .85 .91 1.02 1.02 1.02 .64 .5 ( .95) ( .98) ( .97) (.99) ( .99) ( .99) 51 19 .68 .71 .78 .77 .77 .50 ( .97) ( .98) ( .99) (.99) ( .99) ( .99)

*Kaigh and Lachenbruch method only. en (.oJ TABLE 2.9

(continued)

Kaigh Simulated Exact 'and Order Order p N K* L-COST Lachenbruch Bootstrap Statistic Statistic Parametric

31 23 1.41 1.90 1. 76 1.82 1.28 ( .95) ( .96) ( .99) ( .99) ( .99) .75 51 39 1.16 1.42 1.53 1.51 1.00 ( .97) (.97) ( .99) ( .99) ( .99)

.9 51 39 2.01 2.54 3.04 3.02 1.66 (.97) ( .94) (.99) (.99) ( .99)

*Kaigh and Lachenbruch method only.

m .j:::> e e e e e e

TABLE 2.10 EXPECTED LENGTHS OF 95% CONFIDENCE INTERVALS (AND THEORETICAL OR OBSERVED CONFIDENCE) COMPUTED FOR VARIOUS QUANTILES OF THE LOGNORMAL DISTRIBUTION, WITH THREE SAMPLE SIZES

Kaigh Simulated and Order P N K* L-COST Lachenbruch Bootstrap Statistic

. 1 51 29 .24 .25 .29 ( .89) ( .89) ( .95)

31 23 .46 .55 .55 (.90) ( .90) ( .94) .25 51 29 .36 .36 .43 (.91 ) ( .89) ( .94)

11 3 1.46 1.80 1.98 1.88 ( .91) ( .95) ( .97) ( .97) 31 9 .84 .89 1.02 1.19 .5 ( .91) ( .95) ( .95) (.96 ) 51 19 .66 .68 .76 .73 ( .93) ( .95) ( .95) ( .95)

*Kaigh and Lachenbruch method only. O"l <.n TABLE 2.10

(Continued)

Kaigh Simulated and Order p N K* L-COST Lachenbruch Bootstrap Statistic

31 23 1.83 2.21 2.28 ( .90) ( .90) ( .96) .75 51 39 1.42 1.61 1.62 ( .91) ( .91 ) ( .94)

.9 51 39 3.29 3.86 4.47 (.92) ( .87) ( .97)

*Kaigh and Lachenbruch method only.

O'l O'l e e e e e e

TABLE 2.11 EXPECTED LENGTHS OF 99% CONFIDENCE INTERVALS (AND THEORETICAL OR OBSERVED CONFIDENCE) COMPUTED FOR VARIOUS QUANTILES OF THE LOGNORMAL DISTRIBUTION, WITH THREE SAMPLE. SIZES

Kaigh Simulated and Order PN K* L-COST Lachenbruch Bootstrap Statistic

.1 51 29 .31 .34 .36 (.96) ( .96) ( .99)

31 23 .60 .81 .76 (.95) (.97) (.99) .25 51 29 .47 .49 .54 ( .97) (.95) ( .99) - 11 3 1.92 2.61 2.61 3.01 ( .96) ( .99) (.99) (.99) 31 9 1.11 1. 21 1. 34 1.35 .5 (.97) ( .99) ( .99) (.99) 51 19 .87 .92 1.00 1.01 ( .97) (.98) ( .98) ( .99)

*Kaigh and Lachenbruch method only.

0'1 '-I TABLE 2.11

(Continued)

Kaigh Simulated and Order p N K* L-COST Lachenbruch Bootstrap Statistic

31 23 2.40 3.21 3.20 ( .94) ( .95) ( .99) .75 51 39 1.86 2.25 2.64 ( .96) ( .96) ( .99)

.9 51 39 4.33 5.41 8.33 ( .95) ( .94) ( .99)

*Kaigh and Lachenbruch method only.

~ co e e e CHAPTER III

THEORY FOR ESTIMATION OF AN INTERQUANTILE DIFFERENCE

3.1 Introduction

Until the present chapter, most of this work's discussion regard­ ing quantile estimation has centered on estimation of a single quan­ tile. The difference between quanti1es, however, is also useful. It can serve as a nonparametric measure of dispersion. Chu (1957) dis­ cusses how the standard deviation is in fact a constant multiple of an interquanti1e difference in many cases. Thus, it is of practical interest to determine useful methods of estimating differences between quanti1es and to provide a comparison of them. This difference be- tween two quanti1es is called an interquanti1e difference.

Definition 3.1:

Let sp and Sq be the p-th and q-th quanti1es from the cdf F, respectively, with q > p. The interquantile difference is then Sq - sp. If F is the cdf of a continuous random variable, then the interquanti1e difference is t-s when F(t) = q and F(s) = p. II Both the L-COST method of Harrell and Davis (1982) and the "K-L" method of Kaigh and Lachenbruch (1982) will be modified in order to estimate the interquanti1e difference. This chapter will present 70 the theory needed to extend the use of both of these estimators. The appropriate confidence intervals will also be presented.

3.2 Theory for the L-COST Estimator of Interguanti1e Difference

3.2.1 The L-COST Interguanti1e Difference Estimator

The L-COST estimator for the p-th quantile can be written as:

n Q = I Pw . X(,.) P i=l n,' where i n Pw . = 1 J / (n+1)p-1 (1_y)(n+1)(1-P)-1 dy n,, B(( n+ 1) p, (n+1)( 1- p)) . i -1 Y n

= I i/ n {p (n+ 1) ,( 1- p)( n+ 1)} - I i _1 {P(n+ 1) ,(1- p)( n+ 1) }, n and Ii/n(a,b) is the incomplete beta function.

An estimator of the q-th quantile can be defined similarly as

n Q = I qw . X. q i=l n,' (,)

This readily allows construction of the L-COST interquanti1e differ- ence estimator as follows.

Definition;).2:

The L-COST interquanti1e difference estimator is defined to be

n d X ~(q_p)(L-COST) = i~l Wn,i (i) (3. 1) where d . = qw - p Wn, , n,i Wn,i (3.2) 71

Thus t the estimator for an interquantile difference is again a linear combination of order statisticst with coefficients consisting of a function of incomplete beta functions as indicated in (3.2). II

3.2.2 Theoretical Framework for Converqence to Normality

In order to establish the use of the Normal distribution for a pivotal quantitYt it is important to demonstrate convergence to nor­ mality of the estimator presented in (3.1). The framework in which the results will be demonstrated is that of L-estimators as discussed in Parr and Schucany (1982)t Cheng (1982)t and Sen (1982).

3.2.2.1 L-estimators and the L-COST Estimator

Consider -1 n (3.3) Ln = a general form for an L-estimatort which is a linear combination of order statistics. Let J(tnti ) be a score or weight function whose argument depends on both n and it and g be some suitably bounded function of the order statistics. In this case t g(X(i)) will be X(i)t the order statistic itself. As discussed in Parr and Schucany (1982)t and Sen (1982)t (3.3) is asymptotically equivalent to a form

n Un = I C. X( .) (3.4) i=l ltn 1 where iln = J (u) Ii-l n and thus Un can be considered equally well in asymptotic contexts. 72

In fact, Sen (1982) specifically makes the equivalence

J(tn,i) = J(nl1) = J(u) for i~l < u < *due to this asymptotic equivalence. The L-COST estimator for interquantile differences can be ex­ pressed in the form of (3.4). To do so, consider

(3.5) where _ u(n+l)q-l (l_u)(n+l)(l-q)-l Jl(u) B((n+l)q,(n+l)(l-q)) and = u(n+l)p-l (l_u)(n+l)(l-p)-l J (u) 2 B((n+l)p,(n+l)(l-p)) To simplify the subsequent expressions, (3.5) will be written as e- al-l a -1 bl-l b -1 J(u) = Clu (l-u) 2 - C2u (l-u) 2 (3.6) where Cl = [B((n+l)q, (n+l)(l-q))]-l

C2 = [B((n+l)p, (n+l)(l-p))]-l a (n+l)q 1 =

a2 = (n+l)(l-q)

bl = (n+l)p

b2 = (n+l)(l-p)

3.2.2.2 Establishing Conditions for Convergence

Sen (1982) and Serfling (1980) provide several necessary condi­ tions in order to establish convergence to normality of a pivotal 73

quantity based on the estimator. Also, the estimator of variance, proposed in section 3.2.3, will be shown to converge to the true asymptotic variance. Each of the conditions needed will be estab­ lished for the L-COST interquantile difference situation prior to presenting the main theorems. The first five conditions which fol­ low are required by Sen (1982) and the remainder by Serfling (1980).

Condition (i):

If b(u) = g(F-l(u}), 0 < u < 1, fore do,~), b(u) is of bounded variation on (e,l-e).

Condition (ii):

If b(u) = g(F-l(u}), 0 < u < 1, then Ib(u)1 $ K{u(l-u)}-a for -- some positive, finite K, real a, and all u E (0,1). Condition (iii):

J(u) has continuous first-order derivatives, {JI(U}; 0 < U < l}, almost everywhere.

Condition (iv):

IJ(u)1 ~ K{u(l-u)}-b and

IJ1(u)1 $ K{u(l_u)}-b-l V 0 < u < 1, real b, and some finite, positive k.

Condition (v):

a + b = ~ - 0, for some 8 > 0, where a and b are defined in condition (ii) and condition (iv). 74

Condition (vi): n If Tn = .I J(tn i) X(i)' then 1=1 '

n max t . - - = O( l) . l:si:5n I n,l iin

Condition (vii):

For some d > 0,

Condition (viii): r If Elxl < 00 for some r, then for 6 > 0, IJ(u)1 ~ M(u(l-u))-~ + l/r + 6 ,0< u < 1. e- Conditions (i) through (viii) will now be shown to be satisfied by the L-COST interquantile difference estimator.

1 1 Condition (i): As g(X(i)) = X(i)' g(F- (u)) = F- (u). On (0,1), 1 _00 < F- (u) < 00. By Rudin (1976, p. 128),

1-0 -1 F (u)du < 00 fo and

Thus, b(u) is of bounded variation on (0,1-0).

Condition (ii): Since b(u) = F- 1(u) 75

1 a where K1 = F- (u)u (l-u)a > 0, on u £ (0,1),

IuIa < 1, and

Then, ~ ~ Ib(u)1 K{u(l-u)}-a for some K1 K< 00.

Condition (iii): From (3.6), a-l a-1 b-l b-l J(u) = C u 1 (l-u) 2 _ C u 1 (l-u) 2 . 1 2 Thus a -1 a a -1 a JI(U) = C u 1 (a -1)(1-u) 2-2 + (l-u) 2 (a -l)u 1-2J 0_ 1[ 2 1 b -1 b -2 b -1 b -21 + C u 1 (b -1)(1-u) 2 + (l-u) 2 (b -l)u 1 _. 2[ 2 1

It is obvious that the first derivative of JI(U) exists everywhere

for u £ (0,1). By Rudin (1976, p. 104), JI(U} is continuous on (0,1).

Condition (iv): From (3.6), J(u) can be written as: al-l+b a -1+b b -1+b b -1+b} { }-1 J(u) = C u (l-u) 2 -C u 1 (l-u) 2 u(l-u} { 1 2 b = K2 {U(l-U)r .

Also, JI(U) can be written as:

a +b a -l+b a +b a +b- 1] JI(U) = C u 1 (a -1)(1-u) 2 + (l-u) 2 (a -l)u 2 { 1[ 2 1 76

b +b b +b-l b +b b +b_1J} + C u 1 (b -l)(l-u) 2 + (l-u) 2 (bl-l)u 1 2[ 2

-b-l x {u(l-u)}

-b-l = K3 {u(l-u) } .

For each of these expressions, every term within the sum or product of terms is easily shown to be finite. Thus, it follows that K2 and K are each finite. Consider k = max{K ,K ,K } and conditions (ii) 3 l 2 3 and (iv) are simultaneously satisfied by one k.

Condition (v): Proper choice of a and b will still preserve the required finiteness of K and meet this condition:

Condition (vi): As Tn is asymptotically equivalent to Un' then it suffices to show that if

n i/n Un = I i-l J(u)du X(i)' i =1 U ) n then n max = O( 1) . l~i~n Iu - *1 Since max max 1 < ~ ~I < 1 l:si~n Iu - *1 l:::i:::n Ii - - n it follows that n max l-si:;:n lu - *1 = 0(1).

Then 77

Condition (vii): Consider u < >t . for i-l < u < i. It n" n n is obvious that this relation will hold for arbitrary nand i ~ n, by suitably selecting d.

Condition (viii): This condition is required by Serfling (1980) and is more specific than condition (iv) required by Sen (1982), but shown by the same techniques.

3.2.3 Convergence Theorems for the L-COST Estimator of Interguantile Difference

Having established all the needed conditions, the following theorems regarding convergence for the L-COST estimator of interquan­ tile distance may be established.

THEOREM 3.1.

Consider

as a pivotal quantity for the L-COST interquantile difference esti- mator, where

which converges to ~q - sp' and J(u) = Jl(u) -J2(u) as defined in (3.5). Let

1 0l =f: f:[ffiin(s.t)-stJ J(s)J(t) dF-1(s} dF- (t} (3.7)

be the asymptotic variance. 78

Assume {Xi} are independent and identically distributed for any r cdf, F. Assume EjXl < 00 for some r > O. Then, since conditions (iii), (vi), (vii), and (viii) are satisfied,

Zl + N(O,l) as n + 00. d Proof: From a result in Serfling (1980, p. 277), since the required conditions are satisfied, the result has been established for the interquantile difference estimator. II

2 To define an appropriate estimator for the variance, 0L' con- sider using the jackknife framework.

Definition 3.3:

Let D. = d .IX J .1 ~ where and

(removing the j-th order statistic) Then

qw n-1,1- . I (.1>J.) =

I__i__ {qn,(l-q)n} - I i _l {qn,(l-q)n} if i < j n-l n-l (3.8) > j, I i- l {qn,(l-q)n} - Ii _2 {qn,(l-q)n} if i n-l n-l i < n

PWn-l,i-I(i>j) is similarly defined, replacing q by P throughout in (3.8). Then n 1 n - 2 S~2(L-COST) = var(~(q_p)(L-COST)) = -i- j~l (OJ-D) (3.9) 79 _ n where D = I D·/n. j=l J

This can be shown by applying (4) of Harrell and Davis (1982). II To prove the next theorem, consider the following jackknife notation based on the L-statistic, Un' Let

n U (3. 10) n = I Ci n X(i) i =1 ' i/n where Ci ,n = Ii-l J(u)du (as in (3.4) and (3.5)). Then, n

U. = nU - (n-l)u(i ) n,1 n n- l (3.ll) where U(i) = Un based on n-l sample observations, removing the n-l j-th order statist1c. and n U* = 1 I U .. (3.12) n n.1=1 n,1

THEOREM ;). 2.

Let S~2(L-COST) be the jackknife variance estimator of t(q_p)(L-COST) as defined in (3.9), and a~ be the asymptotic vari­ ance as shown in (3.7). Then, under satisfaction of conditions (i) through (v), S~2(L-CQST) ~ a~ almost surely.

Proof: Let 2 -1 n 2 Sn = (n-l) .I (Un i-Un) (3.13) 1 =1 ' 2 -1 * 2 = S~ (L-COST) + n( n-l) (Un - Un) . (3.14)

By (3.10) through (3.14) and satisfaction of conditions (i) through

(v), Sen (1982) shows U* -U ~ 0 almost surely as n ~ 00. Thus, n n 80

S~ - S~2(L-COST) + 0 almost surely as n + 00. 2 2 As shown in Sen (1982)~ Sn - 0L + 0 almost surely as n + 00. Thus~ under conditions (i) through (v) with the asymptotic equiva­

lence of Tn and Un~ S~2(L-COST) - o~ + 0 almost surely as n + 00. II

With these two theorems, the use of the L-COST difference esti- mator and its jackknife variance estimator are both correct based on their convergence to the desired parameters.

3.2.4 Confidence Interval Estimator Based on L-COST Interguantile Difference Estimator

Establishment of Theorems 3.1 and 3.2 allows the use of the L-COST estimator and the jackknife variance estimator in a pivotal quantity for the construction of a (1-a)x100% confidence interval for the interquanti1e difference.

Again~ the small sample distribution of the pivotal quantity is not clearly determined, so the asymptotic normality results will be applied. This allows the confidence interval to be defined in the following way.

Definition 3.4:

A (1-a)x100% confidence interval for the interquanti1e differ- ence based on the L-COST interquanti1e difference estimator is

A -1 * S(q_p)(L-COST) ± ~ (1-a/2) Sn(L-COST)

2 where S~ (L-COST) is the jackknife variance estimator defined in

(3.9). II 81

3.3 Theory for the Kaigh and Lachenbruch (1982) Estimator of an Interguanti1e Difference

3.3.1 The K-L Interguanti1e Difference Estimator

Recall that

where k is the subsamp1e size and r = [(k+1)p]. The estimator of the q-th quantile is defined in a similar manner. This allows con- struction of an interquanti1e difference estimator as follows:

Definition 3.5:

The K-L Interquanti1e Difference Estimator is defined to be:

~(q_p)(K-L)

where k1 = subsamp1e size for q-th quantile estimator

k2 = subsamp1e size for p-th quantile estimator

r1 = [(k1+l)q]

r2 = [(k2+l)p] II Thus, this estimator is also a linear combination of order statistics.

It should be noted that k1 and k2 are equal only when it is found that the two subsamp1e sizes can equally well be used in the estima­ tion of their intended quanti1es. As discussed in Chapter II, 82 selection of k may be difficult and is somewhat arbitrary, requiring much care.

3.3.2 Convergence Theorems for the K-L Estimator of Interguantile Di fference

The asymptotic normality of a pivotal quantity for the K-L inter- quantile difference estimator is established in the following theorem.

THEOREM 3.3.

Let r.-l k.-r. = u 1 (l-u) 1 1 < < mr .: k. (u) a u 1 1 1 S(r.,k.-r.+l) 1 1 1

mdr:k(u) = mrl:kl(u) - m : (u) r2 k2 1 and ~r.:k.(F) = fa F-l(u) mr.:k.(u) du 1 1 1 1

2 for i=1,2 corresponding to p and q. Suppose I x dF(x) < 00, then

k A 2 (~(q_p)(K-L) (~rl:k,'F) F n - - lJ r2 :k/ )) z2 ­- 0 d(F) is a pivotal quantity where

and Z2 converges to the standard normal distribution as n ~ 00.

Proof: Extending directly from Theorem 2.1 of Kaigh (1982), and re- placing mr:k(u) by mdr:k(u) yields the result. II 83

To define the variance estimator for the Kaigh and Lachenbruch estimator, an extension of the jackknife variance estimator discussed in Chapter II will be considered.

Definition 3.6:

AO A Let 0n = ~(q_p){K-L) and Ai on-1

where the i-th observation has been removed from the sample, r , r , 1 2 k1, k2 are defined in Definition 3.15, and let A AO Ai 0~ ,= n0 n - (n-1) 0n-1 i = 1, ... ,n

n and 8* = I 8i/n. i=l

Then S~2{K-L) =Variance estimator of ~(q_p){K-L) n = I (8~ - §*)2 / {n{n-1)). (3.16) ,=. 1 ' From (2.8),

n A· • 2 s*2{K.-L) =~1 \ (0' - 0' ) (3.17) n n ,.f=1 -n-1 n-1 where n o-i = I i lin. II n-1 i =1 en- 84

The next theorem establishes the almost sure convergence of the jackknife variance estimator for the K-L interquanti1e difference estimator to the asymptotic variance.

THEOREM :3.4. 2 Let f IXI +€ dF(x) < 00 for some € > O.

*2 Then, for fixed p € (0,1), andk, as n --r 00, Sn (K-L) - 0d(F) --r 0, almost surely, where 0d(F) is defined by (3.15).

Proof: Since the estimator for the difference between two quanti1es is the difference of two U-statistics, it is also aU-statistic. From Theorem 3.1 of Kaigh (1982), established by Theorem 6 of Arvesen (1969), the result follows. II

Remark 1 of Theorem 3.1 in Kaigh (1982) indicates that the existence of the variance of the r-th order statistic in a random sample of size k from F is sufficient for this theorem to hold. This will apply easily for any distribution. For a Cauchy distri- bution, the smallest and largest order statistics do not have moments (David, 1981, p. 34). Choosing k so as to eliminate the extreme order statistics will permit the needed conditions to be met for that distribution.

3.3.3 Confidence Interval Estimator for the K-L Interguantile Difference Estimator

Theorems 3.3 and 3.4 having been established allows the K-L estimator and its jackknife variance estimator to be used in the 85 pivotal quantity for construction of a (l-a)xlOO% confidence inter­ val for the interquantile difference. This confidence interval is defined as follows.

Definition ;).?:

A (l-a)xlOO% confidence interval for the interquantile differ- ence based on the K-L estimator for interquantile difference is de­ fined as

A * ~ ± t _ q_ p(K-L) n max (kl'k)2'1- a12 Sn(K-L).

The number of degrees of freedom for the t-statisticis chosen to be the smaller number between the degrees of freedom for the t-statistics used for each individual quantile's estimator for a confidence inter- val. II CHAPTER IV

A COMPARISON OF POINT AND CONFIDENCE INTERVAL ESTIMATORS FOR INTERQUANTILE DIFFERENCES

4.1 Introduction

In the previous chapter, point and interval estimators of inter­ quantile distance were defined, and distributional properties devel­ oped. It is of interest to investigate numerically the relative per­ formance of these estimators. Two estimators, based on the L-COST and the Kaigh and Lachenbruch (K-L) methods, will be evaluated as point estimators based on their Mean Squared Error (MSE) and relative bias. Their relative efficiency (RE), as compared with an interquantile difference estimator based on the difference of sample quantiles, will be computed. Only inter- decile and interquartile difference estimators will be considered because of their easy interpretation and application. Confidence intervals for interquantile difference estimators will be constructed as described in Chapter III, and their ability to preserve confidenc~ and expected length will be evaluated using methods described in Chapter II. These intervals will also be discussed relative to bounds which were developed by Chu (1957). 87

4.2 Point Estimators for the Interguantile Difference

The L-COST of Harrell and Davis (1982) and the K-L estimator of Kaigh and Lachenbruch (1982) have been developed into estimators for the interquantile difference. Definition 3.2 in Chapter III specifies the form of the L-COST Interquantile Difference Estimator, denoted ~(q_p)(L-COST). The Kaigh and Lachenbruch (K-L) Interquan­ tile Difference Estimator, denoted t(q_p)(K-L), was presented in Definition 3.5, also in the previous chapter. The interquantile difference estimator based on sample quantiles (SQ method) is de­ fined as follows.

Definition 4.1:

The sample quantile interquantile difference estimator is de­ fined to be

where rl = [q(n+l)] [p( n+ 1)] r2 = q( n+ 1) a1 = - rl a2 = p( n+ 1) - r2 II

All three of these estimators will be compared as point estimators.

4.3 Evaluation of Point Estimators

4.3.1 Methodology for Comparisons The L-COST, K-L, and SQ interquantile difference estimators were 88 all evaluated as to relative bias, and MSE. In order to evaluate the estimators, 500 simulations were per­ formed for different combinations of sample size and underlying dis­ tribution. The interquartile differences were estimated for samples of size 31 and 51 for the uniform, normal, exponential, and lognormal distributions. The interquartile distance for the Cauchy distribution was only estimated from a sample of size 51 because far more than 500 simulations would be required to obtain an adequate estimate in this case. The interdecile difference for the uniform, normal, exponential, and lognormal distributions was also evaluated for a sample size of 51.

Values of Kl and K2 for the K-L estimator as defined above were chosen to be the same as those selected for the analyses in Chapter

II. Depending on the Kl and K2 values selected, the results will vary. As each underlying distribution has a known value for !;q - !;p' the interquantile difference of interest, the relative bias for each estimator was approximated as follows:

"- Relative bias ~ bias/(~q - ~p) where

bias where So is the number of simulations (500) performed, and Method is either SQ, L-COST, or K-L . The MSE of each •estimator was also estimated from simulations. Within each simulation, a value of (~(q_p)(Method) - (~q - ~p))2 was .. obtained. These squared errors were averaged over the number of 89 simulations performed to obtain an estimate of MSE. To obtain relative efficiencies, the MSE of the sample quantile (SQ) method was divided by the MSE of the L-COST and K-L methods. This then gives a measure of the performance of each estimator1s MSE relative to a standard, i.e., the SQ method.

4.3.2 Results of Comparisons

Estimates of relative bias are presented in Table 4.1. The amount of relative bias present in the estimators varied noticeably depending on which di~tribution the data were sampled from, but also varied somewhat between estimators and by changing sample size. For the uniform distribution, relative bias was small compared to that for other distributions. The sample quantiles method was most nearly unbiased. Negative bias was present when using the L-COST method, as well as when any method was used for sample size 31 esti­ mation of the interquartile range. Relative bias patterns were similar for the normal and exponen­ tial distributions. For both of these distributions, the L-COST method always produced estimates with lowest relative bias, followed by the Sample Quantiles method, and the K-L method with consistently higher relative bias. Increasing the sample size reduced the rela­ tive bias under the SQ and L-COST methods, but increased it under the K-L method. For the Cauchy distribution, the SQ method produced the lowest relative bias, followed by the L-COST and K-L methods. The same almost holds true under the lognormal distribution, except that the 90

K-L and L-COST methods had approximately the same relative bias when estimating the interquartile difference from a sample of size 31, and the interdecile difference from a sample of size 51. With both of these distributions, all three methods resulted in moderately large bias, however. Finally, there is some tendency for the interdecile difference estimates to have slightly higher relative bias than the interquartile difference estimates. The MSEs for the same cases were computed. As a summary measure of performance, the relative efficiencies of both the L-COST and K-L methods for estimating interquantile differences were computed using the SQ method as a reference. These values are found in Table 4.2. Neither the L-COST nor the K-L estimator was as efficient as the SQ method when the Cauchy distribution is sampled. In every other case, no matter what distribution or which interquantile dif­ ference, the L-COST estimator was more efficient than the SQ method. Except for the interdecile range computed for the uniform distribution, the L-COST estimator had higher relative efficiency than did the particular K-L estimator under consideration. While it is possible that changing values of Kl and K2 could improve estimation by the K-L method, preliminary work indicates that different choices of these parameters varies the results, but not in a consistent fashion across distributions. Overall, as formulated for this chapter~ the K-L estimator is approximately as efficient as the sample quantile method, whereas the L-COST method is generally more efficient. 91

4.4 Evaluation of Confidence Intervals

Three types of confidence intervals for the interquantile dif­ ference estimator will be considered. One is based on the modified version of the L-COST estimator and is presented in Definition 3.4. Another, shown as Definition 3.7, is based on the K-L estimator as modifed in Chapter III. The third is based on the bounds of Chu (1957) as presented in section 1.2.6 of Chapter T. For purposes of this chapter, only symmetric interquantile ranges, for which q = l-p, are being considered. In this case, Chu describes a simplification of his general bounds as follows. If then p(X(s ) - X( r ) > - l;q - l;p-) > 1 - a/2 if r is chosen such that Bn(r-l,p) ~ a/4 and s = n-r+l. Also,

if u is chosen such that Bn(u-l,p) ~ l-a/4 where v = n-u+l. Select­ ing rand u in this manner will assure that

This probability is very conservative because of the method by which Chu (1957) determined these bounds. He used the relation

to obtain his result. The probability on the left-hand side actually equals the probability on the right-hand side added to 92

+ in addition to other combinations. Similar simplifications apply to the upper bound with X(v)- X(u)' Thus, the confidence interval can be of much greater confidence than is initially represented, and hence be much wider than an order statistic bound of this type could be constructed to be.

Following closely to Chu·s (1957) recommendations, the values of r,s, u,v selected for bounds which meet at least (l-a)xlOO% con- fidence in the two-sided interval are found by examining the binomial distribution and are presented in Table 4.3. In the case of the interdecile range, the actual X(r) and X(s) chosen lead to slightly shorter intervals than Chu would require. They are, however, the extreme order statistics, and are the best possible approximation. Just as in Chapter II, the confidences were evaluated using the simulated interval endpoints. The equation (2.1) was used to compute the proportion of times that the true interquantile differ- ence is within the observed bounds. This applied for all three methods considered. The average lengths of the intervals are also obtained in a manner similar to that in Chapter II. For the L-COST interquantile distance estimator, the expected length of the interval is 2 ¢-1(1-a/2) E(S~(L-COST)), where S~(L-COST) is defined in (3.9). The K-L estimator has an 93

interval with average length

2 tn-max(kl,k2),1-a/2 E(S~(K-L))

where S~(K-L) is defined in (3.16). Finally, for the interval bounds presented by Chu (1957), the expected length is

E((X(S) - X(r)) - (X(v) - X(u)))

= E(X(s) + X(u) - X(r) - X(v))'

For all three intervals, the expectations are estimated through the simulations, which are conducted as discussed in section 2.5 of Chapter II.

4.5 Results from Simulated Confidence Intervals

The expected lengths and observed confidences for all three methods appear in Tables 4.4 through 4.8. The most obvious feature from the tables is that Chu's (1957) bounds are indeed overly wide and exceed their intended confidence. This was expected as described previously. Thus, it appears that these bounds on interval length can be considered generous upper bounds on the true interquantile difference, and should be im~roved upon by other methods. In fact, even 99% intervals from the K-L or L-COST methods would always pre­ serve at least 95% confidence and yet be shorter than Chu's 95% intervals. Regardless of interquantile difference or underlying distribution, both the K-L and L-COST methods achieve interval lengths much lower than those ofChu but they do this in nearly all cases by reducing the actual level of confidence below that intended. 94

For the uniform distribution, both the L-COST and K-L methods are close to performing acceptably for the 95% confidence intervals, but both estimators have somewhat lower confidence than acceptable for the 99% intervals. The K-L method always equals or exceeds the confidence of the L-COST method, but K-L intervals are often longer as well. Results are somewhat different for the normal distribution. In this instance, the K-L method always provides an interval of accept­ able confidence, whereas the L-COST method is only within acceptable tolerances when using the larger sample size. For equal confidence, the K-L method provides a longer interval, but it provides an acceptable, longer interval when L-COST cannot attain the approximate confidence desired. For the Cauchy distribution, either estimator provides an inter­ val of acceptable confidence. Again, the K-L intervals are longer than those from the L-COST method. Entirely different results are obtained when forming intervals for quantile differences from exponential and lognormal distributions. Neither the L-COST nor the K-L method was acceptable for attaining the desired confidence, regardless of sample size or interquantile difference of interest. A minor exception is that the L-COST method did achieve an observed .94 confidence for the interquartile range from an exponential sample of size 51. Otherwise, when sampling from these asymmetric distributions with the moderate to large sample sizes used for this work, it appeared that neither method was of acceptable confidence to use for estimation of interquantile ranges. 95

4.6 Conclusion and Summary Based upon the results presented in section 4.3, the L-COST estimator for interquantile differences is of great value as a point estimator. Its bias is generally within a reasonable amount of that of the SQ method, and is actually less biased for two distributions. It also had lower MSE than either the SQ or K-L method, with the ex­ ception of estimation from a Cauchy-distributed sample. Thus, assum­ ing adequate sample sizes and reasonable interquanti1e ranges, the L-COST estimator for interquanti1e differences is the preferable method, in general, for use as a point estimator. This is consistent with the finding of Harrell and Davis (1982), which was that their L-COST estimator for a single quantile had generally higher efficiency than the sample quantile method for sample sizes between 20 and 60. For construction of confidence intervals, not one of the three methods in their present form performed satisfactorily overall, and thus none can be recommended for general use. The Chu (1957) method overstated the bounds required, and thus led to overly wide bounds which were too conservative. The L-COST method provided shorter bounds, but too often, its confidence was below the acceptable thresh­ olds as defined in section 2.5 of Chapter II. For symmetric distri­ butions, the K-L method achieved acceptable confidences in virtually every case simulated. Thus, it is acceptable to use in the instances in which the underlying distribution is considered to have some type of symmetric shape. This assumes, of course, adequate sample sizes as well. It should be kept in mind that results looked less promis­ ing for interdeci1e ranges than for interquarti1e ranges. Further 96 work is needed to find an acceptable confidence interval estimator for general, unspecified distributions. e - e

TABLE 4.1 RELATIVE BIAS OF PROPOSED ESTIMATORS

t; -t; ** Sample Distribution N Kl* K2* q P Quantiles L-COST K-L

Uni form 31 23 23 Q -.0058 - .0366 -.0054 51 39 29 -.0082 - .0138 -.0392 51 39 29 0 .0019 - .0173 .0018 31 23 23 Q .0359 .0272 .0460 Normal 51 39 29 .0267 .0213 .0790 51 39 29 0 .0377 .0234 .0456

Cauchy 51 39 29 Q .0641 .1157 .1744

31 23 23 Q .0359 .0330 .0482 Exponenti al 51 39 29 .0353 .0320 .0590 51 39 29 0 .0508 .0502 .0628

31 23 23 Q .0740 .0950 .0943 Lognormal 51 39 29 .0637 .0688 .0884 51 39 29 0 .0800 .1034 .1041

*Kaigh and Lachenbruch method only. \.0 -.....J **Q = t;.75 - t;.25; .0 = t;.90 - t;.10· TABLE 4.2 RELATIVE EFFICIENCyt OF PROPOSED ESTIMATORS VS SAMPLE QUANTILE METHOD

Kaigh and I; -I; ** Distribution N K1* K2* q P L-COST Lachenbruch

31 23 23 Q 1.375 1.209 Uni form 51 39 29 1.293 1.084 51 39 29 D 1.186 1.216 31 23 23 Q 1.329 1.128 Normal 51 39 29 1.367 .958 51 39 29 D 1.188 .916

Cauchy 51 39 29 Q .944 .685

31 23 23 Q 1.339 1.158 Exponenti a1 51 39 29 1.218 1.047 51 39 29 D 1.225 1.044

31 23 23 Q 1.147 1.085 Lognormal 51 39 29 1.197 1.051 51 39 29 D 1.063 .976

*Kaigh and Lachenbruch method only. t Re1 . Eff. = 1.0 **Q = 1;.75 - 1;.25; D = l;.90 - l;.10· CXl e e e e e e

TABLE 4.3 INDICES FOR ORDER STATISTICS SELECTED FOR FORMATION OF CONFIDENCE INTERVALS DESCRIBED BY CHU (1957)

Intended Indi ces of Order Statistics ~ -~ * Confidence N q P I r** s u v

31 I 3 29 14 18 Q 95% 51 6 46 21 31

51 D I 1 51 11 41

31 Q I 2 30 16 16 99% 51 5 47 23 29 51 DI 1 51 13 39

*Q = ~. 75 - ~ .25 ; D = ~. 90 - ~. 10.

**r as in X(r)'

~ ~ TABLE 4.4 EXPECTED LENGTHS OF CONFIDENCE INTERVALS (AND OB­ SERVED COKFIDENCE) COMPUTED FOR INTERQUANTILE DIFFERENCE FROM UNIFORM DISTRIBUTION

Intended ~ -~ ** Confidence N Kl* K2* q P Chu L-COST K-L

31 23 23 .69 .29 .39 ( .92) ( .94) Q (1) 51 39 29 .58 .24 .29 95% (1) ( .94) ( .94)

51 39 29 D I .38 .19 .22 (1) ( .93) ( .94)

31 23 23 .84 .39 .57 Q (1) (.97) ( .98) 51 39 29 .69 .32 .41 99% (1) ( .98) ( .99)

51 39 29 D .46 .26 .31 (1) (.97) ( .97) I *Kaigh and Lachenbruch method only. **Q - ~ .75 - ~ .25'. D = ~. 90 - ~. 10· a0 e e e e e e

TABLE 4.5 EXPECTED LENGTHS OF CONFIDENCE INTERVALS (AND OB­ SERVED CONFIDENCE) COMPUTED FOR INTEROUANTILE DIFFERENCE FROM NORMAL DISTRIBUTION

Intended ~ -~ ** Confi dence N K1* K2* q P Chu L-COST K-L

31 23 23 2.45 .98 1.29 ( .92) ( .94) Q (1) 51 39 29 1. 96 .79 .94 95% (1) (.95) ( .95)

51 39 29 D I 2.85 1.13 1.37 (1) ( .92) ( .93)

31 23 23 3.01 1.28 1.88 Q (1) (.97) ( .99) 51 39 29 2.39 1.03 1.32 99% (1) ( .98) ( .98)

51 39 29 D I 3.11 1.49 1. 92 (1) ( .96) ( .98) I

*Kaigh and Lachenbruch method only...... 0 **Q - ~ .75 - ~ .25 .• D = ~.90-~.10' ...... TABLE 4.6 EXPECTED LENGTHS OF CONFIDENCE INTERVALS (AND OB­ SERVED CONFIDENCE) COMPUTED FOR INTERQUANTILE DIFFERENCE FROM CAUCHY DISTRIBUTION

Intended t,; -t,; ** Confi dence N Kl* K2* q p Chu L-COST K-L

51 39 29 Q 5.47 1.89 2.30 95% (1) ( .94) ( .95)

99% 51 39 29 Q 7.45 2.49 3.22 (1) ( .99) ( .99)

*Kaigh and Lachenbruch method only.

**Q = ';.75 - t,; .25'

o N e e e e e e

TABLE 4.7 EXPECTED LENGTHS OF CONFIDENCE INTERVALS (AND OB­ SERVED CONFIDENCE) COMPUTED FOR INTERQUANTILE DIFFERENCE FROM EXPONENTIAL DISTRIBUTION

Intended ~ -~ ** Confidence N K1* K2* q p I Chu L-COST K-L

31 23 23 2.12 .99 1. 26 Q I ( .99) ( .90) ( .91) 51 39 29 1. 73 .82 .96 95% (l) (.94) ( .92)

51 39 29 D 3.17 1.52 1.80 (.99) ( .92) (.90)

31 23 23 2.88 1. 31 1.83 Q (l) ( .95) (.96) 51 39 29 2.11 1.08 1.35 99% (l) ( .97) ( .97)

51 39 29 D 3.39 2.00 2.52 ( .99) ( .97) ( .95) I *Kaigh and Lachenbruch method only...... ~ ~. 0 **Q - .75- .25' D = ~. 90 - ~ . 10· w TABLE 4.8 EXPECTED LENGTHS OF CONFIDENCE INTERVALS (AND OB­ SERVED CONFIDENCE) COMPUTED FOR INTERQUANTILE DIFFERENCE FROM LOGNORMAL DISTRIBUTION

Intended l; -l; ** Confidence N K1* K2* q P Chu L-COST K-L

31 23 23 3.62 1.71 2.11 ( .90) Q (1) ( .88) 51 39 29 2.70 1.32 1. 53 95% (1) ( .92) (.92)

51 39 29 D I 8.53 3.27 3.84 (.99) (.91) ( .88)

31 23 23 5.13 2.25 3.07 Q (1) ( .95) ( .96) 51 39 29 3.37 1. 74 2.14 99% (1) ( .96) ( .96)

51 39 29 D 8.86 4.30 5.38 ( .99) ( .95) ( .94) L *Kaigh and Lachenbruch method only. --' l; l; . 0 **Q - .75- .25' D = l;. 90 - l; . 10. ~ e e e CHAPTER V

EXAMPLE OF QUANTILE ESTIMATION METHODS

5.1 Introduction

In the preceding chapters, comparisons of various estimation methods have been performed using data simulated from known under­ lying distributions. These simulations provided a basis for assess­ ing a constructed confidence interval's expected length and ability to achieve desired confidence for the particular combination of sample size and function of quantiles. While the simulated inter­ vals are of use for evaluating the merits of the various intervals, an example is provided in this chapter to illustrate the variation among the different point and confidence interval estimates obtained from using the different methods on real data. The data for this chapter are taken from the Lipid Research Clinics Program Prevalence Study (Davis, 1980), which has the goal of improved understanding of heart disease. The study is cross­ sectional, including data from subjects with widely differing socio­ economic and cultural backgrounds. For this example, separate random subsamples of 51 users and 51 nonusers of oral contraceptives were selected from among the female study participants aged 20 to 29. The variables included in this illustration are Total Cholesterol 106

(CHOL), Triglycerides (TRIG), and High-Density Lipoprotein Choles­ terol (HDL-C). The example is not intended to provide a definitive analysis of the data in the sample, but rather to illustrate the methods for quantile estimation on a real data set.

5.2 Comparison of Results for the Example

Point estimators and confidence intervals were constructed for estimators of the median as well as for the interdecile and inter­ quartile difference. Several different methods were compared for each class of oral contraceptive use and type of lipid. Considering the median, Table 5.1 provides evidence that all four order statistic-based point estimation methods discussed in Chapter II provide estimates which are quite close to one another. The greatest variability occurred among estimates of median TRIG levels for women who do not use oral contraceptives; even here, the estimates were within 10% of each other. All methods indicate that nonusers of oral contraceptives have noticeably lower median CHOL and TRIG levels than users, but about the same levels of HDL-C. Both the 95% and 99% confidence intervals for the median had similar values among the various methods. That is, the upper and lower endpoints for intervals constructed from all four methods agreed to a large extent. Once again, the exception was for the TRIG levels for nonusers, specifically the lower endpoints of the interval. These lower limits varied from 42.9 for the bootstrap, up to 63 for the order statistic method for 95% intervals, and from 35.7 up to 58 for the 99% intervals. The long-tailed distribution 107

of triglycerides among nonusers is probably a major factor in this discrepancy among methods. Interval lengths for median CHOL levels were somewhat greater for nonusers than for users of oral contracep­ tives regardless of estimator. The same held true for TRIG levels but with a wider variation in interval lengths among methods, whereas interval lengths for median HDL-C were virtually independent of method and class of contraceptive use. Estimates of interdecile difference were constructed using the sample quantile, L-COST, and K-L methods as described in Chapter IV. The results are presented in Table 5.4. Essentially, results are of the same nature as for the median in terms of variability among methods. The methods all yielded virtually identical values across all variables and contraceptive use classes with the exception of TRIG level inter­ decile differences for nonusers. The range of variation, in this case, was smaller than for , however. Confidence intervals for the interdecile difference were con­ structed using Chu's method, the L-COST method, and the K-L method for interquantile differences, which were also discussed in Chapter IV. The most striking feature of both Tables 5.5 and 5.6 is that the interval endpoints obtained using Chu's method differ widely from those of the L-COST and K-L methods, and result in a much longer interval than either of these more recent competitors. For instance, the length of the 95% confidence interval for the interdecile range for CHOL level among users of oral contraceptives is 115 when obtained by the Chu method, 41.7 by L~COST, and 38.7 by the K-L method. Mainly, this difference is due to Chu's method of constructing very conservative 108 intervals which exceed the desired confidence with high probability. By contrast, the K-L and L-COST methods yield intervals that are generally in fairly close agreement (with the exception again being a slightly wider disparity between measurements of TRIG among non­ users of contraceptives) and much shorter than intervals produced by Chu's method. Finally, interquartile differences were computed and are pre­ sented in Table 5.7, and confidence intervals computed for inter­ quartile differences are presented in Tables 5.8 and 5.9. As is readily apparent from Tables 5.8 and 5.9, the results for the inter­ quartile difference have similar characteristics as those of the interdecile difference. A minor difference is the increased discre- pancy between estimates obtained from the L-COST method as opposed to the other two methods of point estimation considered for this differ­ ence. As the discrepancy is most pronounced only for nonusers' levels of CHOL, it might be attributed to the particular pattern in the data.

5.3 Conclusion

The example in this chapter suggests that, for the particular data used, different point estimators for medians or for interquartile and interdecile differences will lead to only slightly varying results with real data, and only a somewhat more noticeable disagreement when comparing the TRIG levels for nonusers of oral contraceptives. This may be an unusual situation, so the generally slight differences be­ tween results may lead one to choose the most precise method from which to actually report results. The L-COST method has been 109 demonstrated in Harrell and Davis (1982) to be of greater small sample efficiency than the order statistic method for point esti­ mation. The method for estimating interquantile differences, which is based on the L-COST method, was also shown (in Chapter IV) to be more efficient in small to moderate samples than either K-L or sample quantile methods for interquantile differences. Thus, since precision of estimators is an important criterion for selection of a method to use, the L-COST method should probably be preferred. For construction of confidence intervals for the median, the example shows that the methods were in fairly close agreement, with the exception of TRIG levels among nonusers, as was noted. Thus, from the results of this example, the methods are likely to give close results. The intervals for interquantile differences constructed using Chu's method were much broader than those constructed from either the K-L or L-COST method. If one needed to be very conservative, one could use Chu's bounds; otherwise the example illustrates how small the differences really are between the other methods. Thus, while there is no clear choice, the methods of K-L and L-COST appear useful in examples such as the ones presented in this chapter. TABLE 5.1 ESTIMATES OF MEDIANS, lIPID DATA, SAMPLE SIZE 51, BY USERS AND NONUSERS OF ORAL CONTRACEPTIVES

Oral Contraceptive Sample K-l Lipid Use Quantil e l-COST (K=19) Bootstrap

CHOl USER 205 204.8 204.8 205 NONUSER 178 177 .4 177.5 178

TRIG USER 106 106.9 107.1 106 NONUSER 66 71.0 72.4 66

HDl-C USER 51 51.2 51.4 51 NONUSER 52 51.8 51. 7 52

-' -' a e e e e e e

TABLE 5.2 LIMITS FOR 95% CONFIDENCE INTERVALS FOR MEDIAN OF LIPID DATA, SAMPLE SIZE 51, USERS AND NONUSERS OF ORAL CONTRACEPTIVES

Oral Contraceptive Order K-L Lipid Use Limit* Statistic** L-COST (K=19) Bootstrap

USER L 193" 193.9 193.6 193.1 U 217 215.7 215.9 216.9 CHOL NONUSER L 160 166.1 164.9 165.1 U 193 188.7 190.2 190.9

USER L 100 98.0 98.0 96.0 U 117 115.9 116.2 115.9 TRIG NONUSER L 63 54.5 56.3 42.9 U 95 87.5 88.5 89.1

USER L 47 47.1 46.8 46.1 U 59 55.3 56.0 55.9 HDL-C NONUSER L 47 47.3 47.5 47.1 U 55 56.2 55.9 56.9

*L = Lower limit; U = Upper limit.

**L = X(19); U = X(33)...... TABLE 5.3 LIMITS FOR 99% CONFIDENCE INTERVALS FOR MEDIAN OF LIPID DATA, SAMPLE SIZE 51, USERS AND NONUSERS OF ORAL CONTRACEPTIVES

Oral Contraceptive Order K-L Lipid Use Limit* Statistic** L-COST (K=19) Bootstrap

USER L 184 190.6 189.7 189.3 U 220 219.1 219.8 220.7 CHOL NONUSER L 158 162.6 160.5 161. 1 U 200 192.3 194.6 194.9

USER L 94 95.2 94.8 92.9 U 118 118.7 119.3 119.1 TRIG NONUSER L 58 49.4 50.8 35.7 U 96 92.7 94.1 96.3

USER L 44 45.8 45.2 44.5 U 60 ·56.6 57.6 57.5 HDL-C NONUSER L 46 45.9 46.0 45.5 U 57 57.6 57.4 58.5

*L = Lower limit; U = Upper 1imit. **L = X(16) ; U = X(35). ---l ---l N e e e e e e

TABLE 5.4 ESTIMATES OF INTERDECILE DIFFERENCE, LIPID DATA, SAMPLE SIZE 51, USERS AND NONUSERS OF ORAL CONTRACEPTIVES

- Oral Contracepti ve Sample K-L Lipid Use Quantile L-COST (K1=39;K2=29) CHOL USER 89.4 85.0 86.1 NONUSER 95.4 96.5 96.5

TRIG USER 94.0 91. 5 91.4 NONUSER 142.8 133.7 141 .1

HDL-C USER 42.6 41.0 41.9 NONUSER 30.6 30.5 31.5

..... -' w TABLE 5.5 LIMITS FOR 95% CONFIDENCE INTERVALS ON INTERDECILE RANGE, LIPID DATA, SAMPLE SIZE 51, USERS AND NONUSERS OF ORAL CONTRACEPTIVES

Oral CHU** Contraceptive (L=X(v)-X(u)) K-L Lipid Use Limit* (U=X(srX(r)) L-COST (K l=39;K2=29) USER L 51 64.2 66.7 U 166 105.9 105.4 CHOL NONUSER L 75 82.8 81.1 U 155 110.1 111.9

USER L 59 72.3 73.4 U 283 11 O. 7 109.5 TRIG NONUSER L 66 . 81.9 77.5 U 219 185.4 204.7

USER L 24 33.2 34.6 U 75 48.7 49.2 HDL-C NONUSER L 19 23.8 23.6 U 51 37.3 39.4

*L = Lower limit; U = Upper 1imit. **See Table 4.3 for r,s, u,v...... +:> e e e e e e

TABLE 5.6 LIMITS FOR 99% CONFIDENCE INTERVALS ON INTERDECILE RANGE, LIPID DATA, SAMPLE SIZE 51, USERS AND NONUSERS OF ORAL CONTRACEPTIVES

Oral cHlJ** Contraceptive (L=X(v)-X(u)) K-L Lipid Use Limit* (U=X(s)-X(r)) L-COST (Kl =39;K2=29) USER L 45 57.7 58.9 U 166 112.4 113.2 CHOL NONUSER L 68 78.5 75.0 U 155 114.4 118.1

USER L 49 66.2 66.2 U 283 116.8 116.7 TRIG NONUSER L 61 65.7 52.0 U 219 201.7 230.3

USER L 20 30.8 31.6 U 70 51.1 52.2 HDL-C NONUSER L 16 21. 7 20.4 U 51 39.4 42.6

*L = Lower limit; U = Upper limit. **See Table 4.3 for r,s, u,v. --' --' U1 TABLE 5.7 ESTIMATES OF INTERQUARTILE DIFFERENCE, LIPID DATA, SAMPLE SIZE 51, USERS AND NONUSERS OF ORAL CONTRACEPTIVES - Oral Contracepti ve Sample K-L Lipid Use Quantil e L-COST (K l =39;K2=29) CHOL USER 45 44.1 46.1 NONUSER 68 63.7 68.6

TRIG USER 49 46.6 48.4 NONUSER 61 56.7 58.9

HDL-C USER 20 20.7 21. 7 NONUSER 16 15.4 16.2

---I ---I O"l e e e· e e e

TABLE 5.8 LIMITS FOR 95% CONFIDENCE INTERVALS ON INTERQUARTILE RANGE, LIPID DATA, SAMPLE SIZE 51, USERS AND NONUSERS OF ORAL CONTRACEPTIVES

Oral CHU** Contracepti ve (L=X(v)-X(u)) K-L Lipid Use Limit* (U=X(s)-X(r)) L-COST (K l =39;K2=29) USER L 19 34.3 35.7 U 83 53.9 56.5 CHOL NONUSER L 13 46.3 48.4 U 89 81.1 88.8

USER L 15 24.9 20.0 U 86 68.3 76.9 TRIG NONUSER L 30 39.1 36.3 U 134 74.4 81. 5

USER L 7 14.7 14.5 U 41 26.7 28.9 HDL-C NONUSER L 7 11. 2 11.2 U 29 19.7 21.2

*L = Lower limit; U = Upper limit.

**See Table 4.3 for r,s, U,V...... -....J TABLE 5.9 LIMITS FOR 99% CONFIDENCE INTERVALS ON INTERQUARTILE RANGE, LIPID DATA, SAMPLE SIZE 51, USERS AND NONUSERS OF ORAL CONTRACEPTIVES

Oral CFiu** Contraceptive (L=X(v)-X(u)) K-L Lipid Use Limit* (U=X(sfX(r)) L-COST (K1=39;K 2=29) USER L 7 31.3 31.5 U 88 56.9 60.7 CHOl NONUSER l 9 40.9 40.3 U 97 86.6 96.9

USER l 8 18.1 8.6 U 96 75. 1 88.3 TRIG NONUSER l 18 33.5 27.2 U 145 79.9 90.6

USER l 2 12.9 11.6 U 44 28.5 31.8 HDl-C NONUSER l 4 9.8 9.2 U 31 21.0 23.2

*l = lower limit; U = Upper limit; **See Table 4.3 for r,s, U,V...... 00 e e e CHAPTER VI

SUMMARY AND SUGGESTIONS FOR FURTHER RESEARCH

6.1 Summary

The research in this dissertation was motivated by the need for further understanding of the properties of proposed estimators for quantiles. Until this work, the L-COST and K-L estimators had only been evaluated in a limited way for use in confidence inter­ vals. Their use had not been explored for estimating important functions of quantiles, such as the interquantile range, a useful measure of dispersion. In Chapter I, the existing literature is reviewed, with an emphasis on formation of confidence intervals for quantiles. Both parametric and nonparametric methods are considered, and estimators for quantile intervals and quantile differences are discussed. Chapter II discusses formation of confidence intervals based upon six potentially useful estimators for single quantiles. The methods used to determine average interval length and ability to preserve confidence are detailed for both simulated and exact inter­ vals. Simulations were performed to compare the Bootstrap, L-COST, K-L, and order statistics methods under five distributions for the data. Results showed that the L-COST interval would need an ordinate 120

other than one from the normal distribution to perform consistently well in small samples. The K-L method, as described, performs reasonably well for median estimation. Finally, the method based on order statistics produced longer intervals, both when simulated and (when possible) when calculated explicitly. It did maintain the desired confidence quite well, however. Theoretical developments needed to establish 1arqe-samp1e use of the normal distribution for estimators of interquanti1e differ­ ence were presented in Chapter III. It was shown that both the K-L and L-COST methods could be used to form pivotal quantities with asymptotic normal distributions, and thus readily lend themselves to use in confidence intervals. Chapter IV first compared the interquanti1e difference estimators based on L-COST and K-L methods. As a point estimator, the L-COST method performed very well under most distributions. Confidence inter­ vals based on these methods were also constructed and compared with the order statistic method of Chu (1957). Chu's method was found to yield very conservative, long intervals, and neither the K-L nor the L-COST method consistently provided intervals meeting the desired confidence with the particular ordinates selected. The K-L method was useful for intervals based on symmetric underlying distributions. Finally, in Chapter V, an example using data from the Lipid Research Clinics Program was constructed to provide a simple illu­ stration of how the methods discussed in both Chapters II and IV compare when applied to a real data set. Generally, differences were slight; Chu's method for intervals of interquanti1e differences 121 lead to vastly different intervals, however. In some cases, charac- teristics of different variables analyzed lead to more noticeable variations among estimates.

6.2 Suggestions for Further Research

The research in the preceding chapters attempted to address the question of how well the L-COST and K-L estimators would perform when used for confidence interval construction, estimation of simple functions of quantiles, and estimation of confidence intervals for functions of quantiles. Further work on the following areas would provide additional information in evaluating the methods described.

i ) Determine the distribution of

~ L-COST - p in small samples. Determine the accuracy of an approximation by the normal distribution. If an appropriate distribution can be shown to be a t-distribution, determine the method for finding the degrees of freedom. Otherwise, empirically find those factors which closely yield the desired confidence for intervals formed.

ii) Develop a randomized procedure such that either the ob­ served confidence level or interval length could be held constant so as to permit uniform evaluation of the different methods.

iii) Develop an alternative variance estimator to one based on the Jackknife, and compare results obtained with those in this dissertation. 122

iv) Further investigate methods for selecting k for the K-L method.

v) Develop tighter bounds, based on Chu's (1957) method, which would better account for the probabilities previously not considered. 123

BIBLIOGRAPHY

• ALI, MIR MASOOM, UMBACH, DALE, and HASSANEIN, KHATAB, M. (1981), "Estimation of Quantiles of Exponential and Double Exponential .. Distributions Based on Two Order Statistics", Communications ~ Statistics, A10, 1921-1932. ANGUS, J.E. and SCHAFER, R.E. (1979), "Estimation of Logistic Quan­ tiles with Minimum Error in the Predicted Distribution Function", Communications ~ Statistics, A8, 1271-1284. ARVESEN, JAMES N. (1969), "Jackknifing U-Statistics", Annals of Mathematical Statistics, 40, 2076-2100. AZZALINI, A. (1981), "A Note on the Estimation of a Distribution Function and Quantiles by a Kernel Method", Biometrika, 68, 326-328.

BAUER, DAVID F. (1972), "Constructi ng Confi dence Sets Using Rank Statistics", Journal of the American Statistical Association, 67, 687-690. ------

BROOKMEYER, RON and CROWLEY, JOHN (1982), "A Confi dence I nterval for the Median Survival Time", Biometrics, 38, 29-41. CHENG, KUANG-FU (1982), "Jackknifing L-estimates", Canadian Journal of Statistics, 10, 49-58. CHU, J. T. (1957), "S ome Uses of Quasi-Ranges", Annal s of Mathematical Statistics, 28, 173-180. DAVID, HERBERT A. (1981), Order Statistics, Second Edition, New York: John Wiley. -----

DAVIS, C.E., et al. (1980), "Correlations of Plasma High-Density Lipoprotein Cholesterol Levels with Other Plasma Lipid and Lipo­ protein Concentrations", Circulation, Part II, 62: IV-24 - IV-30.

DESU, MAHAMUNULU M. and RODINE, R.H. (1969), "Estimation of the Population Median l' , Skandinavisk Aktuarie Tidskrift, 67-70.

DYER, DANNY O. and KEATING, JEROME P. (1979), "A Further Look at the Comparison of Normal Percentile Estimation", Communications in Statistics, A8, 1-16. --

DYER, D.O., KEATING, J.P., and HENSLEY, O.L. (1977), "Comparison of Point Estimators of Normal Percentiles", Communications in Statistics, B6, 269-283. EFRON, B. (1979), "Bootstrap Methods: Another Look at the Jackknife", Annals of Statistics, 7, 1-26. 124

EFRON, BRADLEY (1981), "Censored Data and the Bootstrap", Journa1 of the American Statistical Association, 76, 312-319. EKBLOM, H. (1973), "A Note on Nonlinear Median Estimators", Journal .. of the American Statistical Association, 68, 431-432. EMERSON, JOHN D. (1982), IlNonparametric Confidence Intervals for the Median in the Presence of Right Censoring", Biometrics, 38, 17-27. GIBBONS, J.D. (1971), Nonparametric , New York: McGraw-Hill. GREEN, J.R. (1969), "Inference Concerning Probabilities and Quantiles", Journal of the Royal Statistical Society, Ser. B, 31, 310-316. GREENBERG, B.G. and SARHAN, A.E. (1962), "Exponential Distribution: Best Linear Unbiased Estimates", in Contributions to Order Statistics, eds. A.E. Sarhan and B.G. Greenberg, New York: John Wil ey, 352-360. GUILBAUD, OLIVIER (1979), "Interval Estimation of the Median of a General Distribution", Scandinavian Journal of Statistics, 6, 29-36. --

HARRELL, FRANK E. JR. and DAVIS, C.E. (1982), "A New Distribution­ Free Quantile Estimator", Biometrika, 69, 635-640. HARTER, LEON (1961), "Expected Values of Normal Order Statistics", Biometrika, 48, 151-165. HARTIGAN, J.A. (1969), "Using Subsamp1e Values as Typical Values", Journal of the American Statistical Association, 64, 1303-1317.

HARVARD UNIVERSITY COMPUTATION LABORATORY (1955), Tables of the Cumulative Binomial Probability Distribution, Cambridge:lHiarvard Uni vers i ty Press. HOGG, ROBERT V. and CRAIG, ALLEN, T. (1978), Introduction to Mathe­ matical Statistics, Fourth Edition, New York: Macmillan: JENNETT, W.J. and WELCH, B.L. (1939), liThe Control of Proportion Defective as Judged by a Single Quality Characteristic Varying ll on a Continuous Sca1e , Journal of the Royal Statistical Society, Supplement, 6, 80-88. KAIGH, W.O. (1982), "Quantile Interval Estimation", unpublished manu­ scri pt. KAIGH, W.O. and LACHENBRUCH, PETER A. (1982), IIA Generalized Quantile Estimator", Communications in Statistics, All, 2217-2238. 125

KREWSKI, DANIEL (1976), "Distribution-Free Confidence Intervals for Quantile Intervals", Journal of the American Statistical Association, 71, 420-422. ------

KUBAT, PETER and EPSTEIN, BENJAMIN (1980), "Estimation of Quantiles of Location-Scale Distributions Based on Two or Three Order Statistics", Technometrics, 22, 575-581. LANKE, JAN (1974), "Interval Estimation of a Median", Scandinavian Journal of Statistics, 1, 28-32. LAWLESS, J.F. (1975), "Construction of Tolerance Bounds for the Extreme-Value and Weibull Distributions", Technometrics, 17, 255-261 .

LEVER, W. E. (1969), "Note: Confidence Limits for Quantil es of Mor­ tality Distributions", Biometrics, 25, 176-178. MANN, NANCY R. and FERTIG, KENNETH W. (1975), "Simplified Efficient Point and Interval Estimators for Weibull Parameters", Technometrics, 17, 361-368. MANN, NANCY R. and FERTIG, KENNETH (1977), "Efficient Unbiased Quan­ tile Estimators for Moderate-Size Complete Samples from Extreme­ Value and Weibull Distributions; Confidence Bounds and Tolerance ll and Prediction Intervals , Technometrics, 19, 87-93. MARITZ, J.S. and JARRETT, R.G. (1978), "A Note on Estimating the ll Variance of the Sample Median , Journal of the American Statistical Association, 73, 194-196. ------

MOOD, ALEXANDER M., GRAYBILL, FRANKLIN A., and BOES, DUANE C. (1974), Introduction to the Theory of Statistics, Third Edition, New York: McGraw-Hill.

ll MOSES, LINCOLN E. (1965), IIQueries: Confidence Limits from Rank Tests , Technometrics, 7, 257-260. NAIR, K.R. (1940), "Tables of Confidence Intervals for the Median ll in Samples from any Continuous Population , Sankhya, 4, 551-558.

ll NOETHER, GOTTFRIED E. (1948), "On Confidence Limits for Quantiles , Annals of Mathematical Statistics, 19, 416-419.

NOETHER, GOTTFRIED E. (1973), "Some Simple Distribution-Free Confi­ ll dence Intervals for the Center of a Symmetric Distribution , Journal of the American Statistical Association, 68, 716-719. OGAWA, JUNJIRO (1962), "Di stributi on and Moments of Order Stati stics II, in Contributions to Order Statistics, eds. A.E. Sarhan and B.G. Greenberg, New York: John Wiley, 11-19. 126

OWEN, DON B. (1968), "A Survey of Properties and Applications of the Noncentral t-Distribution", Technometrics, 10, 445-478. PARR, WILLIAM C. and SCHUCANY ,WILLIAM R. (1958), "Jackknifing L­ Statistics with Smooth Weight Functions", Journal of the American Statistical Association, 77, 629-638. ------

REID, NANCY (1981), "Estimating the Median Survival Time", Biometrika, 68, 601-608.

REISS, ROLF D. AND RUSCHENDORF, LUDGER (1976), "On Wilks· Distribution­ Free Confidence Intervals for Quantile Intervals", Journal of the American Statistical Association, 71, 940-944. ------

ROBERTSON, C.A. (1977), "Estimation of Quantiles of Exponential Dis­ tributions with Minimum Error in Predicted Distribution Functions", Journal of the American Statistical Association, 72, 162-164. RUDIN, WALTER (1976), Principles of Mathematical Analysis, New York: McGraw-Hill.

RUKHIN, ANDREW L. and STRAWDERMAN, ~JILLIAM E. (1982), "Estimating a Quantile of an Exponential Distribution", Journal of the American Statistical Association, 77, 159-162. ------

SARHAN, A.E. (1954), "Estimation of the Mean and Standard Deviation by Order Statistics", Annals of Mathematical Statistics, 25, 317-328. -- a.

SARHAN, A.E. and GREENBERG, B.G. (1962), 1I0ther Distributions: Rectangular Distribution", in Contributions to Order Statistics, eds. A.E. Sarhan and B.G. Greenberg, New York: John Wiley, 383­ 390. SAS INSTITUTE (1979), SAS User's Guide, 1979 Edition, Raleigh: SAS Institute. - --

SATHE, Y.S. and LINGRAS, S.R. (1981), "Bounds for the Confidence Coefficients of Outer and Inner Confidence Intervals for Quan­ tile Intervals", Journal of the American Statistical Association, 76, 473-475. ------

SAVUR, S.R. (1937), "The Use of the Median in Tests of Significance", Proceedings of the Indian Academy of Science, Section A, 5, 564-576. SCHAFER, R.E. and ANGUS, J.E. (1979), "Estimation of Weibull Quantiles with Minimum Error in the Distribution Function", Technometrics, 21, 367-370. SCHEFFE, HENRY (1943), "Statistical Inference in the Nonparametric Case", Annals of Mathematical Statistics, 14,305-332. 127

SCHEFF€' H. and TUKEY, J.W. (1945), IINonparametric Estimation. I. ll Validation of Order Statistics , Annals of Mathematical Statistics, 16, 187-192. - SCHMEISER, BRUCE W. (1975), liOn Monte Carlo Distribution Sampling, ll with Application to the Component Randomization Test , Ph.D. Dissertation, Georgia Institute of Technology, Atlanta. SEDRANSK, J. and MEYER, J. (1978), IIConfidence Intervals for the Quantiles of a Finite Population: Simple Random and Stratified ll Simple Random Sampling , Journal of the Royal Statistical Society, Sere B, 40, 239-252. SEN, P. K. (1982), IIJackkni fi ng L-Estimators: Affi ne Structure and ll Asymptotics , Institute of Statistics Mimeo Series No. 1415, The University of North Carolina, Chapel Hill, North Carolina. SERFLING, ROBERT J. (1980), Approximation Theorems of Mathematical Statistics, New York: John Wiley.

ll STIGLER, S.M. (1969), IILinear Functions of Order Statistics , Annals of Mathematical Statistics, 40, 770-788. THOMPSON, WILLIAM R. (1936), liOn Confidence Ranges and Other Expec­ tation Distributions for Populations of Unknown Distribution Form", Annals of Mathematical Statistics, 7,122-128. • TUKEY, JOHN W. (1958), "Bias and Confidence in Not-Quite Large Samples (Abstract)", Annals of Mathematical Statistics, 29,614. UMBACH, DALE, ALI, MIR MASOOM, and HASSANEIN, KHATAB, M. (1981), IIEstimating Pareto Quantiles Using Two-Order Statistics", Communications iQ Statistics, A10, 1933-1941.

WALSH, JOHN E. (1958), "Efficient Small Sample Nonparametric Median Tests with Bounded Significance Levels", Annals of the Institute of Statistical Mathematics, Tokyo, 9, 185-199. ----- WEISS, LIONEL (1960), "Confidence Intervals of Preassigned Length for Quanti les of Uni moda 1 Popul ations", Naval Research Logi stics Quarterly, 7, 251-256. WEISSMAN, ISHAY (1978), IIEstimation of Parameters and Large Quan­ tiles Based on the k Largest Observations", Journal of the American Statistical Association, 73, 812-815. -----

ll WILKS, S.S. (1948), 1I0 rder Statistics , Bulletin of the American Mathematics Society, Series 2, 54, 6-50. - ---- WILKS, SAMUEL S. (1962), Mathematical Statistics, New York: John Wiley. 128

WILLEMAIN, THOMAS R. (1980), "Estimating the Population Median by Nomi na.ti on Sampl i ng", Journal of the Ameri can Statistica1 Association, 75, 908-911. ---

ZIDEK, JAMES V. (1971), "Inadmissibil ity of a Class of Estimators ll of a Normal Quantile , Annals of Mathematical Statistics, 42, 1444-1447. 130

L-COST Quantile Estimation Program

PROC MATRIX; * L-COST QUANTILE ESTIMATION PROGRAM; *N IS THE SAMPLE SIZE, P REFERS TO P-TH QUANTILE; N=ll; * THESE LINES; P=.5; * WILL VARY; *** CONSTANTS FOR THE INCOMPLETE BETA; *., Al=P*(N+1); A2=P*N; Bl=(l-P)*(N+1) ; B2=(1-P)*N; *., *** INITIALIZE W'S, LAMBDA, AND THE U VECTOR; *., Wl=J(N,l,O); W2=J(N,1,0); W3=J(N,1,0); Ul=J(N,l ,0); LAMBDA=J(N,N,O); *., *** FORM W'S AND THE U VECTOR; *., DO 1=1 TO N; *., Wl(I,)=PROBBETA(I#/N,Al,Bl)-PROBBETA((I-l)#/N,Al,Bl); *., IF (I > 1) THEN W2(I,)=PROBBETA((I-l)#/(N-l),A2,B2)-PROBBETA((I-2)#/(N-1),A2,B2); ELSE W2(I,)=0; IF (I < N) THEN W3(I,)=PROBBETA(I#/(N-l),A2,B2)-PROBBETA((I-l)#/(N-l),A2,B2); ELSE W3( I, )=0; *** FORM THE U VECTOR; D=W2(I,) ; E=W3(I ,) ; Ul(I,)=(I-l)*D + {N-I)*E; END; * OF I LOOP; *** CONSTRUCT LAMBDA FROM VARIOUS W·S; *., DO L=l TO N; DO M=l TO N; *., ***SEPARATE W'S FOR INDEX LAND M; *., FL =W2( L,) ; GL=W3(L,); FM=W2(M,); GM=W3(M,) ; 131

*** FILL LAMBDA WITH CORRECT COMBINATION OF W'S; *.• IF L=M THEN LAMBDA(L,M}=(L-1}*(FL**2}+(N-L}*(GL**2}; *., IF M> L THEN LAMBDA(L.M}=(L-1}*(FL*FM + (M-L-1}*GL*FM + (N-M}*GL*GM; *.• IF M< L THEN LAMBDA(L,M)=LAMBDA(M.L}; *.• END; * OF MLOOP; END; * OF L LOOP; *.• * INDATA IS A SAS DATASET CONTAINING; * ONLY THE VARIABLE OF INTEREST; FETCH X DATA=INDATA; A=X; * OBTAIN THE ORDER STATISTICS; A4=A; B4=A4; Y=RANK(A4}; A4(Y.}=B4; * QP IS THE ESTIMATOR OF THE P-TH QUANTILE; *.• QP=W1'*A4 ; *.• * FORM THE MEAN OF THE SIS; S=(1#/N}*U1 1 *A4; * FORM THE SUM OF S-SUB-J SQUARED; Vl=A4 i *LAMBDA*A4; * SBAR2=SQUARE(MEAN OF THE SIS}; SBAR2=S##2; * FORM THE VARIANCE AND STD DEVIATION; *.• VARQP=(N-l}*((l#/N}*Vl - SBAR2}; SDEVQP=VARQP##.5; * FORM INTERVAL LENGTHS; LEN95=2*1.96*SDEVQP; LEN99=2*2.575*SDEVQP; * UPPERXX AND LOWER XX ARE; * UPPER AND LOWER .XX CONFIDENCE LIMITS; UPPER95=QP + 2*1.96*SDEVQP; LOWER95=QP - 2*1.96*SDEVQP; UPPER99=QP + 2*2.575*SDEVQP; LOWER99=QP - 2*2.575*SDEVQP; PRINT NP QP LEN95 LEN99; UPPER95 LOWER95 UPPER99 LOWER99; TITLEl HARRELL AND DAVIS (1982) L-COST ESTIMATOR; TITLE2 SAMPLE SIZE=11; TITLE3 QUANTILE: P=.5; II 132

K-L Quantile Estimation Program

PROC MATRIX; * K-L ESTIMATION PROGRAM; * READ IN N, K, P; N=51; P=.5; K=19; DIFF1=N-K; R=INT((K+l)*P); Fl=SQRT(l#/N); * INITIALIZE THE WEIGHT VECTOR; WEIGHT=J(N,l,O); * FILL THE WEIGHT VECTOR WITH PROPER WEIGHTS; NFACT=GAMMA(N+l); KFACT=GAMMA(K+l); NKFACT=GAMMA(N-K+l); NCHOOSEK=NFACT#/l(KFACT*NKFACT); DO J=R TO N+R-K; J1FACT=GAMMA (J) ; R1FACT=GAMMA(R); JRFACT=GAMMA(J-R+l); J1CHR1=J1FACTH/(R1FACT*JRFACT); NJFACT=GAMMA(N-J+l); KRFACT=GA~1A(K-R+l); NJKRFACT=GAMMA(N-J-K+R+l); NJCHKR=NJFACT#/(KRFACT*NJKRFACT); WEIGHT(J,)=(J1CHR1*NJCHKR)#/NCHOOSEK; END; * OF J LOOP (CONSTRUCTING WEIGHTS); *., * INITIALIZE JACKKNIFE WEIGHT VECTOR; WEIGHT2=J(N-l,1 ,0); * FILL THE WEIGHT VECTOR WITH PROPER WEIGHTS; NFACT2=GAMMA(N); KFACT2=GAMMA(K+l); NKFACT2=GAMMA(N-K); NCHUZK2=NFACT2#/(KFACT2*NKFACT2); DO Jl=R TO N+R-K-l; J2FACT=GAMMA(Jl); R2FACT=GAMMA(R) ; JRFACT2=GA~~A(Jl-R+l); J2CHR1=J2FACT#/(R2FACT*JRFACT2); NJFACT2=GAMMA(N-Jl); KRFACT2=GAMMA(K-R+l); NJKRFAC2=GAMMA(N-Jl-K+R); NJCHKR2=NJFACT2#/(KRFACT2*NJKRFAC2); WEIGHT2(Jl,)=(J2CHR1*NJCHKR2)#/NCHUZK2; END; * OF Jl LOOP (CONSTRUCTING WEIGHTS WITH ONE DELETION); 133

* INDATA IS A SAS DATASET CONTAINING ONLY; * THE VARIABLE OF INTEREST; FETCH X DATA=INDATA; A=X; B~; Y=RANK(A); A(Y!)=B; *A IS NOW A VECTOR OF ORDER STATISTICS; * FORM THE K-L ESTIMATOR; EST1=WEIGHT ' *A; * FORM THE JACKKNIFED ESTIMATES; JACKEST=J(N!1!0); DO 1=1 TO N; IF 1=1 THEN SHORTER=A(2:N!); IF ((I >1) AND (I

PRINT PNK EST1 UPPER95 LOWER95 UPPER99 LOWER99 LEN95 LEN99; TITLE1 KAIGH AND LACHENBRUCH METHOD; TITLE2 SAMPLE SIZE 51; TITLE3 QUANTILE: P=.5;

H-D Interguanti1e Difference Program

PROC MATRIX; * L-COST INTERQUANTILE DIFFERENCE PROGRAM * INSERT N~P~Q VALUES; N=51; P=. 10; Q=. 90; * DEFINE CONSTANTS FOR THE INCOMPLETE BETA; A1=Q*{N+1}; B1={1-Q}*{N+l) ; A2=P*{N+1); B2={1-P)*{N+1); A3=Q*N; B3={1-Q)*N; A4=P*N; B4={1-P)*N; *INITIALIZE WEIGHTS; W1=J (N~ 1 ~ 0) ; W2=J{N~ 1~O); W3=J{N~1~0); D=J{N~N~O); *CONSTRUCT WEIGHTS FOR THE POINT ESTIMATOR; DO 1=1 TO N; W1(I~)=PROBBETA{I#/N~A1~B1)-PROBBETA{{I-1)#/N~A1~B1) -PROBBETA{I#/N~A2~B2)+PROBBETA{{I-1)#/N~A2~B2); END; * OF LOOP TO CALCULATE WEIGHTS FOR THE POINT ESTIMATOR; * CALCULATE WEIGHTS FOR THE JACKKNIFE VARIABLE ESTIMATOR; DO J=l TO N; IF J > 1 THEN DO; DO 1=1 TO J-1; D{I~J)=PROBBETA{I#/{N-1)~A3~B3) -PROBBETA{{I-1)#/{N-1)~A3~B3) -PROBBETA{I#/{N-1)~A4~B4) +PROBBETA{{I-1)#/{N-1)~A4~B4); END; * OF I LOOP; END; * OF IF-THEN GROUP; • IF J < N THEN DO; DO I=J+1 TO N; D{I~J)=PROBBETA{{I-1)#/{N-1)~A3~B3) -PROBBETA{{I-2)#/{N-1)~A3,B3) -PROBBETA{{I-1)#/(N-1)~A4,B4) +PROBBETA{(I-2}#/(N-1),A4~B4); 135

END; * OF I LOOP; END; * OF IF-THEN GROUP; END; * OF J LOOP; * SPECIFY T1 AND T2; Tl=l. 96; T2=2.575; * INITIALIZE VECTORS; DJ=J(N,l,O); DBAR=J(N,l,O); * INDATA IS A SAS DATASET CONTAINING; * ONLY THE VARIABLE OF INTEREST; FETCH X DATA=INDATA; A=X; BVEC=A; Y=RANK(A) ; A(Y,)=BVEC; * NOW A CONTAINS THE ORDER STATISTICS; * FORM THE ESTIMATOR FOR THE SAMPLE; QP=W1 1 *A; * COMPUTE THE N D-SUB-J VALUES; DO J=l TO N; DJ(J,)=D( ,J) '*A; END; * OF J LOOP; * OBTAIN DBAR; SUMDJ=SUM(DJ); DBARELEM=SUMDJ#/N; DBAR=J(N,l,DBARELEM); * OBTAIN THE VARIABLE AND STD DEV OF THE ESTIMATOR; VARQP=( (N-l)#/N)*( (DJ-DBAR) '*(DJ-DBAR)) ; SDEVQP=VARQP##.5; * FORM LENGTHS OF INTERVALS; LEN95=2*T1*SDEVQP; LEN99=2*T2*SDEVQP; * UPPERXX AND LOWERXX ARE ENDPOINTS; * OF THE .XX CONFIDENCE INTERVAL; UPPER95=QP+2*Tl *SDEVQP; LOWER95=QP-2*T1*SDEVQP; UPPER99=QP+2*T2*SDEVQP; LOWER99=QP-2*T2*SDEVQP; PRINT NPQ QP LEN95 LEN99 UPPER95 LOWER95 UPPER99 LOWER99; TITLE1 L-COST INTERQUANTILE DISTANCE ESTIMATOR; TITLE2 SAMPLE SIZE=51; .. TITLE3 INTERDECILE DIFFERENCE; II 136

K-L Interguantile Difference Program

PROC MATRIX; * K-L INTERQUANTILE DIFFERENCE PROGRAM; * READ IN NEEDED PARAMETER VALUES; * K1, R1 CORRESPOND TO THE Q-TH QUANTILE; * K2, R2 CORRESPOND TO THE P-TH QUANTILE; P=.25; Q=.75; N=51 ; * PARAMETER VALUES WILL VARY; IF N=31 THEN DO; K1=23; K2=23; END; IF N=51 THEN DO; K1=39; K2=29; END; DIFFl =N-K1 ; DIFF2=N-K2; IF {DIFF1 > DIFF2} THEN DIFFM=DIFF2; ELSE DIFFM=DIFF1; R1=INT{{K1+1}*Q); R2=INT((K2+1)*P); *., IF DIFFM=8 THEN DO; T1=2.3060; T2=3.3554; END; IF DIFFM=12 THEN DO; T1=2.1788; T2=3.0545; END; IF DIFFM=17 THEN DO; Tl=2.1098; T2=2.8982; END; IF DIFFM=22 THEN DO; T1=2.0739; T2=2.8188; END; IF DIFFM=28 THEN DO; Tl=2.0484; T2=2.7633; END; IF DIFFM=32 THEN DO; T1=2.0369; T2=2.7385; END; *., * INITIALIZE THE WEIGHT VECTORS; *., WEIGHT1Q=J(N,1,0); WEIGHT1P=J{N,1,0); • WEIGHT2Q=J {{N-l}, 1,0) ; WEIGHT2P=J{ (N-1), 1,O}; *., 137

*COMPUTE THE WEIGHTS FOR Q-TH QUANTILE, FULL SAMPLE; *., NFACT1=GAMMA(N+1); K1FACT1=GAMMA(K1+1); NK1FACT1=GAMMA(N-K+l); NCHUZKll=NFACT1#/(K1FACT1*NK1FACT1); DO J1=R1 TO N+R1-Kl; J1FACT1=GAMMA(J1); R1FACT1=GAMMA(Rl); J1R1FAC1=GAMMA(Jl-Rl+1); J1CHR1l=J1FACT1#/(R1FACT1*J1R1FAC1); NJ1FACT1=GAMMA(N-Jl+1); K1R1FAC1=GAMMA(K1-Rl+l); NJKRFAQ1=GAMMA(N-Jl-Kl+Rl+l); NJQl Kl Rl =NJl FACTl# / (Kl Rl FAel *NJKRFAQl) ; WEIGHT1Q(Jl,)=(J1CHRll*NJQ1K1Rl)#/NCHUZKll; END; * OF Jl LOOP FOR FULL SAMPLE Q-TH QUANTILE WEIGHTS; * COMPUTE WEIGHTS FOR THE P-TH QUANTILE, FULL SAMPLE; *., NFACT1=GAMMA(N+l); K2FACT1=GAMMA(K2+1); NK2FACT1=GAMMA(N-K2+1); NCHUZK21 =NFACTl #/ (K2FACTl*NK2FACTl); DO J2=R2 TO N+R2-K2; J2FACT1=GAMMA(J2); R2FACT1=GAMMA(R2); J2R2FAC1=GAMr~(J2-R2+1); J2CHR21=J2FACT1#/(R2FACT1*J2R2FAC1); NJ2FACT1=GAMMA(N-J2+1); K2R2FAC1=GAMMA(K2-R2+1); NJKRFAP1=GAMMA(N-J2-K2+R2+1); NJP1K2R2=NJ2FACT1#/(K2R2FAC1*NJKRFAP1); WEIGHT1P(J2,)=(J2CHR21*NJP1K2R2)#/NCHUZK21; END; *., * OF J2 LOOP OVER FULL SAMPLE P-TH QUANTILE WEIGHTS; *., * COMPUTE WEIGHTS FOR Q-TH QUANTILE, REDUCED SAMPLE; *., NFACT2=GAMt4A(N) ; K1FACT2=GAMMA(Kl+l); NK1FACT2=GAMMA(N-Kl); NCHUZK12=NFACT2#/(K1FACT2*NK1FACT2); DO J3=Rl TO N+Rl-Kl-l; J3FACT2=GAMMA(J3); R1FACT2=GAMMA(Rl); J3R1FAC2=GAMMA(J3-Rl+l); J3CHR12=J3FACT2#/(R1FACT2*J3R1FAC2); NJ3FACT2=GAMMA(N-J3); K1R1FAC2=GAMMA(Kl-Rl+l); NJKRFAQ2=GAMMA(N-J3-Kl+Rl); NJQ2K1Rl=NJ3FACT2#/(K1R1FAC2*NJKRFAQ2); WEIGHT2Q(J3,)=(J3CHR12*NJQ2K1Rl)#/NCHUZK12; 138

END; * OF J3 LOOP FOR REDUCED SAMPLE Q-TH QUANTILE WEIGHTS; *., * COMPUTE WEIGHTS FOR P-TH QUANTILE, REDUCED SAMPLE; *., NFACT2=GAW1A(N); K2FACT2=GAMMA(K2+1); NK2FACT2=GAMMA(N-K2); NCHUZK22=NFACT2#/(K2FACT2*NK2FACT2); DO J4=R2 TO N+R2-K2-1; J4FACT2=GAMMA(J4); R2FACT2=GAMMA(R2); J4R2FAC2=GAMMA(J4-R2+1); J4CHR22=J4FACT2#/(R2FACT2*J4R2FAC2); NJ4FACT2=GAMMA(N-J4); K2R2FAC2=GAMMA(K2-R2+1); NJKRFAP2=GAMMA(N-J4-K2+R2) ; NJP2K2R2=NJ4FACT2#/(K2R2FAC2*NJKRFAP2); WEIGHT2P(J4,)=(J4CHR22*NJP2K2R2)#/NCHUZK22; END; * OF J4 LOOP FOR REDUCED SAMPLE Q-TH QUANTILE WEIGHTS; JACKEST=J(N,l,O); * INDATA IS A SAS DATASET CONTAINING ONLY THE; * VARIABLE OF INTEREST; FETCH X DATA=INDATA; A=X; B=A; Y=RANK(A); *A IS NOW A VECTOR OF ORDER STATISTICS; * FORM THE K-L ESTIMATOR OF INTERQUANTILE DIFFERENCE; EST1=(WEIGHT1Q'*A)-(WEIGHT1P '*A); * FORM THE JACKKNIFE ESTIMATES; DO 1=1 TO N; IF 1=1 THEN SHORTER=A(2:N,); IF ((I >1) AND (I < N)) THEN SHORTER=A(1:(I-1),)//A((I+1):N,); IF I=N THEN SHORTER=A(1:(N-1),); JACKEST(I,)=(WEIGHT2Q'*SHORTER)-(WEIGHT2P ' *SHORTER); END; * OF I LOOP FOR JACKKNIFE ESTIMATES; MEAN1=SUM(JACKEST)#/N; VECMEAN1=J(N,1,MEAN1); DIFF=JACKEST-VECMEAN1 ; SDEV1=((DIFF'*DIFF)*(N-1)#/N)##.5; *FORM THE LENGTHS OF THE INTERVALS; LEN95=2*T1*SDEV1 ; LEN99=2*T2*SDEV1; * LOWERXX AND UPPERXX ARE UPPER AND; • * LOWER .XX CONFIDENCE INTERVALS; UPPER95=EST1+2*Tl*SDEV1; LOWER95=ESTl-2*T1*SDEV1; UPPER99=EST1+2*T2*SDEV1; LOWER99=ESTl-2*T2*SDEV1; 139

PRINT NQP K1 K2 EST1 LEN95 LEN99 UPPER95 LOWER95 UPPER99 LOWER99; TITLE1 K-L INTERQUANTILE DISTANCE ESTIMATOR; " • TITLE2 SAMPLE SIZE=51; TITLE3 INTERQUANTILE DIFFERENCE; II