<<

The Standard Error of Author(s): W. Duane Evans Source: Journal of the American Statistical Association, Vol. 37, No. 219 (Sep., 1942), pp. 367- 376 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/2279005 . Accessed: 11/08/2011 11:44

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association.

http://www.jstor.org THE STANDARD ERROR OF PERCENTILES

BY W. DUANE EVANS U. S. Bureau ofLabor

IN RECENT STUDIES, percentilesof variouskinds have been much used to demonstrate economic differencesbetween segments of populations. This use has been specially markedin surveysof the dis- tributionof personsor familiesaccording to theirincomes. Percentileshave much to recommendthem. They have a specific meaning in terms of population values, and they may be definedin terms which are readily understood by a person without technical training. Moreover, they are in general somewhat less sensitive to errorsresulting from certain types of samplingbias than, forexample, the .' One of the drawbacks to the use of percentileshas been the lack of some generalmethod by whichthe probable errordue to chance devia- tions in samplingmight be investigated.With respect to the , the rule that the standard erroris about one and one-quartertimes the easily estimatedstandard errorof the mean is well known. However, this rule is applicable only when the population distributionapproxi- mates the normal form. It is not valid for the highly skewed and flatteneddistributions in which incomes usually fall. In fact, in such cases the standard errorof the median is generallysubstantially less than that of the mean. The followingparagraphs presenta method by which the standard errorof any percentilemay be estimated.In the case of large samples this requiresin generalless effortand computationthan an estimateof the standarderror of the mean. For large samples,the methodhas been worked out for the general case of any percentileof a sub-groupof a populationestimated by of a stratifiedsample. The resultsmay, of course, be readily simplifiedto apply to less complex systems of .The methoddoes not requirethe assumptionthat the parent population is distributedin any special way. The methodpresented does not resultin exact values forthe stand- ard errorof percentiles,except in the limit,but ratherin usable and in many cases very close approximationsto these values. The principal

1 It has been noted that highincome families tend to be somewhatless cooperativein incomeand expenditurestudies than familiesin the middleranges. Even thoughhigh income families are infrequent in the population,a slightbias against them, resultingin a disproportionatelylow numberof such familiesin a ,may markedlychange the positionof the estimatedmean whilehardly altering the estimateof the median. 367 368 AMERICANSTATISTICAL AssoCIATION- aim is to provide the practical statisticalworker with a simple means of testingthe significanceof percentilesand the equipmentnecessary to designsurveys with preassigned levels of reliability. LargeSamples. The populationis definedas consistingof N sampling unitsor individuals.Among these are A samplingunits possessing some complexof characters which sets them apart from the rest. The A type individualsvary with respect to someadditional factor X. This factor is ofsuch type and thenumber A is sufficientlylarge so thatthe varia- tionis effectivelycontinuous. Corresponding to a preassignedvalue of the characterX, say XB, thereare amongthe A typeindividuals a certainnumber B, forwhich the value of X is lessthan XB. To assistin visualizingthe foregoing,we may,for example, let N representthe number of individuals in a city.The symbolA maythen representthe numberof personswithin the city that are of some specificrace and sex.These vary according to theirincomes (X). There are thena total of B individualsof the specifiedrace and sex which have incomesof less than,say $1,000(XB). The populationis dividedinto r independentstrata (to carryout theanalogy of the city, into r distinctgeographical districts).We then let- Ni=the numberof sampling units in theith stratum, Ai=the numberof A typesampling units in theith stratum, Bi =the numberof B typesampling units in theith stratum. A sampleis selectedat randomand withoutreplacement in each of thestrata. With respect to thissample, let- ni=the numberof samplingunits constituting the samplein the ith stratum, a=the numberof A typesampling units in thesample ni, bi=the numberof B typesampling units in thesample ni. Let thereciprocal of the sampling ratio employed in theith stratum be representedby Si. Then Si = Nl/n,. (1) Estimatesof quantities are denotedby primes.Estimates of A and B are thendefined by the relations

A'l = Sai (2a) i=1

B' = Sibi. (2b)

It is evidentthat E(A') =A and E(B') =B. * THE STANDARD ERROR OF PERCENTILES 369

A ,XK, is to be determinedfor the group of A type indi- viduals in the population. Let the percentilebe definedby K, which representsthe proportionof the A type individualsto which the per- centileis to refer.Then, forexample, K = 0.25 forthe firstquartile, or K = 0.50 forthe median. The quantityA is definedby the equation A=KA-B. (3) It then representsthe differencebetween the numberof individuals of type A in the populationwith a value of the characterX less than XK and the numberwith a value of the characterX less than XB. An esti- mate of A is obtained as follows:

r

A' 2 E Si(Kaj - bi). (4) i=l1 It may be shown that r E(A) = Sini(Kpai - Pb) = A (5) i=1 wherepai is the proportionor probabilityof occurence in the ithstratum of individualsof the A type. The definitionof Pbi is similar,and both may be representedas follows:

Pai = AiNi Pbi = Bi/Ni. We may now note a veryimportant fact. On the basis of any particu- lar sample, we may estimate the percentileXK to be eitherabove or below the value XB. However,the lattercan occur onlyif the differenceA' forthis particularsample is negative in sign. Then, if we can determinethe relative frequencyof occurence or probabilityof negative values of A' under any particularset of conditions,we at the same time have specifiedthe relative proportionof all times when on repeated samplingwe will estimatethe percentileXK to lie below XB. This may be expressedby the relationship

P(A' < 0) = P(X'K < XB) = ff(/\)dA/ (6)

where f(A') representsthe probabilitydistribution of the estimated frequencydifferences. This relationshipis true irrespectiveof either population or sample size. We may now proceed to examine the formof the distributionof A' in samples of large absolute size. Since the quantityA' is a linearfunc- 370 AMERICANSTATISTICAL ASSOCIATION- tion of a numberof hypergeometricallydistributed variates, one may at once inferthat as the size of the sample on whichit is based increases whilethe ratio of sample to populationsize remainssmall, the distribu- tion of A' will approach the normalprobability distribution as a limit. The latter formwill give a fair approximationto the probabilitydis- tributionof A' in the case of samples of even moderatesize drawnfrom a large population. Under the given conditions,equation (6) may then be rewrittenas follows: -t P(X'K < XB) = (1/V2r) J e-z212dz (7) where t = A/a,. (8)

In evaluating (7), it is necessaryto have the of the distribu- tion ofA' in termsof the populationparameters. This is easily obtained in the followingmanner. Since the individual strata are independent, it will be sufficientto determinethe variance of As', where A= Si(Kai - bi) (9) because, from(4),

2,&= Z A2i. (10) i=l From (9)

E(Ai2) - S2E(K2a,2 + bV2- 2Kaibi). (11) But it may be shown that

E(a.2) = nf2pai2 + nipaiqai(Ni - ni)/(Ni - 1) (12a) E(bi2) = nl2Pbi2 + nipbiqbi(Ni - ni)/(Ni - 1) (12b) E(aibi) = niPaiPbi + nipbiqai(Ni - ni)/(Ni - 1). (12c)

Since the variance of any variate is equal to the expected value of its square minus the square of its expected value, we may combine equa- tions (5), (10), (11), and (12) as follows:

r - - - - = , S.2ni[K(l K)pai (1 2K)(Kpai Pbi) i=l

- (Kpa - pbi)2] [(Ni - n)/(N - 1) ]. (13) * THE STANDARD ERROR OF PERCENTILES 371 If the appropriatepopulation parametersare knownor can be esti- mated, equation (7) can now be evaluated to determinethe probability that a percentilemay be estimated to lie below some specificvalue XB. It is only necessaryto evaluate t and referto suitable tables fora normal probabilitydistribution of unit and zero mean. The desired probabilitywill be representedby the area falling below minus t. By repeatingthe procedurefor two differentvalues of X, say XB1 and XB2, we may determinethe mathematicalprobability that XK will be estimatedto lie betweenXB1 and XB2. Choosingappropriate values of X, we may then reconstructthe probabilitydistribution of X'K over any desired range. To obviate this awkwardprocedure, we must learn somethingof the formof distributionof X'K. In the firstplace, interestcenters only in the shape of the centralportion of the distributionof X'K; the extreme tails of the distributionare of little importance.This is equivalent to saying that equation (7) will be evaluated only for values of t which lead to probabilitiesof significantsize, say forvalues between -3 and +3. Since the variance of A' decreases with an increase in the sample size, it is apparentfrom (8) that as the sample becomeslarger the values whichmay be assigned to X whichlead to values of t lyingwithin any preassignedinterval will cover a shorterand shorterrange. In a large sample the entirerange of such values will be small. Two important resultsfollow from this fact. First, the population distributionwithin this narrowrange may be assumed to be approximatelylinear, since it was originallyassumed that the distributionof the A type individualswith respectto X was continuous,though of unspecifiedform. As the size of sample increases and the range of relevant values of X becomes narrower,it follows that the relationshipbetween A and these values of X approaches linearity. Second, examinethe formof equation (13). The same reasoningmay be applied here. In this expressiononly the values of Pbi in the various strata depend on the value of X, and over a narrowrange of values of X, the values of the Pbi will change but slightly.Over this shortrange, then,the variance ofA' is effectivelyconstant. With o-' approachingconstancy and A becominga linear function of X, it is apparent fromequation (8) that the relationshipbetween t and X also approaches linearityin the range over whichwe are inter- ested in determiningthe formof distributionof X'K. But as this rela- tionship becomes linear, it followsfrom (7) that the distributionof 372 AMERICANSTATISTICAL AssoCIATION- X'K necessarilyapproaches the normal form. In a sufficientlylarge sample, then, the distributionof estimates of any percentileis very closely approximatedby a normal probabilitycurve. If we accept the distributionof X'K as approximatelynormal, t is obviously equivalent to the number of standard deviations of X'K representedby the absolute differencebetween XK and XB. We may thereforeset down

UGX'K = (j XK - XBI0|A )/A. (14) This expressionmay be somewhatsimplified. In the firstplace, equa- tion (13) reducesdirectly to the followingform:

- - 2A, = X Si [K(1 K)Ai (1 -2K)i

AA,2/Ni] [(Ni - ni)(Ni -1)]. (15) In line with our previous assumptions,the finitesample factorin the second pair of bracketsmay be taken as approximatelyequal to unity. Now A representsall individuals of the A type in the ith stratum, whileAi representsonly those of the A type in the ith stratumwho fall in the relativelynarrow interval between XK and XB. It is apparent that in most cases Ai willbe quite small relativeto Ai, and the following expressionwill be a satisfactoryapproximation to the variance of A'.

a2A, K(1 - K)Z SiA. (16) i=1

Because of the modifyingfactors K(1 - K) and (1-2K), the agree- ment betweenequations (15) and (16) will be least satisfactoryin the case of the extremepercentiles. However, here again sample size is a factor. As the size of sample increases, the range of values of X for which an evaluation of (15) may be required will become narrower, and accordinglythe maximumvalues of A\iwhich may appear will be- come smaller. The second and third terms within the brackets thus decrease in importancerelative to the first. Finally, we may assume that X is measuredin relativelyfine inter- vals, and choose XB SO that XK - XB is equal to one-halfC, where C is the widthof the class intervalby whichthe characterX is measured and withinwhich the frequenciesof the A type individuals are tabu- lated. Then A will be very nearlyequal to one-halfFC, whereF, is the total numberof A type individualsin the population in the intervalC which includes the percentileXK. Making the indicated changes,and incorporatingequation (16), equation (14) reducesto *THE STANDARD ERROR OF PERCENTILES 373

X'K = C K(1 - K) , SiAi] F?. (17)

This expressionis an approximationto the standard errorof a per- centile. In applying it, then, due care must be exercisedthat the as- sumptionsinvolved in its derivation are not violated. Perhaps most important,it should not be applied to extremepercentiles unless the sample is large enough to justifythe assumption of normalityin the textfollowing equation (6), and the use of equation (16) as an approxi- mation to equation (15). In evaluating equation (17) in any particularinstance it will prob- ably be necessary to use sample values as approximationsto the population parametersspecified. It may well be noted, then, that the ratio C/Fe is in realityan approximationto the value of the reciprocal of the ordinateof the distributionof the A type individualsaccording to the characterX at the point XK. It followsthat a considerablerange of may be used to estimate the most probable value of the ratio in any case wherethe sample resultsare somewhatirregular. Equation (17) is quite generalin form.It may be readilysimplified to referto less complexsampling situations. For example,let us assume that a percentilereferring to the entirepopulation is to be estimated on the basis of a non-stratifiedsample of size n. Equation (17) then reducesto UX'K = C [K(1 - K)n]1I2/fc (18) wheref? representsthe expected frequencyof A type individuals in a sample of size n in the interval which includes the percentile.If the percentileto be studied is the median, equation (18) becomes simply

Ux'0.5 = C\n/2f,. (19) This will be recognizedas the same formulaas that given by Yule and Kendall, but derivedby them by a somewhatdifferent method.2 Small Samples. In the foregoing,the argumenthas been limited to large samples. In small samples the various distributionswhich have been studiedmay not be representedwith sufficient accuracy by normal errorfunctions, and consequentlyequation (7) and its simplifications become invalid. However, a line of attack on the problemof determin- ing the probable errorof an estimate of a percentilebased on a small sample is indicated. In the firstplace, since a small sample is seldom a stratifiedsample, we may limit our examinationto non-stratifiedsamples. Similarlyto

2 G. U. Yule and M. G. Kendall, An Introductionto theTheory of Statistics,London, 1937, p. 384. 374 AMERICAN STATISTICAL ASSOCIATION- the terminologypreviously used, let n, a, and b representthe sample size and the numberof A and B type sample units in the sample. Now, provided that the proportionsof the A and B type individualsin the populationare knownor can be estimated,we may estimatethe prob- ability of occurrenceof combinationsof particularvalues of a and b in repeatedsamples of size n. From these probabilitieswe may determine the probabilityof appearance of negativefrequency differences and so the probabilitythat the percentileXK will be estimatedon the basis.of a sample of this size to lie below XB. Repeating this for various as- signed values of XB, we may constructany part of the distributionof X'K desired. Since n is small, the numberof combinationsof a and b to be investigatedis limited.The labor involvedin thisprocedure, while not small, will still usually not be prohibitive.This is especiallytrue if, as will many timesbe the case, all that is desiredis a specifictest of the significanceof the differencebetween the percentileand some other singlevalue of X. The settingup of a singleform summarizing the above procedureis difficultand unwieldyin the general case forany percentile.This has been done in the special case of the median,but the resultingexpression is not givenin the textbecause of its complexityand limitedusefulness. The small sample case mustbe regardedas borderingon the trivial,be- cause in extremelyfew instances are percentilesused in conjunction withsamples of verysmall absolute size. Samples of IntermediateSize. In some cases, a sample may be too large to permitconvenient evaluation by the special proceduressug- gestedin the textimmediately above, but too small to permitthe appli- cation of equation (17), especiallyin view of the simplificationsincor- porated in the latter whichare based on the assumptionof very large sample size. Under these conditions,the procedure suggested in the text followingequation (13) will usually suffice.This method is most usefulif all that is desiredis a test of the significanceof the difference betweenan estimatedpercentile and some preassignedquantity. The validity of the use of this proceduredepends upon how closely a linearfunction of one or morehypergeometric variates (dependingon the type of sampling employed) may be representedby the normal probabilitydistribution. In the case of the more centralpercentiles a fairagreement is obtained even withrelatively small sample sizes. Sample Allocationsin StratifiedSamples. Neyman and others have studied the question of the proper allocation of a limited number of schedulesbetween strata to produce the most efficientestimate of the mean. It may be of interestto apply this same technique to the prob- *THE STANDARD ERROR OF PERCENTILES 375 lem of estimatinga percentile.First, it is assumed that the cost of ob- tainingeach scheduleis the same throughoutthe population.The total numberof schedules to be obtained (total sample size) is represented by n,. We then definea functionV as follows:

V = [C2K(l - K)/F2c E N2pa,/ni+ L( Eni - n) (20) i-1 i=l1 whereL is an arbitraryLagrange multiplier.The functionV is differ- entiated with respectto ni and n; and the resultsequated to zero. Be- tween the resultingtwo equations L is eliminated.From the result,it may be shown that to minimizethe variance of our estimateof a per- centile,the size of sample to be taken in any stratumshould be allo- cated in accordancewith the followingrelationship,

r ni = no /NA- E V/NAi. (21) i=1

It is interestingto observe that the allocation of the sample is inde- pendent of the particularpercentile which is to be estimated. More- over, if the percentileis to be determinedfor the whole population,Ai becomes equal to Ni, and the allocations are simplymade proportion- ate to the total numberof individualsin each stratum. Illustration.It was mentionedin the introductionthat the standard errorof the median may be substantiallyless than that of the mean in certaindistributions, especially those of an economiccharacter. A con- creteillustration of this may be of interest.The accompanyingtabula- tion presentsa distributionof 1,200 familiesaccording to theirannual incomes.This distributionis patternedafter a sample obtained by the Study of Consumer Purchases in a restrictedsection of New York City. It thus representswhat may be foundin practice.

Under$250 ...... 10 $ 2,500 to $2,999... 157 $ 250 to $ 499. . . 12 3,000 " 3,499... 88 500 " 749... 26 3,500 " 3,999... 52 750 " 999... 49 4,000 " 4,499... 33 1,000 " 1,249... 89 4,500 " 4,999... 20 1,250 " 1,499... 104 5,000 " 7,499... 54 1,500 " 1,749... 125 7,500 " 9,999... 15 1,750 " 1,999... 132 10,000 and over.... 18 2,000 " 2,249... 126 2,250 " 2,499... 90 All incomes...... 1,200 376 AMERICAN STATISTICAL AssocIATIoNx Applying equation (19), the standard errorof the median of this distributionis estimatedto be about $34. The usual methodslead to an estimateof the standard errorof the mean of $73. The range of uncer- taintyof an estimateof the mean based on this sample would then be about double the range forthe median. The differencein reliabilitymay be exhibitedin morestriking fashion in termsof the greatersize of sample requiredto get an estimateof the mean having a standard errorof only $34. It is readily found that to provide such an estimate a sample of approximately5,500 families would be required,or a sample more than 41 times as large as that re- quired to estimatethe medianwith the same standarderror.