<<

Statistics of Range of a Set of Normally Distributed Numbers

Charles R. Schwarz1

Abstract: We consider a set of numbers independently drawn from a . We investigate the statistical properties of the maximum, minimum, and range of this set. We find that the range is closely related to the of the original population. In particular, we investigate the use of the online positioning user service ͑OPUS͒ global positioning system ͑GPS͒ precise position utility, which produces three estimates of each coordinate and reports the range of these three estimates. We find that the range divided by 1.6926 is an unbiased estimate of the standard deviation of a single coordinate estimate, and that the of this estimate is 0.2755 ␴2.We compare this to the more conventional method of estimating the standard deviation of a single observation from the sum of squares of residuals, which is shown to have a variance 0.2275 ␴2. DOI: 10.1061/͑ASCE͒0733-9453͑2006͒132:4͑155͒ CE Database subject headings: ; surveys.

Introduction timates are statistically independent. If x1, x2, and x3 are three ␴2 ␴2 independent estimates of coordinate X, with 1, 2, ␴2 This investigation was motivated by users of the online posi- and 3, then the best estimate of the coordinate X is the tioning user service ͑OPUS͒ utility for computing precise coordi- weighted nates for global positioning system ͑GPS͒ geodetic receivers ͑see http://www.ngs.noaa.gov/OPUS͒. This utility, provided by the 3 U.S. National Geodetic Survey, computes highly accurate GPS ͚ wixi positions from a single set provided by the user. It performs i=1 ¯x = 3 a differential GPS solution by searching for suitable reference station data in the archive of Continuously Operating Reference ͚ wi Station ͑CORS͒ data. OPUS selects three CORS stations and i=1 performs three single baseline solutions. This produces three ␴2 where the weights are wi =1/ i . The variance of the mean is estimates for the coordinates of the user’s receiver in the U.S. then National Spatial Reference System. The OPUS utility reports the mean of these three estimates, together with their range, to the 1 ␴2 user. Fig. 1 shows a typical solution report obtained from OPUS. ¯x = 3 The numbers following the coordinates show the range of the ͚ wi three estimates of each coordinate. i=1 Each single baseline solution is highly accurate by itself. The reason to use three such baselines appears to be to provide some According to the OPUS user documentation, the designers protection against blunders, not because they provide significant of OPUS declined to use this , saying that statistical or geometric redundancy. The range of the three esti- the variances from the single baseline solutions are mates tells the user if blunders are present. If the range is signifi- notoriously overoptimistic ͑http://www.ngs.noaa.gov/OPUS/ cantly larger than normally obtained with similar data sets, then UsingគOPUS.html#accuracy͒. one or more of the reference station’s coordinates or tracking data 2. Linear error propagation, using external estimates of the vari- may have contained a blunder. ances of single baseline solutions. A common method of ob- There are several ways to estimate the uncertainty of the mean taining such external estimates is to perform a study using a of the three single baseline estimates reported to the user. Among large number of single baseline solutions. Such studies may these are produce rules for estimating variances from some property of 1. Linear error propagation, using the variances of the indi- the data, such as the time span of the data or the length of the vidual single baseline solutions and assuming that these es- baseline. 3. Linear error propagation, assuming that all three single base- line solutions have the same variance ␴2. Then the best esti- 1Consultant, Geodesy, 5320 Wehawken Rd., Bethesda, MD 20816. mate of the coordinate X is the simple mean E-mail: [email protected] Note. Discussion open until April 1, 2007. Separate discussions must be submitted for individual papers. To extend the closing date by one 3 month, a written request must be filed with the ASCE Managing Editor. ͚ xi The manuscript for this paper was submitted for review and possible i=1 ¯x = publication on May 10, 2005; approved on July 8, 2005. This paper is 3 part of the Journal of Surveying Engineering, Vol. 132, No. 4, November 1, 2006. ©ASCE, ISSN 0733-9453/2006/4-155–159/$25.00. and its variance is

JOURNAL OF SURVEYING ENGINEERING © ASCE / NOVEMBER 2006 / 155 ϱ 3 ͓ ͔ ͑ ͒ E u = ͵ ufu u du = ͱ = 0.8463 −ϱ 2 ␲ and its variance ͓Weisstein, undated, Eq. ͑39͔͒ is

ϱ 4␲ −9+2ͱ3 E͓͑u − E͓u͔͒2͔ = ͵ u2f ͑u͒du − E2͓u͔ = = 0.5593 u ␲ −ϱ 4 ͑ ͒ Similarly, let v=min x1 ,x2 ,x3 . It is easy to show that E͓v͔ =−E͓u͔ Finally, let w be the range of the standardized random variables. Then

E͓w͔ = E͓u − v͔ = E͓u͔ − E͓v͔ =2E͓u͔

Application to the OPUS Utility

Returning to the original set of numbers X1 ,X2 ,X3 and comput- ing their maximum, minimum, and range, we have Fig. 1. Portion of an OPUS data sheet 3 E͓max͔ = ␮ + ␴ 2ͱ␲

␴2 ␴2 3 ¯x = E͓min͔ = ␮ − ␴ and 3 2ͱ␲

3 While linear error propagation is formally correct, it is easily E͓range͔ = ␴ = 1.6926␴ ͱ␲ contaminated by blunders in the data. Detecting such blun- ders appears to have been the main concern of the designers This that of OPUS. Their methodology is to estimate the coordinate X with the simple mean of the three single baseline solutions, s1 = range/1.6929 but report the range of the three estimates rather than a stan- is an unbiased estimate of ␴, the standard deviation of the popu- dard deviation. The range ͑also called the peak-to-peak error͒ lation from which the three numbers were drawn. Furthermore, will show the presence of a blunder more clearly than a propagated variance. The explanation on the OPUS web site s /ͱ3=range/2.9317 suggests that the peak-to-peak error is intended to be a rough 1 guide to the accuracy of the reported coordinates. may be used as an unbiased estimate of the standard deviation of A number of OPUS users have asked for a formal standard the mean of the three individual estimates. deviation or variance for the reported coordinates. This has prompted interest in the question of whether such a standard de- viation can be computed from the reported range of the three How Good Is This Estimate? single baseline solutions. Intuitively, it seems that such a relation- ship must exist; if the uncertainties of the single baseline solu- If we want to use range/1.6926 as an estimate of ␴, it makes tions are large, then one would expect the range of a sample of sense to ask how good is this estimate. three of them to be large. We have already noted that Var͑u͒ = E͓͑u − E͓u͔͒2͔ = E͓u2͔ − ͑E͓u͔͒2 = 0.5593 Statistics of the Maximum, Minimum, and Range We can also easily show that Var͑v͒=Var͑u͒. With w=u−v as the range of the standardized random vari- ables we have Let X1, X2, and X3 be three numbers independently drawn from a ␮ population with a normal distribution with mean and stan- ͓ ͔ ͓͑ ͓ ͔͒2͔ ͓ ͔ ͓ ͔ ͓ ͔ ␴ ϳ ͑␮ ␴͒ Var w = E w − E w = Var u + Var v − 2Cov u,v dard deviation ; i.e., X1 ,X2 ,X3 n , . Standardize these ͑ ␮͒ ␴ ͑ ␮͒ ␴ ͑ ␮͒ ␴ by x1 = X1 − / , x2 = X2 − / , and x3 = X3 − / so that We are tempted to assume that the between the maxi- ϳ ͑ ͒ x1 ,x2 ,x3 n 0,1 . mum u and minimum v should be zero. After all, the maximum is ͑ ͒ Let u=max x1 ,x2 ,x3 . Then u is also a . Its one of the random variables xi and the minimum is another one. distribution is the extreme value distribution, a topic treated in the xj, and xi and xj are statistically independent. However, the analy- subject of order statistics ͑see Wilks 1962͒. Its sis in the Appendix shows that this is not the case. Instead, the ͓see Weisstein, undated, Eq. ͑34͔͒ is covariance is

156 / JOURNAL OF SURVEYING ENGINEERING © ASCE / NOVEMBER 2006 Cov͓u,v͔ = E͓͑u − E͓u͔͒͑v − E͓v͔͔͒ = E͓uv͔ − E͓u͔E͓v͔ Var͓w͔ = Var͓u͔ + Var͓v͔ − 2Cov͓u,v͔ where 4␲ −9+2ͱ3 9−4ͱ3 =2 −2 = 0.7892 ␲ ␲ ϱ ϱ 4 4 ͓ ͔ ͵ ͵ ͑ ͒ E uv = uvfuv u,v dudv The variance of the range of the original random variables is then −ϱ −ϱ Var͓range͔ = 0.7892␴2 ͑ ͒ Using the expression for fuv u,v from the Appendix and setting n=3 and the standard deviation of the range is 0.8884␴. ␴ Recalling that we use s1 =range/1.6926 as an estimate of , ϱ ϱ the error in this estimate is E͓uv͔ =6͵ ͵ ͓F ͑u͒ − F ͑v͔͒f ͑u͒f ͑v͒dudv x x x x ␴ −ϱ v s1 − For the normal distribution, the probability density is the and the expected squared value of this error is Gaussian function ͓͑ ␴͒2͔ ͓͑ ␴͒2͔ E s1 − = E range/1.6926 −

1 2 1 ͑ ͒ ͑ ͒ −u /2 2 fx u =G u = e = E͓͑range − E͓range͔͒ ͔ ͱ2␲ ͑1.6926͒2 and the distribution function is 1 = Var͓range͔ ͑1.6926͒2 1 1 F ͑u͒ = + Erf͑u/ͱ2͒ 0.7892 x 2 2 = ␴2 = 0.2755␴2 ͑1.6926͒2 ͑ ͒ ͱ␲͐z −t2 where Erf z =2/ 0e dt is the described in many texts ͑see, e.g., Abramowitz and Stegun 1977͒. Thus,

ϱ ϱ How Good Are Other Estimates? E͓uv͔ =3͵ ͵ uv͓Erf͑u/ͱ2͒ − Erf͑v/ͱ2͔͒G͑u͒G͑v͒dudv −ϱ v The more conventional method of computing the variance of a ϱ ϱ single observation from a sample of three numbers is to compute =3͵ vG͑v͒ͫ͵ uErf͑u/ͱ2͒G͑u͒duͬdv the mean of the three numbers −ϱ v 3 ϱ ϱ ͚ X ͱ i −3͵ vG͑v͒Erf͑v/ 2͒ͫ͵ uG͑u͒duͬdv i=1 X¯ = −ϱ v 3 ϱ u and the residuals from the mean =3͵ uG͑u͒Erf͑u/ͱ2͒ͫ͵ vG͑v͒dvͬdu −ϱ −ϱ v = X − X¯ ϱ ϱ i i −3͵ vG͑v͒Erf͑v/ͱ2͒ͫ͵ uG͑u͒duͬdv Then −ϱ v 3 ϱ ͚ v2 =3͵ uG͑u͒Erf͑u/ͱ2͓͒−G͑u͔͒du i 2 i=1 ϱ ␴ˆ = − 2 ϱ 2 −3͵ vG͑v͒Erf͑v/ͱ2͓͒G͑v͔͒dv is an unbiased estimate of ␴ . −ϱ It is shown in many texts on least-squares adjustments ͑e.g., ϱ Leick 1995, Sec. 4.9.2͒ that the sum of squares of residuals, di- =−6͵ zG2͑z͒Erf͑z/ͱ2͒dz vided by the true value of the variance of a single observation, is −ϱ distributed as chi squared with n−1 degrees of freedom. Thus ϱ ͱ ␴2 3 2 3 1 3 =− ͵ ze−z Erf͑z/ͱ2͒dz =− =− = − 0.5513 ␴ˆ 2 ϳ ␹2 ␲ ␲ ͱ ␲ 2 2 −ϱ 3 Then From this ␴2 Cov͓uv͔ = E͓uv͔ − E͓u͔E͓v͔ E͓␴ˆ 2͔ = E͓␹2͔ = ␴2, since E͑␹2͒ = 2, and with Var͑␹2͒ =4 2 2 2 2 ͱ3 3 2 9−4ͱ3 =− + ͩ ͪ = = 0.1649 ␲ ͱ␲ 4␲ ␴2 2 ␴2 2 2 Var͓␴ˆ 2͔ = ͩ ͪ Var͓␹2͔ = ͩ ͪ 2 ϫ 2=␴4 2 2 2 and the variance of the range of the standardized random vari- ables is Thus, the expected squared error of ␴ˆ 2 as a estimate of ␴2 is ␴4.

JOURNAL OF SURVEYING ENGINEERING © ASCE / NOVEMBER 2006 / 157 Fig. 3. Another portion of OPUS extended output

Fig. 2. Portion of OPUS extended output

3 ͚ v2 ␴ˆ ͱ i It is tempting to conclude from this that is an unbiased i=1 ␴ ␴2 ␴ˆ = estimate of and that its variance is . However, we find that 2 this is not the case. ͚3 2 ␴2 Let y= i=1vi / . Then y is distributed as chi square with two This yields degrees of freedom, and its probability density function is ␴ lat = 0.005 m 1 f ͑y͒ = e−y/2, where the ⌫͑1͒ =0!=1 y ⌫͑ ͒ 2 1 ␴ long = 0.003 m The estimated standard deviation is ␴ˆ =ͱ␴2 /2y and its expecta- tion is ␴ height = 0.014 m ϱ ␴ ␴ ͱ␲ 3. Using the covariance matrices of the individual baseline so- ͓␴ˆ ͔ ͵ 1/2 −y/2 ͱ ␲ ␴ E = ͱ y e dy = ͱ 2 = = 0.8662 lutions. The single baseline processing engine used in OPUS 2 2 −ϱ 2 2 2 produces a for each baseline. These are This says that even though ␴ˆ 2 is an unbiased estimate of ␴2, ␴ˆ combined to produce the covariance matrix of the mean of underestimates ␴. However, we can still find the expected squared the three individual coordinate determinations, rotated into error in this estimate as east, north, and up coordinates, and reported in the OPUS extended output ͑Fig. 3͒. The square roots of the diagonal E͓͑␴ˆ − ␴͒2͔ = E͓␴ˆ ͔ −2E͓␴ˆ ␴͔ + E͓␴2͔ = E͓␴ˆ 2͔ −2E͓␴ˆ ͔␴ + ␴2 elements of this matrix also give the standard deviations of = ␴2 −2ϫ 0.8862␴ϫ␴+ ␴2 the reported coordinates. By this method we compute So, finally, ␴ lat = 0.001 m E͓͑␴ˆ − ␴͒2͔ = 0.2275␴2 ␴ = 0.003 m This is slightly better than the expected squared error of 0.2755␴2 long obtained when the range is used to estimate the standard deviation ␴ of a single observation. This seems intuitively correct, since it is height = 0.003 m based on all three observations rather than just two.

Example Using the OPUS Data Sheet Concluding Remarks The use of the range, or peak to peak errors, given on the data The OPUS solution report with extended output provides three sheet may be advantageously used to estimate the standard devia- different ways of estimating the uncertainty of the computed co- tions of the computed coordinates. This method of computing the ordinates: standard deviations is almost as good as the more conventional 1. Using the range of the three solutions in latitude, longitude, method based on the sum of squares of residuals, and is also and height. From Fig. 1, these are 0.011, 0.005, and 0.026 m. much more robust against the effects of a blunder in the data. We find the standard deviation of the reported latitude, lon- gitude, and height by dividing by 2.9317, yielding ␴ lat = 0.004 m Notation

␴ The following symbols are used in this paper: long = 0.002 m Cov͑͒ ϭ covariance of two random variables; E͓͔ ϭ expected value; ␴ = 0.009 m height Erf͑x͒ ϭ error function; ͑ ͒ ϭ 2. Using the residuals from the mean. Fig. 2 shows the three Fx x cumulative probability function associated with the individual position determinations. The mean of these in random variable x, evaluated at the number x; ͑ ͒ ϭ each coordinate is the published International Terrestrial fx x probability density function associated with the Reference Frame ͑ITRF͒ position, and the maximum minus random variable x, evaluated at the number x; minimum in each coordinate ͑converted to meters͒ is the G͑x͒ ϭ Gaussian probability function; reported range. From these, we can compute the residuals Var͑͒ ϭ variance of a random variable; and and estimate the standard deviation in each coordinate by ⌫͑x͒ ϭ Gamma function.

158 / JOURNAL OF SURVEYING ENGINEERING © ASCE / NOVEMBER 2006 ͑ϱ ͒ ͓ ͑ ͔͒n ͑ ͒ Appendix. Joint Probability Function Fuv ,v =1− 1−Fx v = Fv v of the Minimum and Maximum of a Set of N Random Variables and the marginal probability density functions ͕ ͖ Let x1 ,x2 ,...,xn be a set of n independent random variables ͕ ͖ ͕ ͖ and let u=max x1 ,x2 ,...,xn and v=min x1 ,x2 ,...,xn . The joint function of u and v is found ϱ u ͵ ͑ ͒ ͑ ͒͵ ͓ ͑ ͒ ͑ ͔͒n−2 ͑ ͒ ͑ ͒ by determining the region of n-dimensional space such that fuv u,v dv = n n −1 Fx u − Fx v fx u fx v dv ͕uϽu&vϽv͖ and then finding the probability density contained −ϱ −ϱ in this region ͑Papoulis 1965͒. Thus n−1͑ ͒ ͑ ͒ ͑ ͒ = nFx u fx u = fu u ͑ ͒ ͕ Ͻ Ͻ ͖ Fuv u,v = Pr u u & v v ͕ Ͻ Ͻ Ͻ Ͻ ͖ = Pr x1 u & x2 u &, ..., & xn u ϱ ϱ ͕ Ͻ Ͻ Ͻ Ͻ Ͻ Ͻ ͖ ͵ ͑ ͒ ͑ ͒͵ ͓ ͑ ͒ ͑ ͔͒n−2 ͑ ͒ ͑ ͒ − Pr v x1 u & v x2 u &, ..., &v xn u fuv u,v du = n n −1 Fx u − Fx v fx u fx v du −ϱ v The random variables x are all independent, so = n͓1−F ͑v͔͒n−1f ͑v͒ = f ͑v͒ ͑ ͒ ͓ ͕ Ͻ ͖͔n ͓ ͕ Ͻ Ͻ ͖͔n x x v Fuv u,v = Pr x u − Pr v x u ͑ ͒ ͑ ͒ Ͼ The probability in the second term is Fx u −Fx v if u v; otherwise it is zero. Thus References n͑ ͒ ͓ ͑ ͒ ͑ ͔͒n Ͼ F u − Fx u − Fx v , u v F ͑u,v͒ = ͭ x ͮ uv n͑ ͒ Ͻ Abramowitz, M., and Stegun, I., eds. ͑1977͒. Handbook of mathematical Fx u , u v functions with formulas, graphs, and mathematical tables, Dover, and the joint probability density function is New York. .Leick, A. ͑1995͒. GPS satellite surveying, Wiley Interscience, New York ͒ ͑ 2ץ Fuv u,v f ͑u, ͒ = Papoulis, A. ͑1965͒. Probability, random variables, and stochastic pro- ץ ץ uv v u v cesses, McGraw-Hill, New York. ͑ ͒ n͑n −1͓͒F ͑u͒ − F ͑v͔͒n−2f ͑u͒f ͑v͒, u Ͼ v Weisstein, E. W. undated . “Extreme value distribution,” ͭ x x x x ͮ ͗ = from Mathworld—A Wolfram web resource, http:// 0, u Ͻ v mathworld.wolfram.com/ExtremeValueDistribution.htm͘ It is straightforward to verify the marginal distribution functions Wilks, S. S. ͑1962͒. , Wiley, New York. ͗http://www.ngs.noaa.gov/OPUS͘ ͑ ϱ͒ n͑ ͒ ͑ ͒ Fuv u, = Fx u = Fu u ͗http://www.ngs.noaa.gov/OPUS/Using_OPUS.html#accuracy͘

JOURNAL OF SURVEYING ENGINEERING © ASCE / NOVEMBER 2006 / 159