Quick viewing(Text Mode)

Estimation of Moments and Quantiles Using Censored Data

Estimation of Moments and Quantiles Using Censored Data

WATER RESOURCES RESEARCH PAGE, 4 . , VOLNO S, 1005-1012.32 , APRIL 1996

Estimatio momentf no quantiled an s s using censored Charles N. Kroll and Jery R. Stedinger School of Civil and Environmental Engineering, Cornell University, Ithaca, New York

Abstract. Censored data setoftee sar n encountere waten di r quality investigationd an s streamflow analyses. A Monte Carlo analysis examined the performance of three technique r estimatinfo s momente gth quantiled san distributioa f so n using censored data sets. These techniques includ lognormaea l maximum likelihood (MLE) loga , - probability plot regression log-partiaestimatorw ne a d lan , probability-weighted estimator. Data sets were generated fro mnumbea f distributiono r s commonly useo dt describe water quality and water quantity variables. A "robust" fill-in method, which circumvents transformation bias in the real space moments, was implemented with all three estimation technique obtaio t s completna e sampl computatior efo sample th f no e mea standard nan d deviation. Regardles underlyine th f so g distributionE ML e th , generally performe r bettewels o d a s a l r tha othee nth r , thoug momene hth t and estimators using all three techniques had comparable log-space root square errors (rmse) for at or below the 20th for samples sizes of n = 10, the 40th percentile for n = 25, and the 60th percentile for n = 50. Compariso log-space th f no e rms real-spacd ean e rmse indicated tha log-spaca t e rmse was a better overall metric of estimator precision.

Introduction showed that more sophisticated statistical techniques per- formed better than these simple "replacement" methodsn I . Whe data n t containse a s some observations withi- re na particular, the log-probability plot regression method provided stricted rang valuef eo t otherwissbu t measuredeno calles i t i , d the best estimators of the mean and , while a censored dat t [Cohen,se a 1991]. Censored data sete ar s lognormae th l maximum likelihood method provide bese dth t commonly fielde founth f waten o sdi r quality, where labora- estimators of the and interquartile . Estimation tory f contaminanso t concentration e oftear s n of other than the median was not considered by reported as "less than the detection limit." Censored data sets Gillio Helseld man . Helsel Cohnd an [1988] extended Gilliom alse ar o foun waten di r quantity analyses when river discharges and HelseFs wor datko t a sets with several censoring thresholds. less than a measurement threshold level are reported as zero. This study extends the work of Gilliom and Helsel to the In some regions, historical river discharge records report over estimation of several quantiles and considers new estimators. halannuae fth l minimum flow zers sa o [Hammett, 1984]. These e log-probabilitTh y plot regression method e (LPPRth d an ) discharges may have been zero, or they may have been between lognormal maximum likelihood method (MLE) are evaluated zero and the measurement threshold and thus reported as along with a new method based on partial probability-weighted zero efficientlo t concerf O .w ho s ni y estimate moments, quan- moments (PPWM). As with the MLE and LPPR estimators, tiles othed an , r descriptive underlyine th f so g contin- our PPWM estimator assumes that the data are described by a uous distribution using such censored data sets. situatioThe n wher dateall a below fixea d valu censoreeare d lognormal distribution. It employs with the logarithms of the is referre typs a eI censoringo dt . With typ eI censoring e th , flow xlata the censored-sample probability-weighted moment number of values censored is a . With type II (PWM) estimators derived by Wang [1990] to obtain the pa- censoring, a fixed number of data points are always censored rameter lognormaa f o s l distribution. Wang employe- es s dhi ancensorine dth g threshol randoa s di m variable [David, 1981]. timator rean si lgeneralize a spac t fi o et d extreme value (GEV) Censored water qualit wated yan r quantity data should resem- distribution performance .Th probabilitf eo y weighted moment ble type I censoring because the censoring threshold is fixed by estimators with complete samples has been examined for a measuremene th t technolog physicae th d yan l setting. numbe f distributionso r mann i d y an ,estimatorcases M PW , s numbeA f studiero s have f simplsuggesteo e us e "ree dth - e higheoth f r moment d quantilean s a distributio f o s n have placement" techniques for estimating the mean and standard performed favorably with product-momen maximud an t m like- deviatio typf no ecensoreI d data sets [Cohen Ryan,d an 1989; lihood estimators [Landwehr et aL, 1979; Hashing et aL, 1985; Newman et al, 1990]. These techniques replace all the cen- Hashing Wallis,d an 1987]estimatorM PW . lineae sar r combi- sored observations with some value betwee- de e nth zerd oan nation observatione th f so thud lese an ssar s sensitive th o et tection limit. Gilliom Helseld an [1986 Helseld Gilliomd ]an an largest observation sampla n si e than product-moment estima- [1986] examine performance dth varieta f eo techniquef yo o st tors that squar cubd observationse ean e th merie .Th probf o t - estimate the mean, standard deviation, median interquard an , - ability-weighted moment estimators with censored samples has tile range using type I censored water quality data. They analyzede b ye o t t . This study focuse estimation so meane th f no , standar- dde Copyright Americae 199th y 6b n Geophysical Union. viation, and of a distribution, as well as Paper number 95WR03294. quantiles with nonexceedance probabilities of 10% and 90%. A 0043-1397/96/95WR-03294$05.00 "robust" fill-in metho implementes di d with each estimation 1005 1006 KROLL AND STEDINGER: ESTIMATION OF MOMENTS AND QUANTILES

techniqu obtaio et ncompleta e sampl computatior efo e th f no date th a r abovFo threshole eth - logarithe or dth e th f mo sample mea varianced nan . Gilliom Helseld an [1986] used this dered values, Y/5 are regressed against the corresponding "nor- "robust" fill-in method only with a log-probability plot regres- mal scores" corresponding to the model sion estimator thin I . s study this metho alss di o used wite hth lognormal maximum likelihoo d partiaan d l probability- I = (LY £/ = c (3) weighted moment estimators differeno Tw . t metric usee ar s d where < £ e invers1th (p s i ) e cumulative normal distribution to compare estimators. Data are generated from distributions i function e resultinevaluateth e d a ar d (LY an g t pYan a di9 commonly observed in the water quality and water quantity estimators of the mean and standard deviation of the log- fields, including three distribution t consideresno Gillioy db m transformed data obtained using ordinary regres- and Helsel; their extreme case for the gamma distribution sion. These LPPR estimator similae sar thoso t r e derivey db ( = 2.0) was omitted. Gupta [1952] and have been implemented in a number of studies of estimation with censored data sets [Gilliom and Estimation Techniques Helsel, 1986; Helsel and Gilliom, 1986; Helsel and Cohn, 1988; Helsel, 1990]. All three estimation techniques make the assumption that underlyine th g distributio date th lognormal s ai f no . Helseld an Partial Probability Weighted Moments Hirsch [1992, p. 360] observe that the lognormal distribution flexibla s ha e shape thed an ,y provid reasonablea e description For a variable Y, probability-weighted moments are defined as of many positive random variables with positively skewed dis- p, = E{Y[F(Y)]r} (4) tributions lognormae Th . l distributio bees nha n a show e b o nt good descriptor of low river flows [Vogel and Kroll, 1989] and wher ecumulative F(Y)th s i e distribution function (CDFr )fo water quality data [Gilliom Helsel,and 1986]. continuoua r YFo . s random variable, PWMwrittee b n sca n

Lognormal Maximum Likelihood Estimator |3r= Y(F)FdF (5) Conside orderen ra d censored data set Xl < X2 ''' ^ Xc < Xc +l '— ^ Xn, where the first c observations are censored and reported only as below some fixed measurement threshold. wher evaluateY e= f F inverse o F(Y) Y(F)d th F s an i de CD Let Yt = In (Xt) and let T be the log of the measurement e probabilitath t a censore yr F.Fo d sample, Wang [1990] threshold. Assuming that X is lognormally distributed and in- define PPWda Ms a dependent likelihooe th , d functio date th s ai r nfo

Y(F)FrdF (6) T - L - /Y c\(n - c}\ T'n CTy 0-y l=C + l where PT — F(T), the probability of censoring, and T is the are where <£ and the distribution and density function of a censoring threshold.

standard normal variate e log-transe meath th , f s o i nIJL - Y Assumin lognormalle date ar g th a X y distributed= Y d an , forme de standar data th d os yi an , d deviatio e logth -f o n log (X), normalls thei e nY th f o y g distributedlo e th s i T . transformed data. By taking the logarithm of (1) and setting censoring threshold normae th r l Fo .distributio inverse nth e the partial derivatives with respect to JLLY and oy to zero, one CDF for a random variable Y is can solve for the maximum likelihood estimators (MLE) (LY and 6y [Cohen, 1991]. (7) An approximatio &~o nt l(F)s i Log-Probability Plot Regression Method Again consider a log-transformed censored data set where 3>-\F)±5. 05[F°-135 - (1 - (8) the first c of the n data values are censored. Plotting positions for the uncensored observations are This is a good approximation of the normal inverse CDF for 0.005 <

approximation a B(Ps i ) &f , n(Po 0 ) = base r (8)r n do Fo . T r T estimators with the Hirsch-Stedinger Blom-based censored r data plotting position (equatio nsmallea (2)d ha ) r root mean

B(P) = ^g,(P)] + o-yUCPr)]

T T 2 square error than estimators with a standard complete sample 0 Weibull plotting position [i/(n + 1) ] when censored data _(P (1-P= ) ) (10) were present. Gilliom and Helsel [1986] used the standard gi T T complete sample Weibull plotting position in their LPPR es- timator, while Helsel and Cohn [1988] used a Weibull-based plotting position with the Hirsch-Stedinger censored data plot- ting position. 1= , r r fo d an KROL STEDINGERD LAN : ESTIMATIO MOMENTF NO QUANTILED SAN S 1007

To obtain estimators of the mean and in real space, one could transform the log-space mean and variance into the real-space moments [Aitchison Brown,d an 1957] e realTh . - (11) space estimators woul thibiaseo e t db se transformationddu , e log-spaceveth f ni e estimator unbiasee ar s d [Finney, 1941]. 1 [ P P2/35 P (l-P )L g (P ) = 5.05 r r e real-spacTh e sample estimate e mea th standard f nan o s d 4 T [2.135 2 135 1.135 deviation are most sensitive to the largest observations, and theso t t lacfi e f observationko producn sca e substantial error 2 135 (1-Pr) - 1 in these estimators. Several studies have examined techniques (1.135)(2.135)J whiccorreco t y hthir tr fo ts bias [Helsel Cohn,d an 1988; Cohn et al., 1989; Newman al,t e 1990]. Helsel Hirschd an [1992. pp , For X1 < X 2<• • < X• unbiasen na , d estimato fif ro (P T) [Wang, 1990] is 360-361] note that compensating for this bias requires an assumption about distributional shape, whic s impossibli h e whe underlyine nth g distributio date th unknowns ai f no . , "_1 (i-!)(/-2 ) •••(i-r) „ x solutioA thio nt s proble combino t ms i observee eth d data above the censoring threshold with estimators of the smallest observations which were censored because they fell beloe wth where measurement threshold. Estimates of the mean and standard 1 deviatio thee nar n obtaine sample th s da e mea sampld nan e AT,f i p (7- exp(r) [1992] indicated that such estimation techniques are "robust" because they perform well even t lognorwhe no date e nth aar - [1990] recommended estimating Pr as mally distributed. PT=c/n (14) Gilliom d Helselan [1986] only employed this "robust" numbee wherth s i observatiof ec r o t exceedinnno . SettingT g method using e LPPth R estimator, though this techniques equation r B fo B^(Pd s(P an ) ) e estimatorequath o t l s coul appliee db d witestimatory han . With this technique eth 0 T T regression relationship is used to extrapolate the below thresh-

b(P) and ^(P^), PPWM estimators of (L and <3> are T Y

obtained0 : old observations plottine Th . g censorec positio e th - r ndfo ob servations are b (P )g (P )-b (P )g (P ) l T 1 T 0 T 3 T (15) (18)

(16) [Hirsch and Stedinger, 1987]. The censored observations are then estimatey db Unlike other applications of PWMs, here PPWM estimators are applied to the log-transformed data. Applying a log trans- i = exp [Ay + cry / = 1, (19) formation to the data before calculating sample moments is using pt from (18). The mean and standard deviation of the another approac reducino ht influence g th larges e th - f eo ob t data are estimated as the sample mean and sample standard servations [Stedinger et al, 1993, p. 18-5]. Hosking [1989] de- deviatio e completeth f no d data set. This technique wile b l velope dreal-spaca estimatoM parametere eth PW r fo r a f so used to obtain real-space estimators of the mean and standard lognormal distribution e simplificationTh . s that Hoskin- gem deviation associated wit l threhal e estimators r eacFo . h esti- ployed in his derivation cannot be adapted to real-space mator, estimates of the mean and standard deviation in loga- PPWM estimators with the lognormal distribution because the rithmic space will be used in (19) to obtain estimates of the integration of (6) is over a limited domain and not from 0 to 1 censored observations. These estimates wil combinee b l d with as in Hosking's formulation. A real-space PPWM estimator uncensoree th d observation obtaio st sample nth e mead nan wit lognormae hth l distribution would require complicate- din sample standard deviatio completee th f no d data set. tegration procedure werd t exploresan eno thin di s study.

Data Generation Estimation of Statistics Data for this was generated from a number of The focus of this study is the estimation of the mean (JLL), distributions which are commonly used to describe water qual- standard deviation (or), interquartile range (IQR), and quan- wated ityan r quantity data. Gilliom Helseld an [1986] gener- tiles with nonexceedance probabilities of 10% (^10) and 90% ated data from four distributions: lognormal, contaminated (X90). The MLE, LPPR, and PPWM estimators describe the lognormal, gamma deltad e contaminatean , Th . d lognormal mea variancd e log-transformenan th f o e d datae correTh . - distribution they used is a combination of two lognormal dis- sponding estimator of the various quantiles is tributions, Xl whic hdistributione describeth f o % Xd 2s80 an , whic hdistributione describeth f o % momente s20 Th . e th f so XP = exp ((LY + (17) two distributions are related by ^ — 1-5 A%l5 and crxJ^X2 = wher n estimata e s Xi p f X o ea quantilp , e wit n nonexa h - 2-Qo-x /n . Gilliomx Helseld an [1986] derive momente dth f so ceedance probability o f percentp inverse e th th s z i d pf eo an , this distribution e deltTh . a distribution employe Gillion di m standard normal cumulative distribution function evaluated at and Helselognormaa s wa l l distribution plu poinsa t masf so /?th percentile. 5 %t zero.a Aitchison [1955] provide generalsa descriptiof no 1008 KROL STEDINGERD LAN : ESTIMATIO MOMENTF NO QUANTILED SAN S

suc hdelta a distribution. Base uncensoren do d water quality square error (R-rmse) of an estimator in real space was calcu- records, Gillio Helsed man l suggest that these four distribu- lated as

tions adequately describe the characteristics of most water 2 1/ I quality data. They considered each distribution with a coeffi- cient of variation (CV) of 0.25, 0.5, 1.0, or 2.0. R-rmse = (20) While censoring could occur when recording daily or even annual maximum river flows, it is most likely to occur when wher estimatnumbee n ea th statisti e 0s i s ,i th f reN o cd 6an recording annual minimum river flows. The true distribution of of replicate thf so e experiment (500 eacr 0fo h parent distribu- river flows is not known, but several distributions have been tion). This metric is commonly used to evaluate the perfor- foun describo t d e annual minimum flows. Tasker [1987] rec- mance of estimation methods and was employed by Gilliom ommende log-Pearsoe dth Weibuld an I nII l distributionr sfo and Helsel [1986] and Helsel and Cohn [1988]. estimating at-site curve Virginian si . x CondieNi d an The second criterion is the log-space rmse, defined as [1975] found a Weibull distribution provided a good fit to 1/2 Canadian rivers. Vogel Krolld an [1989] showed tha lognora t - mal s reasonabli l r annuafo e l minimume flowth n i s L-rmse= (21) northeastern United States. Based on these studies, data was generated from seven dis- With the log-space rmse underestimation errors receives more tributions: lognormal, contaminated lognormal, gamma, delta, weight than overestimation errors. This - criterioem s nwa Weibull, log-Pearso I wit skeng II hlo w equa 0.25 o t logld an ,- ployed by Stedinger and Cohn [1986] and Fill [1994]. The real- Pearson III with log skew equal to 1.0. For the log-Pearson III space metric give (20y nb ) assigns symmetric losse oveo st d ran distribution skeg representative lo w ar e th , smalf eo larg d an l e underestimation errors of equal magnitude. The log-space values observed with annual minimum streamflow data metric given by (21) assigns symmetric losses to equal percent- [Tasker, 1989]. The log-Pearson III is similar in shape to the age of over and underestimation errors. It is easily shown that contaminated lognormal distribution. Bot f thesho e distribu- the two metrics are equivalent to first order for small errors. tions have a thicker upper tail than the lognormal distribution. The choice between competing estimators often requires a The contaminated lognormal and delta distributions used here trade-off between bias and mean square error. In some cases, were the same as those employed by Gilliom and Helsel [1986]. unbiased estimators may be scaled so that the resulting nega- r eacFo h distribution, four variant 0.25= V s, C base a n do tively biased estimato smallea s ha r r mean square error than 0.5, 1.0, and 2.0 were included. The mean of the flows was set originae th l estimator. Conside estimatoe rth variance th f o r e to 1.0. A gamma distribution with a CV = 2.0 produces quan- with normal data. The traditional unbiased estimator, s2, can tiles less tha mediae nth n whic vere har y clos zero et o (the 50th be scaled by a factor y = (n - l)/(n 4- 1), where n is the percentil thif eo s distribution equal 40te sth 0.1hd percen7an - sample size, to produce a biased estimator, ys2, with the small- tilf thio e s distribution equals 0.06). Owine extremth o t g e est mean square error amon l estimatorgal fore th m f so ys 2. characte gamma f o r a distributio n2.0= wit,V dathC a were However e rooth tf i ,mea n square erro s dividee wa th r y db not generated from this distribution. Interestingly, Gilliom and expected e estimatorvaluth f o e e resultinth , g coefficienf o t Helsel [1986] used result gammr sfo a distributio= n witV C ha variation e samwoul th r bot fo e b hd estimators. Thue th s 2. shoo 0coult E pooa we ML d b r tha e estimatoth t e th f o r scaled estimato same th s e relativha r e precisio biases i t ndbu mean and standard deviation for nonlognormal data. and hence does not represent any real improvement over the Combining all combinations of distribution and CV (except traditional estimator e R-rmsTh . e performance criterio- nfa gamma with CV = 2) yields 27 different parent distributions. vors the scaled estimator over the traditional estimator, be- Five thousan observation0 5 d an d, , sample25 , 10 s (n10 = f so caus e reductioth e variancn ni s greatei e r tha e squarenth d 25, and 50) were generated for each of these 27 combinations. increas biasn i e L-rms Th . e criteri t fooleno sucy s i ab d h Complete samples were generate s wel a ds sample a l s with scaling because the variance of the logarithm of an estimator is censoring levels set at the 10th, 20th, 40th, 60th, and 80th unaffecte multiplyinby d estimatogan fixea dby r scalarand , percentil parene th f eo t distribution. Data sets with less than any increas log-space th n ei scalineo t biae sdu g increasee sth L-rms estimatorn a f eo . three uncensored observations were discarded n practiceI . , these estimation procedures e performewoulb t no d d when only one or two observations are uncensored. Data sets were generated until 5000 acceptable data sets were available. Results The delta distribution used in this experiment is a lognormal This experiment includes results for data generated from the distribution wit hpoina t mas zersat o havin gprobabilita yof three groups of parent distributions. The first group includes . Sinc5% e this distribution produces data equa zeroo t l e th , results for samples drawn from lognormal distributions (LN). estimation methods can not be performed for the case with 0% Sinc MLEe eth , LPPR PPWd an , M estimator l assumal s ea censoring, and thus that case was excluded. lognormal distribution, of interest is the performance of these estimators when the data are generated from that distribution. The second group (water quality (WQ)) comprises data gen- Performance Measures erated from distributions commonly use o describdt e water Estimation methods were compared using two different per- quality data: lognormal (LN), contaminated lognormal (CLN), formance measures relative :th e root mean square erro rean ri l gamma (g), and delta (D). These were the distributions that space (R-rmse), and the root mean square error in log space Gilliom and Helsel [1986] considered in their analysis of cen- (L-rmse) e estimator th e bia f Th o s. s alswa os calculated, sored water quality data. The third group (low flow, LF) in- though those t reportedresultno e relative sar Th . e root mean cludes data generated from distributions commonly useo dt KROL STEDINGERD LAN : ESTIMATIO MOMENTF NO QUANTILED SAN S 1009

Table 1. R-rmse and L-rmse of Estimators as a Percentage of the True Value for Lognormal Data IQR MEAN s.d.

Method R-rmse L-rmse R-rmse L-rmse R-rmse L-rmse R-rmse L-rmse R-rmse L-rmse PercentileCensoringOth the at MLE 24 3 2 22 4 2 3 2 2 23 21 42 45 LPPR 23 22 25 22 23 24 23 21 42 45 PPWM 23 4 2 22 3 2 5 2 2 23 21 42 45 Censoring at the 20th Percentile MLE 27 4 2 26 4 2 4 3 2 23 21 43 46 LPPR 30 5 2 28 5 2 5 3 2 23 21 43 46 PPWM 28 27 25 23 25 25 23 21 43 46 Censoring at the 60th Percentile MLE 47 6 2 46 9 2 4 4 2 23 21 45 49 LPPR 63 57 26 24 30 28 23 21 46 51 PPWM 57 60 8 2 26 9 2 24 24 22 45 49 Censoring at the 80th Percentile MLE 106 72 23 20 34 29 24 20 48 53 LPPR 168 3 103 0 0 24 6 2 2 29 24 50 58 PPWM 72 164 29 29 45 35 32 37 49 45 Dat. Standar t siz25 ase s e i d l estimateerroal f o rl case <3e al sar r %s fo reporte d above.

describe annual minimum streamflows in the United States: Table 2 is presented to compare the two metrics for Xw lognormal (LN), log-Pearson III (LPIII) with log skew of 0.2 5e censorinth estimatorsd an g , 50 . = = 1.0Whe n ,V C n and Weibul1.0d an , l (W). threshol e highesth t a d t level e 80tth : h percentile. Tabl2 e censorind an 80t e 0 th 1 h t ga percentile= n r Fo probe th , - report e R-rmseth s , L-rmse, mean, mediane th biad f an o ,s fewer o abilito r tw uncensored f yo observation 68%s si . Results estimators. The PPWM estimator has the smallest R-rmse but for censorin 80te th ht ga percentil e wit1= 0 hn wer e there- largese th estimatoE t e smallesL-rmseth ML s e - ha rL tTh . forconsideret eno d meaningful. censorinWit d 1= 0an hn t ga rmse. Figure 1 illustrates the distribution of the three estima- 60te th h percentile probabilite fewer ,th o o r tw uncensore f yo d tors based on the 5000 replicates. The MLE and LPPR esti- observation 17%s si censorind , whilan e 5 th e2 t witga = hn mators are less biased than the PPWM estimator but can also 80th percentile the probability is 10%. These cases were also yield large overestimates. In general, a probability density considerenot d meaningfu thiin l s analysis. Numerical results function (pdf) symmetric abou true tth e valu favorables ei e Th . for all cases are not reported here but are given byKroll [1996]. distribution of the PPWM estimator resembles a scaled version of the distribution of the MLE or LPPR estimators. If one had Group I: Lognormal Data (LN) used R-rms comparisoa s a e n metric e PPWth , M estimator Table 1 contains the relative real-space root mean square would be better than the MLE and LPPR estimators for this error (R-rmse) and log-space root mean square error (L-rmse) case. pdfe Baseth , n meando mediad estimatorse an , th f no , of the estimators averaged over all CV values when d LPPthean underRE estimatorML - e th s appear preferable th o t e lying distributio censorind an lognorma5 s ni 2 g= occurn r lfo s PPWM estimator. Using L-rmse as a metric, a similar conclu- at the Oth, 20th, 60th, and 80th . In general, the real- sion woul drawne db . L-rmse gives greater weigh undereso t t - and log-space metrics produc same eth e rankin estie th -f go timation and less to overestimation and is not mislead by mators. The exception is at high censoring, where the bias of scaled estimators whic producy hma reductioea R-rmsen ni . some estimators is large. For censoring at the 80th percentile the PPWM estimator of Xw has a smaller real-space rmse than the MLE and LPPR estimators, but a larger log-space rmse. Note that the R-rmse and L-rmse for some estimators is smaller with censoring at the 80th percentile than the 60th percentile. Thi probabls si rejectino t e ydu largga e numbef ro data sets with two or fewer uncensored observations when censoring is at the 80th percentile.

Tabl . 2 ePerformanc f Estimatoreo s ofX When I w LL. CV = 1.0, Sample Size n = 50, and Censoring is at the 80th Percentile Method R-rmse L-rmse Mean Median Bias MLE 0.35 0.36 1 0.2 9 80. 8 0. 0.27 0. 7 6 0. 5 0.00. 4 4 0. 1 0.2 0. 3 0. t 0 LPPR 0.62 0.59 0.30 0.27 0.06 True X10 Value = 0.243 X10 PPWM 0.32 1.41 0.17 0.15 -0.07

Figur . 1 eProbabilit y density function of Xestimators with 10 True value of X0.243= w . censorind C an V1.0= 80t e , th n =, h50 t ga percentile . 1010 KROLL AND STEDINGER: ESTIMATION OF MOMENTS AND QUANTILES

a) LPPR to MLE: All Distributions 1.6 -i b) PPWM to MLE: All Distributions

1.4 LN LF 1.4 LN LF

1.2 it 1.2 U ~ AAA ' § ^x D CD 0.8 i ' " yy 0.8 ooo 1 0.6 000 0.6 WQ XX 0.4 n A 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 Censoring Percentile Censoring Percentile

1.6 c) LPPR to MLE: Water Quality Distributions 1.6 d) PPWM to MLE: Water Quality Distributions 1.4 LN g 1.4 LN 3 1.2 1.2

1 0.8

0.6

0.4 0 8 0 7 0 6 0 5 0 4 0 3 0 2 0 1 0 0 10 20 30 40 50 60 70 80 Censoring Percentile Censoring Percentile

e) LPPR to MLE: Low Flow Distributions f) PPWM to MLE: Low Flow Distributions

10 20 30 40 50 60 700 8 8 00 7 0 6 0 5 0 4 0 3 0 2 10 Censoring Percentile Censoring Percentile

X X10 o X90 a IQR o MEAN A ST DEV.

Figur . 2 ePerformanc lognormalr fo PPWE d e an ratio ML ME ,o LPP f t wate so ML Ro t r qualityw lo d an , flo. w50 dat, d 2510 an a, wit= hn

L-rmse appears to be a better performance criterion than R- viation, the PPWM estimator of the standard deviation pro- rms r strictlefo y positive estimators. duces a smaller L-rmse than the L-rmse of the MLE and LPPR Based on L-rmse when the underlying distribution is lognor- estimators, whil R-rmse e estimatorth l al f eo almose sar t identical. bese math lt s i ove range E th r f caseeo ML Table n si th , e1 illustrato T wel LPPw e leth ho R estimator performs relative estimato f Xo r10 , X90, IQR ,d or JUL, an ,thoug t extrema h e to the MLE, a performance ratio (PR) for the estimators was censoring (80th percentile) the PPWM is the best estimator of calculated as: l estimatorAl . a comparabld sha e L-rms censorinr efo r o t ga PR - [L-rmse(MLE)]/[L-rmse(LPPR)] (22) belo 40te wth h percentil 25= .n Althougr efo resulte hth e sar not reported here, when n = 10, all estimators had compa- Figure 2a is a plot of PR versus censoring percentile when the rable L-rmse for censoring at or below the 20th percentile, and underlying distribution is lognormal. Note that the results for l estimatoral , 50 whecomparabl d = sha nn e L-rms cenr efo - wate rflo w qualitwlo d distributioyan n group e als- ar soin sorin belor o t 60tge a w th h percentiled an generalE n .I ML e ,th cluded in Figure 2. For each censoring percentile, the 15 values LPPR estimator standare th f so d deviation hav negativea e bias plotte eacr five dfo th e h f statistico grou R P p r e refesfo th o t r at extreme censoring (80th percentile), whil PPWe eth a s Mha n = 10, 25, and 50. To avoid bias due to rejecting a large positive bias. This result may be due to rejecting data sets with number of samples, the cases with n = 10 for censoring at the tw r feweo r uncensored observations. Sinc n averago e t i e 60t hcensorinr percentilfo 5 e 2 th d t witg1d a = 0 an e an hn overestimate opposes sa underestimateo dt standare sth - dde 80th percentil omittede censorine ar th s A . g level increases, KROLL AND STEDINGER: ESTIMATION OF MOMENTS AND QUANTILES 1011 lese ar s thaR P , indicatin n1 e mor th f eo ghighera L-rmsr efo L-rms PPWe th f eo M estimato rmeae oth f slightls ni y lower LPPe th R estimato estimatorE r thaML e nth . Thi especialls si y than the L-rmse of the MLE estimator. At high censoring, the tru estimatorr efo e th sr ofo f XR wP .e Figurploth a f s i o t b e2 PPW bettea Ms i r estimato e standarth f o r d deviatione Th . PPWM compared to the MLE. As censoring increases, the MLE estimator of X90 performs poorly at low censoring when PPWM estimators hav higheea r L-rms estiE e- ML tha e nth e underlyinth g distribution gammas i . Comparing Figurec s2 mator, except for the standard deviation. The PPWM does LPPe th an, Rd2d estimators generall goobetter s a o s e da y rar especially poorly when estimating Xw. Except for estimation than the PPWM estimators. The exception is for estimators of of the standard deviation at high censoring percentiles, the PR the standard deviation at high censoring levels. LPPe oth f R estimator usualle sar f o y R close P tha 1 e o nth rt PPWe th M estimators, especiall t higheya r censoring thresh- Group III: Low Flow Data (LF) olds. This indicates that the LPPR is generally a better esti- LPPe Figurth f R o contain a estimatore2 R P e e th sth o t s mator than PPWM, thoug e performanchth e differencee ar s MLE estimators for results averaged over all low flow distri- modest except for Xlo estimators. butions. Figure 2e compares the PR of LPPR estimators to the Estimation of X10 is interesting since the methods are gen -E estimator e ML differenth r flow fo swlo t distributionsn I . erally forced to extrapolate below the censoring threshold. At general, the MLE estimators have a smaller L-rmse than the low censoring, most of the information about the distribution LPPR estimators. The exception is the estimator of X90 when lowes it d r an tai containes i l uncensoree th n di d observations, the underlying distribution is Weibull, a distribution whose rather than numbe f uncensoreo r d observations l estial -d an , shape differs significantly fro e e lognormashapth mth f o e l mation methods produce similar results rate censorf th e o s .A - distribution. ing increases, less overall information about the distribution is Figure 2b contains the PR of the PPWM estimators to the contained in the above threshold observations and the relativ eestimatorE ML r resultsfo s flodw wlo ove l distrial r - amount of information provided by the uncensored observa- butions. Figure 2f compares the PR of PPWM estimators to tions compare informatioe th o dt n provide censorine th y db g the MLE estimators for the different low flow distributions. rate decreases. These trends coul illustratee db d quantitatively When censorin highs gi L-rmse PPWe th , th f eo M estimatof ro usin e Fishegth r information matrix [Judge t al.,e 1985]e Th . the standard deviation is less than the L-rmse of the MLE use E informatioe sth ML n provide censoree th y db d observa- estimator. The MLE also performs poorly when estimating tions more efficiently tha othee nth r estimator thud san s pro- X90 for the Weibull distribution. In general, the MLE estima- duces an estimator of X10 with a smaller L-rmse than the other tors have a smaller L-rmse than the PPWM estimators. Com- estimators as the censoring rate increases. paring Figures 2e and 2f, the LPPR estimators generally are as Wit lognormae hth l distribution date th log-spac,n ai e ear good as or better the PPWM estimators, except when estimat- described by a . Use of L-rmse is therefore ing the standard deviation at high censoring percentiles. equivalen comparino t estimator e rmse th g th f eo s of d X10an X90 for the normal distribution. From this perspective, the PPW real-spaca Ms i e estimato normae th r rfo l distributions a , Conclusions opposed to a log-space estimator for the lognormal distribu- The following conclusions can be drawn from these experi- e real-spactione rmsth Th f . o e e estimator e normath r fo sl ments: distribution are the same as the L-rmse of the X10 and X90 1. The log-space rmse (L-rmse) and the relative real-space estimators for the lognormal distribution given in Table 1. rmse (R-rmse) produced similar rankin e estimatorsth f o g , normae Thuth r sfo l distributio thre e rmse th n th f eeo estima- except for some estimators with large negative biases. The tors of Xw are generally equivalent when censoring is at or L-rmse places a larger penalty on underestimation errors and belo 10te wth h percentil 10= ,n belor efo 20te wth h percen- smallea r penalt oven yo r estimation errors tha R-rmsee nth . belod 25= an 40te ,n w th tilr h efo percentil 50= .n r efo e L-rmsTh e a betteappear e b r o t sestimato r performance metric tha R-rmsee nth becaust misleano s i estimator y t edi b s Group II: Water Quality Data (WQ) which may represent a scaling that produces a negative bias LPPe Figurth f Ro containa estimatore2 R P e e th sth o t s an dsmallea r R-rms increaso n t ebu information n ei . estimatorE ML r resultfo s s averaged ove l wateal r r quality 2. Whe estimatore nth s were tested with data drawn from (WQ) distributions. These results are similar to those for the a range of distributions representative of both water quality lognormal distribution discussed above. Figur comparec 2 e s and water quantity , the of the estima- the PR of LPPR estimators to the MLE estimators for each tor generalls si samee yth . Regardles underlyine th f so g distri- water quality distribution. Even when the underlying distribu- generallE butionML e yth , performe better wels o d a s a l r than t lognormal estimatorE no s tioML nwa e th ,s generally per- the other estimators. formed better tha LPPe nth R estimators, especiall t higheya r 3. Acros l thresal e data three groupsTh e) estimator(1 : s censoring levels. However LPPe th , R estimatoa d Xf o r90ha (MLE, LPPR, and PPWM) generally produced comparable smaller L-rmse than the MLE estimator when the underlying L-rmse values when censorinr beloo t 20te a w th s h gperwa - distribution was gamma and censoring was at or below the 40th centile whe 40t10e = th ,nhn percentile whe nn = d 25an , percentile. The shape of a gamma distribution with a high CV the 60th percentil e exceptioe Th th . s e50 nwa whe = n value differs considerably from the shape of a lognormal dis- estimators of Xw, whose performance differe evet da lowena r tribution. censoring percentile. (2) At higher censoring, the MLE usually Figure 2b contains the PR of the PPWM estimators to the provided the best estimator of quantiles with a nonexceedence MLE estimators for results averaged over all water quality percent0 interquartile 9 probabilit th d d an an , 0 1 f yeo range. distributions. PPWe Figurth f compared o Me2 R estiP e - sth The exception is estimators of X90 when the shape of the estimatorE differene ML mator th e r th sfo o st t water quality underlying distribution was very different than that of a log- distributions censorinr .Fo belor o t 40te ga w th h percentilee ,th normal distribution, and censoring was at or below the 40th 1012 KROL STEDINGERD LAN : ESTIMATIO MOMENTF NO QUANTILED SAN S

percentile. (3) "Robust" fill-in methods produced efficient es- . CohnA Helsel . T , . d R.EstimatioD ,an , descriptivf no e statisticr sfo timator meae standarth d f nan so d deviation when used with multiply censored water quality data, Water Resour. Res., 24(12), 1997-2004, 1988. all three estimators. The MLE generally provided the best Helsel, D. R., and R. J. Gilliom, Estimation of distributional param- estimator of the mean and standard deviation. (4) In general, eters for censored trace level water quality data, 2, Validation tech- the LPPR estimators are as good as or better than the PPWM niques, Water Resour. Res., 22(2), 147-155, 1986. estimators LPPe .Th R estimator easiee sar understano rt d dan . HirschM Helsel . R ,. R.d Statistical,D ,an Methods Watern i Resources, implement than the PPWM and MLE estimators and thus are Elsevier YorkSci.w Ne , , 1992. . StedingerR Hirsch . J . d M.R , an ,, Plotting position r historicafo s l recommende practicn i e us r e dwitfo h mediu mlargo t e sample floods and their precision, Water Resour. Res., 23(4), 715-727, 1987. sizes and low to moderate censoring. Hosking . theore M.R Th ,. J ,probabilitf yo y weighted moments, Res. 4. Unlike most other application f probability-weighteso d Rep. RC12210, IBM Res. Div., T. J. Watson Res. Cent., Yorktown moments (PWMs), the PPWM estimator in this experiment is Heights, N.Y., Apri , 19893 l . Hosking, J. R. M., and J. R. Wallis, and quantile estimation applied to the logarithms of the data. A log transformation for the generalized Pareto distribution, Technometrics, 29(3), 339- reduces the influence of exceptionally large observations on 349, 1987. e estimatorth f higheo s r moment quantilesd an s . Whee nth . WoodHoskingF . . Wallis . E R M. R . ,d J ,. Estimatio J ,an , e th f no underlying distribution is lognormal, the PPWM, MLE, and generalized extreme-value distributio methoe th y n b probabilitf do y LPPR estimators X and X are equivalent to real-space weighted moments, Technometrics, 27(3), 251-261, 1985. lo 90 Joiner, B. L, and J. R. Rosenblatt, Some properties of the range in estimator normae th r sfo l distribution L-rmse th d equivs e,i an - samples from Tukey's symmetric lambda distributions, JASA J. Am. alen real-spaca o t e rmse normar .Fo l dat performance ath f eo Stat. Assoc., 394-399, 66 , 1971. e estimatorth vers wa sy simila r moderatfo r e censoringt A . Judge . GriffithsE . . G.. Hill G C ,. LutkepohlW , . . LeeH , R C , . ,T d an , higher censoring the MLE performs better than the other The Theory Practiced an f ,o . 797-821 pp chap , 19 . , John Wiley, New York, 1985. estimators. Kroll, C. N., Censored data analyses in water resources, Ph.D. thesis, Sch f CivEnvirond o . an . . Eng., Cornell Univ., Ithaca . Y.N ,, 1996. Landwehr . Wallis. Matalas R C . M. . ,J . J N , d , Probabilitan , y weighted Acknowledgments. The authors appreciate comments provided by moments compared with some traditional technique estimatinn si g Gilroyd E , Steve Millard Dennid an , s Helsel serveo ,wh reviewer s da s Gumbel and quantiles, Water Resour. Res., 15(5), 1055- e manuscriptoth f e authorTh . s also express their gratitude toward 1064, 1979. Richard Vogel, Timothy Cohn Jorgd an , e Damazio also o,wh reviewed Liu, S., and J. R. Stedinger, Low stream flow frequency analysis with the manuscript. ordinary and Tobit regression, paper presented at 18th Annual Con- ference and Symposium of the ASCE: Water Resources Planning and Management and Urban Water Resources, Am. Soc. of Civ. References Eng., New Orleans, La., May 20-22, 1991. Aitchisondistributioe th n O , positiva ,J. f no e random variable having Newman, M. C., P. M. Dixon, B. B. Looney, and J. E. Finder, Esti- a discrete probabilit y. origine Stat.masth Am t . a sAssoc.,/ , 50, mating mea variancd nan r environmentaefo l samples with below 901-908,1955. detection limit observations, Water Resour. Bull, 25(4), 905-916, Aitchison, J., and J. A. C. Brown, The Lognormal Distribution, Cam- 1989. bridge Univ. Press Yorkw Ne , , 1957. Stedinger, J. R., and T. A. Cohn, Flood frequency analysis with his- Cohen, A. C., Truncated and Censored Samples, Marcel Dekker, New torical and paleoflood information, Water Resour. Res., 22(5), 785- York, 1991. 793, 1986. Cohen, M. A., and P. B. Ryan, Observations less than the analytical Stedinger Vogel. . GeorgiouE M . R.d . J , R ,an , , Frequency analysif so limit of detection: A new approach, JAPCA, 59(3), 328-329, 1989. extreme events, in Handbook of Hydrology: Frequency Analysis of Cohn . DeLong L . A. . T ,. L Gilroy, J . Hirsch. Wells . D M E ,. d R ,, an , Extreme Events, chap. 18, pp. 18.1-18.66, McGraw-Hill, New York, Estimating constituent loads, Water Resour. Res., 25(5), 937-942, 1993. 1989. Tasker, G. D., A comparison of methods for estimating low flow Condie . NixA . , G Modelin, d R. flow an , wlo f frequencgo y distribu- characteristics of streams, Water Resour. Bull., 23(6), 1077-1083, tion parameted an s r estimation, paper presente t Internationada l 1987. Water Resources Symposium: Water For Arid Lands, Teheran, Dec. Tasker, G. D., Regionalization of low flow characteristics using logistic 8-9, 1975. and GLS regression, in Proceedings of the Baltimore Symposium, New David . A.H ,, Order Statistics, John Wiley Yorkw Ne , , 1981. Directions for Surface Water Modeling, pp. 323-331, Int. Assoc. of Fill, H., Improving flood quantile estimates using regional information, Hydrol. Sci., Gentbrugge, Belgium, 1989. Ph.D. thesis, Sch f CivEnviron.d o an . . Eng., Cornell Univ., Ithaca, Vogel, R. M., and C. N. Kroll, Low-flow frequency analysis using N. Y., 1994. probability-plot correlation coefficients,/. Water Resour. Plann. Man- e distributioFinneyth n . J. O D ,,variata f no e whose logarithms i age., 115(3), 338-357, 1989. normally distributed . Stat.R . ,/ Soc., 155-161, 7 Ser. , B , 1941. Wang . J. ,Q , distributio EstimatioV GE e th f no from censored sam- . HelselR Gilliom . D ,d . EstimatioJ.an R ,, f distributionano l param- ple methoy sb f partiado l probability weighted moments . Hydrol.,,J eters for censored trace level water quality data, 1, Estimation tech- 120, 103-114, 1990. niques, Water Resour. Res., 22(2), 135-146, 1986. , GuptaEstimatioK. . e meaA , th standard f nan no d deviatioa f no . StedingerR . . J Krol N d . an lC , Schoo Civif Environmentao ld an l l normal population fro mcensorea d sample, Biometrika, , 26039 - Engineering, Hollister Hall, Cornell University, Ithaca, NY, 14853- 273, 1952. 3501. (e-mail: [email protected]; [email protected]) Hammett . M.K , , Low-flow frequency analysi r streamfo s wesn i s t central Florida, U.S. Geol Surv. Water Resour. Invest. Rep., 84-4299, 1984. Helsel . R.D ,, Less than obvious, Environ. Sci. TechnoL, 24(12), 1766- (Received Januar , 1995y23 ; revised Octobe , 199519 r ; 1774, 1990. accepted October 27, 1995.)