A Comparison of Techniques for Assessing Central Tendency in Left

Chemosphere 80 (2010) 7–12 Contents lists available at ScienceDirect Chemosphere journal homepage: www.elsevier.com/locate/chemosphere A comparison of techniques for assessing central tendency in left-censored data using PCB and p,p0DDE contaminant concentrations from Michigan’s Bald Eagle Biosentinel Program Katherine F. Leith a,*, William W. Bowerman a, Michael R. Wierda a, Dave A. Best b, Teryl G. Grubb c, James G. Sikarske d a Department of Forestry and Natural Resources, Clemson University, Clemson, SC 29634, USA b United States Fish and Wildlife Service, East Lansing Field Office, East Lansing, MI 48823, USA c United States Forest Service, Rocky Mountain Research Station, Flagstaff, AZ 86001, USA d College of Veterinary Medicine, Small Animal Clinical Sciences, Michigan State University, East Lansing, MI 48825, USA article info abstract Article history: Monitoring of contaminants in the environment is an important part of understanding the fate of ecosys- Received 23 March 2010 tems after a chemical insult. Frequently, such monitoring efforts result in datasets with observations Accepted 30 March 2010 below the detection limit (DL) that are reported as ‘non-detect’ or ‘<DL’ and no value is provided. This Available online 22 April 2010 study explored the effects of non-detect data and their treatment on summary statistics. The data analyzed in this paper are real-world data. They consist of both large (N = 234) and moderate (n = 12–64) Keywords: sample sizes with both good and marginal fit to an assumed distribution with a log-transformation. Sum- Kaplan–Meier mary statistics were calculated using (1) the ‘0.0001’ near-zero method of substitution, (2) substitution Substitution with ‘1/2 DL’, (3) multiple imputation, and (4) Kaplan–Meier estimation. Median was used for compar- Outliers Ã Multiple imputation ison. Several analytical options for datasets with non-detect observations are available. The general con- Non-detect sensus is that substitution methods ((1) and (2)) can produce biased summary statistics, especially as levels of substitution increase. Substitution methods continue to be used in research, likely because they are easy to implement. The objectives were to (1) assess the fit of lognormal distribution to the data, (2) compare and contrast the performance of four analytical treatments of left-censored data in terms of estimated geometric mean and standard error, and (3) make recommendations based on those results. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction elli et al., 2005; Helsel, 2005a,b, 2006; Eastoe et al., 2006; Needham et al., 2007; Antweiler and Taylor, 2008). Specifically, it has been Monitoring contaminant levels in the environment is an impor- shown that the bias caused by substitution increases dramatically tant part of understanding the fate of ecosystems after a chemical as the percent of observations censored increases (Eastoe et al., insult. As levels decrease over time, limitations in analytical equip- 2006). In spite of this, various substitution methods continue to ment create a lower bound below which contaminant levels cannot be used in research, frequently with little regard for the proportion be accurately reported. This results in datasets with observations of observations censored. below the detection limit (DL) that are reported only as ‘non-de- In addition to censored observations, another complication tect’ or ‘<DL’ and no value is provided. This type of distribution is when analyzing environmental contaminant data is the possibility called ‘left-censored’, as the low-end observations that are un- that datasets display a right skew. This occurs when a few samples known generally occur near the origin of the x-axis in figures. Sev- show very high concentrations while the general tendency is for eral options for the analysis of these datasets have been concentrations to be lower. This distribution is common in envi- investigated leading to the conclusion that methods replacing all ronmental data and can frequently be accommodated by log-trans- non-detects with a single value (substitution methods) are fre- formation. The median has traditionally been an accepted measure quently inferior (Liu et al., 1997; Singh and Nocerino, 2002; Baccar- of central tendency for data which do not fit a normal distribution well, and this approach has been used in almost every field of sci- entific inquiry. Focus has shifted to newer approaches as more * Corresponding author. Address: Department of Forestry and Natural Resources, complex methods have been developed and computing power 261 Lehotsky Hall, Clemson University, Clemson, SC 29634, USA. Tel.: +1 864 656 has grown to make them feasible for the average researcher. While 5332 (O), +1 864 903 2728 (C); fax: +1 864 656 3304. E-mail addresses: [email protected], [email protected] (K.F. Leith). the median is still useful in that it is not based on indefensible 0045-6535/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.chemosphere.2010.03.056 8 K.F. Leith et al. / Chemosphere 80 (2010) 7–12 assumptions about the shape of the distribution, it does not make plored the effects of non-detect data and their treatment on sum- use of all the information contained in a dataset. A final complexity mary statistics. The data analyzed in this paper represent both is added by the fact that lognormal data are frequently summa- large (N = 234) and moderate (n = 12–64) sample sizes with both rized using the geometric mean, which is particularly sensitive to good and marginal fit with a log-transformation. Summary statis- the choice of substitution value. tics were calculated using the current method of substitution with In 1997, the Michigan Department of Environmental Quality ‘0.0001’, the common method of substitution with ‘1/2 Ã DL’, MI, (MDEQ) implemented a Bald Eagle Biosentinel Program (BEBP) to and KM estimation. The median was also calculated for compari- monitor trends of a suite of organic pollutants under the Clean son with all four methods. The objectives were to (1) assess the Michigan Initiative (MDEQ, 1997). The data analyzed here are fit of lognormal distribution to the data, (2) compare and contrast 234 observations of total polychlorinated biphenol (PCB), which the performance of the four methods of non-detect handling in is a sum of all observed PCB congeners, and p,p0-dic- terms of estimated geometric mean, comparison to the median, hlorodiphenyldichloro-ethylene (p,p0DDE) concentrations found and standard error, and (3) make recommendations based on those in nestling bald eagle plasma samples from throughout the State results. of Michigan. Current statistical methods include tests of significant differences between regions at several geographic scales, and cal- 2. Methods culating descriptive statistics in the form of geometric means. Sub- stitution is currently used in cases of non-detect, or left-censored, Concentrations of total PCBs and p,p0DDE (lg/kg wet weight) in observations. plasma collected from nestling bald eagles across Michigan from This choice of methods for addressing the non-detects does not 1999 to 2003 were used in these analyses. In addition to analysis affect testing of significant differences between regions. The mon- as a single dataset representing the whole State of Michigan, data itoring program uses nonparametric Kruskal–Wallis and Wilcoxon were also classified geographically by subpopulation, based on the tests which are rank-based and make no assumptions about distri- classifications used in the BEBP. Subpopulations were defined by bution and so, are not sensitive to the problems of substitution first subdividing the state spatially into the categories of Great (Helsel, 2005b). However, summary statistics are reported as geo- Lakes and Inland breeding areas. Great Lakes breeding areas are de- metric means, which are affected by the choice of substituted va- fined as being within 8.0 km of Great Lakes shorelines and/or along lue. The BEBP currently substitutes the near-zero value of tributaries open to Great Lakes fish runs and inland breeding areas ‘0.0001’ for concentrations at non-detectable levels when analyz- are defined as being greater than 8.0 km from the Great Lakes ing contaminants that are summed, such as total PCBs. This near- shorelines and not along tributaries open to Great Lakes fish runs. zero value might appear to have little influence on the resulting These categories are then further subdivided into four Great Lakes calculations to those accustomed to arithmetic mean calculation and two Inland groups. The Great Lakes subpopulations consisted because the arithmetic mean is a function of addition, for which of Lake Superior (LS), Lake Michigan (LM), Lake Huron (LH), and ‘0’ is the identity value. Geometric means on the other hand are Lake Erie (LE). The Inland subpopulations consisted of Upper Pen- a function of multiplication, for which the identity value is ‘1’, insula (UP), and Lower Peninsula (LP) (Wierda, 2009). The data while near-zero values (like ‘0.0001’) have a drastic impact on analysis for this paper was generated using SASÒ software, Version the product. For contaminants that do not involve summation, a 9.1.2 of the SAS system for Windows. Copyright 2000–2004 SAS value of one-half of the quantification level is substituted for Institute Inc. SAS and all other SAS Institute Inc. product or service non-detects. names are registered trademarks or trademarks of SAS Institute The desire to summarize data more accurately has fueled recent Inc., Cary, NC, USA. comparisons of proposed analytical alternatives to substitution or simple median reporting (Helsel, 2005b). One alternative, maxi- mum likelihood estimation (MLE), forces the researcher to assume 2.1. Assessment of fit the shape of the underlying distribution, but is powerful if this assumption is correct. These MLE methods have been explored in Data were analyzed for significant departure from the lognor- a variety of environmental applications (Singh and Nocerino, mal distribution using the UNIVARIATE PROCEDURE with the op- 2002; Helsel, 2005b, 2006; Antweiler and Taylor, 2008; Jain et tions ‘normal’ and ‘plot’ activated (SAS Institute Inc., 2000–2004).

A Comparison of Techniques for Assessing Central Tendency in Left

Comparison of Harmonic, Geometric and Arithmetic Means for Change Detection in SAR Time Series Guillaume Quin, Béatrice Pinel-Puysségur, Jean-Marie Nicolas

University of Cincinnati

Chapter 4: Central Tendency and Dispersion

Central Tendency, Dispersion, Correlation and Regression Analysis Case Study for MBA Program

Incorporating a Geometric Mean Formula Into The

Wavelet Operators and Multiplicative Observation Models

Measures of Central Tendency for Censored Wage Data

Redalyc.Robust Analysis of the Central Tendency, Simple and Multiple

“Mean”? a Review of Interpreting and Calculating Different Types of Means and Standard Deviations

Expectation and Functions of Random Variables

Notes on Calculating Computer Performance

Robust Statistical Procedures for Testing the Equality of Central Tendency Parameters Under Skewed Distributions