August 2006 FRESHWATER BIOASSESSMENT 1277

Ecological Applications, 16(4), 2006, pp. 1277–1294 Ó 2006 by the the Ecological Society of America

QUANTIFYING BIOLOGICAL INTEGRITY BY TAXONOMIC COMPLETENESS: ITS UTILITY IN REGIONAL AND GLOBAL ASSESSMENTS

1 CHARLES P. HAWKINS Western Center for Monitoring and Assessment of Freshwater , Department of Aquatic, Watershed, & Earth Resources, Utah State University, Logan, Utah 84322-5210 USA

Abstract. Water resources managers and conservation biologists need reliable, quantita- tive, and directly comparable methods for assessing the biological integrity of the world’s aquatic ecosystems. Large-scale assessments are constrained by the lack of consistency in the indicators used to assess biological integrity and our current inability to translate between indicators. In theory, assessments based on estimates of taxonomic completeness, i.e., the proportion of expected taxa that were observed (observed/expected, O/E) are directly comparable to one another and should therefore allow regionally and globally consistent summaries of the biological integrity of freshwater ecosystems. However, we know little about the true comparability of O/E assessments derived from different data sets or how well O/E assessments perform relative to other indicators in use. I compared the performance (precision, bias, and sensitivity to stressors) of O/E assessments based on five different data sets with the performance of the indicators previously applied to these data (three multimetric indices, a biotic index, and a method used by the state of Maine). Analyses were based on data collected from U.S. stream ecosystems in North Carolina, the Mid-Atlantic Highlands, Maine, and Ohio. O/E assessments resulted in very similar estimates of mean regional conditions compared with most other indicators once these indicators’ values were standardized relative to reference-site means. However, other indicators tended to be biased estimators of O/E,a consequence of differences in their response to natural environmental gradients and sensitivity to stressors. These results imply that, in some cases, it may be possible to compare assessments derived from different indicators by standardizing their values (a statistical approach to data harmonization). In situations where it is difficult to standardize or otherwise harmonize two or more indicators, O/E values can easily be derived from existing raw sample data. With some caveats, O/E should provide more directly comparable assessments of biological integrity across regions than is possible by harmonizing values of a mix of indicators. Key words: biological assessment of freshwater ecosystems; biological indices; Clean Water Act; conservation; harmonization; indicators of biological integrity; modeling; monitoring; multimetrics; pollution; RIVPACS; .

INTRODUCTION integrated, adaptive of having a There is a critical need to assess the biological status composition, diversity, and functional organiza- of the world’s freshwater ecosystems and determine tion comparable to that of the natural of the whether conditions are improving or declining (Revenga region.’’ This is the definition used by the U.S. Environ- and Kura 2003). This need was anticipated in the United mental Protection Agency (USEPA) when providing States almost 30 years ago when the modern Clean guidance to states and tribes regarding bioassessment Water Act was created (1972, amended in 1977), which programs (available online).2 There is not consensus, requires that states and tribal nations monitor and assess however, on how it should or can be measured. the biological integrity of their waters. Biological Over the last two decades there has been considerable integrity was defined by Frey (1977:128) as ‘‘the work devoted to the development of biological indica- capability of supporting and maintaining a balanced, tors for use in assessing the biological integrity of freshwater ecosystems (USEPA 2002b), and many states Manuscript received 25 October 2004; revised 19 August in the United States and several countries have active 2005; accepted 23 August 2005. Corresponding Editor: E. H. biological monitoring and assessment programs. How- Stanley. For reprints of this Invited Feature, see footnote 1, p. 1249. 1 E-mail: [email protected] 2 hhttp://www.epa.gov/bioindicators/html/biointeg.htmli 1277 Ecological Applications 1278 INVITED FEATURE Vol. 16, No. 4

PLATE 1. Indicators of taxonomic completeness require site-specific estimates of the taxa expected under specific natural environmental settings. The models used to derive these estimates are calibrated with data collected at a series of reference sites that represent the range of natural conditions within a region of interest. Sampling methods that adequately characterize the biota at a site are required to obtain accurate and precise models. The photo shows Scott Rollins, a Ph.D. student from Michigan State University, completing sampling on the Verde River, Arizona, USA, as part of an effort to derive reference conditions for streams in the western United States. Photo credit C. P. Hawkins. ever, the independent development of assessment what biological condition is, and (2) they differ in how methods by different political jurisdictions has resulted expected values are derived. in the use of a large mix of indicators about which we Biotic indices (BI) measure the average pollution have little knowledge regarding their comparability. tolerance of taxa found at a site and are typically This issue is particularly problematic given the emerging calculated as RTVI 3 ni / N, where TVi ¼ the tolerance need in the United States, Europe, and elsewhere to value of taxon i, ni ¼ abundance of taxon i, and N ¼ the integrate multiple assessments conducted at small scales total number of individuals in the sample. Biotic indices into regional- or national-level assessments. For exam- are based on the idea that unpolluted water bodies ple, assessments made by the states in the United States contain many pollution-sensitive taxa (low tolerance are supposed to be summarized by the U.S. Environ- values), whereas polluted water bodies contain mostly mental Protection Agency in bi-annual reports that pollution-tolerant taxa (e.g., Chutter 1972, Hilsenhoff describe the status and trends of the Nation’s water 1987). A low BI value implies high biological integrity. quality (e.g., USEPA 2002a). However, meaningful Until recently, most tolerance values used to estimate BI summaries have been impossible because of insufficient values were derived by comparing how abundances of or incompatible data (U.S. General Accounting Office different taxa vary across gradients of known or 2000, Heinz Center 2002, USEPA 2003). presumed stress or water quality (e.g., Lenat 1993). Incompatibility between assessments can occur for Biotic indices are used as the main indicator of two reasons: (1) biota are sampled in different ways and biological quality in several countries (see overview by (2) we use different indicators to measure biological Johnson et al. [1993]) and in at least one U.S. state. In condition (e.g., Houston et al. 2002, Davies and Jackson the United States, biotic indices are used most often as 2006). In the United States three main types of one of the component metrics in a multimetric index. indicators are commonly used to measure biological Multimetric indices (MMI) were conceived as a way condition: biotic indices, multimetric indices, and to quantify Frey’s (1977) concept of biological integrity, measures of taxonomic completeness. There are at least e.g., Karr’s (1981) index of biological integrity (IBI). two reasons why these different types of indicators may Index values are calculated by summing the stan- yield different inferences regarding the biological status dardized values of several different types of individual of a water body: (1) they are based on different ideas of metrics (e.g., richness, tolerance, composition, guild August 2006 FRESHWATER BIOASSESSMENT 1279 structure) derived from a sample of organisms. The might differ in their accuracy and precision because of selection of metrics used in the index can be based on differences in the predictor variables used, the methods either a conceptual understanding of what attributes are used to select predictor variables, differences among biologically important (e.g., Karr’s original IBI) or models in the taxonomic resolution applied to the biota identification of that subset of the many possible metrics being modeled, methods used to sample biota, decisions that best discriminates between reference and degraded regarding the PC threshold to use when calculating E water bodies (e.g., Barbour et al. 1999). Assessments are and O, or decisions regarding what subset of the biota made by comparing observed MMI values to expected inhabiting a water body are used in assessments (e.g., values that are derived from an appropriate set of Hawkins et al. 2000, Ostermiller and Hawkins 2004). reference sites (sensu Stoddard et al. 2006). MMI values A hybrid assessment method has also been developed that fall within the range of expected values imply high by the Maine Department of Environmental Protection biological integrity, whereas values lower than that (DEP) that uses discriminant-function models to predict observed at reference sites imply biological degradation. the a priori, legally defined water-quality classes to Although the original MMI approach most closely which samples belong from the values of 1– 9 biological parallels Frey’s concept of biological integrity, it is not metrics and indices measured in samples (Davies et al. clear that all MMIs will lead to directly comparable 1995). Although this method shares some of the inferences. Regional differences in the fauna or flora, predictive machinery used by O/E models (discrimi- which will affect the set of metrics used in MMIs (e.g., nant-function models), it is similar to BI and MMI Fore et al. 1996, Klemm et al. 2002), and differences in methods in that the assessments are calibrated by the intensity of stress at non-reference sites used to training models to discriminate between samples col- calibrate MMIs might affect their comparability. lected from reference-quality and a-priori-defined de- Measures of taxonomic completeness are based on graded sites. estimates of the difference between observed (O) and Given the marked differences between many pro- expected (E) taxonomic composition. In the most grams in assessment methods and indicators, there are widespread implementation of this idea (e.g., Moss et two possible approaches to the synthesis of existing data al. 1987, Hawkins et al. 2000, Simpson and Norris 2000, for the purpose of creating larger, regional assessments. Wright 2000), the ratio, O/E, represents the proportion In one approach, a system might be developed for of predicted taxa that were observed in a sample. O/E translating among different types of indicators. Davies values near 1 imply high biological integrity and values and Jackson (2006) provide a conceptual framework, the ,1 imply biological degradation. O/E quantifies a biological condition gradient, that provides qualitative fundamental component of ecological capital, one of guidance regarding the biological attributes that should the three general indicators that the National Research be considered when making such translations. An Council identified as critical to monitor (NRC 2000). alternative approach is to use a single indicator that is Given that O is always a subset of E (the predicted taxa), it is a measure of the integrity of the native biota. Values general enough to measure what the other indicators near 1 are consistent with descriptions of those bio- measure, can be easily applied to all data sets, and thus logical attributes characteristic of biological integrity in avoid the need to develop translation functions. Because the highest quality tier of the biological condition O/E is based on the raw compositional data from which gradient described by Davies and Jackson (2006). A other indicators are derived, it might serve as such a unique property of O/E assessments is that, unlike biotic universal indicator if project specific effects on estimates indices and multimetric indices, values are not derived of O/E do not compromise its inter-project compara- from or calibrated against any stressor gradient. Instead, bility. empirical models that relate taxonomic composition to In this paper I examine the potential use of O/E as a naturally occurring environmental gradients are devel- universal indicator of biological integrity. To do so, I oped from data collected at a set of reference-quality compare the performance of O/E assessments with that sites that differ in their natural environmental setting of three other types of indicators: MMI, BI, and the (see Plate 1). These models are then used to predict what Maine DEP methods of assessment. I examine perfor- the probability of capturing (PC) each taxon in the mance of both O/E and the other indicators in terms of regional taxon pool would be at specific sites if those indicator bias and precision and sensitivity to stressors. sites were in reference condition. The expected number To further evaluate the robustness and comparability of of taxa (E) at a site is then estimated as RPCi for given O/E-based assessments, I also examine how variable PC threshold values (e.g., 0, 0.1, 0.5, etc.), where i ¼ each reference-site O/E values were across years, how taxon in the region of interest. O is that set of taxa with taxonomic resolution used in models affected values, PC greater than the specified threshold value that were and if the type of sampling method used to collect collected in a sample. The performance of O/E-based samples of biota affected O/E assessments. I conclude by assessments is therefore largely dependent on how well discussing both the potential advantages of O/E as a models predict the PC for different taxa under different means of providing standardized assessments across environmental settings (Clarke et al. 2003). Models regions as well as the pitfalls associated with its use. Ecological Applications 1280 INVITED FEATURE Vol. 16, No. 4

TABLE 1. Comparison, by taxon, of different assessment measures with O/E, the ratio of observed to expected taxonomic composition.

Data-set samples (mean 6 SD) Taxon, U.S. Assessment source, and habitat measureà Calibration (C) Validation (V) Test C–T 10th% C§ %T , 10th% C§ Invertebrates North Carolina, MH (208) (202) (984) raw NCBI 3.86 6 0.97 4.14 6 0.90 6.03 5.16 77 SNCBI 1.00 6 0.16 0.96 6 0.15 0.65 0.35 0.79 77 ASNCBI 1.00 6 0.09 1.01 6 0.07 0.72 0.28 0.90 80 O/Esp,0 0.99 6 0.16 1.03 6 0.14 0.70 0.29 0.78 61 O/Esp,0.5 1.01 6 0.14 0.98 6 0.13 0.62 0.39 0.83 78 O/Eg,0 0.99 6 0.15 1.03 6 0.14 0.72 0.27 0.82 64 O/Eg,0.5 1.01 6 0.13 0.98 6 0.11 0.65 0.36 0.83 77 O/Ef,0 1.00 6 0.13 1.03 6 0.13 0.77 0.23 0.85 64 O/Ef,0.5 1.01 6 0.10 1.00 6 0.08 0.73 0.28 0.87 72 Mid-Atlantic Highlands, FW (72) (14) (456) raw MIBI 77.3 6 14.2 75.3 6 13.0 52.7 24.6 55.1 50 SMIBI 1.00 6 0.18 0.98 6 0.17 0.68 0.32 0.71 50 O/E0 1.00 6 0.19 0.96 6 0.20 0.78 0.22 0.74 38 O/E0.5 1.01 6 0.17 0.98 6 0.16 0.64 0.37 0.77 67 Maine, AS (64) (20) (452) O/E0 1.01 6 0.26 0.98 6 0.24 0.78 0.23 0.64 33 O/E0.5 1.00 6 0.30 1.08 6 0.23 0.72 0.28 0.60 38 Ohio, AS and MH (58) (34) (322) raw ICI 42.8 6 8.57 42.1 6 7.69 33.3 9.5 30.0 35 SICI 1.00 6 0.20 0.98 6 0.18 0.78 0.22 0.70 35 O/E0 1.03 6 0.25 1.04 6 0.20 0.90 0.13 0.71 25 O/E0.5 1.04 6 0.16 1.01 6 0.16 0.80 0.24 0.79 44 Fish Ohio, MH (114) (0) (1438) raw IBI 46.6 37.0 33.5 40 SIBI 1.00 0.82 0.18 0.75 40 O/E0 0.99 0.82 0.17 0.75 39 O/E0.5 1.02 0.80 0.22 0.77 44 Notes: The numbers in parentheses indicate sample sizes. The proximity of the mean value for calibration (C) and validation (V) data sets to 1 is a measure of global accuracy. Precision is reported as the standard deviation of values obtained from reference sites (calibration and validation data sets). Sensitivity is reported as the difference between mean values obtained from test (T) and calibration samples. sampled: MH, multiple habitats; FW, fast-water habitats; AS, artificial substrates. à Where possible, the different measures were standardized (SNCBI, standardized North Carolina biotic index; SMIBI, standardized macroinvertebrate index of biotic integrity; SIBI, standardized [Ohio] index of biotic integrity; SICI, standardized [Ohio] invertebrate-community index) so the mean of calibration samples ¼ 1 to allow direct comparison with O/E values. In the case of the NCBI, values were inverted prior to standardization so that decreasing values implied increasing biological degradation. SNCBI values were further adjusted (ASNCBI) for the effects of four factors (latitude, longitude, distance from source, and calendar day). ASNCBI values are the residuals from the regression of SNCBI values on latitude, longitude, log distance from source, and calendar day. Original residual values had a mean of zero but were incremented by 1 to allow direct comparison with SNCBI and O/E values. The O/E models based on zero and 0.5 probabilities of capture are denoted as O/E0 and O/E0.5. Models based on species, genus, and family levels of taxonomic resolution are identified with sp, g, and f subscripts. § The percentage of test sites whose assessment values were below the 10th percentile of calibration sample values (%T , 10th% C) was used to show how model precision and sensitivity jointly influence the power of detection of biological impairment.

MATERIALS AND METHODS set contained samples collected at reference sites that Data sets were used to derive expected conditions at other sites (Stoddard et al. 2006) and samples from a series of test I based analyses on five data sets (Table 1). These data sites that varied in the degree to which they were included samples of stream benthic invertebrates and exposed to stressors and thus in their potential amount associated habitat information from four regions: North of biological impairment. I provide only a brief Carolina (NCDENR 2003), Maine (Davies and Tso- description of these data sets here. Full descriptions mides 2002), the Mid-Atlantic Highlands, a region that are available in the original reports cited above. spans several states (Klemm et al. 2002), and Ohio (Ohio For each data set, I built RIVPACS-type predictive EPA 1989). I also included one fish data set from Ohio models (Moss et al. 1987, Wright 2000) following the (Ohio EPA 1989). Invertebrates were identified to the procedures described in Hawkins et al. (2000), Hawkins lowest taxon possible in all data sets, including and Carlisle (2001), and Van Sickle et al. (2005). O/E chironomid midges, which were identified to genus or values were calculated based on two probability-of- species level. Fish were identified to species. Each data capture thresholds: PC . 0 and PC . 0.5. These two August 2006 FRESHWATER BIOASSESSMENT 1281 thresholds have been used elsewhere and essentially scale from 0 to 100, where 100 ¼ the best biological represent assessments based on either all taxa including condition. those that are expected to be extremely rare (i.e., PC . I constructed a predictive model from the same set of 0) or only those taxa that are expected to be moderately data used to construct the MIBI. Data from 86 reference common at a site (PC . 0.5). Ostermiller and Hawkins sites (72 calibration, 14 validation) were used to build (2004) discuss the statistical and biological reasons why the model, and it was applied to 456 test sites. Six use of an intermediate PC threshold such as 0.5 may variables were used in the predictive model: North have advantages over the inclusion of all taxa. I then Central Appalachian Mountains ecoregion (1 or 0 compared the performance of these O/E assessments [present or not]), Central Appalachian Ridge and Valley with those based on the indicators originally used for ecoregion (1 or 0), calendar day, elevation, carbonate each data set. concentration, and catchment area. Van Sickle et al. North Carolina.—These data were collected by the (2005) describe general statistical aspects of the model. North Carolina Department of Environment and Stressor data were also available for many of these sites, Natural Resources and consisted of 208 samples used which allowed me to compare sensitivities of both O/E to calibrate models, 202 validation samples, and 984 and the MIBI to variation in those factors likely causing ‘‘test’’ samples from potentially impaired sites. Samples biological impairment. were based on multi-habitat, qualitative collections of Maine.—Maine recognizes four aquatic-life use cate- invertebrates. North Carolina bases biological assess- gories (AA, A, B, and C), of which classes AA and A are ments on the North Carolina biotic index (NCBI), the highest quality waters defined as having ‘‘aquatic life which is calculated from tolerance values assigned to as naturally occurs’’ (Davies et al. 1995: Table 1), class B each of the taxa as described above. Tolerance values includes waters that receive discharges but experience no range from 0 to 10 with low values implying less ‘‘detrimental’’ biological change, and class C includes tolerance to stress. waters in which discharges may alter assemblage composition but assemblage structure and function are Because the taxonomic resolution in this data set was maintained. Waters that do not meet the minimal exceptionally good, I used this data set to examine if standards for Class C are grouped in a non-attainment taxonomic resolution affected the performance of O/E (NA) class. Predictions are derived from a set of assessments by constructing predictive models based on hierarchical discriminant-function models in which bio- species-, genus-, and family-levels of taxonomic reso- logical metrics are the predictors of class membership. lution. Six to eight predictive variables were used in The Maine Department of Environmental Protection these three models, which included elevation, stream biological data are based on samples collected from width, stream depth, percentage boulder substrate, artificial substrates (rock-filled baskets, bags, or cones), percentage rubble substrate, calendar day, latitude, which are allowed to colonize for about 28 days before longitude, and catchment area. Van Sickle et al. (2005) they are collected. describe general aspects of the species-level model. I analyzed data from 84 reference-quality samples (64 Because of the long period of record covered by this calibration, 20 validation) and 452 test sites that were data set, I also used this data set to determine if collected between 1974 and 1997. Model building estimates of reference condition were affected by the resulted in selection of five predictor variables: elevation, year in which data were collected. Such an effect could distance from stream source (DFS), latitude, the number bias assessments if O/E values, or other indicators, were of freeze-free days, and calendar day. Because Maine developed from data collected over a restricted period uses artificial substrates to collect invertebrate samples, I and then applied to data collected in other years. I also was also able to compare model performance derived examined if O/E values and the NCBI were differentially from this type of sampling with the performance of sensitive to year effects. models based on samples collected from natural Mid-Atlantic Highlands (MAH).—These data were habitats. collected in conjunction with the U.S. Environmental Ohio.—The Ohio Environmental Protection Agency Protection Agency (EPA)’s EMAP program (Herlihy et assesses their rivers and streams with both an inverte- al. 2000). For this study, I used invertebrate data brate-community index (ICI) and an index of biological collected from 542 fast-water (riffle) habitats. The U.S. integrity (IBI) based on fish samples (Ohio EPA 1989). EPA has constructed a multimetric index (the macro- For this paper I used data from only those samples for invertebrate index of biological integrity, MIBI) for which I could build and apply predictive models. The both riffle and pool habitats (Klemm et al. 2002). The number of samples that I could use in model building MIBI, which I consider here, included seven individual was also restricted by the number of sites for which metrics: mayfly, stonefly, caddis fly, and collector-filterer predictor variables were available. For comparisons richness; a biotic index; percentage non-insect individ- based on invertebrates, I used data from 58 reference uals; and percentage individuals in the top five dominant calibration sites, 34 reference validation samples, and taxa. Richness values were adjusted for catchment area, 322 test-site samples. Ohio uses two sampling methods and the overall index was standardized by the authors to for collecting invertebrates: Hester-Dendy, multiplate Ecological Applications 1282 INVITED FEATURE Vol. 16, No. 4 artificial substrates and qualitative kick-net samples For the NCBI, I also adjusted the standardized NCBI from multiple natural habitats. Because Ohio combines values (SNCBI) for environmental setting by calculating these data in their macroinvertebrate ICI, I also used the the residual values obtained after applying a multiple- combined data. However, I also developed preliminary regression equation to all sample data. This equation O/E models based only on data derived from Hester- described the effect of naturally occurring environ- Dendy samplers to assess the performance of models mental variables on SNCBI values and was derived from derived solely from artificial substrates. For compar- the calibration samples. Because the residuals for the isons based on fish, I used data from 114 reference sites calibration samples had a mean of zero, I added 1 to all and 1438 test-site samples. No validation samples were residuals to make these adjusted SNCBI values (ASNC- used when this model was built. Stressor data were BI) directly comparable with O/E values. available for a subset of these samples. Six predictor Theeffectofprecisiononinferences regarding variables were used in the invertebrate model: river basin biological impairment was assessed by determining (Maumee River Basin, 0/1), calendar day (calendar day), how many test-site samples fell outside the distribution ecoregion (Western Allegheny Plateau, 0/1), average of reference-site indicator values. For these tests, I used relative humidity, log slope of the sampled reach, and the lower 10th percentile of reference sample indicator log drainage area above the sampling location. Four values as a standard threshold below which values variables were used in the fish predictive model: latitude, would be considered biologically degraded. The 10th- longitude, log catchment area, and log slope of the percentile threshold value was used solely to standardize sampled reach (see also de Zwart et al. 2006). comparisons among methods and data sets and should not necessarily be considered a standard for regulatory Measures of performance purposes. Although use of the 10th percentile here might The performance of any bioassessment method can be represent an arbitrary choice for regulatory purposes, it characterized in three ways: precision, bias, and should represent a reasonable threshold for statistical sensitivity to stressors. Evaluation of such criteria is a comparisons among methods in that indicator values straightforward process when known standards can be less than this threshold have only a 10% probability of applied under controlled conditions. However, evalua- occurring by chance. Hence values this low should tion of the performance of biological indicators is usually represent a biologically real response to stress. complicated by the fact that the real degree of biological Use of a more stringent threshold such as the 1st degradation (changes in community structure and percentile, although leading to greater confidence that a function) at a site can never be fully known, i.e., we sample is degraded, could confound comparisons of cannot know how impaired a site is and in all the ways it detection frequencies among indicators because such is impaired prior to sampling it (Cao and Hawkins small percentile values can be easily influenced by 2005). We therefore have to compare methods against outliers in the different distributions of reference-site surrogate measures of biological impairment (e.g., values. Because the percentage of sites declared as presence of stressors) or against one another and then degraded by different methods will not necessarily use indirect means of judging the performance of change in parallel with differences in the percentile different methods relative to one another. threshold used, the comparisons based on the 10th I quantified precision as the standard deviation (SD)or percentile cannot be simply extrapolated to other coefficient of variation (CV) of indicator values derived thresholds. from the population of reference sites used to establish Unrecognized or uncontrolled variation associated expected conditions at assessed sites (see Stoddard et al. with naturally occurring factors can affect the accuracy 2006). Ideally, the only variation in reference-site values of assessments in addition to inflating estimates of error would be associated with sampling error, which, if above that associated with sampling error. An impor- minimized by adequate sampling, would allow detection tant area of current bioassessment research focuses on at test sites of small deviations from expected condition. how to best classify reference sites so as to minimize Because comparisons of precision can be confounded by such errors. I examined accuracy of assessments by use of different units of measurement, I standardized all determining the extent to which indicator values varied reference site NCBI (North Carolina biotic index) and with naturally occurring environmental gradients. This IBI values to have a mean of 1, the expected mean analysis can show if methods are locally biased even reference site O/E value derived from predictive models. though they may be globally accurate, i.e., accurate on Raw index values for test sites were then divided by the average across all sites that are assessed. For this mean of reference- site values to put NCBI and IBI analysis, I regressed indicator values derived from assessments in the same units of measure as O/E. calibration samples against the suite of available Standard deviations based on such standardized values variables describing the natural setting for each sample are equivalent to the CV calculated from raw values. I location. Those variables typically included measures of could not conduct a similar standardization with the stream size, geographic location, elevation, climate, Maine assessments because their assessment endpoints calendar day, and channel habitat condition. I also are categorical. conducted complementary analyses based on ANOVA August 2006 FRESHWATER BIOASSESSMENT 1283

TABLE 2. Regression statistics derived from calibration data sets describing bias in different assessment measures associated with site-specific differences in environmental setting.

Assessment Std. Data set measure R2 Factorà Coefficient coefficient§ Tolerance|| tP} North Carolina SNCBI 0.71 Constant 5.332 0.000 9.888 0.000 longitude 0.068 0.713 0.875 17.910 0.000 log DFS 0.096 0.268 0.876 6.727 0.000 latitude 0.027 0.089 0.988 2.371 0.019 date 0.0002 0.090 0.991 2.392 0.018 O/Esp0 0.05 constant 0.933 0.000 23.205 0.000 log DFS 0.083 0.226 0.995 3.326 0.001 O/Esp0.5 0.01 constant 0.958 0.000 35.186 0.000 log DFS 0.042 0.139 1.000 2.011 0.046 O/Ef0 0.02 constant 0.951 0.000 32.349 0.000 log DFS 0.040 0.140 1.000 2.030 0.044 Mid-Atlantic Highlands SMIBI 0.08 constant 1.200 0.000 10.776 0.000 log carbonate 0.118 0.286 0.947 2.451 0.017 log WSA 0.063 0.248 0.947 2.127 0.037 Ohio: invertebrates SICI 0.07 constant 0.685 0.000 4.860 0.000 habitat index 0.0043 0.296 1.000 2.243 0.029 Ohio: fish SIBI 0.46 constant 0.310 0.000 1.454 0.149 annual precipitation 0.00091 0.289 0.853 3.873 0.000 habitat index 0.0060 0.524 0.853 7.027 0.000 O/E0 0.10 constant 0.67 0.000 7.531 0.000 habitat index 0.0045 0.327 1.000 3.660 0.000 O/E0.5 0.11 constant 0.72 0.000 8.899 0.000 habitat index 0.0043 0.341 1.000 3.844 0.000 Assessment measure: SNCBI, standardized North Carolina biotic index; O/E, observed taxonomic composition as fraction of expected taxonomic composition (sp ¼ species, f ¼ family; 0 and 0.5 ¼ mean probability of capturing a taxon); SMIBI, standardized macroinvertebrate index of biotic interity; SICI, standardized (Ohio) invertebrate-community index; SIBI, standardized (Ohio) index of biotic integrity. à DFS ¼ distance to stream source, WSA ¼ watershed area, date ¼ day of year (i.e., 1–365). § Standardized (Std.) coefficients measure the relative strength of associations between indicator values and the different independent variables. || Tolerance is a measure of independence between predictor variables. High values (near 1) imply little colinearity with other predictors. } Two-tailed. to test for effects of overall regional setting as indicated habitat evaluation index’’ (QHEI) to measure habitat by the ecoregion from which samples were collected. I condition, an index that is based on similar metrics as also compared the standardized indicator values derived used by EPA: substrate, in-stream cover, channel from different methods with one another to determine to morphology, riparian and bank cover, and stream what extent one method was a biased estimator of the gradient (Rankin 1989). other. This latter analysis cannot evaluate accuracy per For my analyses, I first used multiple regression to test se, but can provide insight regarding the degree to which the hypothesis that indicators were sensitive to all two methods lead to similar inferences. measured stressors. Following that test, I conducted Different indicators may be differentially sensitive or another regression analysis on just those variables that responsive to stressors in general or to individual were statistically significant to determine how much of stressors. I measured sensitivity in two ways: (1) as the the observed variability in indicator values was asso- magnitude of difference between the mean standardized ciated with measured stressors. indicator values for the populations of reference and test sites examined, and (2) as the magnitude of standardized RESULTS regression coefficients derived from regressions of Comparisons of O/E with other indicators indicator values on different measures of stress. Data on stressors were available only for the Mid-Atlantic North Carolina stream invertebrates.—The standard- Highlands and Ohio data sets. Most stressor data were ized North Carolina biotic index (SNCBI) was both measured as concentrations of specific chemical constit- slightly less precise (reference-sample SD)andless uents (e.g., pH, sulfate, nitrogen), but habitat condition sensitive in detecting departure from reference con- was reported as aggregate indices of habitat-quality ditions than was the observed taxonomic composition measures. EPA measured habitat condition in terms of as a fraction of the expected taxonomic composition, the ‘‘Index of in-stream habitat,’’ which includes aspects O/E, based on species-level data, the level of resolution of channel sinuosity, amount of various types of usedintheNCBI(Table1).O/E based on the substrates, water depth, and velocity characteristics probability of capturing a taxon, PC, at PC . 0.5 (Kaufman et al. 1999). Ohio used the ‘‘qualitative (O/Esp,0.5) was slightly more precise than O/E based on Ecological Applications 1284 INVITED FEATURE Vol. 16, No. 4

2 TABLE 3. Associations (r ) between indicator values and ecoregion setting, together with mean indicator values for each ecoregion.

Data source, ecoregion N O/E05 NCBI SNCBI ASNCBI MIBI SMIBI ICI SICI IBI SIBI North Carolina Blue Ridge Mountains 141 1.02 3.35 1.09 1.01 Piedmont 38 0.99 4.83 0.84 0.96 MACP/SPà 29 1.00 5.09 0.91 1.02 r2 0.01NS 0.60 0.60 0.06 Mid-Atlantic Highlands North-central Appalachians 18 1.02 77.7 1.01 Blue Ridge Mountains 6 1.05 78.2 1.01 Central App. Ridge and Valleys 39 1.02 79.4 1.03 Central App. Mountains 9 0.93 66.9 0.87 r2 0.03NS 0.08NS 0.08NS Ohio: invertebrates Eastern Corn Belt Plains 26 1.04 43.9 1.03 Erie Drift Plain 13 1.07 44.2 1.03 Huron/Erie Lake Plain 3 1.04 41.3 0.97 Interior Plateau 3 0.92 39.0 0.91 Western Allegheny Plateau 13 1.03 40.5 0.95 r2 0.04NS 0.04NS 0.04NS Ohio: fish Eastern Corn Belt Plains 55 1.03 46.6 1.04 Erie Drift Plain 16 0.97 42.5 0.95 Huron/Erie Lake Plain 11 0.97 33.1 0.74 Interior Plateau 6 1.02 45.9 1.02 Western Allegheny Plateau 26 1.05 47.3 1.05 r2 0.03NS 0.30 0.30

2 Notes: All r values are statistically significant (P , 0.05) unless noted as nonsignificant (NS). For assessment-measure codes, see Table 2. For North Carolina, O/E05 is O/Esp,05. à Middle Atlantic Coastal Plain and Southeastern Plateau.

PC . 0(O/Esp,0). Both the lower precision and markedly after adjusting for natural setting (Table 1), a sensitivity of the SNCBI were associated with the consequence of removing apparent differences between strong dependency (71% of variation) of SNCBI values observed and expected values that were caused by on naturally occurring environmental conditions (Table systematic variation among test sites in natural 2). Ecoregion accounted for less (60%) variation in features. Adjusting SNCBI values for longitude, dis- SNCBI values than the regression did (Table 3). The tance from source, latitude, and calendar day also adjustment of SNCBI values(ASNCBI) for variation in resulted in the removal of most, although not all, of the longitude, distance from source, latitude, and calendar association of SNCBI values with ecoregion (Table 3). day, resulted in ASNCBI being more precise than that In contrast to the SNCBI, none of the O/E models of O/Esp,0.5, and, consequently, the number of test sites exhibited substantial site-specific bias with respect to that were detected as being different from reference geographic location (latitude, longitude), stream size increased (Tables 1 and 4). However, the average (log DFS [distance from stream source]), and calendar difference between test and reference sites decreased day, although most models slightly underpredicted

TABLE 4. Concurrence between O/E0.5 assessments and the four other biotic indices (ASNCBI, SMIBI, SICI, and SIBI [fish]) in inferring if test sites are in reference or nonreference condition.

Percentage of samples Case ASNCBI SMIBI SICI SIBI Both O/E and assessment method concur in reference condition 9 30 46 45 Both O/E and assessment method concur in not reference condition 67 46 27 29 Only O/E implies reference condition 13 4 8 11 Only O/E implies not reference condition 11 20 19 15 Notes: Key to abbreviations: ASNCBI, adjusted standardized North Carolina biotic index; SMIBI, standardized macro- invertebrate index of biological integrity; SICI, standardized Ohio invertebrate-community index; SIBI, standardized Ohio index of biotic integrity. O/E for the ASNCBI and SIBI are based on species-level data; other O/E models are based on variable taxonomic resolution. Values are the percentages of samples that were in each of four possible categories of agreement. August 2006 FRESHWATER BIOASSESSMENT 1285

FIG. 1. Relationships between the adjusted standardized NCBI (ASNCBI), the standardized MIBI (SMIBI), the standardized ICI (SICI), and the standardized IBI (SIBI) with O/E0.5 values derived from test site samples from each data set. Models for the North Carolina invertebrates and Ohio fish are based on species or near-species taxonomic resolution. The other models are based on highest possible resolution, generally genus. Differences between the solid regression lines and the dashed 1:1 lines show the extent to which the other indicators and O/E are biased predictors of one another. The histograms show the distribution of sample values as measured by each method. Vertical and horizontal dashed lines indicate the lower 10th percentile values derived from the reference calibration sample values for each indicator and are equivalent to the 10th% C values in Table 1, which were used to estimate the percentage of samples in nonreference condition (%T , 10th% C). richness with increasing stream size (Table 2). None of declined as sites approached reference condition. The the variation in O/E values was associated with overall outcome of these differences was that use of the ecoregion setting (Table 3) showing that the models two assessment methods led to dissimilar inferences accounted for the effects of those stream-habitat factors regarding the biological status (reference or not) of a that vary with ecoregion in North Carolina. test-site sample in 24% of the cases examined based on Although both O/E and the ASNCBI led to generally use of a 10th–percentile-of-reference-values threshold similar inferences regarding the percentage of sites that (Table 4). were not in reference condition (Table 1), O/E and the Mid-Atlantic Highlands stream invertebrates.—The ASNCBI often resulted in markedly different site- standardized macroinvertebrate index of biological specific assessments (Fig. 1). These differences occurred integrity for the Mid-Atlantic Highlands (MAH), because the ASNCBI was a biased predictor of O/E as SMIBI, was marginally less precise than O/E0.5 assess- revealed by the .0 intercept and ,1 slope of this ments, and O/E0 was less precise than either O/E0.5 or relationship (P , 0.05). This relationship showed that the SMIBI (Table 1). In all cases, precision was less than the relative degree of impairment estimated by these two observed for the North Carolina data set. Regressions of indicators was a function of the overall degree of reference-sample indicator values on naturally occurring impairment at a site. In particular, the ASNCBI implied factors showed that the MIBI produced slightly (R2 ¼ that biological condition was better than that implied by 0.08) biased assessments depending on setting (Table 2). O/E at the most degraded sites, and this difference Reference-sample SMIBI values decreased with increas- Ecological Applications 1286 INVITED FEATURE Vol. 16, No. 4

TABLE 5. Regression statistics describing the response of O/E and standardized multimetric indices to variation among test sites in potential stressors measured at sites in the Mid-Atlantic Highlands (MAH) and Ohio, USA.

Assessment Standardized Data set measureà R2 Factor Coefficient coefficient Tolerance§ tPjj MAH SMIBI 0.35 Constant 0.763 0.000 9.667 0.000 SO4 0.220 0.361 0.987 10.303 0.000 TSS} 0.084 0.145 0.917 4.001 0.000 habitat index 0.039 0.376 0.918 10.350 0.000 O/E0 0.25 constant 0.449 0.000 3.465 0.001 pH 0.081 0.245 0.949 6.387 0.000 SO4 0.219 0.387 0.989 10.308 0.000 habitat index 0.021 0.226 0.942 5.861 0.020 O/E0.5 0.35 constant 0.061 0.000 0.471 0.000 pH 0.085 0.238 0.949 6.673 0.000 SO4 0.234 0.384 0.989 10.973 0.000 habitat index 0.041 0.401 0.942 11.177 0.000 Ohio invertebrates SICI 0.19 constant 0.159 0.000 0.067 0.067 habitat index 0.0093 0.390 0.996 7.100 0.000 log NH3 0.475 0.191 0.996 3.472 0.001 O/E0 0.06 constant 0.884 0.000 43.911 0.000 Pb 0.027 0.257 1.000 4.306 0.000 O/E0.5 0.14 constant 0.509 0.000 6.833 0.000 Pb 0.029 0.296 0.998 3.456 0.000 habitat index 0.0045 0.227 0.998 5.158 0.000 Ohio fish SIBI 0.25 constant 0.341 0.000 9.700 0.000 habitat index 0.0069 0.444 0.990 16.284 0.000 NH3 0.064 0.128 0.988 4.685 0.000 Zn 0.0011 0.088 0.889 3.047 0.002 Pb 0.0100 0.091 0.882 3.145 0.002 hardness 0.00017 0.069 0.986 2.538 0.011 O/E0 0.09 constant 0.542 0.000 14.682 0.000 habitat index 0.0044 0.244 0.977 8.138 0.000 NH3 0.0485 0.081 0.996 2.724 0.007 Zn 0.00064 0.071 0.911 2.295 0.022 Pb 0.0081 0.065 0.913 2.082 0.038 Cd 0.061 0.067 0.983 2.232 0.026 O/E0.5 0.15 constant 0.390 0.000 8.590 0.000 habitat index 0.0059 0.317 0.990 10.858 0.000 NH3 0.049 0.081 0.988 2.788 0.005 Zn 0.0019 0.126 0.889 4.098 0.000 Pb 0.0100 0.075 0.882 2.431 0.015 hardness 0.00018 0.061 0.986 2.081 0.038 Notes: No stressor data were available for North Carolina or Maine. Habitat condition was measured with indices in which increasing values imply better quality habitat. Number of samples for which stressor values were available: MAH, 456; Ohio invertebrates, 264; Ohio fish, 1013. à Key to abbreviations: SMIBI, standardized macroinvertebrate index of biological integrity; O/E, observed taxonomic composition as a fraction of expected taxonomic composition (subscripts 0 and 0.5 denote mean probability of capturing a taxon); SICI, standardized Ohio invertebrate-community index; SIBI, standardized Ohio index of biotic integrity. § A measure of independence between predictor variables; high values near 1 imply little collinearity with other predictors. jj Two-tailed. } Total suspended solids. ing concentrations of carbonate in stream water and sites than the MIBI as being in nonreference condition. increased with watershed area. None of the variation in As observed in the comparison between the ASNCBI and reference-site SMIBI values was significantly associated O/E, the intercept was .0 and the slope ,1(P , 0.05), with ecoregion setting (Table 3). O/E0 and O/E0.5 values which caused degraded sites to appear more degraded by were not related to any naturally occurring individual O/E assessments than by SMIBI assessments. Although factor that I was able to examine, nor were these the association between O/E and SMIBI values for test measures associated with ecoregion (Table 3). 0.5 sites was stronger than for any other comparison (Fig. 1), Both indicators showed that test sites were substan- the combination of differences in precision and bias tially degraded relative to reference conditions (Table 1). resulted in the MIBI and O/E0.5 leading to different O/E0.5 and the SMIBI had nearly identical mean values for test sites. Although mean test-site values were similar, conclusions regarding impairment in 24% of samples (Table 4). These disagreements were not symmetric with the slightly greater precision of the O/E0.5 model resulted in ;30% more test sites being inferred as in nonreference respect to the percentage of samples inferred to be in condition than the MIBI. In contrast, the lower precision reference or nonreference condition. Of the 109 samples of the O/E0 model resulted in it assessing ;35% fewer for which the two assessments disagreed, O/E0.5 was five August 2006 FRESHWATER BIOASSESSMENT 1287 times more likely than the MIBI to imply a sample was degraded. Both the MIBI and O/E varied in response to stressor gradients, but the two measures differed in their responsiveness to the suite of stressors present in the MAH region (Table 5). Variation in both O/E indicators was most strongly associated with the same three stressors (pH, SO4, and habitat modification), but more of the variation in O/E0.5 was associated with these stressors than that for O/E0, primarily because O/E0.5 was more strongly associated with the measure of habitat quality than was O/E0. The SMIBI was similar to O/E0.5 in how it declined with decreasing measures of habitat quality and increasing values of SO4, but in contrast to O/E0.5, it was not sensitive to pH but was sensitive to total suspended solids (TSS). Variation among sites in levels of stressors accounted for similar amounts of variation in the two types of indicators. Classifying sites by their dominant types of stressors provided somewhat different insights regarding the relative sensitivities of the different indicators (acid mine drainage ¼ metals and pH, pH ¼ acid deposition, nutrients ¼ phosphorus and nitrogen, mixed ¼ general habitat degradation plus other stressors). Although values of both SMIBI and O/E0.5 varied similarly among stressor categories (Fig. 2: top panel), ANOVA based on the sample-wise differences between O/E0.5 and the S-MIBI showed that the two measures were differentially sensitive to these broad categories of stress (Fig. 2: bottom panel, F ¼ 5.96, df ¼ 4537, P , 0.0005). In general, O/E0.5 appeared to be more sensitive to pH and acid mine drainage than was the MIBI, but there was little difference in sensitivity between the two measures at sites dominated by nutrients or mixed stressors. Maine stream invertebrates.—The Maine O/E indica- tors were among the least precise of the models examined FIG. 2. Box plots of the O/E0.5 (left member of pair) and SMIBI (right member of pair) values (top panel) and sample- (Table 1). Although the models showed no systematic wise differences between the two indicators showing the bias with respect to any natural environmental gradient differential sensitivity of the two indicators with respect to four examined (ecoregion, Maine biophysical region, latitude, classes of dominant stress occurring at each site: acid deposition basin size, elevation, channel gradient, temperature, (pH), acid mine drainage (AMD), nutrients (N), and mixed calendar day), the models were less precise than the stressors (M). Reference sites (R) are included for comparison. Box plots show the medians, first and third quartiles (top and other O/E indicators examined. The large reference-site bottom of boxes), and lower and upper inner fence values (61.5 O/E standard deviation for these models implies that 3 inner quartile range). Outliers are shown by stars. little of the variation in assemblage composition across reference sites was associated with variation in the (Davies et al. 1995, Davies and Jackson 2006). However, predictor variables used (elevation, distance from stream there was substantial variation in O/E values among source, latitude, number of freeze-free days, and calendar samples assigned to any of the water quality classes (Fig. day). This low precision resulted in a small percentage of 3). Only 31% (O/E0) and 38% (O/E0.5) of the variation in test-site samples falling below the 10th percentile of O/E values for test-site samples was associated with the reference-sample values even though mean O/E values water-quality class to which samples were assigned by estimated for test-site samples were not substantially the Maine method. different from that observed in other data sets. Ohio stream invertebrates.—Assessments based on the Mean O/E values declined with decreasing water- invertebrate-community index (ICI) and O/E0.5 model quality class as generally expected given how classes resulted in similar estimates regarding the average were defined by the Maine Department of Environ- biological condition of test site samples, but the O/E0 mental Protection. Classes AA and A were combined model resulted in substantially higher estimates of mean because they both imply excellent biological integrity condition for test site samples than either the ICI or Ecological Applications 1288 INVITED FEATURE Vol. 16, No. 4

For the 264 samples for which stressor values were available, SICI values were most strongly associated with the index of habitat quality (increased as habitat scores increased) and were less strongly associated with

NH3 concentrations (decreased with increasing concen- trations). O/E0 values were associated with only Pb concentrations (decreased with increasing concentra-

tions), but O/E0.5 values were associated with both habitat quality and Pb. The association of O/E0.5 with Pb was stronger than that with habitat. Because Pb and

log NH3 concentrations were correlated (r ¼ 0.51, P , 0.001), both stressor variables may be indicators of the same overall suite of stressors affecting biota at these sites. Ohio stream fish.—The general performance of the index of biological integrity (IBI) and both O/E measures were very similar (Table 1). The precision of the standardized IBI (SIBI) was slightly better than that

for O/E0.5, which was more precise than O/E0 assess- ments. The mean condition of test-site samples was also FIG. 3. Variation in test-site O/E0.5 values within and very similar among indicators (Table 1) as was the among the four different water-quality classes to which samples percentage of test-site samples that were assessed as were assigned by the Maine water-quality class predictive model. Classes AA and A have been combined (since both being in nonreference condition (Table 4). Even though imply excellent biological integrity). NA represents non-attain- the O/E0.5 model was slightly less precise than the SIBI, ment according to Maine water-quality criteria. The horizontal it assessed a slightly higher percentage of test site dashed line represents the 10th percentile of reference-site samples as being in nonreference condition than the O/E0.5 values. Box plots show the medians, first and third quartiles (top and bottom of boxes), and lower and upper inner SIBI did because O/E0.5 assessed test sites as slightly fence values (61.5 3 inner quartile range). Outliers are shown more degraded on average than did the SIBI. As in other by stars. data sets, values of the SIBI and O/E0.5 were correlated, and the SIBI was a biased predictor of O/E (intercept .

O/E0.5 (Table 1). The O/E0.5 model produced the most 0, slope , 1, P , 0.05, Fig. 1). This bias, together with precise assessments followed by the standardized ICI differences in precision, resulted in 26% of test-site

(SICI) and then the O/E0 model. These differences in samples being assessed differently in terms of whether precision resulted in corresponding differences in the they were in reference condition or not. O/E0.5 had a number of test-site samples that would be inferred as slightly higher tendency to imply samples were in being in nonreference condition (Table 1). The perfor- nonreference condition than the IBI did (Table 4). mance of O/E assessments based on Ohio invertebrates Both methods were subject to bias associated with was similar to that of the MAH predictive models in differences between reference sites in environmental terms of bias and precision, but the average condition of setting, the IBI substantially so (Table 2). Thirty percent test samples was higher and the number of test samples of the variation in SIBI values was associated with that would be considered to be in non-reference ecoregion (Table 3). Regression analysis showed that condition was lower for the Ohio data than the MAH even more of the variation among reference sites (46%) data (Tables 1 and 4). in the SIBI was associated with differences among sites Neither type of assessment method was strongly in habitat quality and annual precipitation (Table 2). biased by environmental setting. None of the variation This result implies that an ecoregion classification was in reference-sample indicator values was associated with only partly successful in accounting for natural variation ecoregion (Table 3), but ICI values did vary slightly with in fish assemblages among reference, sites and that differences among reference sites in qualitative habitat- variation in aspects of climate and channel features evaluation index (QHEI) values (Table 5). Neither of the affect assemblage structure within ecoregions. In con- O/E measures derived from reference-site samples varied trast, relatively little of the variation in reference-site with either ecoregion (Table 3) or QHEI (Table 5). ICI O/E values was associated with environmental setting. and O/E0.5 assessment values for individual test site About 10% of the variation in both O/E0 and O/E0.5 was samples were only weakly associated with one another related to variation among reference sites in habitat- (Fig. 1), and as in the other comparisons, the intercept quality scores (Table 2), but no variation in either O/E was .0 and the slope ,1(P , 0.05). measure was associated with any other channel or For these data the ICI and O/E values were only regional (e.g., climate, ecoregion) variable (Tables 2 and weakly associated with estimates of stressors (Table 5). 3). The O/E models were therefore successful in August 2006 FRESHWATER BIOASSESSMENT 1289 accounting for natural variation in biotic structure that occurred both among and within ecoregions. Both types of indicators were generally similar in their response to stressors (Table 5). The statistical tests of response to seven stressors showed that the IBI and both O/E measures varied with differences among test-site samples in measures of habitat condition, NH3, Zn, and Pb. The IBI and O/E0.5 were sensitive to the same five stressors, including hardness, to which O/E0 was not sensitive. O/E0 showed sensitivity to one stressor (Cd) to which the IBI and O/E0.5 did not. Variation in stressors accounted for between 9% and 25% of the variation in indicator values. More of the variation in the IBI was associated with stressors than were the O/E measures, although most of the variability in the IBI was associated with habitat condition (r2 ¼ 0.20), a factor that also varied substantially among reference sites. Both O/E measures were also more strongly associated with variation in habitat condition than variation in other potential stressors. Summary of comparisons of O/E with other indica- tors.—For each comparison, Fig. 1 shows graphically the lower 10th-percentile value for each indicator. Points that fall within the upper right and lower left quadrants as defined by these lines would be assessed similarly as either in reference condition or not in reference condition by the two methods as summarized in Table 4. Points that fall within the other two quadrants would be assessed FIG. 4. Variation in adjusted standardized North Carolina biotic index (ASNCBI) and O/E values for the 208 North differently by the two methods, which is also summarized Carolina reference sites used for model calibration over a 15-yr in Table 4. Relationships between the different indicators period of record. No two of these samples were collected from 2 and O/E were: ASNCBI ¼ 0.395 þ 0.527 3 O/Esp,0.5, r ¼ the same site. The linear best-fit line is shown for each indicator. 2 0.46; S-MIBI¼0.145þ0.839O/E0.5, r ¼0.66; SICI¼0.226 2 þ0.6863O/E0.5, r ¼0.34; SIBI¼0.368þ0.5693O/E0.5, reference-sample values. In general, the species-level 2 r ¼ 0.46. In all cases, intercepts and slopes were model based on PC . 0.5 most strongly discriminated statistically different (P , 0.05) from 0 and 1, respectively. between reference and test sites. Factors potentially affecting comparability Effect of sampling method on O/E assessments.—The of different O/E assessments two O/E models that were developed with invertebrate data collected from artificial substrates were distinctly Effects of taxonomic resolution on O/E assessments.— less precise than the O/E models developed from Taxonomic resolution influenced both the precision and samples collected from natural substrates. The precision sensitivity of O/E indicators (Table 1). In general, model of the Ohio model that I developed from invertebrates precision improved with decreasing taxonomic resolu- collected from Hester-Dendy samples was the lowest tion (i.e., species to family), and as a consequence the observed for any model (O/E : SD ¼ 0.35, not reported magnitude of biological change that could be detected 0.5 O E decreased. Furthermore, models based on PC thresholds in Table 1). The Maine / model based on inverte- of .0.5 were more precise than those based on PC . 0. brates collected from artificial rock baskets was also However, differences in taxonomic resolution affected imprecise (SD ¼ 0.26) relative to the models derived from sensitivity as well as precision, both of which in invertebrate samples collected from natural habitats combination affected assessments. For example, the (Table 1). The precision of both of these models is difference in mean O/E values between test and reference similar to that observed for null models (Van Sickle et sites increased with increasing taxonomic resolution, al. 2005), which implies taxonomic composition in these and as a consequence so did the percentage of test-site samples varied in a nearly random way between sites. samples with O/E values below the 10th percentile of Variation in reference-site indicator values with year of reference site values even though precision decreased. sampling.—In general, indicator values for both the Exclusion of locally rare taxa (PC . 0.5 models) also ASNCBI and O/E showed little obvious systematic resulted in both increasing differences between mean variation across the ;5500-day period of record in the reference- and test-site samples and the number of test North Carolina data (Fig. 4) even though this period of sites with O/E values below the 10th percentile of time included both droughts as well as regional flooding Ecological Applications 1290 INVITED FEATURE Vol. 16, No. 4 associated with hurricanes. None of the variation in individual sample was in reference condition or not must O/Esp,0.5 was associated with time since sampling arise from differences between indicators in either their started, and only 6% of the variation in ASNCBI values biological or statistical properties. Both types of effects was associated with time. These results imply that, on a are likely contributing to imperfect agreement between regional scale, invertebrate assemblages at reference sites indicators in their sample-specific assessments. Indica- were generally stable from year to year. tors that combine raw information on taxonomic composition into aggregate metrics, such as stonefly DISCUSSION richness or percentage tolerant individuals, could con- In this paper, I compared the performance of O/E, the ceivably be either more or less responsive to stressors observed taxonomic condition of a site as a fraction of affecting a site than changes in raw taxonomic the site’s expected (i.e., natural) composition, with the composition. Whether such an aggregate metric is more other main types of biological indicators used in the or less sensitive to stress than O/E could easily be depen- United States and elsewhere. My general aims were to dent on the philosophy used when selecting metrics, i.e., determine (1) how O/E performed relative to other a priori selection based on ecological principles or a indicators and (2) if O/E could serve as a standard posteriori selection based on empirical discrimination means of quantifying biological condition and thus between reference and stressed sites. Many of these comparing biological conditions across either politically differences in biological properties among indicators are or ecologically defined regions. This evaluation required not transparent to typical users, and to my knowledge that I examine performance measures for both O/E and we have seldom delved very deeply into how we should other indicators as well as factors that potentially affect interpret these metrics. In the future, we may want to their performance. scrutinize how well the indicators we use both measure overall biological integrity and are reflective of the Comparability among indicators values society actually places on freshwater ecosystems. For different biological indicators to be comparable, The consequences of differences between indicators in they must, in general, measure the same properties and their statistical properties are more easily understood. In respond to stress in parallel. Despite the fact that O/E this analysis, O/E0.5 assessments were more likely to and the other biological indicators are based on some- detect departures from reference condition than MMI what different biological attributes, on average, O/E assessments because of their greater precision as well as assessments were generally similar to assessments based their slightly greater sensitivity to whatever stressors on other indicators. This result might imply that all of existed at test sites (Table 1, Figs. 1 and 2). In many these indicators measure to a large extent the same cases differences in precision appeared small (e.g., 0.01– fundamental underlying property of biological assem- 0.02 SD units) and therefore potentially not meaningful. blages from which different measures of assemblage Although such differences may not be significant in structure are derived. Because tolerance values, biotic some specific instances, the general tendency for O/E0.5 indices, and other types of metrics are derived from the assessments to consistently have lower SD than other same raw information on composition that O/E indicators is evidence that such differences are likely measures, these indicators should be correlated with real. In general, the precision of O/E0.5 models vary O/E. from ;0.10 to .0.20 SD, thus differences of 0.01 SD units Differences in indicator performance that were may therefore represent a 10% difference in precision. apparent were associated with (1) differences between Models with SD , 0.15 are relatively good, account for a indicators in the precision with which we estimate substantial portion of the variation in assemblage expected values (Table 1) and hence the effect size structure between sites, and can approach pure sampling (departure from reference condition) that could be error among replicate samples within a site (e.g., detected; (2) differences in sensitivity among indicators Ostermiller and Hawkins 2004, Van Sickle et al. 2005). to natural environmental variability among sites (Table Models with SD . 0.20 are relatively imprecise, are 2) that affected bias of assessments; and (3) differences similar in precision to null models, and often account for in sensitivity among indicators to different stressors little of the biotic variation among sites. (Table 5, Fig. 2). These factors in combination resulted In practice, the greater precision gained from model- in both the North Carolina biotic ineex (NCBI) and all ing may at least partly disappear if expected ranges of three multimetric indices (MMIs) being biased predic- indicator values are adjusted by geographic or other tors of O/E in such a way that the agreement with O/E strata, as is done for both the NCBI and the Ohio ICI for a given sample decreased with increasing biological (invertebrate-community index) and IBI (index of bio- degradation. This bias in turn resulted in disagreement logical integrity) (Lenat 1993, Ohio EPA 1989). I between indicators in the estimated proportion of sites demonstrated such an effect when the precision of the that we would infer to be in nonreference condition NCBI was greatly improved by adjusting for environ- (Table 4). mental setting by modeling (71% of variation in SNCBI) The tendency for different indicators to frequently and to a lessor extent by adjusting for ecoregion (60%). (;25% of comparisons, Table 4) disagree in whether an In this case, however, the adjustments had mixed effects August 2006 FRESHWATER BIOASSESSMENT 1291 on assessment outcomes. The increase in precision only regions. In theory, O/E avoids these problems by basing had a marginal effect on the number of samples that assessments on the degree to which the observed taxa list were detected as being in nonreference condition, mainly matches the expected one. Unfortunately, even compar- because adjusting for local environmental setting re- isons between different O/E assessments are not without sulted in estimates of the average condition of reference- problems. and test-site samples becoming more similar to one another (Table 1). Three factors affecting direct comparability Although apparent at the site level, the effects of between O/E indicators biological and statistical differences between indicators O/E assessments are potentially sensitive to at least appeared to largely disappear when individual site three of the factors that can also influence other assessments were aggregated across samples, a process assessment methods: equivalence of reference sites, the that would be applied when conducting regional-scale taxonomic resolution used, and the sampling method comparisons of biological condition. Mean standardized used to collect biota. Stoddard et al. (2006) treat the first IBI values for test-site samples from Ohio were issue in detail, but the data examined here also illustrate remarkably similar to mean O/E0.5 values (within 0.04 the problem well. The magnitude of an O/E value is units, Table 1). These differences were similar to dependent not only on what biota are observed but on previously observed differences between O/E assess- the estimate of what biota should occur. If reference ments derived from models that were based on samples sites are of high quality, as many were in the Maine data taken either from different habitat types (fixed-area riffle set, estimates of E may represent something close to the vs. timed multi-habitat) or with different historical potential of a water body. On the other hand, counts (50–450) (Ostermiller and Hawkins 2004). if a region has experienced severe landscape alteration, Variation in mean O/E0.5 values across the Maine water the least-disturbed sites will likely represent something quality classes (Fig. 3) also points to some basic considerably below historical condition. Such a situation consistency between how O/E and the Maine method is certainly the case for Ohio and probably North assess biological condition. From a statistical perspec- Carolina and the Mid-Atlantic Highlands as well. For tive, these results imply that regional and national example, direct comparison of the mean O/E0.5 value for syntheses may be achievable by either harmonizing test-site samples from Maine (0.72) with that for Ohio indicators via standardizing indicator values to common (0.80) implies that biological conditions at Ohio test sites nondimensional units or by reanalysis of raw data to are less impaired than those from Maine. It is unlikely estimate O/E values. The only substantial inconsistency that Ohio streams and rivers are less impaired than those between assessments of mean condition were for the in Maine given the history of landscape alteration in NCBI and O/E. The mean of the adjusted, standardized each state. Direct comparison of O/E requires either NCBI, which is more comparable to how North ‘‘equivalent’’ reference-site quality, something that may Carolina applies the NCBI in practice by adjusting for be difficult to determine, or societal acceptance that E ecoregion, differed from O/E0.5 by 0.10 units for the represents not the historical biological potential of species model and 0.07 units for the genus model. In this aquatic ecosystems but regionally specific desired or type of situation, it will be more difficult to harmonize best attainable condition. Although the latter case may indicators by a simple standardization or re-scaling of greatly complicate comparisons of biodiversity loss or indicator values. the degree to which systems meet the biointegrity In general, because all O/E-based assessments are objective of the Clean Water Act (see Stoddard et al. designed to measure the same biological property, O/E 2006), use of least-impaired reference sites can at least assessments conducted in different regions should be establish a fixed benchmark to which future assessments more comparable to one another than comparisons within a region can be compared. To some extent then, based on either another type of indicator or on a mix of ‘‘equivalence’’ is in the eye of the beholder and depen- standardized indicators. Even if we can show empirically dent on the criteria used to define ‘‘expected.’’ Defining that assessments based on different indicators are expected condition will often have both scientific and statistically equivalent, biological inferences could re- social components. main problematic. For example, comparing across Differences between data sets in taxonomic resolution MMIs will require either that we assume their compo- and sampling method can also confound comparisons nent metrics are ecologically equivalent or that we between O/E assessments, but can potentially be develop ways to map different MMIs to a common controlled by standardization of methods (e.g., Oster- biological condition scale, i.e., the harmonization miller and Hawkins 2004). The trends observed among approach described by Davies and Jackson (2006). This the North Carolina models based on different levels of is an especially problematic issue when comparing taxonomic resolution (Table 1) are largely consistent assessments across landscapes the size of the entire with patterns emerging from the literature (Lenat and United States, where assemblage composition and Resh 2001, Waite et al. 2004). In general, there appears structure, and hence the ecological relevance of any to be a trade-off between precision and sensitivity to individual metric, will vary markedly across sites and stress that varies with the taxonomic resolution used. Ecological Applications 1292 INVITED FEATURE Vol. 16, No. 4

Use of coarse taxonomy results in more precise expec- condition of test-site samples (Table 1; also Ostermiller tations, probably because there are fewer rare taxa that and Hawkins 2004). In one sense, it is unlikely that are difficult to model and hence add error to predictions. artificial substrates adequately characterize the natural However, lumping of more highly resolved taxa biotic structure at a site and hence their use would likely obscures the signal shown by sensitive taxa within a be unsuitable for assessments of conservation status or given group. This trade-off was clear in the North potential. On the other hand, the biota that colonize Carolina data, where the mean test site O/E0.5 value artificial substrates may provide sufficient signal to changed from 0.62 for species to 0.65 for genera and detect biological changes relevant to Clean Water Act 0.73 for families (Table 1). In contrast, the SD of mandates. reference site O/E values changed from 0.14 (species) to 0.13 (genera) to 0.10 (families). In this case, detecting A fundamental assumption affecting departure from reference condition was more sensitive interpretation of all indicators to responsiveness to stress than precision given that the Most bioassessment methods in use today compare percentage of sites that were assessed as impaired the observed biota at a site to that estimated from relative to the 10th reference-sample percentile de- samples collected at several appropriate reference sites creased as taxonomic resolution decreased (Table 1). (Bailey et al. 1998, Reynoldson and Wright 2000, In practice, decisions regarding the level of taxonomic Stoddard et al. 2006). Most of the information for these resolution to use are made by balancing the sensitivity reference sites is usually collected over a short time needed against the costs of identifications. These issues period, and application of these data to future time are not as problematic for assessments based on fish, for periods requires an assumption that the distribution of which species-level identifications can often be con- conditions across reference sites does not change ducted in the field and are the norm. significantly over time (Stoddard et al. 2006). In general, Finally, this analysis provided evidence that the we often assume that the spatial variance in conditions performance of O/E models is affected by use of observed across environmentally similar reference sites artificial substrates to sample biota (Table 1). The most sampled over a short period of time is equivalent to the precise models were those based on the North Carolina temporal (year-to-year) variance we would observe at a data in which all natural habitats at a site were single site. Such a space-for-time substitution can greatly exhaustively sampled (SD for genus-based O/E0 and reduce the cost of assessments by alleviating the need to O/E0.5 models ¼ 0.15 and 0.13, respectively). The least constantly monitor individual reference sites and adjust precise models were those based on the Maine (SD ¼ 0.26 yearly assessments accordingly. However, this assump- and 0.30) and Ohio (SD ¼ 0.30) invertebrate data that tion has not been well tested. were collected from artificial substrates and which had The stability of the distributions of both O/E and been allowed to colonize for 28 to 42 days. Models for ASNCBI values observed across a 15-year period (Fig. the Mid-Atlantic Highlands (MAH) and for Ohio 4) implies that the space-for-time assumption may often invertebrates were intermediate in precision and were be reasonable and that conditions estimated at one time either based on less exhaustive sampling (MAH) or were apply to other times. However, other studies have noted based on a combination of data collected from artificial significant variation in assemblage structure over time and natural substrates (Ohio). These results do not periods as long as 20 years, some of which were appear to be consistent with analyses that show variance associated with climatic conditions (Bradley and Ormer- among replicate artificial substrates to be lower than od 2001, Metzeling et al. 2002, Daufresne et al. 2004). that among samples taken from natural habitats Although the results reported here were encouraging, (Rosenberg and Resh 1982, Morin 1985). However they more documentation is clearly needed describing long- are interpretable in terms of how well (or poorly) the term variation in reference-site conditions, especially fauna that initially colonize a new, standard habitat within different climatic settings, and how well the patch characterizes the fauna occurring either in the overall distribution of reference-site indicator values variety of natural habitats that occur within reaches or mimics the long-term variation at individual sites. among reaches that can vary substantially in the types of habitats present. Furthermore, it is not clear how Concluding comments between-sample variance within an individual site would In recent years there has been vigorous, and some- be related to predictive-model precision because the times passionate, debate among researchers and practi- error in these models is based on entire sites as sampling tioners regarding how to measure the biological integrity units, not individual subsamples within a site. Although of freshwater ecosystems (Gerritsen 1995, Norris 1995, this issue needs to be addressed more rigorously, it seems Karr and Chu 1999, 2000, Downes 2000, Norris and likely that, at a minimum, detectability of impairment Hawkins 2000). The debate has both conceptual and will be affected by sampling technique, and, as a empirical origins (NRC 1994, Boulton 1999, Karr and consequence, so will the percentage of test samples that Chu 2000, Norris and Hawkins 2000). Some view the are inferred as impaired (Table 1). It is less clear that general concept of biological integrity as heuristically sampling method will affect estimates of the average useful but essentially unmeasureable (e.g., NRC 1994). August 2006 FRESHWATER BIOASSESSMENT 1293

Some have expressed different views regarding the Clarke, R. T., J. F. Wright, and M. T. Furse. 2003. RIVPACS specific biological attributes that should be measured models for predicting the expected macroinvertebrate fauna (Karr and Chu 2000, Norris and Hawkins 2000). Others and assessing the ecological quality of rivers. Ecological Modeling 160:219–233. have questioned the adequacy of different indicators or Daufresne, M., M. C. Roger, H. Capra, and N. Lamouroux. analytical methods when measuring biological condition 2004. Long-term changes within the invertebrate and fish (Gerritsen 1995, Norris 1995, Fore et al. 1996). These communities of the Upper Rhoˆne River: effects of climatic differences in opinions have usually been fueled by both factors. Global Change Biology 10:124–140. Davies, S. P., and S. K. Jackson. 2006. The biological condition a scarcity of data and a failure by participants to clearly gradient: a descriptive model for interpreting change in distinguish between the technical merits of different aquatic ecosystems. Ecological Applications 16:1251–1266. indicators and the ecological and social values that users Davies, S. P., and L. Tsomides. 2002. Methods for sampling attach to those indicators. The analyses presented here and analysis of Maine rivers and streams. Maine Department show that choice of an indicator, or indicators, for use in of Environmental Protection, Bureau of Land and Water Quality, Division of Environmental Assessment, Augusta, regional-scale assessments should depend, in part, on Maine, USA. both the biological properties society wishes to measure Davies, S. P., L. Tsomides, D. Courtemanch, and F. and the statistical properties of each indicator. There are Drummond. 1995. Maine biological monitoring and biocri- clearly no perfect indicators that will satisfy all users or teria development program. DEP-LW108. Maine Depart- uses; however, the numerical simplicity of O/E, its ease ment of Environmental Protection, Augusta, Maine, USA. de Zwart, D., S. D. Dyer, L. Posthuma, and C. P. Hawkins. of biological interpretation, and its inherent standard- 2006. Predictive models attribute effects on fish assemblages ization to site-specific conditions make it an excellent to toxicity and habitat alteration. Ecological Applications 16: candidate as a general measure of biological integrity for 1295–1310. both local and regional/global assessments. Downes, B. 2000. Book review (Restoring life in running waters: better biological monitoring). Freshwater Biology 43: ACKNOWLEDGMENTS 663–665. The research on which this paper was based was supported Fore, L. S., J. R. Karr, and R. W. Wisseman. 1996. Assessing by a cooperative agreement with the Office of Science and invertebrate responses to human activities: evaluating alter- Technology of the United States Environmental Protection native approaches. Journal of the North American Bentho- Agency (CX-826814-01). I thank Susan Davies and David logical Society 15:212–231. Courtemanch of the Maine Department of Environmental Frey, D. G. 1977. Biological integrity of waters: an historical Protection, Trish McPherson and David Lenat of the North approach. Pages 127–140 in R. K. Ballentine and L. J. Carolina Department of Environment and Natural Resources, Guarraia, editors. The integrity of water: a symposium. U.S. and Chris Yoder and Ed Rankin formerly of the Ohio Environmental Protection Agency, Washington, D.C., USA. Environmental Protection Agency for access to their state’s Gerritsen, J. 1995. Additive biological indices for resource databases and for providing comments that improved the management. Journal of the North American Benthological manuscript. I also thank EPA EMAP for access to the Mid- Society 14:451–457. Atlantic Highlands data and Alan Herlihy for his help in both Hawkins, C. P., and D. M. Carlisle. 2001. Use of predictive defining reference sites and providing stressor data for the Mid- models for assessing the biological integrity of wetlands and Atlantic Highlands data set. Comments by Mark Vinson, Yong other aquatic habitats. Pages 59–83 in R. Rader, D. Batzer, Cao, Daren Carlisle, Lester Yuan, Susan Davies, and two and S. Wissinger, editors. Bioassessment and management of anonymous reviewers substantially improved earlier drafts of North American freshwater wetlands. John Wiley & Sons, the manuscript. New York, New York, USA. Hawkins, C. P., R. H. Norris, J. N. Hogue, and J. W. LITERATURE CITED Feminella. 2000. Development and evaluation of predictive Bailey, R. C., M. G. Kennedy, M. Z. Dervish, and R. M. models for measuring the biological integrity of streams. Taylor. 1998. Biological assessment of freshwater ecosystems Ecological Applications 10:1456–1477. using a reference condition approach: comparing predicted Heinz Center [The H. John Heinz III Center for Science and the and actual benthic invertebrate communities in Yukon Environment]. 2002. The state of the nation’s ecosystems: streams. Freshwater Biology 39:765–774. measuring the lands, waters, and living resources of the Barbour, M. T., J. Gerritsen, B. D. Snyder, and J. B. Stribling. United States. Cambridge University Press, New York, New 1999. Rapid bioassessment protocols for use in wadeable York, USA. streams and rivers: periphyton, benthic macroinvertebrates, Herlihy, A. T., D. P. Larsen, S. G. Paulsen, N. S. Urquhart, and fish. Second edition. EPA 841-B-99-002. U.S. Environ- and B. J. Rosenbaum. 2000. Designing a spatially balanced, mental Protection Agency, Office of Water, Washington, randomized site selection process for regional stream surveys: D.C., USA. the EMAP mid-Atlantic pilot study. Environmental Mon- Boulton, A. J. 1999. An overview of river health assessment: itoring and Assessment 63:95–113. philosophies, practice, problems and prognosis. Freshwater Hilsenhoff, W. L. 1987. An improved biotic index of organic Biology 41:469–479. stream pollution. The Great Lakes Entomologist 20:31–39. Bradley, D. C., and S. J. Ormerod. 2001. Community Houston, L., M. T. Barbour, D. Lenat, and D. Penrose. 2002. persistence among upland stream invertebrates tracks the A multi-agency comparison of aquatic macroinvertebrate- North Atlantic Oscillation. Journal of Animal Ecology 70: based stream bioassessment methodologies. Ecological In- 987–996. dicators 1:279–292. Cao, Y., and C. P. Hawkins. 2005. Simulating biological Johnson, R. K., T. Wiederholm, and D. M. Rosenberg. 1993. impairment for evaluating ecological indicators. Journal of Freshwater biomonitoring using individual organisms, pop- Applied Ecology 42:954–965. ulations, and species assemblages of benthic macroinverte- Chutter, F. M. 1972. An empirical biotic index of the quality of brates. Pages 40–158 in D. M. Rosenberg and V. H. Resh, water in South African streams and rivers. Water Resources editors. Freshwater biomonitoring and benthic macroinver- 6:19–30. tebrates. Chapman & Hall, New York, New York, USA. Ecological Applications 1294 INVITED FEATURE Vol. 16, No. 4

Karr, J. R. 1981. Assessment of biotic integrity using fish RIVPACS-type models. Journal of the North American communities. Fisheries 6 (6):21–27. Benthological Society 23:363–382. Karr, J. R., and E. W. Chu. 1999. Restoring life in running Rankin, E. T. 1989. . The qualitative habitat evaluation index waters: better biological monitoring. Island Press, Washing- (QHEI): rationale, methods, and application. State of Ohio ton, D.C., USA. Environmental Protection Agency, Ecological Assessment Karr, J. R., and E. W. Chu. 2000. Sustaining living rivers. Section, Division of Water Quality, Columbus, Ohio, USA. Hydrobiologia 422–423:1–14. Revenga, C., and Y. Kura. 2003. Status and trends of Karr, J. R., and D. R. Dudley. 1981. Ecological perspective on biodiversity of inland water ecosystems. Technical series water quality goals. Environmental Management 5:55–68. number 11. Secretariat of the Convention on Biological Kaufmann, P. R., P. Levine, E. G. Robison, C. Seeliger, and D. Diversity, Montreal, Quebec, Canada. D. Peck. 1999. Quantifying physical habitat in wadeable Reynoldson, T. B., and J. F. Wright. 2000. The reference streams. EPA/620/R-99/003. U.S. Environmental Protection condition: problems and solutions. Pages 293–304 in J. F. Agency, Washington, D.C., USA. Wright, D. W. Sutcliffe, and M. T. Furse, editors. Assessing Klemm, D. J., K. A. Blocksom, W. T. Thoeny, F. A. Fulk, A. the biological quality of fresh waters: RIVPACS and other T. Herlihy, P. R. Kaufmann, and S. M. Cormier. 2002. techniques. Freshwater Biological Association, Ambleside, Methods development and use of macroinvertebrates as Cumbria, UK. indicators of ecological conditions for streams in the Mid- Rosenberg, D. M., and V. H. Resh. 1982. The use of artificial Atlantic Highlands region. Environmental Monitoring and substrates in the study of freshwater benthic macroinverte- Assessment 78:169–212. Lenat, D. R. 1993. A biotic index for the southeastern United brates. Pages 175–235 in J. Cairns, Jr., editor. Artificial States: derivation and list of tolerance values, with criteria for substrates. Ann Arbor Science, Ann Arbor, Michigan, USA. assigning water-quality ratings. Journal of the North Simpson, J. C., and R. H. Norris. 2000. Biological assessment American Benthological Society 12:279–290. of river quality: development of AusRivAS models and Lenat, D. R., and V. H. Resh. 2001. Taxonomy and stream outputs. Pages 125–142 in J. F. Wright, D. W. Sutcliffe, and ecology- the benefits of genus and species level identifications. M. T. Furse, editors. Assessing the Biological Quality of Journal North American Benthological Society 20:287–298. Freshwaters: RIVPACS and other techniques. Freshwater Metzeling, L., D. Robinson, S. Perriss, and R. Marchant. 2002. Biological Association, Ambleside, Cumbria, UK. Temporal persistence of benthic invertebrate communities in Stoddard, J. L., D. P. Larsen, C. P. Hawkins, R. K. Johnson, south-eastern Australian streams: taxonomic resolution and and R. H. Norris. 2006. Setting expectations for the implications for the use of predictive models. Marine and ecological condition of streams: the concept of reference Freshwater Research 53:1223–1234. condition. Ecological Applications 16:1267–1276. Morin, A. 1985. Variability of density estimates and the USEPA [U.S. Environmental Protection Agency]. 2002a. Na- optimization of sampling programs for stream benthos. tional water quality inventory—2000 report. EPA-841-R-02- Canadian Journal of Fisheries and Aquatic Sciences 42: 00. U.S. Environmental Protection Agency, Office of Water, 1530–1534. Washington, D.C., USA. Moss, D., M. T. Furse, J. F. Wright, and P. D. Armitage. 1987. USEPA [U.S. Environmental Protection Agency]. 2002b. The prediction of the macro-invertebrate fauna of unpolluted Summary of biological assessment programs and biocriteria running-water sites in Great Britain using environmental development for states, tribes, territories, and interstate data. Freshwater Biology 17:41–52. commissions: streams and wadeable rivers. EPA-822-R-02- National Research Council. 1994. Review of EPA’s environ- 048. U.S. Environmental Protection Agency, Office of Water, mental monitoring and assessment program. Committee to Washington, D.C., USA. Evaluate Indicators for Monitoring Aquatic and Terrestrial USEPA [U.S. Environmental Protection Agency]. 2003. Draft Environments, Board on Environmental Studies and Tox- report on the environment.Technical document. EPA-600-R- icology, and the Water Science and Technology Board, 03-050. United States Environmental Protection Agency, Commission on Geosciences, Environment and Resources. Washington, D.C., USA. National Academies Press, Washington, D.C., USA. U.S. General Accounting Office. 2000. Water quality: key EPA National Research Council. 2000. Ecological indicators for the and state decisions limited by inconsistent and incomplete nation. National Academy of Sciences Press, Washington data. Report to the Chairman, Subcommittee on Water D.C., USA. Resources and the Environment, Committee on Trans- Norris, R. H. 1995. Biological monitoring: the dilemma of data portation and Infrastructure, House of Representatives. analysis. Journal of the North American Benthological GAO/RCED-00-54. U.S. General Accounting Office, Wash- Society 14:440–450. Norris, R. H., and C. P. Hawkins. 2000. Monitoring river ington, D.C., USA. health. Hydrobiologia 435:5–17. Van Sickle, J., C. P. Hawkins, D. P. Larsen, and A. T. Herlihy. North Carolina Department of Environment, Health and 2005. A null model for the expected macroinvertebrate Natural Resources. 2003. Standard operating procedures: assemblage in streams. Journal of the North American biological monitoring. Division of Water Quality, Raleigh, Benthological Society 24:178–191. North Carolina, USA. Waite, I. R., A. T. Herlihy, D. P. Larsen, N. S. Urquhart, and Ohio Environmental Protection Agency. 1989. Biological D. M. Klemm. 2004. The effects of macroinvertebrate criteria for the protection of aquatic life. Volume III. taxonomic resolution in large landscape bioassessments: an Standardized biological field sampling and laboratory example from the Mid-Atlantic Highlands, U.S.A. Fresh- methods for assessing fish and macroinvertebrate commun- water Biology 49:474–489. ities. State of Ohio Environmental Protection Agency, Wright, J. F. 2000. An introduction to RIVPACS. Pages 1–24 Ecological Assessment Section, Division of Water Quality, in J. F. Wright, D. W. Sutcliffe, and M. T. Furse, editors. Columbus, Ohio, USA. Assessing the biological quality of fresh waters: RIVPACS Ostermiller, J. D., and C. P. Hawkins. 2004. Effects of sampling and other techniques. Freshwater Biological Association, error on bioassessments of stream ecosystems: application to Ambleside, Cumbria, UK.