Guidance for experimental design and statistical analysis of ecotoxicological community effect studies (field studies)

Report to UK Chemicals Regulation Directorate (CRD)

CRD/DEFRA project code PS2363

Report Authors:

Alan Lawrence (Cambridge Environmental Assessments, ADAS UK Ltd) Kevin Brown (Independent Consultant) Geoff Frampton (University of Southampton) Paul J. Van den Brink (Alterra & Wageningen University)

CEA report no. 1088

CONTENTS Executive summary ...... 3 1 Introduction ...... 7 1.1. Aims, objectives and limitations ...... 7 1.2. Project team ...... 7 1.3. Report structure ...... 7 1.4. Definitions ...... 8 1.5. Funding ...... 9 2 Summary of Guidance Documents...... 10 2.1. Section summary ...... 16 3 The context of community studies ...... 17 3.1. Introduction ...... 17 3.2. Types of community studies and place in tiered testing ...... 17 3.3. Advantages and limitations of replicated manipulative field studies ...... 18 3.4. Population versus community; protection goals ...... 18 3.5. Representation of communities by test systems ...... 20 3.6. Section summary ...... 21 4 Experimental design ...... 23 4.1. Study layouts and randomisation of plots ...... 23 4.2. Pseudoreplication ...... 26 4.3. Grouping data for analysis ...... 27 4.4. Section summary ...... 29 5 General introduction to statistical testing ...... 30 5.1. Introduction to statistics in community ecotoxicology studies ...... 30 5.2. Hypothesis testing ...... 30 5.3. Probability within statistical hypothesis testing ...... 31 5.4. Repeated two sample testing ...... 33 5.5. Importance of assumptions – normality, independence, randomisation ...... 34 5.6. Replication ...... 35 5.7. Section summary ...... 36 6 Initial data interpretation and univariate approaches ...... 37 6.1. Introduction ...... 37 6.2. Initial investigation ...... 37 6.3. Outliers ...... 38 6.4. Screening prior to analysis ...... 38 6.5. Data pre-requisites for analysis ...... 39 6.6. Univariate tests and statistical power ...... 39

1

6.7. Consecutive significant events ...... 40 6.8. Percent reduction measures ...... 41 6.9. Section summary ...... 43 7 Multivariate approaches ...... 44 7.1. Introduction to multivariate analyses ...... 44 7.2. Background to ordination techniques ...... 46 7.3. Introduction to Principal Response Curves analysis ...... 47 7.4. PRC mechanics and terminology...... 50 7.5. Interpretation of first PRC axis ...... 51 7.6. Construction and interpretation of the second PRC axis ...... 53 7.7. Other PRC outputs, significance testing, explained variance ...... 60 7.8. Summary output statistics from PRC analyses conducted in CANOCO ...... 61 7.9. Data transformations ...... 62 7.10. Analysing a reference item with PRC ...... 62 7.11. Other uses for multivariate approaches in ecotoxicological field studies ...... 63 7.12. Section summary ...... 64 8 Application of PRC analysis to example scenarios ...... 65 9 Summary of data examination and analysis ...... 72 10 Monitoring studies for effects of plant protection products on terrestrial invertebrates ...... 73 10.1. Background...... 73 10.2. Setting the research question ...... 74 10.3. Spatial scale ...... 74 10.4. Size and siting of study plots ...... 76 10.5. Sampling locations within study plots ...... 76 10.6. Temporal scale ...... 77 10.7. Sampling considerations ...... 78 10.8. Statistical analysis...... 78 11 References ...... 79 Appendix I ...... 84 Appendix II ...... 90

2

Executive summary Introduction Ecotoxicological field studies conducted under EC Regulation 1107/2009 may be designed by industry notifiers in association with external consultants and contract research organisations (CROs). The results may then be interpreted and summarised before inclusion in a regulatory submission in support of a product or active substance registration. The submission will be reviewed by representatives from Member State (MS) Competent Authorities, in terms of both study quality and meaning. Not all parties involved may be familiar with the statistical design and interpretation requirements of such studies, which are complex. This project was conceived in order to provide background information and guidance to specialists and non-specialists alike on the statistical design and interpretation of ecotoxicological community effect studies. The specific objectives of this project were to: a. Develop a guide to best practice for experimental design of ecotoxicological community studies (aquatic mesocosms, non-target field studies, soil mesofauna field studies). b. Develop a guide to best practice for statistical analysis of ecotoxicological community data (aquatic mesocosms, non-target arthropod field studies, soil mesofauna field studies). Note this project was intended to provide statistical guidance on regulatory ecotoxicological field studies (a, above) for agrochemicals under EC Regulation 1107/2009, with a particular reference to principal response curves analysis, and was not intended to represent general statistical guidance for studies conducted for other purposes. Method The project was conducted by a team of four specialists with experience covering development of statistical techniques for analysing ecotoxicological community data, conduct of replicated field and large scale monitoring programmes, interpretation of community effects data and conduct of regulatory risk assessments. The approach of the report is to introduce key aspects of the statistical design and interpretation of ecotoxicological community studies in a logical order. Each area is discussed and, where possible, case studies or examples are presented. Some are based on real data and others are based on artificial data, generated to provide clear examples. A summary of key points is provided at the end of each chapter. The report focuses on replicated manipulative field studies, but includes a chapter on field monitoring. Key conclusions The following is a summary of the key conclusions of the report, presented by chapter. Study conception and context • The objective of the study should be clearly defined so that sampling effort may be focused, but also to restrict interpretation to those questions for which the study design is appropriate (avoid opportunistic interpretation). • The spatial scale of the study (including individual plots in terrestrial systems), in relation to the community studied, should be justified since scale may interact with detection of effect and time to recovery. • The researcher should understand any other effects of the study design on the results, including capacity of the system to detect effect and recovery. Study design • The report focuses on replicated manipulative field studies.

3

• Study design should address the regulatory question. • Underlying environmental gradients may affect the community present in the test system, including effect and recovery responses, and should be controlled by appropriate study design – randomisation and/or blocking. • Random allocation of replicates to treatment groups is required to preserve the statistical assumption that samples are taken at random from the population. Individual samples should be taken without bias, for the same reason. • Replication enables the comparison of different treatments. Studies without replication are limited to the comparison of plots only. Higher replication leads to greater ability to detect real effects and lower likelihood of erroneously concluding an effect when there is none. • Study data may be analysed on a sampling method basis to avoid confusion when data from different sampling methods are combined. Likewise, study data may be analysed on a functional group basis to describe the various sub-community responses that may occur. General statistical testing • The ability of a statistical test to detect an effect is limited by study design (e.g. replication, meaningful grouping of taxa, independence, randomisation, accounting for underlying gradients) and appropriate application of that test. In addition, variability between samples impacts on the ability to detect trends. • The researcher should understand the requirements and underlying assumptions of a statistical test. The use of any test should be justified in terms of suitability of the data and relevance of the test to the question posed. • Two sample univariate tests (such as the t test) are inappropriate for conducting multiple comparisons between, for example, a control group and more than one treatment group for a given time point (e.g. when a taxon level NOEC/NOEL is required). This is because such tests are designed for the comparison of only two samples from a statistical population; repeated testing leads to an increased chance of concluding an effect where none exists. Alternatives (multiple hypothesis tests followed by multiple comparison tests) are appropriate. Data interpretation, univariate analysis • A visual inspection of the data, plotted as abundance over time for key taxa, will allow identification of outliers, an assessment of overall abundance, life history effects and will aid in an assessment of suitability for including data in univariate assessments. • A property of community data is that many of the taxa present are scarce. Therefore a decision is required, preferably prior to study start, as to the threshold of abundance required before univariate analyses may be conducted. When comparing data for scarce and patchy data in controls to treatment groups, the data may hold little information. • Outliers should be identified by the above process – these may be attributable to experimental anomaly or error. Outliers should be treated consistently and any removal from data for analysis should be justified. • Percentage reduction measures may aid in effects assessment, but do not account for taxon abundance, nor for anomalies between treatment group replicates. When abundance is low or data are very variable, such measures should be used with caution.

4

• Univariate tests may assume certain properties of the data, such as normality and equal variance. Transformations may be applied where required. Parametric tests assume an underlying distribution; non-parametric tests do not make this assumption, but still require that samples are taken at random. • Appropriate multiple hypothesis tests (such as analysis of variance) will allow for a comparison between all treatment groups and will inform if the test item had an effect on the taxon or population as a whole. This may be followed by multiple comparison tests which will allow for the assessment of specifically which test item treatment (e.g. which concentration) had an effect on the taxon or population in question. • It is recommended, due to the high number of univariate analyses that may be conducted as part of a community field study, that two consecutive significant effects at the taxon level are required before both a statistically and biologically significant effect is concluded. This is because with a high number of tests (each sufficiently abundant taxon at each time point), a number of ‘single significant effects’ would be expected to occur due to chance alone. There are two important caveats, however. First, the frequency of sampling events must be considered – if widely spaced, then severe effects may occur between sampling events and remain undetected, or be detected by only one sampling event and thus not considered a significant effect. Therefore, sample frequency must be adequate and matched to the biology of the test system. Second, a visual inspection of the data should also be conducted to detect, for example, a statistically non-significant reduction in abundance which occurs between two statistically significant reductions, which would otherwise lead to a conclusion of no significant effect, but which may actually represent a prolonged and significant effect. Therefore, both biological and statistical significance should be considered. Multivariate analysis • Multivariate methods, derived from ordination, identify overall trends from multi- dimensional data. • Data from all taxa, including those too scarce for use in univariate analyses, may be included in the analysis and as such multivariate methods may detect subtle, but consistent effects which univariate methods are unable to detect. • The commonly used ‘principal response curves (PRC) analysis’ method is designed to identify global trends in community data which are confined to the effects of treatment. The analysis is able to identify more than one community response and provides a measure of affinity of each taxon to the community response. • PRC provides output which is intuitive and clear yet preserves considerable detail, especially when second axes are considered. • The PRC analysis may be accompanied by statistical testing to determine the significance of effects on the community as a whole at each time point and, where replication is adequate, allows the assessment of the significance of each treatment at each time point. Where replication is lower, other methods are available (discussed). • The overall variance explained by treatment may be around 40%, for example, but this is usual for complex ecological studies. • Where studies include a reference item, this should be analysed separately, otherwise the influence of the reference item on the community will be included with the response of the test item. Application of statistical methods to study data

5

• Certain patterns in abundance data may lead to statistical results which are unexpected. For example, if a test item effect is followed by a decline in background abundance due to seasonal changes, then a statistical test may indicate no significant effect of test item in comparison to control. This may be due to very low numbers in the treatment group (as a result of test item effects) and also in control groups (due to seasonal decline) but the abundance data do not support a conclusion of no effect of the test item. • Contrasting responses in a community (due to indirect effects, for example) may occur and, in PRC, may be described by the second (or further) axes. • Different community responses may result in similar statistical outcomes, e.g. PRC diagrams. In principal response curves analysis, this is partly due to the setting of control group trajectory to zero. • It is important to conduct multivariate analyses alongside visual inspection of abundance data and univariate analyses. Field monitoring studies • ‘Monitoring’ studies differ from replicated manipulative studies (so called ‘field studies’) in that the aim is to observe the trends in communities or taxa in relation to, for example, normal pesticide use, rather than to conduct a controlled and replicated experimental study to derive estimates of effect and recovery at various levels of exposure. • As for replicated field studies, a focused regulatory question is necessary. • Appropriate experimental scale, study duration, sampling intensity and taxonomic resolution should be applied.

Conclusion Field studies represent a powerful tool in the environmental risk assessment process. The results may be usefully applied to risk assessment problems to provide added realism and reduced uncertainty. For this to be the case, however, proper study design and appropriate statistical analysis is required. Multivariate analyses, where appropriate, should be accompanied by univariate analyses, visual inspection of the data and ecological interpretation. This report provides discussion around many key issues of relevance to ecotoxicological community effect field studies.

6

1 Introduction Ecotoxicological field studies may be designed by industry notifiers in association with external consultants and contract research organisations (CROs). The results may then be interpreted and summarised before inclusion in a regulatory submission in support of a product registration. The submission will be reviewed by representatives from Member State (MS) Competent Authorities, in terms of both study quality and meaning. Not all parties involved may be familiar with the statistical design and interpretation requirements of such studies. This project was conceived in order to provide background information and guidance to specialist and non-specialists alike on the statistical design and interpretation of ecotoxicological community effect studies.

1.1. Aims, objectives and limitations The specific objectives of this project were to: a. Develop a guide to best practice for experimental design of ecotoxicological community studies (aquatic mesocosms, non-target arthropod field studies, soil mesofauna field studies). b. Develop a guide to best practice for statistical analysis of ecotoxicological community data (aquatic mesocosms, non-target arthropod field studies, soil mesofauna field studies). This report is intended to provide background and guidance on those statistical aspects of specific relevance to the design and analysis of ecotoxicological community studies so that non-specialists may be better informed on this complex area. References to background literature and further reading are provided throughout. Note this project was intended to provide statistical guidance on regulatory ecotoxicological field studies (a, above) for agrochemicals under EC Regulation 1107/2009, with a particular reference to principal response curves analysis, and was not intended to represent general statistical guidance for studies conducted for other purposes.

1.2. Project team The project was conducted by a team of four specialists with experience covering development of statistical techniques for analysing ecotoxicological community data, conduct of replicated field and large scale monitoring programmes, interpretation of community effects data and conduct of regulatory risk assessments.

1.3. Report structure The report introduces key aspects of the statistical design and interpretation of ecotoxicological community studies in a logical order, with illustrative case studies or examples. Some are based on real data and others are based on artificial data, generated to describe specific issues or principles. The report opens with a summary of the existing field study guidance (Chapter 2), in particular any reference to the use of statistical aspects of study design and statistical analysis of the data. We then consider the role of community studies in relation to the relevant legislation and perceived protection goal, and highlight the need for a defined regulatory question so that effort may be focused where it is required (Chapter 3). In addition, the ability of field studies to represent the real world scenario, and the potential effects of study design on this ability, are highlighted.

7

Principles of, and some issues associated with, experimental design for ecotoxicological community studies are presented in Chapter 4. An important consideration is the need to account for underlying environmental gradients which may introduce bias into the data. This Chapter also considers ways to analyse taxa whose abundance has been estimated using more than one sampling technique. A background to the key issues around statistical testing in ecotoxicological community studies is provided in Chapter 5. This chapter considers hypothesis testing, the assumptions underlying statistical tests, limitations of statistical testing, the concepts of type I and type II errors, repeated two sample testing and replication. Univariate approaches for statistical testing are introduced in Chapter 6. This chapter covers initial interpretation of the data at the taxon level, data screening prior to analysis, percentage reduction measures and provides an outline of appropriate univariate statistical tests. Multivariate approaches are introduced in Chapter 7. The focus is on principal response curves (PRC) and how such analyses may be interpreted. Due to the complexity of the approach and the complexity of the ecological communities which it analyses, the conduct and interpretation of PRC analyses forms a relatively large component of this report. Reference is made to examples from the literature and, Chapter 8, simplified artificial scenarios designed to illustrate some strengths and limitations of the PRC approach when interpreting community responses to experimental treatments such as different application rates of a plant protection product. A summary of the important issues around data examination and analysis for replicated manipulative field studies is provided in Chapter 9. Although the main focus of this report is on replicated manipulative field studies, field monitoring studies are also included in Chapter 10. While used less often, field monitoring studies nonetheless share some commonalities with manipulative field studies.

1.4. Definitions A glossary of terms used in this report is presented in Appendix II. The report makes a distinction between the various types of field study which may be conducted. The term ‘replicated manipulative field studies’ is used for those studies typically conducted in outdoor aquatic microcosm or mesocosm facilities (model ponds and ditches, for example), and terrestrial field plots. Such designs typically include multiple test item concentrations, or field rate, drift rate and reference item, in addition to control treatments. The studies are normally used to determine the potential for effects and recovery following exposure of the test system to a test item at pre-determined concentrations or application rates. The term ‘field monitoring studies’ is used for those studies which are typically conducted on a larger scale, for example whole fields or farms, or in natural water bodies, and which seek to characterise the effects on an ecological community following routine application of plant protection products, for instance. The studies tend not to be dose response designs. Examples include the SCARAB project conducted in the UK (e.g. Young et al, 2005). The term ‘semi field studies’ is used to describe those studies where test organisms may be confined in cages under field conditions (e.g. Candolfi et al, 2001). Such studies typically do not assess community level effects; rather they are often designed to examine potential for recovery in accompaniment to a replicated manipulative field study, or as a focused single test conducted under field conditions. The focus of this report is on replicated manipulative field studies, with a chapter on field monitoring studies.

8

1.5. Funding This work was kindly funded by the UK Chemicals Regulation Directorate (CRD), project code PS2363, awarded to ADAS UK Ltd.

9

2 Summary of Guidance Documents Eleven relevant guidance documents relate to the design and analysis of replicated manipulative field studies, and are summarised in Table 2-1. The guidance documents published for higher tier testing in ecotoxicology have shown a gradual shift over time from population-based endpoints towards the inclusion of community endpoints and the concepts of population and community NOEC values. Early field study guidance for non-target (e.g. PSD/HSE, 1997) specified randomised block designs and univariate analyses. The relevant regulatory concern was population level effects and recovery. Later, higher tier Guidance of Candolfi et al (2000) on non-target arthropods still focussed on population endpoints, but the main output, the overall ecosystem function and community structure was described as being the ultimate object of regulatory concern. Guidance on interpreting earthworm field studies (de Jong et al 2006) has been structured towards population level effects, but is concerned with the statistical power of each study and the possible need to increase replication to derive sufficient power. Community level analysis was initially introduced for aquatic mesocosm studies with de Jong et al (2008) defining 7 effect classes for different endpoints. Both univariate and multivariate analyses were considered a requirement (e.g. species richness, species diversity, multivariate analysis of the community and individual taxon abundance). By 2010, a similar approach had been applied to terrestrial non-target arthropod guidance (de Jong et al, 2010), with both community level analysis using principal response curves (PRC) analysis, together with population level analyses, considered to be a requirement. The current guidance for aquatic and terrestrial ecotoxicity higher tier testing requires univariate analysis of population data (counts for individual taxa) as well as multivariate analysis for the entire data set, to determine community response and recovery. Although the earthworm guidance of de Jong et al (2006) does not mention multivariate analysis, the use of PRC in the interpretation of such studies is now commonplace. The available guidance is summarised in Table 2-1, with respect to guidance provided on experimental design and statistical analysis.

Table 2-1. Summary of existing guidance on experimental design and statistical analysis of ecotoxicological community studies. Guidance presented in reverse chronological order.

Reference Experimental Statistical Analysis Design Guidance Guidance SETAC (1991). Guidance Dose response Monotonic responses may be document on testing (regression) and analysed using regression type procedures for pesticides in pairwise comparison approach, to provide an ED50. freshwater mesocosms. Ed (analysis of Data should be appropriately Arnold et al. variance) designs transformed prior to analysis. are mentioned. ANOVA or pairwise Experimental units comparison designs should be must be randomly tested with ANOVA first then a assigned to suitable pairwise comparison treatments. test such as Dunnett’s. Guidance states that replicates are not necessary for regression approach, but is required for ANOVA designs.

10

Reference Experimental Statistical Analysis Design Guidance Guidance Guidance suggests that it may be desirable to conduct pre-test sampling to establish coefficient of variation in data, in order to inform required sample size. PSD/HSE(1997) Guideline to Randomised Block Two way analysis of variance study the within season effects Design (blocks and treatments) of insecticides on non-target recommended with recommended for each terrestrial arthropods in cereals at least four sampling occasion. in summer. In “The Registration replicates. Small Handbook”, Vol. II, Part Three, barriered plots or Species not abundant at the A3/ Appendix 2. large >1 ha open time of spraying must be plots. Data from two assumed to be at risk unless or more site-years demonstrated to be unaffected are required. in further, more detailed experiments. Candolfi et al (2000) Fully replicated Pre-application sampling may Principles for regulatory testing design specified. give information on the power and interpretation of semi-field Boundaries of fields of the study to detect effects. and field studies with non-target should be No statistical procedures are arthropods. comparable. Studies specified. J. Pest. Science, 73, 141-147 should be intended to study Although population endpoints polyphagous, are described as being the detritivorous and main output the overall zoophagous ecosystem function and arthropods. community structure are Randomized block described as being the design with ultimate object of regulatory population endpoints concern. specified. Four replicates are preferable, three acceptable, fewer can be used for reference item treatments if plot size is a limiting constraint. ESCORT II No guidance given. No guidance given. Refers to Candolfi et al (2001) Refers to Candolfi et Candolfi et al (2000) for semi- Guidance Document on al (2000) for semi- field and field study design and Regulatory Testing and Risk field and field study interpretation. Assessment Procedures for design and Plant Protection Products with interpretation. Non-target Arthropods. SETAC, (2001)

11

Reference Experimental Statistical Analysis Design Guidance Guidance Giddings et al (2002) Makes Univariate techniques are Community-Level Aquatic recommendations recommended for population- Systems Studies - Interpretation with regards to level effect analyses, and Studies: CLASSIC. SETAC, overall study design multivariate techniques for Pensacola, FL, USA. (need for replication, assemblage/community-level ‘toxicological’ versus effect analyses. ‘simulation’ Reports should include a approach, timing of justification of statistical application, level of methods used. ). Reports ANOVA suggested as an should include an example of a method by which assessment of the population-level NOECs could suitability of the be derived; initial ECx values study for assessing via regression also mentioned. effects. Statistical Abundance data may be tests should be aggregated into higher considered during taxonomic or functional groups design of the study. to provide sufficient data for univariate testing. Principal Response Curves analysis recommended for community- level interpretation. It is noted that the high number of endpoints and results from mesocosm studies may result in some statistically significant effects, due to chance. DeJong et al (2006) Specified design of One-sided Dunnett test is Guidance for summarising >10 m x10 m plots, proposed to determine the earthworm studies. randomised plot power of the test if data are RIVM Report No. design, 1 year study normally distributed and the 601506006/2006 duration. variance of all treatments is the same. A power of 80% is considered acceptable, implying that missing a relevant effect in 20% of experiments is acceptable.

Counts data to be log transformed to convert the Poisson distribution to a normal one.

Positive reference has to have at least 50% effect for at least one time point.

Guidance focuses on statistical differences rather than biologically relevant effects. OECD (2006a) Guidance No detailed The methods of statistical Document on Simulated Lentic statistical guidance analysis should be determined

12

Reference Experimental Statistical Analysis Design Guidance Guidance Field tests (Outdoor and no mention of as part of the setting of the Microcosms and Mesocosms). community analysis. objectives. No. 53. ENV/JM/MONO(2006)17 Structural and functional endpoints. Structural endpoints are seen as populations of invertebrate groups seen to be of concern in lower tier risk assessments. OECD (2006b) Current Not specifically Detailed description of approaches in the statistical stated but this statistical tests appropriate for analysis of ecotoxicity data: A document is analysis of standard guidance to application. OECD primarily aimed at ecotoxicity tests to derive series on testing and aquatic testing. endpoints such as ECx and assessment No. 54. Demands for a study NOEC. ENV/JM/MONO(2006)18 aimed at estimating a NOEC or an ECx Visual inspection of data are different in terms recommended and of replication and identification of outliers (i.e. number of data points far away from concentrations. intuitive expectation). Decision as to which is the primary Heterogeneity of variances is endpoint should be discussed and appropriate made before the transformations are described. start of the experiment. Parametric and non-parametric methods described in detail as In a study to are approaches for checking determine the NOEC model assumptions. the goal is to bracket the NOEC with Flow diagrams present concentrations that appropriate statistical tests for are as closely quantal methods for spaced as possible. determining a NOEC. Where effects are expected to increase Where biologically sensible it in proportion to the is better to test a one-sided log of the hypothesis because random concentration then variation in one direction can concentrations be ignored, as a result such should be equally tests are more powerful than 2 spaced on a log sided. scale. Three to seven concentrations plus controls are suggested. At least two, but ideally 3 or 4 sub-groups per

13

Reference Experimental Statistical Analysis Design Guidance Guidance concentration are recommended (Ch. 157). Consideration should be given to having more subjects in the control group than the treatment groups to increase power. De Jong et al (2008) None given as 7 effect classes described, Guidance for summarising and document deals with applied to different endpoints evaluating aquatic micro and summarising and (e.g. species richness, species mesocosm studies. evaluating studies. diversity, multivariate RIVM Report No. community analysis, individual 601506009/2008 taxon abundance).

A nominal NOEAEC (No observed ecological adverse effects concentration) is used to add relevance to any statistical significance. De Jong et al (2010) Guidance None given as Univariate techniques for summarising and evaluating document deals with recommended for single field studies with non-target summarising and populations and multivariate, arthropods. RIVM report No. evaluating studies. (primarily PRC), for community 601712006/2010 level effects.

8 effect classes described. To present relevant effects the regulatory authorities could consider a certain effect class to represent the NOEAER (No observed ecological adverse effects rate).

Example of single taxon analysis uses Abbott’s formula extensively. EFSA (2011) Scientific Opinion. 1. Nature and size of 1. Terminology Guidance: use Statistical Significance and relevant biological significance/significant for Biological Relevance. EFSA differences should statistical concepts, use Journal 2011, 9(9): 2372 be defined before relevance/relevant for studies are initiated. biological considerations. Relevance is based on expert 2. Power analysis judgement and is therefore conducted at start to arbitrary and subjective. determine sample sizes needed to 2. Retrospective power have a high analysis are not considered an probability of acceptable methodology and detecting the not recommended by the relevant effect as a EFSA Scientific Committee.

14

Reference Experimental Statistical Analysis Design Guidance Guidance statistically significant effect. 3. Absence of Evidence is not evidence of absence. Non- significant p values should not be used to draw conclusions about equivalence. Confidence intervals should be compared to pre-defined equivalence limits.

4. Use of confidence intervals more informative than just significant or not significant result as they reflect the uncertainty in the result.

5. Where possible assumptions should be tested to study the robustness of any statistical result (e.g. statistical theories which hold true for large sample sizes may not be applicable with smaller sample sizes). EFSA PPR Panel (EFSA Panel Refers to existing Suggests appropriate on Plant Protection Products guidance (CLASSIC univariate statistical tests for and their Residues), 2013. [Giddings et al, calculation of population-level Guidance on tiered risk 2002], OECD 2006, NOECs. States that the assessment for plant protection eLink [Brock et al, minimum detectable difference products for aquatic organisms 2010], AMRAP for taxon level statistics should in edge-of-field surface waters. [Maltby et al, 2010]). be reported. Suggests that EFSA Journal 2013;11(7):3290, Discusses set-up of ECx may be calculated for 186 pp. Available online: aquatic model those taxa which are www.efsa.europa.eu/efsajournal ecosystem studies, sufficiently abundant, provided in terms of inclusion confidence intervals are also of taxa for ecological reported. Taxa may be threshold versus aggregated into meaningful ecological recovery groups to allow univariate option; exposure analysis where abundance is profile; a dose low. Stresses importance of response design ecotoxicological knowledge in with at least 2 application of statistical replicates per methods (e.g. monotonicity of treatment is responses where indirect recommended, effects may also occur). measurement Consecutive significant events endpoints should be should be given special related to specific weight, while possibility for protection goals, false positive and false sampling procedure negative results should be should be related to borne in mind. Principal treatment regime. Response Curves analysis

15

Reference Experimental Statistical Analysis Design Guidance Guidance maybe used for community- level effects assessment.

2.1. Section summary • The available guidance has shown a shift in recent years from population level assessments to analysis of data at the community level. • Detailed guidance is provided on the analysis of laboratory ecotoxicity data. • Certain documents provide guidance on aspects of study design (replication, blocking; e.g. Candolfi et al 2000), but stated approaches to statistical analysis are generally lacking. SETAC (1991) does mention use of appropriate multiple hypothesis tests (such as analysis of variance) and subsequent use of multiple comparison tests. The current authors agree with this approach in general, but other methods are also now commonly used (Chapters 6 & 7). • Later guidance includes reference to both univariate and multivariate techniques, such as principal response curves (PRC) analysis. This method is discussed in Chapter 7. • The updated aquatic guidance of EFSA (2013) provides a useful summary of considerations, including how taxa may be included in analyses, for model ecosystem studies. • There is a lack of formalised guidance on the interpretation of statistical analyses (beyond the open literature). For PRC, examples are provided in Chapter 8.

16

3 The context of community studies

3.1. Introduction This chapter introduces the concept of community and provides a discussion on protection goals and context to the use and interpretation of community studies. A key aspect is the need for a focused regulatory question for a field study to answer.

3.2. Types of community studies and place in tiered testing Regulatory ecotoxicological field community studies may be conducted as part of a tiered testing approach. Typically, such studies are conducted in response to concern raised by a risk assessment conducted with lower tier effects data. Upon completion of extreme worst- case ‘Tier 1’ laboratory studies (characterised by exposure of specified test organisms via inert test substrate such as glass plates or standardised laboratory media), testing may progress to ‘Tier 2’ laboratory studies. Such designs expose test organisms via natural test substrates (e.g. fresh or aged residues on crop leaves or standardised soils for terrestrial arthropods; aquatic studies may be conducted in presence of sediment). Further information on tiered testing approaches may be found in the relevant guidance documents and regulations (e.g. aquatic guidance document SANCO/3268/2001 and terrestrial guidance document SANCO/10329/2002). It is not necessary, however, to proceed through all available testing tiers before field community studies are conducted. Based on knowledge of the properties of the test item, it may be preferable to move straight to field studies. Field community studies offer the highest level of realism within the confines of a replicated test system. Direct and indirect effects and interactions between the taxa present may be captured. Definitions and terminology of field studies used in this report are provided in Chapter 1 (1.4). These are expanded upon here. Two main types of study may be identified. Those referred to here as ‘replicated manipulative field studies’ consist of application of one or more concentrations of a test item to randomly selected independent replicates (terrestrial field plots, aquatic ponds or ditches, for example) and comparison of them to control replicates. The second type is termed here ‘field monitoring studies’. These may include application of a plant protection product under normal use conditions to part of a catchment and subsequent monitoring of aquatic communities in relevant water courses, or monitoring of in-field and off-field communities following application to a representative crop. Scale may vary (e.g. from single field to wider landscape) and will depend upon the properties and use pattern of the test item and the focus of the study. Field monitoring studies, whilst potentially providing useful information on environmental responses under the most realistic conditions, may be hampered by a lack of replication, lack of control and influence of other factors or stressors, and specificity of the results to the site/context studied. Field monitoring studies are explored further in Chapter 10. Replicated manipulative field studies, therefore, represent the most amenable approach to gathering information on the response of an ecological community to a chemical stressor. Test systems typically consist of outdoor replicated units containing naturalised communities of non-target organisms. Replicates may be defined by ponds or ditches (aquatic microcosm and mesocosm systems) or plots within a larger field (non-target arthropod and soil mesofauna studies). Issues surrounding replication and study design are addressed in Chapters 4 and 5. Indoor aquatic microcosm systems with naturalised communities may also be conducted, and this report would also be relevant to such designs.

17

3.3. Advantages and limitations of replicated manipulative field studies The advantages of field studies over laboratory studies revolve around added levels of realism in the scenario and subsequent risk assessment. Laboratory studies are typically conducted with selected single species which have been previously proven to be amenable to laboratory testing. It may be possible to combine suitable single species data into a species sensitivity distribution (SSD, cumulative probability distributions of toxicity response for several taxa) to derive an estimate of sensitivity of an assemblage (the ‘assemblage’ likely consisting of the species available for testing), but this will not capture indirect effects and interactions between taxa and the wider habitat, nor is it likely to account for life history effects (timing of reproduction in relation to exposure window and effects on recovery, for instance). Replicated manipulative field studies allow for the characterisation of nearly all of the processes which would occur at the full field scale. Those processes which may not be captured are typically defined by the limitations of the test system. An example of this is the absence of fish, and herewith the top predators, in aquatic replicated manipulative field studies (Giddings et al, 2002). Another is that, due to the isolated nature of replicated experimental ponds, they may not fully capture the potential for recovery of certain impacted biota which may, under field conditions (the ‘real world’) have survived the exposure event in refugia (for instance, in larger water bodies) (Brock et al, 2009) or recolonised from other areas in connected habitats (e.g. drift of unaffected organisms from unexposed areas of a flowing water body). Carefully targeted bioassays (for instance, introduction of caged, previously unexposed organisms into the test system at selected time points post test item application) may, however, address such weaknesses. Similarly, terrestrial field studies are typically conducted at a scale which is smaller than the full field scale which they are designed to represent, which may influence effect and recovery. This can, however, be accounted for in the study design through focus on taxa for which study scale adequately represents dispersal. Nonetheless, well designed field studies represent the best available option for gathering representative data on the direct and indirect effects of a chemical stressor on a non-target community. A recent report from the EU (EU, 2012) highlights the limitations of the current environmental risk assessment paradigm and provides a discussion of potential ways forward. Population and community modelling is gathering pace and may overcome the limitations of community field studies mentioned above, but is currently not routinely used in regulatory submissions and widely accepted models are yet to be developed. The benefit of such modelling approaches is that, once parameterised, various scenarios may be run with little additional effort (compared to conducting a field study), but the models are limited by the data used to parameterise them. Adequate complexity to represent communities is another limitation. In the future, it is likely that a combination of field studies and modelling will be employed to characterise risk of chemical stressors to ecosystems. Clearly, the usefulness of the results of a field study in the regulatory process depends upon the design and execution of the study. A key aspect is the focus of the study, and this is helped by the definition of a regulatory question which the study is designed to address. In addition, adequate controls, replication, and appropriate data analysis and interpretation are required to ensure the study is fit for purpose. A poorly designed study (for example, inadequate replication or poor control of underlying environmental gradients) coupled with inappropriate statistical analysis, may nonetheless appear to yield useful results, despite inherent limitations which, to a non-specialist, may go undetected.

3.4. Population versus community; protection goals This document focuses on ecotoxicological community field effect studies, but it is useful to briefly discuss the concepts of population and community in this context. First, it is

18 necessary to refer to the legal requirements regarding the placing of plant protection products on the market. Regulation (EC) 1107/2009 (article 4(3)) states, under approval criteria: “(e) it shall have no unacceptable effects on the environment, having particular regard to the following considerations where the scientific methods accepted by the Authority to assess such effects are available: (i) its fate and distribution in the environment, particularly contamination of surface waters, including estuarine and coastal waters, groundwater, air and soil taking into account locations distant from its use following long-range environmental transportation; (ii) its impact on non-target species, including on the ongoing behaviour of those species; (iii) its impact on biodiversity and the ecosystem.” Certain definitions are provided. Article 3(13) defines ‘environment’ as: “‘environment’ means waters (including ground, surface, transitional, coastal and marine), sediment, soil, air, land, wild species of fauna and flora, and any interrelationship between them, and any relationship with other living organisms;” And ‘biodiversity’ as: “‘biodiversity’ means variability among living organisms from all sources, including terrestrial, marine and other aquatic ecosystems and the ecological complexes of which they are part; this variability may include diversity within species, between species and of ecosystems;” Therefore, reference to species and their interrelationships, the diversity of species, and ecological complexes is made. No statement regarding acceptability or otherwise of any impacts is made (see Uniform Principles). The EFSA draft opinion on protections goals (EFSA, 2010) uses ecosystem services as an overall approach to the derivation of specific protection goals (the so called what to protect, where and over what time period). According to EFSA (2010), specific protection goals require the identification of the following 6 key dimensions or aspects: “[..] the ecological entity that is to be protected (individuals, (meta) populations, functional groups or ecosystems), the attribute(s) or characteristic(s) of that entity that must be protected (behaviour, survival/growth, abundance/biomass, processes, biodiversity), the magnitude of effect that can be tolerated for the attributes to be measured (biological scale), the temporal scale of effect (e.g. the maximum time on an annual basis over which single or repeated exposure/effect events are expected to exceed the critical level that can be tolerated), the spatial scale of the effect (e.g. the distance from the sites of application where the exposures and critical effect level to be tolerated are expected to occur), and the degree of certainty that the specified level of effect will not be exceeded.” EFSA (2010) outlined proposed entities (units for protection) for which key drivers (in this case, groups of non-target organisms) may be considered. Mostly, entities were set as populations or metapopulations for non-target organisms, suggesting that the persistence of populations is the protection goal. For vertebrates, however, the entity to be protected was the individual - any effect at the individual level (e.g. lethality, behavioural change) would be considered unacceptable. Therefore, the proposed entities for protection are largely structural (individuals, meta populations and populations within a species) rather than functional, the potential exceptions being certain ecosystem services provided by microbes and algae (primary production, nutrient cycling etc) for which drivers of protection goals were functional groups (Table 3.4-1). This would suggest that the function of groups such as algae and microbes is intended for protection, whereas for other groups (invertebrates and vertebrates) populations and individuals are intended as the unit to be protected.

19

Table 3.4-1: Proposed entities for protection for key drivers in relation to the environmental risk assessment for plant protection products (EFSA, 2010).

EFSA (2010) also recognised that different protection goals for in-field and off-field areas would be required, at least for non-target arthropods and other invertebrates. This translates into a reduced temporal scale of acceptable effect with regards to potential impacts on biodiversity in off-field areas, for example. Taken together, therefore, Regulation (EC) 1107/2009 and the proposed approach to protection goals (EFSA, 2010) identify diversity of species and ecological complexes, interrelationships between species, in addition to services provided by non-target organisms which should be protected by ensuring no persistent effects at the (meta)population or functional group level. With regard to the focus of regulatory ecotoxicological field studies, therefore, both population (e.g. persistence of species at the field or landscape scale) and community (e.g. interrelationships between populations of species) effects are relevant. This is of relevance as later chapters discuss univariate statistics which are largely concerned with assessment of effects at the taxon (population) level (Chapter 6) and multivariate statistical methods which are able to assess effects at the community level (Chapter 7).

3.5. Representation of communities by test systems This chapter discusses the extent to which the test system and sampling regime affect our understanding of the community, rather than the extent to which replicated manipulative field studies represent the sensitivity of the real world per se (which is beyond the scope of this report). It is important to remember that the community of a replicated manipulative field study is represented by the samples taken from a model test system (experimental pond or ditch; field plot) which is an approximation of the real world. There are, therefore, two issues here which affect our understanding of the effects of the test item on the subject community: 1. We use a model test system which may not be a complete representation of the real world, and 2. We use a sampling regime which cannot provide a complete inventory of the biota present in the system at each sampling time point. These two points are important because the results of a replicated manipulative field study are used in a risk assessment which is designed to be protective of the real world scenario. The results of that study should be interpreted and applied to the risk assessment in the context of the above points. Each point is discussed, in turn, below. Field studies and the real world; effects of scale It is necessary to consider the extent to which the test system may influence community dynamics and, therefore, derived conclusions. Relevant issues may include experimental scale and connectivity of the test system. The effects of such issues may vary between taxa. For instance, for aquatic taxa without an aerial life stage, replicated manipulative field

20

studies may be considered more sensitive than the real world scenario which they are designed to represent, due to the isolated nature of the test system replicates (e.g. individual ponds or ditches) leading to lower potential for recovery via refugia. In the field, uncontaminated refuges or connectivity to uncontaminated water bodies may aid recovery for such taxa, for example (Brock et al, 2009). For those with an aerial life stage, however, rates of recovery may be increased by the presence of nearby control replicates or terrestrial vegetation features which may support aerial adults (e.g. Jenkins et al, 2012). Therefore, interpretation of study data and application to the risk assessment should include the impact of study design on the responses of the various components of the community. In contrast to aquatic systems, in terrestrial systems plots may be marked out on an in-field or off-field area. The plots typically remain open (barriered plots may artificially hinder recovery, e.g. Pullen et al, 1992; Carter, 1993). Depending upon plot size, therefore, taxa may be free to move between plots and the wider untreated field. In many senses, such studies are as realistic as possible in that the actual field community that would be exposed under field conditions is used as the test system, rather than a model system. Nonetheless, the imposition of plots onto the field is entirely artificial and it is, therefore, the researcher’s responsibility to apply plots of appropriate size in order to adequately represent the scale of the target community, that is, is plot scale appropriate for the community of interest in terms of mobility? Will plot size affect the detection of effect and time to recovery (e.g. effects of experimental scale on rates of recovery in terrestrial systems, Duffield & Aebischer, 1994; Duffield et al, 1996)? Will sufficient numbers of key taxa be present given the plot size (and replication)? Do the taxa in the samples originate from the test system or may they be considered ‘tourists’? The considerations of scale and connectivity are more important in terrestrial systems than in aquatic systems, and relate directly to the extent to which the test system may make a faithful representation of the subject community. Influence of sampling methods on our understanding of community Samples taken from a field scale test system do not constitute an inventory of biota present; rather, they are a representation of what is present. Using sampling methods to measure the properties of a model test system does not necessarily equate to a complete understanding of the real world scenario which is being modelled or represented, even when various methods are used, as the methods will be selective in terms of the biota captured. Due to this selectivity, normally various sampling methods are used, and sometimes more than one method may be used to sample similar groups or organisms. Sampling should focus on the relevant components of the test system and be of sufficient intensity to provide adequate numbers for analysis. Overall, aspects of scale, relevance to the real world and sampling effects should be considered during study design so that the study may be appropriately focused.

3.6. Section summary • Replicated manipulative field studies are a useful tool for generating data from naturalised systems. • Replicated manipulative field studies enable the researcher to explore direct and indirect effects of chemical stressors on populations and communities. • A replicated manipulative field study is nonetheless just a representation of the real world. • The samples taken are themselves a representation of the community present, and rarely represent a full inventory of taxa present, due to selectivity of the methods in terms of the taxa captured. • Due to the above points, the researcher should understand any effects of the study design on the results, including capacity of the system to detect effect and recovery.

21

• Aquatic systems are likely to represent a discrete community, but due to lack of connectivity and refugia, recovery rates may be artificially lowered for taxa without aerial life stages. For those with aerial life stages, recovery rates may be enhanced through proximity to control replicates, for instance. Such aspects should be considered on a case-by-case basis. • Terrestrial systems may be more prone to scale effects, where plots are unbounded, but the community present is often the actual community of concern in the regulatory process. • Overall the study design should be tailored to the community present in the test system and the regulatory question it is designed to address, so that any issues which may affect proper interpretation are understood prior to study conduct. • The study should be interpreted within the context of any limitations imposed by the study design.

22

4 Experimental design In this chapter the focus is on statistical aspects of field study design. Useful background reading includes Leps & Šmilauer (2003) for field study design and analysis, Underwood (1997) for general background, Hurlbert (1984) for discussion of pseudoreplication and general experimental approaches and Pitard (1989) for a summary of the discussion on sampling bias. The aim of replicated manipulative ecotoxicological field studies is to investigate the effect of a chemical stressor on a naturalised biological community present in a suitable test system. There is potential, especially where studies are conducted in the field (community studies may also be conducted in indoor microcosm facilities) for underlying gradients to exist which may affect the distribution of organisms in the test system and/or the responses of communities to the test item. It is important that such underlying gradients are understood, or controlled for in study design, so that they do not influence the results of one treatment group more than another. Gradients may be biotic or abiotic and may have direct or indirect effects on the biota present in the test system. For example, shade may affect growth rate of algae in aquatic systems, soil moisture may affect densities of terrestrial arthropods, proximity to a hedgerow or other aquatic water bodies may affect recolonisation in terrestrial and aquatic systems, respectively. In fact, Jenkins et al (2012) showed that recolonisation by taxa with aerial adult life stages in aquatic mesocosm studies could be promoted through provision of terrestrial foliage provided in enclosed systems over individual aquatic mesocosm replicates, indicating that terrestrial vegetation could also influence recolonisation in aquatic, as well as terrestrial, systems.

4.1. Study layouts and randomisation of plots It is essential that treatment is allocated to individual test system replicates (e.g. ponds, ditches or field plots) on a completely random basis to avoid the introduction of systematic error. Every replicate must have an equal chance of being allocated to any treatment group (notwithstanding considerations of blocked designs, discussed below). Failure to allocate treatments randomly (e.g. conscious selection of a plot with a known abundance of specific taxa), would introduce bias. The samples would not have been taken from the statistical population at random, and therefore the underlying assumptions of the statistical analysis which is applied to the study data would be undermined. For more on statistical assumptions see Chapter 5. The location or grouping of test system replicates is a key consideration in the design of manipulative field experiments. The specific way this is done may vary between aquatic and terrestrial test systems. Where environmental gradients are relatively uniform, a completely randomised design may be used. This is appropriate in terms of randomisation, but the design does not control for environmental heterogeneity. Two schematic examples are shown in Figures 4.1-1 and 4.1- 2. The example in Figure 4.1-1 is in a grid structure, which may be used due to practical limitations of the arrangement of the test facility or test site.

23

A B A gradient Environmental

C A B

C C B

Figure 4.1-1. First example of a completely randomised design, within a grid structure. Each plot is randomly assigned to a treatment group. There are 3 treatments (A-C) and 3 replicates of each. A known, unidirectional environmental gradient is indicted by the arrow.

A A gradient Environmental B C B A C C B

Figure 4.1-2. Second example of a completely randomised design. Each plot is randomly assigned to a treatment group and spatial location. There are 3 treatments (A-C) and 3 replicates of each. A known, unidirectional environmental gradient is indicted by the arrow. The completely randomised design is useful where environmental homogeneity is found, but other designs cope better with gradients. The randomised complete blocks design may be used when an environmental gradient is identified or anticipated in the test site. Here, blocks are arranged along the gradient. Each block contains an equal complement of treatments (e.g. plots), but the location of these treatments is randomised within the block. Each block includes the same number of replicates of each treatment. The blocks are arranged so that the treatments within the block are on the same position on the gradient. This is useful where there is a potential source of influence at one end of the study site - this may be a source of recolonisation in an ecotoxicological field study, for example. A schematic example is shown in Figure 4.1-3.

24

A B C B C A C A B

Environmental gradient

Figure 4.1-3. Example of a randomised complete blocks design. Each block contains a full complement of treatments (A-C), which are randomised to replicates within the block. The blocks are arranged so that each replicate within the block is at the same position along the gradient. The arrow indicates a known, unidirectional environmental gradient. The randomised complete blocks design is useful where a directional gradient exists. This design is commonly used in terrestrial ecotoxicological field studies, but is of course limited by plot and field size. If gradients are assumed to exist but are unknown, the completely randomised design should be used. Where gradients exist in two directions, a Latin square design may be used. This consists of one replicate of each treatment group in each row and each column of a square grid. The number of replicates is equal to the number of treatments. A schematic example is shown in Figure 4.1-4. The design is useful where known gradients exist in two directions (such as indicated in Figure 4.1-4), but less powerful than a completely randomised design if gradients are unknown.

25

A B C D Environmental gradient Environmental B A D C C D B A D C A B

Environmental gradient

Figure 4.1-4. Example of a Latin square design with four treatments (A-D). Each row and column contains one replicate of each treatment. The number of treatments equals the number of replicates. The arrows indicate known, unidirectional environmental gradients. Various hierarchical or nested designs are also possible, usually for testing more than one treatment variable. Examples of hierarchical designs may be where various sub samples are taken from plots. These samples are not treated as separate replicates, but do serve to decrease variability between treatment levels. Also available are split plot designs. Here, main plots may contain sub plots which represent different treatments applications, for example. In terms of replicated manipulative field studies, randomised block designs are typically used for terrestrial systems. Unknown gradients may, however, mean that a completely randomised design would be more suitable. Hurlbert (1984), however, argued that blocking was under used and that completely randomised designs may lead to situations where the replicates of a treatment group are clumped together, due to chance allocation of replicate to treatment. It is, however, suggested that where known and relevant gradients exist, a randomised block design be used and where gradients are unknown, a completely randomised design be used. In all cases, the choice of study design should be fully justified. In aquatic mescocosm studies, completely randomised studies are also commonly used, but the permanent nature of the test facility may present problems. Researchers may decide to use blocking to account for known or unknown gradients where replicate ponds are arranged in rows, for example. Clearly, this must be addressed on a case-by-case basis due to the variety of test system arrangements.

4.2. Pseudoreplication In ecotoxicological field studies, each field plot (terrestrial systems) or pond or ditch (aquatic systems) may be considered a replicate, so long as it is independent from the other experimental units in the study (ponds, ditches or plots). This is valid where ponds, ditches

26

or plots have been randomly assigned to treatment groups (including test item and control) without bias (either completely randomly or within a randomised block design) and each replicate receives a unique application event. Regardless of the number of samples taken from each experimental unit, it remains a single replicate. Pseudoreplication occurs when multiple samples taken from a single plot or pond at one time point are treated as individual replicates, or when samples taken from a plot or pond over more than one time point are treated as replicates. Multiple samples may only provide information on the variability between samples from a single replicate. If replicates of a treatment group are segregated in space from those of another group (aggregated by treatment), due to convenience or other non-random factor, this would also be pseudoreplication. This is because the replicates are not independent and may share localised conditions amongst, but not between, treatment groups. This could be addressed through appropriate randomisation or blocking. In unreplicated studies (i.e. those with just one replicate of treatment and control) it is not possible to conclude on the effects of treatments, only on the differences between plots or sites, including the treatment. This is because within treatment variation is unknown and, therefore, cannot be compared with between treatment variation. When replication is not possible, for instance in large scale field monitoring programmes, any analysis which is conducted on the data should be done so with this in mind( i.e. the inherent limitations of the study to infer treatment related versus site-only differences). It is important to recognise that differences between sites may occur prior to treatment, and may or may not be detected, and that differences may arise during the study duration, which may or may not be recognised as independent of treatment but may be a result of factors collinear to the treatment. Therefore, any similarity between sites prior to study commencement does not guarantee that each site would develop along the same trajectory and that post-treatment differences may be attributable to treatment effects only. In fact, the absence of a significant difference between sites prior to study start is likely a factor of low sample size rather than real similarity – ‘absence of evidence is not evidence of absence’. The analysis is limited to detecting difference between sites only, and not treatment effects, since any difference may be inherent and not related to treatment - replication would address this. That is not to say that such studies are not useful, but must be carefully interpreted. Hurlbert (1984) summarised: Replication reduces the effects of noise or random error and increases precision; randomisation reduces bias and increases accuracy.

4.3. Grouping data for analysis

Analysis of sub communities by sampling method It is beyond the scope of this report to provide a critique of the various sampling techniques employed in aquatic and terrestrial ecotoxicological field studies. Rather, the focus is on how data from different sampling techniques may be handled and interpreted. In an ecotoxicological community field effects study, there will typically be more than one sampling technique employed. Some taxa may appear in the samples from more than one sampling technique. Examples include Collembola appearing in suction and pitfall trap samples, and zooplankton appearing in depth integrated water samples and draw net samples. The researcher must then decide how to handle the data. If samples are pooled from different sampling methods does this compromise the statistical integrity of those samples? Also, given that each sampling technique has a different level of efficiency (which may vary over time), does pooling samples represent good biological sense or good statistical practice?

27

It is certainly inappropriate to treat the different samples taken from the same replicate (for example a mesocosm pond sampled using a depth integrated water sampler and a draw net for zooplankton) as replicates themselves – that is, replication is limited to the number of discrete test systems which are independent of one another. As sampling techniques do not represent a complete inventory of the test unit, but a snap shot of abundance within the limitations of the method, various methods are used to provide as complete a picture of community response as possible. Typically, in community ecotoxicology studies, sampling techniques are varied and designed to represent different groups of biota – a depth integrated water sample represents different biota (plankton) to a pebble basket (macro zoobenthos), for instance. Therefore, each sampling technique represents a different interpretation of the community present in the test system. Many researchers analyse results from different sampling methods separately, especially where the biota differ markedly between sampling methods. This avoids the complexities of combining data for one taxon from different methods with different sampling efficiencies. Examples of this approach include a study summarised in de Jong et al (2010), in which a non-target arthropod community was sampled using photo eclectors, pitfall traps and aspirator samples where abundance data were reported both by taxon and by sampling technique. Van den Brink et al (1995) analysed the effects of pesticides on aquatic communities for zooplankton, phytoplankton and macroinvertebrates separately, revealing the differing responses of the groups. In fact, different sampling methodologies are used to sample different communities and since they are expected to show a different response, data should be analysed separately. When more than one sample at each time point is taken using the same method, Hurlbert (1984) recommended taking the mean of sub samples and advised not to treat samples from the same methods separately, as this may lead to calculation error or misinterpretation. Clearly, such sub samples do not represent additional replicates (see 4.2 for more on pseudoreplication). If abundances are low, it may be tempting to pool taxon data from different sampling techniques in order to allow statistical analysis. This should be avoided; it would be preferable to take more samples using each technique and analyse the data separately.

Analysis by functional group In ecotoxicological community studies, many of the taxa present may have similar characteristics in terms of mode of feeding, physical strata, trophic level and reproductive parameters. This may lead to functional redundancy in such systems (e.g. Clarke, 1999) whereby taxa share a common response pattern to a chemical stressor with others. Therefore, taxa may be grouped into functional or traits-based groups. Depending upon the regulatory question which the field study seeks to answer, researchers may choose to analyse the data on a functional group basis. An example of this approach is Frampton et al (2000), where terrestrial arthropods were classed by trophic group as herbivores, predators, mycophages or omnivores prior to analysis, in a study of drought and irrigation effects. In this case, the results of the analysis conducted with functional groups were very similar to the conventional analysis. The assignation to functional groups was fully explained, which is recommended when data are analysed according to particular ecological groupings. In a second example, Frampton & Van den Brink (2007) reported a different community response of Collembola compared to macroarthropods (excluding aphids) when exposed to three active substances (although both sub community responses to an organophosphate insecticide were similar). Holland et al (2002) provided a discussion of taxonomic diversity and use of taxa as indicators of community level effect. The authors concluded that, for carabid communities for example, analyses conducted with and without dominant species would allow for the

28

determination of the importance of such species in driving the overall community response, especially where that taxon may respond differently to other taxa. The effects of varying community responses on multivariate analyses are explored further in Chapters 7 and 8. Often, functional groups will be associated with certain sampling techniques, but not always. For instance, there may be different trophic groups within Collembola and Carabidae which are not detectable by sampling method. Analysis of study data on a functional group basis does, therefore, require sufficient ecological knowledge in order to make meaningful distinctions. The division of study data into functional groups should always be justified, however; arbitrary assignment of taxa to groups should be avoided. Entire study data should never be analysed as one unit, in, for example, a PRC analysis (see Chapters 7 and 8 for more on PRC). Such an analysis may detect the key community responses, but, given that there may be several different responses in, for examplephytoplankton, zooplankton and macroinvertebrates, interpretation of the results may be confused or may require reference to more than two canonical axes. It is more biologically meaningful to analyse taxa by functional group, especially where taxa in a group share a common sampling technique.

4.4. Section summary • Randomisation of treatments among plots (within blocks if required) is necessary to preserve the statistical assumption that samples are taken from populations at random. • Environmental gradients should be controlled for in study design to prevent influence on particular treatment groups. • Randomised, blocked or other study designs may be used to control for environmental gradients. • Study design should always be fully justified and include replication, plot/study replicate layout and scale. • Replication allows the between treatment variation to be tested against the within treatment variation and herewith the evaluation of the statistical significance of treatment effects. The ability of the study to detect effects (statistical power) increases with increased replication. • Researchers should be aware of pseudoreplication. • Study data should be analysed on a sampling method basis, since different sampling methods represent different communities which may be expected to respond differently. • Study data may be analysed on a functional group basis to describe the various responses that may occur amongst taxa collected from the same sampling method. Such groupings should be fully explained and justified.

29

5 General introduction to statistical testing The aim of this chapter is to summarise and draw into one document some key aspects of statistics with regards to design and interpretation of ecotoxicological field community studies. We refer extensively to Zar (1999), which is an example of the many text books available. Ecotoxicological community studies which are the focus of this report are regulatory studies and normally subject to the requirements of Good Laboratory Practice, meaning that the study records made at the time should allow the conduct and interpretation of the study to be reconstructed after the event. Therefore, the statistics used to derive results used in interpretation should be thoroughly explained and justified, so that a non-specialist may follow the logic behind the analyses. This would ensure the traceability of the study output. Examples would include justification for: transformations of data, statistical methods used and explanation of the conclusions drawn. To simply state what was done (without justification) may not necessarily aid in reconstruction or reinterpretation, if required, at a later date. A general procedure for initial data investigation and statistical analysis is outlined in Chapter 9.

5.1. Introduction to statistics in community ecotoxicology studies Following completion of the field phase, the study data may be investigated. Summaries of the (geometric) means of each taxon in the respective treatments and the overall abundance are examples of descriptive statistics. These may allow us to inspect the overall characteristics of the gathered data. Other examples include standard deviation, standard error and measures of skew or kurtosis. In order to answer questions regarding the impact on the community of exposure to a chemical stressor, however, inferential statistics, such as hypothesis testing, may be employed.

5.2. Hypothesis testing Ecotoxicological field studies are conducted in order to determine if exposure to a given level of chemical stressor causes an effect on the community present within the test system when compared to a control treatment. It may be desirable to know whether the result is ‘statistically significant’. Prior to study start, hypothesis may be set and subsequently tested in order to determine the significance of the result. The aim of the procedure is, typically, to determine if two statistical ‘populations’ (not necessarily represented by populations of organisms, but often so in ecotoxicology) are the same. That is, do the samples come from the same, or identical, statistical distribution(s)? Samples are taken from the real populations, and inferences made on the basis of those samples. The symbol µ is used to represent the true population mean.

Firstly, a null hypothesis is stated (H0). This forms the basis of the test and expresses the concept of no difference. In ecotoxicological field testing, the null hypothesis may be that the treatment mean abundance equals the control mean abundance for a given taxon. This may be expressed as

H0: µT = µC Where

µT = true population mean of the treatment group

µC = true population mean of the control group Note that the symbol for true population means, µ, is used, and therefore we can summarise the null hypothesis as stating that the population mean in both control and treatment groups

30

is the same (i.e. they are drawn from the same, or an identical, statistical distribution). Since the aim of the test is to determine, within the limits of the study design and the data gathered, whether or not the population means are the same, by definition an alternative hypothesis, HA, is required. This may be expressed as

HA: µT ≠ µC Here, the hypothesis is that the population mean in the control and treatment groups is not the same (i.e. the two samples from are not drawn from the same statistical distribution). If the null hypothesis is proven to be false, then the alternative hypothesis is accepted. Alternatively, the null hypothesis may not be disproven and is therefore accepted. Hypotheses should be stated prior to study conduct and not in response to data examination. Hypotheses are accepted or rejected at a given level of probability, which is discussed below. It must be stressed that such hypothesis testing is conducted within the limits of study design (any bias due to sample site location, inherent environmental gradients, random allocation of replicate test units to treatment groups, replication, sampling efficiency and variation of the data collected). Such aspects of study design are discussed in Chapter 4. It is intuitive that the more measurements are taken, and the greater the precision of those measurements, the higher the chance of accurately concluding on the relationship of the two population means.

5.3. Probability within statistical hypothesis testing Once hypotheses have been identified, the data can be collected. We often do not know the true population mean µ, nor the true size of the population N, and instead we take at random a number (n = sample size) of measurements Xi, from which we may calculate the sample mean, Xˉ . We may then seek to test the probability of the observed result occurring due to chance alone, or occurring if the null hypothesis were true (i.e. what is the probability of obtaining the result if there is no difference between control and treatment populations?). More formally, we are testing the probability of a sample occurring, in a normal distribution, as far from the population mean µ as the observed Xˉ . If the probability of reaching this result is shown to be low (lower than a previously set threshold), then it may be concluded that the result is probably not due to chance, and there is a real effect (reject H0, accept HA). Likewise, if the probability of reaching that result due to chance alone is above the set threshold, then the H0 may be accepted (i.e. there is no significant difference as the probability of that result occurring due to chance alone is higher than we wish to accept). The threshold of probability for rejecting the null hypothesis is typically set to 5%, sometimes less, and is termed α. This means that the likelihood of rejecting the null hypothesis when, in fact, there is no real difference (i.e. drawing the wrong conclusion) is 5%. In other words, there is a 5% chance of concluding a significant difference even if the null hypothesis is true. Rejecting the null hypothesis under these circumstances is termed a Type I error, or false positive. Sometimes, the null hypothesis will not be rejected when in fact there is a real difference, and this is termed a Type II error, or false negative (i.e. it is concluded that there is no difference between the samples or populations when, in fact, there is). Whereas α, the probability of committing a Type I error is pre-determined, the probability β of committing a Type II error is not generally specified or known. The values of α and β are inversely related, however, such that as the probability of erroneously rejecting the null hypothesis (false positive or Type I error) decreases (e.g. reducing α to 1%, perhaps to prevent erroneous conclusions, due to chance) the risk of erroneously accepting the null hypothesis increases (i.e. real effects may be missed).

31

The levels of α should be set before the analysis. Only increased sample size n can reduce both α and β and therefore the chance of drawing a wrong conclusion. The statistical power of the test may be expressed as 1-β (i.e. the probability of correctly rejecting the null hypothesis). High variability can lead to higher Type II error rate, or increased risk of a false negative, influenced by the number of replicates and inter and intra treatment variability (e.g. Kennedy et al 1999). A statistical test will return a calculated test statistic, which is compared to a distribution for that test statistic. The user then compares the calculated value to the distribution for a statistical test of a given sample size (this is often done automatically in computer programmes). The place of the calculated statistic in the distribution indicates the level of significance of the result. This is reported in terms of probability p, where the calculated value of p is required to be equal to or below the threshold set prior to the test in order to conclude a real effect. Where α is 5%, the probability of rejecting the null hypothesis when there is in fact no real effect is 0.05, and would be reported as p <0.05. The p value is often reported as the key output from a statistical test, but is usefully accompanied by the calculated test statistic and degrees of freedom. The lower the value of p, the more significant the result since the probability of rejecting the null hypothesis simply due to chance is reduced. Historically, test results have been reported as ‘significant’ (p <0.05 or *), highly significant (p <0.01 or **) or very highly significant (p <0.001 or ***), although it is now generally recommended that the exact p value is reported. Hinds (1984) proposed that in ecological field investigations, due to the experimental conditions encountered, error rates of 10% may be considered instead of the usual 5%, which originated from laboratory experimentation under controlled conditions. This is due to background variability and reduces the likelihood of a Type II error (false negative). An example is de Jong et al (2010) where univariate statistical significance conducted on a non- target arthropod study data is reported at both 5% and 10% levels (p <0.05, p <0.10). It should be stressed, however, that α of 5% is routinely used and reliance on 10% alone during study interpretation would require justification. The potential significance of a calculated test statistic is limited by sample size, which is discussed further below. The level of probability p which may be achieved with a study design relates to the average α for all possible plot layouts, from a completely randomised design. If, however, a blocked design is used to control gradients, α may in fact be lower, but this cannot be known for specific plot layouts (i.e. α for a specific layout may be less than the α as indicated by all possible permutations of replicates and treatments, Hurlbert, 1984). On the other hand, if bias is present, α may be higher than that indicated by all permutations of replicates and treatments. Hence, the importance of minimising bias by randomisation and appropriate blocking (see Hurlbert, 1984 for more detail). Levels of probability p which may be achieved with certain study designs are discussed below (Chapter 5.6, Replication). While p values are routinely used in ecotoxicity field studies and many other statistical contexts, their use has been criticised by researchers since the p value may be considered to confound both the size of the observed effect and the precision of the measure (a function of the size of the study). A critique of p values and discussion surrounding their use is provided in Lang et al (1999), Goodman (2001) and Weinberg (2001), in an epidemiological context. It should be remembered that a p value does not represent certainty. Even when data are abundant and study design is well controlled and bias minimised, there is a probability of committing a Type I error, equivalent to α. When studies are poorly designed, however, for instance with inadequate randomisation, inadequate replication, with inherent bias of sampling, then the likelihood increases. Studies should also be interpreted in the light of other (reliable) data, wherever possible (such as other field studies), and data visually inspected, thus involving an element of expert judgement during interpretation. Supporting

32

statistics, such as confidence intervals and variance explained, may help interpretation. In Chapter 7, this is discussed with regards to principal response curves.

5.4. Repeated two sample testing In community ecotoxicology studies, typically there will be more than one treatment group, the response of which we wish to compare to a control group. These may be different concentrations of the test item in aquatic studies, or full field and drift rates, for example, in terrestrial studies. We need to know which treatment group had a significant effect on the community through comparison to the control group; ideally at which time point any effect was first recorded and when recovery may have occurred. In addition to biological interpretation, statistical testing is clearly an important part of data analysis and interpretation. There are various statistical tests which allow the user to compare the mean (e.g. t test) or the variance (e.g. F test) of two samples to determine if the samples may have been drawn from the same statistical population (i.e. to test for a difference between the two samples). There are also non-parametric methods which are appropriate for comparing two samples when the data do not conform to the requirements of the parametric tests (e.g. Mann Whitney); see Chapter 6 for more on requirements of univariate tests. Variations of such tests accommodate paired and unpaired data also. Each of these tests, however, is appropriate for comparing two samples only. As described in Zar (1999), it may be tempting for the user who has gathered data from more than two treatment groups to conduct two sample tests on all combinations sampled. For instance, the null hypothesis for a terrestrial field study may be H0: µC = µF = µD, (i.e. that there is no difference between the true population means (represented by our samples) of the Control (µC), Field rate (µF) and Drift rate (µD) treatment groups, respectively, for a given time point). This could then be tested with a two sample test thus: H0: µC = µF, H0: µC = µD, H0: µF = µD. (The latter comparison, of field rate and drift rate, would not usually be considered very informative in an ecotoxicological context, but the repeated two sample comparison of control/field rate and control/drift rate is sufficient for this example). A problem with employing a two sample test such as the t test in a repeated fashion to compare more than two samples is related to the Type I error rate. When two samples are taken and compared, the chance that the user will conclude that the samples represent two different populations when in fact there is no difference is no greater than α. This is the rate of Type I error, often set to 5%. If, however, the user takes three samples from this population, the Type I error rate rises to 14%, or 0.14. When we conduct such a test with two samples, the probability that we will correctly accept the null hypothesis when the two population means are equal is 95% (where α is 5%). This falls, however, when more than two samples are considered, to 0.953 (for three samples) or 0.86. That is, there is a 1-0.86 = 0.14 or 14% chance that we will erroneously reject the null hypothesis when in fact there is no real difference. The more samples are tested in this manner, the higher the probability of a Type I error until it becomes inevitable that this error will be made. For two means, there is only one pairwise comparison that can be made. For three means, there are three pairwise comparisons, rising to 10 for five means. Table 5.4-1 summarises the probability of committing a Type I error with increasing number of samples when two sample tests are used (after Zar, 1999). The number of pairwise comparisons can be found according to the following equation: = ( 1)/2

𝐾𝐾 𝑘𝑘 𝑘𝑘 − Equation 1

33

Where K = number of pairwise comparisons k = number of means or samples

Table 5.4-1. Probability of committing a Type I error by using a two sample test to perform multiple comparisons between more than two sample means. For two means, the probability of a Type I error is equal to that used in the test. For more than two samples, the probability of committing a Type I error is increased. After Zar (1999).

Number of Level of significance (α) used in Number of pairwise the two sample test Note samples comparisons 0.05 0.01 Proper use of 2 1 0.05 0.01 two sample test 3 3 0.14 0.03 Use violates test 4 6 0.26 0.06 assumptions; higher risk of Type I error 5 10 0.40 0.10 than suggested by α Instead of using multiple pairwise tests, it is appropriate to conduct a multiple sample test such as an analysis of variance to start with. Such tests are able to include multiple treatments of a single experimental factor (e.g. levels of test item concentration; one way or single factor tests) or two factors (such as test item concentration and nutrients; two way or two factor tests). More than two factors may be accommodated by multiway analyses. These tests may be termed global tests, and are able to detect a difference between the treatment groups, but will not indicate between which treatment groups the difference lies. Two way tests will also inform of any interactions between the two factors (for example between the factors test item and nutrients). It is then appropriate to use multiple comparison tests such as Kruskal Wallis to test specifically which treatment group differs to the control group at any one time. This is addressed in Chapter 6.

5.5. Importance of assumptions – normality, independence, randomisation Hypothesis testing techniques can be broadly divided into parametric and non-parametric. Generally speaking, parametric tests make some assumptions about the underlying data, often with regards to the distribution of the data (i.e. distribution parameters are known). Non-parametric tests do not make such assumptions. Where the data do conform to the requirements of parametric tests, such tests are often considered to be more powerful. If however, the data do not conform to the requirements for the test (which vary between tests) then the results of a parametric test may be misleading. The assumptions of parametric tests may include that the data are normally distributed, that the distributions have equal variance and that the observations are independent. Non-normal data maybe transformed prior to analysis in order to make it fit the assumptions – this is perfectly acceptable. Statistical tests may then be performed on the transformed data. Summary or descriptive statistics, however, should be reported in the original units of measurement. Log transformations are typically used in ecotoxicology for abundance values. This is discussed in more detail in Chapter 7. For more on general transformation, see for example, Zar (1999) and Underwood (1997, p. 187-194).

34

When a statistical hypothesis test is conducted, a test statistic is calculated. This is then compared to a table of test statistics for given degrees of freedom (related to sample size) which provides the probability of this result occurring due to chance, in the form of the p value. The table of test statistics itself represents a distribution, which could theoretically be calculated from the assumptions of the test (e.g. Ter Braak & Šmilauer, 2002). If the data on which the statistical test was based do not conform to the requirements of the test, then the calculated test statistic is not drawn from the same distribution as the reference distribution and any conclusion based on it may be misleading. That is, the test assumes that the calculated test statistic lies somewhere in the reference distribution – the significance of the result depends upon where in the distribution the calculated value lies, but if the underlying data do not conform to the reference distribution, then the test result cannot meaningfully be compared to the reference distribution. Therefore, it is important to ensure that the underlying data do conform to the requirements of the test, and that this is justified. In fact, methods such as analysis of variance (ANOVA – see Chapter 6) are relatively robust and able to handle certain departures from normality, and the statistical methods employed in multivariate approaches such as Monte Carlo permutation tests (see Chapter 7) are non- parametric. Nonetheless, this serves to highlight the need to ensure that the appropriate method is selected for data analysis. That samples are taken from the test population at random is an important assumption. If samples were not taken at random (i.e. were consciously chosen), then significant bias could be introduced which would affect the underlying distribution of the gathered data and thus conclusion drawn from it. This applies to both the selection of test plots or mesocosms, for instance, and the sampling of those plots or mesocosms. For instance, the conscious location of pitfall traps in a particularly damp part of a field, or attempting to take a water column sample in the middle of a concentration of zooplankton, is not sampling in a random fashion. Replicates should be independent from one another. That is, one replicate should not exert an influence over another (in this case, for example, the trajectory of the community in one replicate should not influence the trajectory of the community in another replicate). Replicates may not be independent where they are physically connected (e.g. interconnecting pipes used for pre study mixing in a pond system are left open) or where replicates share an application event. If a replicate is not independent from others, then the samples taken from it are not taken from the statistical distribution at random, as they are influenced by other samples.

5.6. Replication Almost all experiments, especially field studies, are prone to chance events, for example changes in weather, or unexpected mortality of test unrelated to treatment, for example. Chance events may affect all or only some experimental units. Replication provides some resilience against chance events, since more than one measurement is taken. Replication is closely associated with independence. Replicates may be considered true replicates when they are independent of one another; for instance, samples taken from two different field plots which received different applications of the test item allocated at random are two replicates. Two samples taken from the same plot or pond do not, however, constitute replicates since the samples are not independent – the application event was shared and the contents of each are influenced by the other. Independence of data points is an important assumption of many statistical tests. Increased replication can lead to greater statistical power and likelihood of avoiding Type I and II errors. Typically, higher replication in ecotoxicological field studies and monitoring

35

studies equates to increased expense and physical space. Therefore, replication is typically limited due financial and practical constraints. It is possible to determine the lowest probability level of a Type I error which may be achieved with a particular level of replication. Kedwards et al (1999b) provided an example with regards to Monte Carlo permutation testing (a technique explained further in Chapter 7). Three replicates in both a control and a treatment group are required to achieve a minimal p value of 0.05, for example. This is calculated as 6!/(3! × 3!) = 20 permutations. The lowest p value is calculated as 1/no. permutations, e.g. 1/20 = 0.05. For two treatments and a control replicated three times, the number of permutations are 9!/(3!x3!x3!) = 1680 and the lowest p value possible is 0.0006. The number of replicates possible in aquatic replicated manipulative field studies may be limited by the number of test ponds available.

5.7. Section summary The following are not stipulations, but points for researchers and risk assessors to consider. • Statistical testing may only provide an approximation of the ‘real world’ via the limitations of study design, sample size, precision and probability of chance events. • Statistical hypotheses provide the framework around which the probability of the observed result occurring due to chance may be calculated. • Study design in terms of replication (sample size) and the variance of the data influence the ability of the study to detect differences between populations. • Statistical tests designed to compare two samples should not be used to compare three or more samples. • The assumptions of any statistical tests used should be considered and the outcomes reported (i.e. the tests used should be justified). • Higher replication increases the statistical power of the study to detect real effects and reduces likelihood of erroneous conclusions due to chance. • In community ecotoxicology field studies natural variation will be present leading to variation in the results.

36

6 Initial data interpretation and univariate approaches

6.1. Introduction Univariate analyses allow the researcher to examine the response of populations to the stressor (i.e. comparison of abundance of one species or taxon in control and treatment groups). This may lead to a taxon level NOEL or an indication of the time to recovery for that taxon, for instance. The effective null hypothesis at each time point for each taxon is that the abundance levels are equal in control and treatment groups. In regulatory higher tier testing, the widespread use of complex statistical packages means that it is possible to routinely conduct a great many analyses with relative ease. It is necessary, however, to examine the data beforehand to determine if such tests are justified and if the requisite assumptions have been met. Examining data on a taxon level basis can reveal life history trends which may be important in the interpretation of the study. Univariate analyses were used in the interpretation of regulatory field studies before multivariate methods became available and remain an important tool for analysis of population and community level data. This is because recovery of individual taxa may be considered to occur at the population, rather than community level. Clearly this is open to debate, yet univariate analysis allows for a taxon level assessment of effect and recovery. This is especially important where specific taxa or groups of taxa of concern have been identified in the study design phase. The most commonly used multivariate assessment technique currently used in community ecotoxicological field studies, principal response curves (PRC) analysis, is not designed to represent all responses at the population level, since it looks for wider trends in the data. Therefore, univariate analyses are complementary to multivariate techniques and should not be seen as redundant in the light of multivariate techniques.

6.2. Initial investigation Faced with a body of raw data it is usually a worthwhile first step to plot abundance over time for all taxa recorded in the study (on a log scale) with different lines for treatments and to examine these graphs with respect to the biology of the organisms concerned. At this stage some data will show no trend at all (due to low affinity to the test system); for example, flying caught in pitfall traps or those with intermittent periods of presence or absence throughout the study period. There will always be some taxa, present in the samples, for which the study is not able to make a meaningful conclusion regarding the effects of treatment on them. This doesn’t necessarily mean such taxa are not sensitive to the test item, but that the design and sampling used may not have been appropriate to detect these effects. If those taxa are considered important for the regulatory process then this should be considered during study and sampling design. The life history of specific taxa may be revealed in the abundance plots, including emergence events or development between life stages as the season progresses. Awareness of such aspects may aid in the interpretation of taxon level NOEL values and detection of recovery, for example if numbers naturally decline towards the end of the season, it may not be possible to demonstrate recovery. Alternatively, if emergence from a protected life stage (e.g. egg or pupa) occurs part way through the study, that taxon may demonstrate a different effect and recovery pattern to those taxa present and exposed at test item application.

37

6.3. Outliers Outliers are observations that stand out from the rest and regularly occur in ecotoxicology field data. Outliers can dramatically affect the outcome of some statistical analyses and are generally best detected by using graphical tools such as the Cleveland dot plot (Cleveland, 1993). The data set should be examined for outliers and a decision to exclude them made on a case by case basis. Very often the outliers turn out to be actual data errors, with such a large body of data there are likely to be data entry errors and if these are detected then they can be corrected. In other cases the reasons may be present in the field notes, a flooded pitfall trap or one containing a dead small mammal would result in very low catches and a prevalence of necrophagous organisms, respectively. Both of these may occur as outliers and such data which would best be excluded from the analysis. The circumstances under which exclusions from analysis will be made should be specified in advance in the study protocol, to minimise bias (which might occur if researchers choose a posteriori which data to analyse).

6.4. Screening prior to analysis Prior to analysis, it is desirable to screen the data to focus efforts on areas of most relevance. The commonest type of screening is based on abundance itself. It is unlikely that anything worthwhile can be said about taxa occurring in very low numbers (e.g. one or two individuals), on a given sampling date. A threshold of abundance is decided upon below which univariate analysis is not deemed to be worthwhile. With relatively homogeneous data (such as visual counts of Phytoseiid mites on leaves) this abundance threshold may be lower than with very heterogeneous data (e.g. water trap counts of flies in a wheat field). Of course, this is a subjective decision and open to criticism. However, without recognition of the effects of low abundance on results, such data can be misinterpreted. One individual observed in one pitfall trap (of 40 in each treatment) on one pre-treatment sampling date and no further catches in the study could be interpreted as a 100% impact due to treatment. In fact, it is not possible to say anything conclusive about that taxon and its response to the treatment. It is also worthwhile considering at this stage what the taxa are and whether their presence in the samples makes sense. For example, water can sometimes be found in quite high numbers in pitfall trap samples from arable field studies. Their presence is nothing to do with the fields or the treatments applied to them. It seems probable that they are attracted to the reflective plastic covers used to prevent the traps from flooding during heavy rainfall. It doesn’t make sense to analyse their abundance data within the context of a terrestrial field study. This aspect is closely related to the context and design of ecotoxicological community studies. If the study design is focused on certain taxa (due to concerns raised over effects on these taxa, or study design is appropriate only for certain taxa), then opportunistic analysis of other, non relevant taxa should be avoided even if their abundance is high. This is because the hypotheses should be set prior to study commencement, not afterwards. Data exploration should not be used to define the questions that a study sets out to test (unless historical data is being analysed, in which case the data may be divided intelligently into two – the first portion used to construct hypotheses and the second to test them – see e.g. URL 1 in References). Related to this are issues of data snooping/dredging, that is conducting many analyses on various combinations of data to preferentially seek significant results while not accounting for the likelihood of Type I error rate (e.g. 5%). That is, repeated tests of the same statistical population may, by chance, throw up results which appear significant, but which are not real effects. This is related to repeated two sample testing, Chapter 5. Such issues are particularly relevant for univariate testing. White (2000) provided a general discussion of data snooping and mathematical background.

38

6.5. Data pre-requisites for analysis Some general discussion of assumptions of statistical tests is provided in Chapter 5 (normality, independence, randomisation). Here, we focus on test specific assumptions. Parametric tests require that data are normally distributed. An initial visual inspection via a simple plot is recommended. Descriptive statistics for skewness (i.e. if a distribution is skewed left or right) and kurtosis (platykurtic, flattened and broadened distribution curve, or leptokurtic, narrowed and heightened distribution) may be easily calculated. These may suggest any gross trends. It is also possible to test whether a sample came from a normally distributed population. Recommended methods include the Shaprio-Wilk (1965) test and the procedure proposed by D’Agostino and Pearson (1973). These methods are described by Zar (1999). Various transformations may also be used (aspects of transformations are discussed further in Chapter 7). Most statistical techniques require observations of one variable to be independent from another. For this reason it is not appropriate to conduct analysis of variance (ANOVA) with time as a variable, since the observations at one time point are not independent of the observations at a preceding time-point. Formally, ANOVA requires normality, independence of data points and homogeneity of variances. ANOVA is generally considered to be robust against departures from certain assumptions, however, such as homogeneity variances (for which it is more robust where samples sizes n (i.e. number of measurements from each treatment) are similar; Zar, 1999). If there are three or more samples then the homogeneity of variances can be tested. The most common procedure is known as the Bartlett’s test (Bartlett, 1937a, 1937b), but this is not generally considered efficient or necessary when using ANOVA (e.g. Zar, 1999). The term homoscedascicity is used to mean homogeneity of variances. The solution to heterogeneity of variances is either to apply a transformation of the response variable so as to stabilise the variance or to apply non-parametric statistical techniques that do not require homogeneity (e.g. Underwood, 1997). Further assumptions of ANOVA are: 1. that the effects are additive, i.e. that an individual value of x is made up of the grand mean plus treatment effect plus uncontrolled error; 2. that uncontrolled error is normally distributed and has an equal variance for all the treatments. For non-parametric tests, the requirements are less stringent than for parametric tests. Non- parametric tests, equivalent to ANOVA (e.g. Kruskal-Wallis, see e.g. Zar, 1999) may be used on fully randomised designs. They are appropriate where samples are taken from non- normal populations, or when the variances of the samples are heterogeneous. The Friedman test may be used for randomised blocked designs. Despite non-parametric tests being distribution free, samples must still be taken from populations at random.

6.6. Univariate tests and statistical power Univariate tests may be divided into two groups – global or omnibus tests, and multiple comparison tests. Global tests include analysis of variance (ANOVA, of which there are several models), which will indicate if a significant difference exists amongst samples taken from a control group and a number of treatment groups (i.e. that the treatment regime as a whole had a significant effect). Global or omnibus tests may also be termed multi-sample hypothesis tests, as they are able to analyse data from more than two groups at a time (i.e. H0: µ1 = µ2 = µ3 .. µi). It may then be desirable to test which treatment had a significant effect by comparing each group to the control in a separate multiple comparison test.

39

Field studies with more than two treatments are not suitable for analysis with a test for two samples (such as the t test) and require a test to address a multi-sample hypothesis in the first instance, possibly followed by a multiple comparison test. See Chapter 5 for further details on issues surrounding repeated two sample tests. Commonly used multiple comparison tests include the Tukey test, Newman-Keuls (or SNK) test, Bonferroni test, Williams test and the Duncan test. In general, the multiple comparison tests for means have the same underlying assumptions as ANOVA. If pre-treatment sampling indicates differences between blocks, an analysis of covariance (ANOCOVA) could be used to account (i.e. test) for differences between blocks. Similarly, a two way ANOVA could be used to test for an interaction between treatment and block. In terrestrial studies, the multiple comparison tests mentioned above are typically used to generate taxon level NOEL values. In aquatic studies, however, the Williams test (a combination of ANOVA and dose-response designs which assumes a monotonic increase of effect with an increasing dose. The test was developed in the medical sciences where replication is low and it is powerful when a dose-response experimental set-up is chosen; Williams, 1971, 1972) is often used to derive taxon level NOEC values. This is commonly implemented via the program Community Analysis (Hommen at al, 1994). The test allows comparisons of multiple treatment groups to a control. To be able to calculate a NOEC a replicated set of treatment levels representing different rates and a replicated control (zero rate) treatment must be present. The Williams test can be used where only two replicates per treatment group are present. If the data are not normally distributed, if the population variances appear to be heterogeneous, or a set of data is collected according to a completely randomized design, it is possible to test non-parametrically for differences between treatments using the Kruskal- Wallis test (Kruskal and Wallis, 1952). The Kruskal-Wallis test is equivalent to, and will be 95% as powerful as, ANOVA (i.e. it is a multiple hypothesis test). Equivalent (non- parametric) multiple comparison tests, to test specifically which treatments differ from each other, include the Nemenyi test, which is similar to the Tukey test, but uses ranks of sums rather than means (Zar, 1999). The power of a statistical test is the probability that it will reject the null hypothesis when it is false. Before performing a study it is desirable, but not particularly practical, to investigate the power of the proposed test. Using taxonomic counts from pre-treatment data it is possible to determine the smallest difference it is possible to detect between two means and the probability of committing a Type II error (failure to reject the null hypothesis when it is untrue, i.e. a false negative). It is also possible to determine the required sample size for analysis to achieve a desired level of power. Indeed, doing so may conclude that the power is so low that the experiment needs to be run with many more replicates or should not be run at all. This may lead to a greater number of sub-samples (from one replicate, to reduce variability, not to increase replication) or the rejection of the test site as being too heterogeneous.

6.7. Consecutive significant events It is straightforward with modern software packages to run univariate statistics for all taxa present in a study. However, if there are 350 taxa present in the study and 350 ANOVAs are conducted for every time point (e.g. 10 time points in a study), with a total of 3500 ANOVAs, 175 of the test results which show differences between groups at p = 0.05 may be due to chance. This is a Type I error and essentially means statistically significant differences may be recorded due to chance alone and not due to treatment. Mindful that such errors occur it is also common when reviewing reported field studies to treat individual statistically significant differences with some degree of scepticism. Consecutive significant effects may

40 be considered a real effect and are less likely due to chance, whereas single significant effects are more likely to be due to chance. The timing of sampling events may affect detection of effects, however, in that infrequent sampling may allow effects to occur undetected, or to be detected at only one sampling event, and hence not considered significant (improved study design required). Also, the occurrence of a statistically non- significant effect between two statistically significant effects would require a visual assessment of the abundance data in order to conclude on the overall significance. Biological and statistical significance of effects should always be considered.

6.8. Percent reduction measures Laboratory and field study reports sometimes include a measure of percent reduction compared to the control, calculated according to the following equation (Abbott, 1925):

= 1 × 100 𝑇𝑇 𝑅𝑅 − � � Equation 2 𝐶𝐶 Where R = percent reduction in treatment group compared to control group T = abundance in treatment group C = abundance in control group For instance, if there are 20 individuals in the control group, and 15 in the treatment group, the calculation would return 25% reduction relative to the control. This approach may be used in field study reports to summarise the magnitude of effect of the test item treatment, alongside statistical significance testing. In extended laboratory non-target arthropod studies, control mortality may be corrected using the following equation (Abbott, 1925):

( ) = × 100 (100 ) 𝑀𝑀𝑇𝑇 − 𝑀𝑀𝐶𝐶 𝑀𝑀 − 𝑀𝑀𝐶𝐶 Equation 3 Where: Mt = % mortality of treatment group Mc = % mortality of control group M = corrected mortality (%) Percent reduction and the related corrections for control mortality are, therefore, routinely used in certain areas of laboratory ecotoxicology. When applied to field studies, percent reduction measures may provide a useful aid to interpretation, but variability may make their use misleading. The effect of variability is illustrated in Table 6.8-1. Here, one replicate in the control group has a very high abundance, leading to an implied significant reduction in treatment group abundance when calculated according to Equation 2. Nonetheless, a treatment effect is not apparent when the raw data are examined, rather there is a control abundance anomaly. Note that the percent reduction is lower when expressed on a log scale.

41

Table 6.8-1. Example abundance data and use of percent reduction when control group contains anomalous value, resulting in apparent reduction in treatment group but little real effect.

Treatment Replicate Sum abundance Percent reduction group 1 2 3 4 Control 23 12 3 648 686 N/A Test item 22 34 3 8 67 90%

Rarity In field community studies, many taxa may be present at numbers which are too low to allow meaningful univariate analysis. An advantage of multivariate techniques is that all data may be included in an analysis, including that for scarce taxa. Such data may make a meaningful contribution to the interpretation of the data at the community level, for example if scarce taxa respond similarly the analysis may detect an overall trend which is not apparent at the individual level. Care must be taken, however, when interpreting such data at the species or taxon level. Essentially, when taxa are rare, the data contain little information. Basing study endpoints on such taxon level data may lead to inappropriate conclusions (i.e. it is always possible to calculate a percent reduction, but background abundance must be sufficient to make this meaningful). In Table 6.8-2, a percent reduction of 50% is calculated, but the abundance in control and treatment across all replicates is 6 and 3 individuals, respectively. There is not enough data to conclude that this taxon was impacted in the treatment group at this time-point, relative to the control.

Table 6.8-2. Example data and use of percent reduction when abundance is very low. The low abundance hinders interpretation of the percent reduction value. It is not clear if there is a real effect, since numbers in either group are very low.

Treatment Replicate Sum abundance Percent reduction group 1 2 3 4 Control 3 0 1 2 6 N/A Test item 1 0 2 0 3 50

Different researchers may specify different thresholds of abundance before univariate statistics and other measures are applied and this should be done prior to study commencement, for example >1 individual per plot in a replicated terrestrial field study design (de Jong et al, 2010). The example study summarised in de Jong et al (2010) includes Abbott values and level of significance of effect for a wide range of taxa and sampling methods. It is notable that many reductions of <50% compared to control were recorded as statistically significant at the 5% level, but that there were also reductions >70% (i.e. higher magnitude of percent reduction) which were not significant at this level (some were significant at the 10% level). This would indicate that percent reduction does not necessarily relate to statistical significance, which is influenced by sample size and distribution of the data in the sample (i.e. low abundance and high variability data may not be conducive to a robust statistical analysis). It is always possible to calculate a percent reduction on data which is not adequate for a statistical analysis.

42

6.9. Section summary • Study design should stipulate conditions for analysing taxa and acknowledge Type I error rate (i.e. some significant results may be due to chance). This can be reduced by requiring two consecutive statistically significant events to conclude an effect (depending upon the α level set) • Graph individual taxa abundance data over time. This can help with interpretation of univariate and multivariate analyses • Conduct univariate analyses only on those taxa which a. Are present in sufficient numbers b. The researcher believes were relevant to the test system (i.e. part of the resident community) c. Are relevant to the regulatory question (ideally identified prior to study start)

• Data exploration should not be used to retrospectively alter the questions that a study sets out to test. Every step of the exploration should be reported, and any outlier removed should be justified and mentioned. Reasons for data transformations need to be justified based on the exploratory analysis. • Consecutive significant effects may be considered a real effect and are less likely to be due to chance, whereas single significant effects are more likely to be due to chance. Both biological and statistical significance should be considered, however, along with the limitations of sampling design (e.g. frequency of sampling events). • Percent reduction measures should be used with caution, especially where abundance is low and data are variable.

43

7 Multivariate approaches

7.1. Introduction to multivariate analyses As with the general statistical background and study design chapters, there is a large body of literature regarding multivariate analyses. Here, we refer to Leps & Šmilauer (2003), Legendre & Legendre (1998) and online resources (see URL 2 in References), as well as references specified in the text. Legendre & Legendre (1998) provide an in-depth account of many background aspects and Ter Braak (1994), Van Wijngaarden et al (1995) and Van den Brink et al (2003) provide useful background, with the latter two focussing on ecotoxicological studies. The aim of ecotoxicological field studies is to assess the direct and indirect effects of a chemical stressor on a biological community, including magnitude of effects and potential for recovery. It is possible to draw conclusions regarding population level effects by using univariate approaches, as outlined above. These allow for comparison of organism abundance in treatment groups to those in control groups in order to detect effect and recovery at the population level (or level of taxonomic resolution employed). Univariate analyses for key taxa may yield important trends in the data. Often, however, data on many taxa will be collected at abundance thresholds below those suitable for univariate analysis (Van den Brink & Ter Braak, 1998). That is, too few data are available from which to draw reliable conclusions (this issue is also discussed in Chapter 6). In addition, univariate approaches may not detect subtle but consistent responses across the community. Many authors have highlighted the need for community level interpretation of ecotoxicological field data (e.g. Clements & Kiffney, 1994; Maund, et al, 1999; Clarke, 1999; Sparks et al, 1999). Community data are complex. Many taxa with differing response patterns may be present, meaning that trends may be obscured. Simultaneous inspection of data for all taxa is often not meaningful. On the whole, community data are often sparse; that is, many taxa may be present at low densities - this is especially so for terrestrial systems. Each taxon, therefore, may contribute little to the overall community composition. Historically, despite investment of time in taxonomic identification and enumeration, such taxa may not have been included in the (univariate) analysis (PSD, 1997). Also, many factors may contribute to the composition of taxa present, but few may be key factors. In ecotoxicological field studies, the study design (see above – blocking, randomisation and replication) is used to enhance the detection of treatment and time related effects, and minimise others. Finally, another characteristic of community data, and less of a difficulty, is that levels of redundancy may be high – that is, several taxa may be closely associated and abundance of one may predict the abundance of another. Impacts may therefore be revealed by more than one taxon, and absence of one taxon may not prevent detection of the effect. On the other hand, responses between taxa at, for example, the Family level may vary (Brown & Miles, 2006), meaning that not all related taxa will respond similarly. The relevance of this to the appropriate protection goal should be stated in any associated risk assessment. In a dose-response laboratory study, a regression-type analysis approach is typically followed. The independent or predictor variable (dose or concentration) is manipulated and the response or dependent variable measured (e.g. survivorship or abundance of test organisms). Such an approach may also be applied in single species field assays, such as with potted macrophytes or caged invertebrates studied in outdoor mesocosm systems. More than one predictor variable may be introduced – levels of nutrients, timing of application, food or light may be manipulated in addition to the levels of the test item, for example (Van Wijngaarden et al, 2006). Assuming a linear response model, a multiple regression analysis could be employed to investigate the influence of the various environmental variables on the response variable (e.g. organism survivorship, abundance).

44

This study design includes only one response variable, however – the survivorship or abundance of the test organism. In community field studies, however, many taxa may be present, such that the data represent multiple environmental variables and multiple response variables. In ecotoxicological field studies, it is typically the number of response variables (species or other taxa) and the interactions between them which present the challenge. Community data may be considered multidimensional, with the number of dimensions equal to the number of response variables. Therefore, in a study for which abundance of 100 taxa are recorded, the response data would exist in 100 dimensions, since there are 100 responding variables. Community data are typically recorded in a matrix consisting of samples by species; that is, a measure of abundance of the various taxa for a given sampling technique in a given treatment group at a given time point. Abundance data for each individual taxon, of which there may be tens or hundreds, could be presented against time in a single plot, with a different series or line for abundance in each treatment group, but this would result in a confused output which would not aid interpretation at the community level. An example is shown in Figure 7.1-1. The figure shows that following application of the test item at time 0, there was a reduction in abundance of some (but not all) taxa. The figure also shows that many taxa are present in low numbers and that zero counts are common. The overall trend of the community sampled is not clear. Univariate analyses would not be possible for the taxa with low abundance, meaning analysis would be limited to those taxa with sufficient abundance. The data are from Van den Brink et al (1996) and Van den Brink & Ter Braak, (1999).

14

12

10

8

6 organisms

4 transformed abundance of of abundance transformed - 2 ln

0 -4 -1 0.1 1 2 4 8 12 15 19 24 Day after treatment

Figure 7.1-1. Example of aquatic mesocosm ln-transformed combined macroinvertebrate and zooplankton abundance data for 180 taxa from a single replicate (44 µg/L chlorpyrifos). Application of test item at time 0. Series legend not shown due to size. Van den Brink et al (1996) and Van den Brink & Ter Braak, (1999).

Ordination techniques, however, through dimension reduction, allow for community level interpretation of ecotoxicological field data (Van Wijngaarden et al, 1995). Dimension reduction involves identification of those factors which explain the most variation in the data, typically through generation of axes along which the samples/taxa are arranged. Therefore, overall trends from an initially confusing community response can be detected. In addition, all taxa may contribute to the analyses, and not only those which reach a certain threshold of abundance.

45

7.2. Background to ordination techniques In a naturalised test system such as an aquatic mesocosm facility or an arable field, various environmental gradients will exert an influence on the assemblage of organisms present. In regulatory studies, the interest is on the effects exerted by exposure to the test item and the interaction with time (i.e. magnitude of effect, time to effect and potential for recovery). As mentioned above, study design may be optimised to reduce the effects of other gradients, such as changes in soil type, shading of pond systems, or proximity to sources of recolonisation. In general, ordination techniques will order the data along gradients (axes) that explain the most variation in the dataset. Responses to gradients may be unimodal or linear. If unimodal, there will be an optimum point along the gradient at which response, such as taxon abundance, is greatest. Either side of the optimum, the magnitude of response will be less. An example would be plant abundance along a soil moisture gradient – plants may be less abundant where soil moisture is very low or very high, and optimal somewhere in between. If linear, however, then the response will increase or decrease with increasing gradient. This may be considered more appropriate to ecotoxicological data where taxa are less likely to show an optimum response to a certain concentration of chemical stressor – more likely the magnitude of response will increase with increased exposure to the stressor (until a total response is recorded). Ordination techniques may be divided into those techniques which use a linear and unimodal species response model. In general, the linear response model will be a better fit over short gradients for homogenous data. See, for example, Leps and Šmilauer (2002) for further discussion of selection of the appropriate model based on gradient length (the methods used routinely in community ecotoxicology are linear models) and van Wijngaarden et al (1995) for discussion in relation to ecotoxicology. In short, the length of the gradient of the environmental variables may be calculated using detrended correspondence analysis (DCA) – shorter (≤3 SD) gradients are better suited to linear models, and longer gradients (≥4 SD) to unimodal models. Dose responses may, in general, be better described by linear approaches, hence these may be used in ecotoxicology and when gradient lengths fall between the categories described above. Ordination techniques may be divided into two other distinct groups – constrained and unconstrained, also termed direct and indirect. Unconstrained/indirect methods will generate axes which explain the most variation in the study data, without reference to environmental variables, and the end user may identify certain ecological characteristics associated with the gradients. For instance, plant composition may be studied at different sites and the ordination reveal associations between samples and taxa which suggest that certain taxa are associated with certain samples and, based on ecological knowledge of the environmental requirements of those taxa, inferences may be made as to the gradient along which the samples are arranged (soil type or moisture, for instance). Unconstrained methods produce axes through the data based on the association of certain taxa and samples, arranging the axes in a way which explains the most variation in the dataset. In contrast, constrained ordination methods use pre-defined and measured environmental variables or predictor variables (in ecotoxicology, test item concentration or application rate and its interaction with time). The analysis is constrained to describing only the variation in the data that can be ascribed to the predictor variables. Outputs of either of the above ordination analyses may be presented in a biplot, with each of two axes representing the main axes of variation formed by the analysis, and the position of the species and samples markers representing affinity to each other and to the community response as indicated by the axes. It should be noted that ordination techniques do not necessarily involve statistical hypothesis testing and may be used as exploratory tools to inform further analyses. Indeed, some authors recommend this as a key use of ordination techniques (see e.g. Sparks et al, 1999). Permutation testing and other methods can, however, be employed to determine the significance of the overall community response in an

46 axis (in comparison to a control) or at an individual time point (again, in comparison to a control treatment – discussed below (Chapter 7.7)). Van Wijngaarden et al (1995) and Sparks et al (1999) provided useful introductions to the use of multivariate analyses in ecotoxicological studies, including an introduction to the many methods available and the constraints and benefits associated with each. The potential benefits and drawbacks of multivariate analyses with regards to field ecotoxicology studies are well known. Clements & Kiffney (1994) suggested that community level data should be analysed at the community and ecosystem level. This was echoed by Maund et al (1999), who also suggested that univariate techniques may be limited by power and experimental variability and, since the number of variables that can be analysed at any one time is limited, then so is the potential for these methods to determine effects at the community level. Clarke (1999) suggested that population variability may be such that univariate approaches can be uninformative or misleading - hence the motivation for multivariate methods. Further, Clarke (1999) highlighted the ability of multivariate approaches to identify subtle but consistent differences. Sparks et al (1999) cautioned, however, that as multivariate methods become more common, misuse may also increase. The authors made three recommendations: 1, as with any data, the first step should be graphical exploration; 2 – ordination should be seen as a method to focus further investigations, rather than an end in itself; 3 – that greater co-operation between those developing the techniques and the end user is required. These sentiments were repeated by Van den brink et al (2003) who also called for greater co-operation between statisticians and ecotoxicologists, and guidance on the use of multivariate methods in community ecotoxicology. Generally, univariate and multivariate methods are considered to be complementary. Clarke (1999) proposed that species level data on meaningful space and time scales, combined with use of community data, would be desirable. Therefore, multivariate analysis methods are considered as part of a toolbox for analysis of community data, and not a replacement for objective visual inspection of data and univariate techniques.

7.3. Introduction to Principal Response Curves analysis Principal Response Curves (PRC) analysis (Van den Brink & Ter Braak, 1999) has become a standard ordination technique for the analysis of regulatory community effects data in aquatic and, increasingly, terrestrial systems. PRC was developed from Redundancy Analysis, or RDA. In turn, RDA is the constrained form of Principal Components Analysis (PCA). Both are linear methods using similar models to those which underlie regression analysis. PCA and RDA are therefore briefly discussed as background before PRC is introduced. We refer to PRC analysis as conducted in the CANOCO for Windows computer program, version 5.02 (Ter Braak & Šmilauer, 2012), but PRC may be conducted in other packages such as SAS (SAS Institute Inc., 100 SAS Campus Drive, Cary, NC 27513-2414, USA) and R (http://www.r-project.org/). Interpretation is unaffected. In PCA, as the method is unconstrained, measured explanatory variables are not included in the analysis. Rather, the analysis uses only the species data and environmental variables are inferred indirectly (hence this method is also termed an indirect analysis). The multidimensional data are rotated about the centroid and reflected onto new axes. The axes are created from a combination of latent (unmeasured) variables. The first axis describes as much of the variance in the data as possible as it passes through the greatest dimension of the data and is termed the first component. The second component is orthogonal to the first and describes as much of the remaining variance in the data as possible. The sum of all axes accounts for the sum of the variance in the data, and often most of the variability can be described in only a few components. Dimensionality is, therefore, effectively reduced. In addition, sample scores and species weights are calculated. The sample scores are the

47

values of the latent variables for each of the samples (i.e. the co-ordinate along the derived axis of each sample location) and the species weights are the regression coefficients of the linear model for each species. An example of the use of PCA might be the exploration of botanical species composition in samples from a meadow. The analysis would show the association between species and samples and would require the user to apply their own ecological knowledge in the interpretation of the distribution – for example, properties of sample sites which may affect species composition. No measurement of properties of the samples other than taxonomic composition would be included in the analysis. Ter Braak (1994) provides further background on scaling for species or sites in PCA and interpretation of variance at the species level. In contrast to PCA, RDA is constrained. Therefore, measured explanatory variables are used as inputs and variation in the community data is directly related by the analysis to the measured variables (hence this analysis type is also termed direct). The components or axes are linear combinations of the measured environmental variables. In community ecotoxicology, these are principally the various test item treatments and their interaction with time. Variance is partitioned between explained variance (variance explained by the explanatory variables, such as test item treatment and its interaction with time) and unexplained variance (residual variance, i.e. due to variation between replicates). RDA is concerned with explained variance only. In formal terms, RDA may be considered to be a PCA in which the sample scores are constrained to be linear combinations of the explanatory variables (e.g. treatment groups and their interaction with time). The resulting ordination diagram output is, therefore, limited to the variation in the data that is due to the explanatory variables. RDA focuses on the variance that can be attributed to time, treatment and the interaction between these two variables. This fits well with the aims of community ecotoxicology field studies, where other sources of variation are controlled as far as possible. The benefit of RDA over PCA, therefore, is that the variance attributable to treatment only may be determined. This is important in community ecotoxicology studies, where significant natural variation (i.e. non-treatment related, between replicate, variation) may be present. The graphical biplot output of RDA, however, may be difficult to interpret. An example RDA output is shown in Figure 7.3-1. (after Van den Brink et al, 1996).

48

Figure 7.3-1. Example RDA biplot output. Samples and taxa are shown, along with community trajectories (labelled). After Van den Brink et al (1996).

Changes in community composition, or treatment related effects, over time are often, however, unclear to non-specialists. PRC overcomes these problems by producing an intuitive graphical output with time along the x axis and the coefficient of the community response pattern on the y axis. A line displaying the response pattern for the community in each treatment group is produced and the control treatment community trajectory is set to 0 at each time point. Therefore, the response of each treatment group for each time point relative to the control is clear. The plotted lines for each treatment group are termed the response curves. The curves are often plotted in an inverse manner, such that a response of higher magnitude (i.e. reduction in abundance) results in a line which is further below the control line. This is intuitive, since the likely effect of a chemical stressor is a reduction in abundance relative to the control. Accompanying the curves of the treatment group coefficients are species scores which show the affinity of each taxon included in the analysis to the response shown. The graphical output may then be divided into two components – the curves of the community response in each treatment group, and the affinity of each taxon present in the study to the treatment group responses. An example from Van den Brink et al (1996) and Van den Brink & Ter Braak, (1999) is shown below (Figure 7.3-2). This shows the community response for a multiple rate aquatic mesocosm study with one test item.

49

6 PRC diagram overview horaria 5 Nauplius Larvae Cloeon dipterum 4 Coefficient (C) of the Strombidium viride Mystacides longicornis/nigra community response in Ablabesmyia monilis 3 Simocephalus vetulus treatment group d at time t Control community Ceratopogonidae Chaoborus obscuripes (Cdt) on y axis trajectory set to zero Caenis luctuosa 2 inaequalis Ostracoda spp Chironomus spec. Hygrotus versicolor Armiger crista 1 Coenagrionidae

k Asellus aquaticus

0.2 b 0 0 -30000 -25000 -20000 -15000 -10000 -5000 0 5000 10000

-0.2 Species scores bk on separate axis – -1 -0.4 affinity of each taxon to the community response -0.6 Bithynia tentaculata dt -2 C -0.8 Principal Response Curve – overall community response /trajectory for -1 each treatment group -1.2

-1.4 Sum of species responses in treatment -1.6 group at given timepoint. Response of -5 0 5 10 15 20 25 each taxon (Tdtk) = Cdt bk Week post application Timepoints on x axis Control 0.1 µg/L 0.9 µg/L 6 µg/L 44 µg/L

Figure 7.3-2. Example Principal Response Curve plot overview. The diagram is split into two main components – the PRC and species scores. The PRC shows the trajectory of the community response (expressed as coefficient of community response, Cdt) in each treatment group on the y axis, with control group response set to 0. Time is expressed on the x axis. The species scores (bk) are the affinity of the response of each taxon to the overall community response to the test item treatment (Van den Brink et al, 1996 and Van den Brink & Ter Braak, 1999).

Chapter 7.4 discusses in more detail the workings of the PRC method and guidance for interpretation.

7.4. PRC mechanics and terminology Since community responses to chemical stress can be very complex, it is worth exploring exactly what the PRC method does and how the response curves and species scores are calculated. We will then look at interpretation of the response curves and species scores in more detail. PRC focuses on the differences in sample compositions between treatment groups and a control group at each time point (see e.g. Van den Brink & Ter Braak, 1998 for comparison of abundance plots on a logarithmic scale and relative to control levels). This is achieved through modelling the abundance of each species as a sum of three components: its mean abundance in the control, a time specific treatment effect and an error.

Hence, for species k, Tdtk is the effect of treatment d at time t. This overall response pattern for each taxon is modelled as a multiple (bk) of a basic response pattern for that treatment group (Cdt) i.e. Tdtk = bk × Cdt. The statistical model for the first PRC (first axis of the ordination or first component) therefore becomes:

( ) = + + ( )

𝑦𝑦𝑑𝑑 𝑗𝑗 𝑡𝑡𝑡𝑡 𝑦𝑦�0𝑡𝑡𝑡𝑡 𝑏𝑏𝑘𝑘𝐶𝐶𝑑𝑑𝑑𝑑 𝜀𝜀𝑑𝑑 𝑗𝑗 𝑡𝑡𝑡𝑡 Equation 4

50

Where:

( ) = abundance of taxon k in replicate (mesocosm, plot) j of treatment d at time t

𝑦𝑦𝑑𝑑 𝑗𝑗 𝑡𝑡𝑡𝑡= mean control abundance of taxon k at time t (control d = 0, by definition) 𝑦𝑦�0𝑡𝑡𝑡𝑡 = species score for taxon k (multiplier of community response) 𝑏𝑏𝑘𝑘 = basic response pattern for treatment d at time t (coefficient of community response, slope of regression) 𝐶𝐶𝑑𝑑𝑑𝑑 ( ) = error term for treatment d in replicate j at time t for taxon k

𝜀𝜀 𝑑𝑑 𝑗𝑗 𝑡𝑡𝑡𝑡 The canonical coefficients are found via partial redundancy analysis (RDA). Hence the abundance of a taxon in a treatment group replicate at a certain time point is the equivalent mean control abundance plus the taxon multiple of the basic response for that time and treatment, plus error.

Note Cdt (the basic response pattern) is different for each treatment group and time point (see Figure 7.3-2). The plotted Cdt values against time give the principal response curves for each treatment. The indicated response of the individual taxa can be plotted as a multiple (bk) of the response (Cdt) since Tdtk = bk × Cdt.

It is important to note that the species score, bk, is not a measure of sensitivity to the stressor, but is a measure of the affinity of that taxon to the fitted community response displayed in the PRC diagram.

7.5. Interpretation of first PRC axis Figure 7.5-1 shows the same PRC diagram as Figure 7.3-2 with further annotation regarding interpretation. As the control community trajectory is set to zero at all time points, and the analysis focuses on treatment related effects, departure of a treatment group community response from the control may be considered treatment related. Of course, in naturalised biological systems such as aquatic mesocosms or terrestrial field plots, considerable variation may be expected and careful interpretation of the response curves is required. The interpretation must be conducted within the limitations imposed by the replication of the treatments and the ability of the test system to detect effects.

51

6 Caenis horaria PRC diagram interpretation Response (Tdtk) of taxon 5 Caenis (k) in treatment group Nauplius Larvae Cloeon dipterum 44 µg/L (d) at time point 4 weeks (t) (T ): 4 44,4,CH Strombidium viride 5.5 (bk) * -1.53 (Cdt) = -8.4. Mystacides longicornis/nigra Ablabesmyia monilis 3 Simocephalus vetulus All treatment group Ceratopogonidae The species score of 5.5 for Chaoborus obscuripes responses are read Caenis luctuosa Caenis suggests it is 2 Hygrotus inaequalis relative to control Ostracoda spp responding very strongly. Chironomus spec. Hygrotus versicolor Armiger crista 1 Coenagrionidae

k Asellus aquaticus

0.2 b 0 0 -30000 -25000 -20000 -15000 -10000 -5000 0 5000 10000 -0.2 -1 -0.4

-0.6 Bithynia tentaculata dt -2 Principal Response Curves – C -0.8 overall community response /trajectory for each treatment -1 group. One curve per -1.2 treatment group. -1.4 -1.6 Higher magnitude of -5 0 5 10 15 20 25 community response = higher negative coefficient (Cdt). Week post application

Control 0.1 µg/L 0.9 µg/L 6 µg/L 44 µg/L Higher rate treatment groups show effect followed by recovery over time.

Figure 7.5-1. Example Principal Response Curve plot, further interpretation (Van den Brink et al, 1996 and Van den Brink & Ter Braak, 1999).

The magnitude of the community response is shown on the y axis and, in general, increases with exposure level. The trajectory of the treatment group community response may indicate an effect (reduction relative to control) followed by, in this example, recovery (return to similar levels as control). The pattern of response (Cdt) may differ markedly between treatment groups. Lower application rate treatment groups may display no overall effect, intermediate application rate groups may show effect followed by recovery and highest application rate treatment groups may show effect and no recovery (within the duration of the study). The effect of differing responses by taxa on the PRC are explored in Chapter 8.

The species score (bk) is an indication of the affinity of each taxon to the overall community response Cdt. A high species score indicates a high magnitude of response. Taxa with a high species score are responding in accordance with the overall response pattern. The fitted change, on a log scale, for a taxon k in treatment d at time t is shown by bk × Cdt (e.g. Van den Brink et al, 2003), equivalent to the response pattern Tdtk (since Tdtk = bk × Cdt) for that taxon. In terms of abundance, exp(bk × Cdt) gives the relative abundance compared to the controls. For instance, for taxon Caenis horaria (Figure 7.5-1) with a bk score of 5.5, the response pattern and fitted change relative to control (on a log scale) at time 4 weeks post application in treatment group 44 µg/L becomes 5.5 × -1.53 = -8.4. In terms of abundance, we can calculate exp(5.5 × -1.53) = 0.0002, meaning taxon 2 would be at 0.02% of control abundance at that time point (i.e. severely impacted).

It can, therefore, be seen how the magnitude of the overall community response Cdt and the affinity of a taxon for that response reveal the response for a taxon at each time point. Of course, taxa may also display a species score which is either near zero or negative. Those taxa with a high negative species score may be considered to be responding in the opposite fashion to the overall response pattern. This may be an indirect effect such as

52 competitive or predatory release. Reduced zooplankton abundance may lead to an increase in phytoplankton, for example. Taxa with a low (close to zero) species score, however, may not be responding at all or may be responding in a way that does not fit the community response (or a close inverse of the community response). Often, taxa with low species scores are not reported in PRC analyses. Low species scores may occur because taxa are unaffected by the treatment and, since PRC is designed to capture treatment related variation, these taxa have little weight in the analysis. There are other reasons, however, why low species scores may be recorded. Given that the species score is a metric of the affinity of the response of that taxon to the overall community response, a taxon may respond to the treatment, but in a quite different fashion (shape or timing of response). This may lead to a low agreement between the response of the taxon in question and the wider community. An example is discussed in Van den Brink & Ter Braak (1998) where the aquatic crustacean Gammarus pulex showed a relatively low species score despite being known as a taxon susceptible to the particular test item. The taxon specific response pattern led to a low species score. Examination of species abundance plots and univariate analyses for taxa present in sufficient numbers would reveal such effects. Alternatively, if abundance of a taxon is low and accompanied with considerable variability, the analysis may not be able to detect a pattern to the response, even though the taxon may be sensitive. This, however, is true of any analysis. Study design may also exert an influence on species score and relates to the need for a focused study aim. For example, in an aquatic mesocosm system, investing sampling effort in taxa which are present in high numbers yet known to be unaffected by the test item would result in a low species score for that taxon. In terrestrial systems, taxa which are not faithful to the study plots (perhaps where plots are small and/or taxa are relatively very mobile) may yield a low species score. Care must be taken when dose response studies are conducted in very small plots and mobile taxa are present, since individuals from control plots may easily invade treatment plots and indicate an higher rate of recovery than would be expected under full field conditions.

7.6. Construction and interpretation of the second PRC axis The first PRC (canonical axis) accounts for the most treatment related variation possible and reflects the overall community response pattern, but there may be other responses in the data which are not adequately expressed on the first axis (Van den Brink & Ter Braak, 1998). Of course, in community ecotoxicology studies where many taxa are sampled, there may be a high number of individual response patterns. The PRC analysis aims to identify overall trends, meaning that some response patterns may not be captured in the first PRC axis. The second PRC axis may display these effects. If the second PRC also shows a significant part of the treatment variance, this may indicate that within the community there is perhaps more than one response pattern. The second PRC represents the next highest amount of variance after the first PRC; it is the second axis of the ordination. Through examining the second and additional PRC axes (if significant; See 7.8), the researcher can assess in more detail the responses of all taxa to the different treatments. The additional response curves may capture aspects of the community response that are not fully accounted for by the first axis. The first PRC axis and its associated species scores are adequate when there is only one significant community response. Additional axes may play a role in interpretation, however, when taxa respond differently to the treatments (e.g. some recover and some don’t) which leads to more than one significant community response following exposure to a chemical stressor.

53

For instance, there may be some taxa which respond to all treatment application rates and some that respond to only the highest of several treatment application rates. Examination of the first and second PRC axes together may reveal to which treatments the taxa are responding. The species score shown for the first PRC axis will reflect the overall response of that taxon across all treatment goups, whereas greater resolution is obtained through examining the second PRC axis. It may, therefore, be possible to determine which treatments affect which taxa. The canonical (regression) coefficients of the second, and further, axes are provided in the analysis output (when using CANOCO) allowing easy construction of the second PRC axis. Significance testing of the second PRC axis is outlined below (Chapter 7.8). The statistical model for two PRC axes is:

( ) = + + + ( )

𝑦𝑦𝑑𝑑 𝑗𝑗 𝑡𝑡𝑡𝑡 𝑦𝑦�0𝑡𝑡𝑡𝑡 𝑏𝑏𝑘𝑘1𝐶𝐶𝑑𝑑𝑑𝑑1 𝑏𝑏𝑘𝑘2𝐶𝐶𝑑𝑑𝑑𝑑2 𝜀𝜀𝑑𝑑 𝑗𝑗 𝑡𝑡𝑡𝑡 Equation 5 Where: = species score for taxon k (multiplier of community response) in PRC axis 1

𝑏𝑏𝑘𝑘1 = basic response pattern for treatment d at time t (coefficient of community response, slope of regression) in PRC axis 1 𝐶𝐶𝑑𝑑𝑑𝑑1 = species score for taxon k (multiplier of community response) in PRC axis 2

𝑏𝑏𝑘𝑘2 = basic response pattern for treatment d at time t (coefficient of community response, slope of regression) in PRC axis 2 𝐶𝐶𝑑𝑑𝑑𝑑2 ( ) = error term for treatment d in replicate j at time t for taxon k

𝜀𝜀 𝑑𝑑 𝑗𝑗 𝑡𝑡𝑡𝑡 It can be seen, therefore, that during interpretation of responses at the taxon level, the responses of both PRC axis 1 and axis 2 should be added (where PRC axis 2 is significant). Detailed examples of examining the second PRC are given in Van den Brink & Ter Braak (1998), Van den Brink et al (2003) and Maccherini et al (2007). A hypothetical example analysis is also provided here (Figures 7.6-1 and 7.6-2). The example is designed to illustrate how to interpret the responses of two taxa which respond differently to two different concentrations of a test item. In this hypothetical example, the data set has many species, but our interpretation of the results is restricted to two taxa and a control and two treatment groups with ten sampling events. Both taxa show identical abundance in the control group. Taxon A is unaffected in treatment group T1 but reduced in abundance in treatment group T2. Taxon B is equally affected by both treatments. The difference, therefore, is that the second taxon is affected by both treatments, whereas the first is affected by one treatment only. Therefore, in the community response there are taxa which are affected only at the higher application rate as well as taxa which are sensitive to all application rates. This is, of course, a very simplified situation, but serves for illustrative purposes. The resulting PRC axis 1 and PRC axis 2 are shown in Figures 7.6-1 and 7.6-2, respectively, with accompanying species scores.

54

2

1.5

A 1 B

0.5

0 0

-0.5

-0.5 k b -1

-1 -1.5

-2 -1.5 dt C -2

-2.5

-3

-3.5 0 1 2 3 4 5 6 7 8 9 10 Sample point

Con T1 T2 Figure 7.6-1. PRC axis 1 and species scores for example data. Con = control; T1 = treatment group 1, lower test item application rate 1; T2 = treatment group 2, higher test item application rate. Taxon A is unaffected in lower test item application rate group T1 but reduced in abundance in higher test item application rate group T2. Taxon B is affected by both treatments. Both taxa have a high affinity to the overall community response indicating a larger effect in the higher treatment rate. Taken together, the first and second PRC diagrams and species scores allow the user to determine the nature of the different responses within the community to the different treatment groups. Both taxa show an affinity to the overall community response of elevated effect in the higher treatment rate, as shown in PRC axis 1 (Figure 7.6-1). The second PRC diagram reveals that taxon B is equally affected by both treatments - it has an affinity to the community response in the second PRC, also, which is an abundance reduction in the lower treatment rate. Taxon A shows a negative affinity to PRC axis 2, suggesting that it has an opposite response pattern to the community trajectory shown in PRC axis 2 for the lower treatment rate, and in fact it is unaffected by this treatment. Examination of the first PRC axis only would not have revealed this pattern, although the individual taxa abundance data and univariate approaches would have shown this. We may now further investigate the fitted responses of each taxa in each of the PRC diagrams, with reference to the abundance data. In PRC axis 1, the fitted response of taxon A in treatment group 2 and time point 2 (bk × Cdt) becomes 1.04 × -1.9 = -1.98; for taxon B this is 0.96 ×-1.9= -1.82. The abundance in this treatment group and time point relative to the control group is exp(bk × Cdt) or 14% and 16% of control abundance in taxon A and B, respectively. The raw data for each taxon are 15 individuals compared to 100 in the control group, since both taxa are affected by this treatment. There is good agreement, therefore, between the actual abundance data, and the calculated reduction relative to control, for the responses of both taxa shown in the first PRC axis for treatment group 2, in which abundance of both taxa was affected.

55

1.5

1 B

0.5

0 0

-0.2 k -0.5 b

-1 A -0.4

-1.5 -0.6 dt C -0.8

-1

-1.2

-1.4 0 1 2 3 4 5 6 7 8 9 10 11 Sample point

Con T1 T2 Figure 7.6-2. PRC axis 2 and species scores for example data. Con = control; T1 = treatment group 1, lower test item application rate 1; T2 = treatment group 2, higher test item application rate. Taxon A is unaffected in lower treatment rate T1 but reduced in abundance in higher treatment rate T2. Taxon B is affected by both treatment rates. Taxon B has a high affinity to the community response shown in the second PRC axis, suggesting an effect on abundance in the lower treatment rate (taxon B is affected by both treatment rates). Taxon A has a negative affinity to the response in PRC axis 2, as it was not affected in the lower treatment rate. This PRC therefore discriminates between taxa which are affected by both treatment rates and those which are only affected by the higher rate. In PRC axis 2, we can investigate the responses of each taxon to treatment group T1, which is the treatment group which discriminates between the two taxa (taxon A is only affected by treatment group T2, and taxon B is affected by both treatment groups T1 and T2). Starting with taxon B, we can calculate the fitted response, which is a sum of the response for the significant axes. Therefore, for taxon B at time point 2 in treatment group T1, the fitted response is 0.96 × -0.89 (bk × Cdt for PRC 1) + 1.04 × -0.97 (bk × Cdt for PRC 2) = -1.86. The relative abundance is then exp(-1.86) = 15.5% of control abundance, which again reflects the raw data (15 individuals in the treatment group T1 at time point 2 in taxon B compared to 100 in control group) demonstrating that taxon B was numerically affected by both treatment groups 1 and 2. For taxon A, the fitted response is 1.04 × -0.89 (bk × Cdt for PRC 1) + -0.96 × -0.97 (bk × Cdt for PRC 2) = 0.0056. The relative abundance is then exp(0.0056) = 100.6% of control abundance, which reflects the raw data (100 individuals in the treatment group T1 at time point 2 in taxon A, which is the same as the control group at the same time point) demonstrating that taxon A was numerically unaffected in treatment group 1. This procedure therefore allows the user to define to which test item treatment rate or concentration certain taxa are responding, revealing the more sensitive taxa and those responding only to higher exposure levels. Van den Brink & Ter Braak (1998) provided an example of interpretation of both PRC axis 1 and PRC axis 2, and of combining the curves to derive fitted abundance for the

56

phytoplankton Keratella. The study showed the effect of two treatments – the first PRC consisting of nutrients only, and the second PRC of pesticide plus nutrients, on aquatic communities in comparison to a control group. Taken together, the two PRC diagrams and their species scores revealed the response of this species to the differing treatments and its affinity to the wider community response. Van den Brink et al (2003) provided an example from a fungicide indoor microcosm study. The interpretation of both PRC axis 1 and PRC axis 2 was shown. Finally, Maccherini et al (2007) provided an example of PRC analysis for an ecological management study. The interpretation of the first two PRC axes is described in detail. A further technique to visualise the overall response of a taxon to multiple treatments is to plot the species scores for the first PRC axis against those for the second PRC axis. This shows how taxa may respond differently to different treatments. An example for the simple hypothetical data shown previously in Figures 7.6-1 and 7.6-2 is shown below in Figure 7.6- 3. The main biplot on the left shows the species scores for the two taxa, A and B, plotted against axes which represent PRC 1 (horizontal) and PRC 2 (vertical). Taxon B, affected by both treatment groups, shows a positive species score for both axes (as per bk values in figures 7.6-1 and 7.6-2, above). Taxon A, affected by treatment group T2 only, shows a positive score for the first PRC axis and a negative score for the second axis (again, as per bk values in figures 7.6-1 and 7.6-2, above). The position on the biplot, therefore, indicates if a taxon responds to all treatment groups (e.g. taxon B) or only to the higher treatment groups (e.g. taxon A), and in so doing also illustrates the differing community responses to the chemical stressor. In addition to the biplot, fitted response curves for taxa A (lower right) and B (upper right) are also shown (Figure 7.6-3). These represent the response pattern, Tdtk, for each taxon summed from both PRC 1 and PRC 2, as described above. The fitted response curves demonstrate the differing responses to the two treatment groups of the taxa. Taxon B (upper right) was affected in both treatment groups, and the fitted response curves show similar reductions of abundance in this taxon, relative to the control, for both treatment groups 1 and 2. The fitted response curves for taxon A (lower right), show a reduction of abundance in this taxon for treatment group 2 only, and the abundance in treatment group 1 to follow that of the control group. This displays, in a format which is easy to compare, the fitted abundances calculated above and the differing responses of taxa which are sensitive to differing levels of exposure to the chemical stressor.

57

0

-0.5

-1

-1.5 dt C 2 -2

Taxon A -2.5 PRC 2 PRC

-3 Taxon B -3.5 1 0 1 2 3 4 5 6 7 8 9 10 Sample point Con Treatment 1 Treatment 2

0 -2 -1 0 1 2 PRC 1

0.5

-1 0

-0.5

-1

dt -1.5 C

-2 -2

-2.5

-3

-3.5 0 1 2 3 4 5 6 7 8 9 10 Sample point Con Treatment 1 Treatment 2

Figure 7.6-3. Plot of species scores bk for taxon A and taxon B from hypothetical example data expressed in two PRC axes (Figures 7.6-1 and 7.6-2). Taxon B is affected by both treatments and displays a positive affinity to community responses in PRC axis 1 and PRC axis 2 which show the effects of two treatment rates, respectively. Taxon A is affected only by the higher treatment rate, which is the dominant community response in PRC axis 1, and therefore has a positive affinity for PRC axis 1 only. Also shown are fitted response curves for taxon A (lower right) and taxon B (upper right). The fitted response curves illustrate the differing responses of the two taxa to treatment groups 1 and 2. Taxon B, affected by both treatment groups, shows similar reduction in abundance relative to control in fitted curves of both treatment groups. Taxon A shows reduction in abundance relative to control in fitted curves of the higher rate treatment group 2 only – the response to treatment group 1 is as per that in the control group. This approach allows the affinity of all key taxa to two community responses to be visualised in one diagram. An example with real microcosm data is shown in Figure 7.6-4 from Van den Brink et al (2003). A number of taxa are plotted. The four fitted response curves correspond to the taxa that have equal weights on each axis, regardless of the sign (+/-).In this example there are four fitted response curves as there are many more taxa and a greater diversity of responses – there are taxa which respond negatively to both the community responses shown in PRC 1 and PRC 2 (bottom left) and there are others which respond negatively in PRC 1 and positively in PRC2 (upper left).

58

Figure 7.6-4. Plot of species scores bk for various taxa from fungicide microcosm experiments (from Van den Brink et al 2003). The central axes represent the first and second PRC axes. Each taxon is plotted here by its species score for both PRC axes. This allows a rapid interpretation of the affinity of each taxon to the different community responses depicted in PRC axes 1 and 2. The four response curves in each corner of the plot correspond to the taxa that have equal weights on each axis, regardless of the sign (+/-).

59

7.7. Other PRC outputs, significance testing, explained variance As discussed above, ordination techniques differ from hypothesis testing techniques in that the aim of the method is not necessarily to determine if two or more samples are significantly different. It is, however, possible to test the significance of the treatment effects on the overall community and for each treatment at each time point, and to determine the proportion of variance attributable to treatment, time and between-replicate differences. The amount of treatment related variance explained by each axis (each PRC axis) may also be determined. Therefore, a comprehensive assessment of treatment related effects at the community level is possible. These aspects are discussed in this chapter. To determine the significance of the overall treatment regime on the community present in the test system, PRC analysis may be accompanied by non-parametric permutation testing. Essentially, permutation testing involves randomly reassigning the sample data amongst the treatment groups many times over to generate a distribution of possible results to which the original data are compared. The distribution is determined by the observed data. This is in contrast to a parametric statistical hypothesis test where the distribution of critical or reference values is pre-determined (e.g. as in a statistical table). In PRC analysis, permutation testing may be achieved via Monte Carlo permutation analysis. For this analysis, the null hypothesis would be that Tdtk = 0 for all time points. Under this null hypothesis, the study data are interchangeable between the treatment groups, since the null hypothesis states that there is no difference between control and treatment groups. Therefore, the data can be randomly reassigned and resampled to provide a reference distribution, against which the test statistic for the original observed data is ranked. If the test statistic from the actual study data does not conform to the re-sampled test statistics, then a treatment related effect is concluded (e.g. van Wijngaarden et al, 1995). Specifically, if the observed test statistic is ranked among the highest 5% of the resampled distribution, then the null hypothesis is rejected at alpha 5% (or other pre-specified level), that is, it is appropriate to conclude that effects were due to treatment with 95% probability (e.g. Verdonschot & Ter Braak, 1994). Not all possible permutations are necessarily calculated, but typically 999 or 9999 are requested. Statistical power increases with increasing numbers of permutations (Ter Braak & Šmilauer 2002). One advantage of the Monte Carlo method is that the analysis is distribution free, that is, not constrained by the requirements of parametric methods (such as normality of response in observed data) (Manly, 1990; Verdonschot & Ter Braak, 1994). In PRC analyses using CANOCO, we specify a split plot study design which accounts for the dependency of the samples obtained in the same replicate at the various sampling occasions of the field study. In the permutation test, we analyse whole plots (treatment groups) without permutation at the spit plot (sampling event) level. This means that whole time-series of, for example, mesocosms, rather than a combination of mesocosms and sampling dates, are re-sampled. This results in fewer potential permutations than when the data are permuted completely at random (i.e. where data are randomly assigned at the sampling date and treatment group level) (Ter Braak & Šmilauer 2002). Initially, the entire study data are used for this procedure. This initial investigation will test the effects of the overall treatment regime on the community response as recorded from all treatment groups (i.e. does the treatment, as represented by all treatment groups, have an effect on the community as a whole?). The significance of the first canonical axis (does the first PRC show a significant proportion of treatment variance) and of all canonical axes (does treatment as a whole have a significant effect on the overall community) are routinely calculated when PRC is conducted in CANOCO. In addition, it is possible to conduct a permutation test for each treatment rate at each sampling time point in comparison to the control group. When conducting permutation tests

60

for significance of treatment effects at individual time points, there are some differences from the approach used for the overall study data. In CANOCO, no covariables are included in the analysis and instead of entering all study data into the analysis, each permutation test is limited to the control and treatment group and time point of interest. This approach is possible where there are more than 4 replicates per treatment. Typically, 4 or more replicates are available for most terrestrial studies (non-target arthropods and soil mesofauna). In aquatic mesocosm studies, replication per treatment group may be lower, depending upon design of the study. When replication is low, it is possible to test the treatment effects on the overall community at each time point (but not the effects of each individual treatment rate at each time point) through permutation testing. This involves analysing all data for a time point together, rather than isolating data by time point and treatment rate. Log-transformed nominal dose is entered into the analysis as environmental data (Van den brink et al, 1996). A brief discussion of the replication that would be required to achieve certain levels of statistical significance is provided in Chapter 6. In cases where replication is limited, it is also possible to determine which treatment had a statistically significant effect at each time point. This may be achieved by conducting a univariate Williams test (Williams, 1972) on the sample scores of the first principal component (from a separate principal components analysis (PCA) conducted on the study data) of each sampling date in turn (e.g. Van den Brink et al, 1996; Van den Brink et al, 2000). This analysis would be limited to an extent since the first axis of the PCA only takes into account a proportion of the total variance. The significance of the second PRC axis can also be tested. This can be achieved by adding the environment-derived sample scores of the first PRC axis to the covariables. If the different treatment groups are differing rates or concentrations of the same test item and a zero rate control is included, the tests described above can provide a NOECCOMMUNITY for the study data in each treatment group and time point. In order to derive taxon level NOEC values, ANOVA-type analyses may be conducted. These are discussed in Chapter 6.

7.8. Summary output statistics from PRC analyses conducted in CANOCO The outputs which accompany the PRC analysis in CANOCO show the partitioning of variance between sampling date, differences between replicates and treatment effects. The other packages in which PRC may be performed are briefly introduced in 7.3. The outputs can be summarised thus: • The amount of the total variance in the whole study data that is explained by the explanatory variables (e.g. treatment groups) is termed the sum of all canonical eigenvalues. This is expressed as a proportion 0-1. Higher values indicate that more variation is explained by treatment effects. • Of this treatment related variance, the cumulative percentage variance of species- environment relation is shown for axes 1-4 (in CANOCO). This is the cumulative amount of the treatment related variance shown in each successive axis. • The amount (proportion) of variance attributable to time is derived as 1-(the sum of all unconstrained eigenvalues). • The amount (proportion) of variance accounted for by between replicate differences is (the sum of all unconstrained eigenvalues) - (the sum of all canonical eigenvalues). • Accompanying the output of the Monte Carlo analysis are two p values: one for the first canonical axis and the second for all axes. These represent the statistical significance of the amount of treatment related variance shown in the first axis (PRC 1) and whether the treatment regime as a whole had a significant effect on the community of the test system, respectively.

61

In a complex community level study with many taxa and several treatment application rates tested over several sampling time points, the level of variance in the data which is explained by treatment effects may be lower than anticipated. Values of 30 – 40% may be expected in, for example, ecological studies (Van den Brink et al 1996).

7.9. Data transformations As PRC is typically used to investigate multi-rate field studies, abundance data are log transformed in order to fit the inherently sigmoidal dose response curve to the linear response models employed (e.g. Van den Brink et al, 1996 and references therein). Van den Brink et al (2000) provided a rationale for data transformation where abundance was variable between groups of organisms. For example, abundance was 0.2 individuals per litre for zooplankton and 1000 individuals per litre for phytoplankton. The authors proposed that a product of 2 for Ax in the transformation ln(Ax+1), when x is the lowest abundance higher than zero, would enable discrimination between low and zero values. Therefore, for zooplankton at 0.2 individuals per litre, A was 10.

7.10. Analysing a reference item with PRC In terrestrial studies, it is a requirement of current testing guidelines to include a reference item (toxic or positive control) treatment in the study design. The aim of the reference item is to confirm the ability of the test system to detect effects (i.e. to prove the sensitivity of the test system). Validation of a study in this way is particularly relevant if the test item treatments (e.g. field rate or drift rate in traditional studies, or the higher rates used in dose response/NOER study designs) fail to produce any significant effects, as it provides confirmation that the lack of observed test item effects is unlikely to be an artefact of the study design. The reference item data must be analysed to demonstrate there was a significant effect in this treatment group. As the reference item data are gathered and processed alongside that for the test item, it may be considered prudent to include the data for the reference item in the main PRC analysis. The reference item, however, may be a substance with a different mode of action or may be a higher rate of the test item (a rate which is known to elicit effects). When the reference item is a different chemical to the test item, including the reference item in the main PRC analysis may reveal a community response which is not related to the test item, but to the reference item, especially since the latter is included in order to demonstrate sensitivity of the test system. For instance, severe effects in the reference item treatment group would be incorporated in the overall community response recorded in the study and mask the effects of the test item. Permutation of the overall study data which includes the reference item data may, therefore, result in a more severe and/or statistically significant response. Inclusion of the reference item data in the analysis of the main study data may also result in a further alternative community response (for example, as shown by PRC 2) which obscures a secondary response by the community to the test item. It is recommended, therefore, that the reference item be analysed separately in comparison to the control group only in order to demonstrate validity of the test system. The influence of a positive control as a reference item in a PRC analysis can be demonstrated by reference to Figures 7.6-1 and 7.6-2, which were initially introduced to demonstrate interpretation of PRC axis 2. First, let treatment group T2 be the reference item, by which both taxa are affected, and treatment group T1 be the test item, by which only taxon B is affected. It can be seen in this example, that the reference item elicits a different community response to the test item. Permutation testing on PRC axis 1 would include the effects of the reference item together with effects of the main test item, which may result in misleading conclusions. In this very simple example, we are able to easily determine which

62 taxa are affected by the test item and which are affected by both test item and reference item. An example analysis with more than two modes of action on one PRC is shown in Figure 7.10-1 (Frampton & Van den Brink, 2007). This shows the response of terrestrial macroarthropods to pirimicarb, cypermethrin and chlorpyrifos at rates of 40 g a.s./ha, 25 g a.s./ha and 480 g a.s./ha, respectively. It is clear that different chemical stressors may elicit different community responses, as seen by the response curves to the different pesticides. It is questionable, however, whether all compounds will have affected the same species and whether the PRC curves for the three compounds would be the same when the compounds were analysed separately. In this case, it would have been preferable to analyse each compound separately, allowing the display of effects of each to be maximised.

4 DELPH 3.5

3

2.5 CECID PARA 2

ALEO 1.5 LLVLATH TLVTQUA 0.25 1 LINY STENOTHAROTHDIP THYS 0.5 ATOM k b 0 0 -0.5

dt -0.25 -1 CORT C -1.5

-0.5

-0.75 -40 -30 -20 -10 0 10 20 30 40 50 Days post application

Control Chlorpyrifos Cypermethrin Pirimicarb Figure 7.10-1. PRC axis 1 for macroarthropods from Frampton & Van den Brink (2007). The different chemical stressors (which were each applied at standard label application rates) elicited different community responses. This serves to illustrate that in regulatory studies, the test item should be analysed separately to any toxic reference item as inclusion of both in one PRC analysis may obscure interpretation (species scores relate to all treatments; permutation testing at the whole study level considers data from all treatment groups).

7.11. Other uses for multivariate approaches in ecotoxicological field studies Apart from assessing the responses of communities to the test item, there a few other uses of multivariate methods in ecotoxicological community field studies. One is the description of the test system (De Jong et al, 2010). Aldershof & Bakker (2011) provided an example of the use of multivariate approaches to characterise the study site used in a replicated off-field non-target arthropod study. Classification methods were used to construct a dendrogram of similarity indices of the study plots and indicate which were the defining species of each group. Ordination methods (correspondence analysis and constrained correspondence analysis) were used to visualise differences in floristic composition between study plots. An

63 understanding of the underlying botanical structure of test sites may be particularly important in the interpretation of ecotoxicological studies.

7.12. Section summary • Multivariate approaches allow for community level interpretation of ecotoxicological data. • Multivariate methods should be accompanied by abundance data inspection and univariate analyses. • PRC can be used to detect global trends in community data and summarise into an easy to interpret format. • Coefficient of community response can be combined with species scores to enable taxon level interpretation. • Indicated abundance relative to control groups can be calculated from PRC diagrams. • When using multiple PRCs, more than one community response can be detected and interpreted allowing differing responses amongst taxa to be identified. • Reference items should be analysed in a separate PRC, regardless of chemical class.

64

8 Application of PRC analysis to example scenarios The following chapter uses simplified hypothetical data in order to demonstrate how potential community responses to a chemical stressor may be depicted in PRC analysis, results of which are provided. The scenarios are simplified for ease of application, but represent aspects of community responses that may be seen in real data. The accompanying PRC analyses are provided with a discussion in each case. The PRC method is used to illustrate the examples as it is widely used in community ecotoxicology, but reference to other means of interpreting the data is made where applicable. The raw data for each hypothetical scenario in each case consists of: 1 control plus 3 treatment groups representing different treatment rates/concentrations of the same test item (Control, T1-3) 4 sampling time points 5 taxa (A-E) 4 replicates of each treatment group at each time point The taxon level abundance data supporting the example scenarios are shown in Appendix I. There are 6 example responses (see Appendix I) which are consolidated into 3 example analyses.

65

Example 1: Effect and recovery; Decline in background abundance; Redistribution; One very abundant, unaffected taxa present in test system This example illustrates how quite different community response patterns may result in PRC diagrams which appear very similar. Only one PRC diagram is shown (Figure 8-1), but there are various patterns of taxon abundance in control and treatment groups which could result in very similar PRC diagrams – see Appendix I. Each scenario is explained in turn. Effect and recovery In this example, relative abundance in the control and the lowest treatment group (T1) increases slightly during the study, then declines to pre-treatment levels by time 4. Taxa in treatment groups T2 & T3 are impacted, but recover to control levels at time 4 (no significant difference between treatment group and control). All taxa behave similarly within treatment groups (Appendix I). The PRC diagram for this response is shown in Figure 8-1. The diagram shows that there is no difference between the control and lowest treatment group T1, and successively greater magnitude of response in the second and third treatment groups T2 and T3. The species scores, bk, show that all taxa have a similar affinity to the community response. This example demonstrates a straight forward effect followed by recovery scenario. Note that the actual increase in abundance of control and lower treatment group organisms is not demonstrated on the PRC, since control levels are used as a reference and herewith set to zero.

1.2

B AE 1 D C

0.8

0.6

k 0.4 b

0.5 0.2

0 0

-0.5

dt -1 C

-1.5

-2

-2.5 0 1 2 3 4 5 Sample point

Con T1 T2 T3 Figure 8-1. PRC axis1 diagram for Example 1 data. Greater magnitude of response on the second and third highest treatment groups T2 and T3. The lowest treatment group is unaffected. Recovery may be considered to have occurred at time point 4 where the

66

trajectory of community response in treatment groups T2 and T3 returns to control levels. Decline in background abundance. This example illustrates what may happen when, for example, a seasonal decline in abundance across the test system occurs (see data in Appendix I). The trend in abundance is for a test item driven decline in the highest two treatment groups followed by a decline in the control group and lowest test item treatment group, caused by seasonal/life history effects. The affected treatment groups do not recover and the abundance in all treatment groups is comparable at the end of the study. In effect, there is no statistical difference between the control and higher treatment rates at the end of the study. The resulting PRC diagram for the abundance data in Appendix I is very similar to that shown in Figure 8-1. The trajectories of the impacted test item treatment groups and the non-impacted groups have coincided by time point 4, as there is no significant difference between the groups in terms of abundance. In this case, it is not appropriate to conclude recovery since recovery has not yet been demonstrated. Practical solutions to enable an assessment of recovery potential may be to sample the following season, or (in aquatic systems) to conduct laboratory bioassays with key taxa using water taken from the affected treatment groups, to determine if the test item residue was below harmful levels (such approaches are examples only and not necessarily endorsed by regulatory authorities – each case should be considered individually). Examination of the abundance data would greatly aid interpretation in this situation. This example highlights the need to consider PRC analysis in conjunction with univariate analyses and plots of abundance over time. Redistribution In most terrestrial (e.g. NTA) studies, individuals are free to move between study plots. To combat this, large plots are normally used and sampling may be focused towards the centre of these. Nonetheless, some more mobile taxa may be able to redistribute from one plot to another, and this effect may be exaggerated if small plots are used. Plotting abundance data for all treatment groups against time for key mobile taxa would help detect this effect (see Appendix I for the abundance data). Redistribution after an initial test item mediated effect may appear in taxon level abundance data as a decline in control group numbers and a concurrent increase in treatment group numbers, until numbers are comparable between treatment groups. In the PRC analyses this effect may result in a similar overall community response to the previous examples (Figure 8-1). The community response trajectory would suggest recovery since treatment group and control group curves converge at the end of the study. This is because abundance in either group is similar; however, whether or not this constitutes recovery would require some interpretation with regards to the aim of the study and the relevant protection goal. It is worth noting that separate PRC analyses may be conducted for mobile and less mobile taxa. One very abundant, unaffected taxa present in test system In this example, there is one unaffected taxon (taxon E) present in the test system at higher abundance than the other taxa (Appendix I). These may be considered to have the potential to mask the community response (they may or may not be considered a relevant part of the test system) if included in an analysis. It is useful to explore how such taxa may influence the outcome of an analysis using PRC. The PRC analysis focuses only on treatment related effects and, therefore, the abundant taxon does not influence the community response – the response pattern is very similar to

67 that shown in Figure 8-1. A zero species score for the very abundant taxon E would be calculated.

68

Example 2: Recovery in 1 abundant taxon In this example, only the highest application rate of the test item has a marked effect on the artificial community. Initially, all taxa are similarly affected, but one taxon subsequently increases strongly at the last sampling occasion resulting in decreased total abundance in the highest treatment rate compared to the control at the final sampling occasion for all but one taxa and a much higher abundance of one taxon in the highest treatment (Appendix I). This example demonstrates how PRC handles a shift in community composition as well as abundance effects over time. The first PRC axis (Figure 8-2) shows the overall community response, which is effect of test item followed by non-recovery. All taxa apart from one show an affinity to this response, suggesting this response pattern is a close description of the response of each taxon. Taxon A, however, shows a low negative species score, suggesting it increases in the highest treatment, but its real response is not represented well by the PRC diagram.

1.2 E D C B 1

0.8

0.6

0.4

0.5 0.2 k b 0 0

A -0.5 -0.2

-0.4 -1

dt -1.5 C

-2

-2.5

-3

-3.5 0 1 2 3 4 5 Sample point Con T1 T2 T3 Figure 8-2. PRC axis 1 for Example 2. The overall community response in treatment group T3 is effect followed by non-recovery. Most taxa have a positive affinity for the overall response. One taxon (taxon A) displays a low species score, suggesting it does not closely follow this response pattern.

The second PRC axis for Example 2 shows the response as displayed by taxon A. Taxa B-E show a low species score for this response since their real response is already represented by the first PRC, while taxon A shows a strong affinity to this alternative response. The community trajectory in PRC axis 2 suggests effect followed by strong recovery. Note that

69

the community response coefficient Cdt on the y axis is lower on PRC axis 2 than PRC axis 1, suggesting this is explaining less variation in the overall data. This example demonstrates that even though abundance in treatment group T3 is equivalent to the control at the last sampling occasion, the analysis is able to detect the difference in community composition. Through examination of PRC axis 1 and PRC axis 2, the user is able to interpret the main and alternative species responses.

2.5

A

2

1.5

0.6 1 k

0.4 b 0.5 0.2

E D C 0 0 B

dt -0.2 C

-0.4

-0.6

-0.8

-1 0 1 2 3 4 5 Sample point

Con T1 T2 T3 Figure 8-3. PRC axis 2 for Example 2. Taxon A follows an alternative community response with a correspondingly high species (bk) score. Other taxa (B-E) do not show affinity to this response.

70

Example 3: Increase in abundance due to life history Taxa may increase in abundance throughout the duration of a study, for example through reproduction. This example shows such a situation with abundance increasing in control and lower application rate treatment groups from study initiation. In the two higher application rate treatment groups (T2 and T3) there is an initial effect of the test item, followed by an increase in abundance as the remaining individuals reproduce. The overall abundance in these treatment groups does not, however, reach control levels at the end of the study. The PRC diagram shows the trend towards recovery in the treatment groups. Whether the difference is considered to be relevant would be case dependent. The increase in abundance in the control group is not shown in the PRC diagram as control abundance is set to zero. Rather, the PRC captures the relative effects of the test item compared to the control group.

1.4

1.2 A

D 1 E C

B 0.8

0.6 k

b 0.4 0.4 0.2 0.2

0 0

-0.2

-0.4

dt -0.6 C

-0.8

-1

-1.2

-1.4

-1.6 0 1 2 3 4 5 Sample point

Con T1 T2 T3 Figure 8-4. PRC axis 1 for Example 3. Abundances in the higher application rate treatment groups T2 and T3 are initially reduced by the test item, but then numbers begin to increase indicating a trend towards recovery. The PRC diagram shows the responses for control and lower application rate treatment rate as constant, even though abundance is actually increasing in these groups also (Appendix I).

71

9 Summary of data examination and analysis Visual inspection, univariate and multivariate methods all play a role in the interpretation of ecotoxicological field study data. The following summary may be a useful guide to researchers and risk assessors: Initial assessment of data 1. Initial inspection of data on a taxon by taxon basis a. Outliers (i.e. abnormally high or low abundance counts, perhaps associated with a single plot or tests system replicate) – can these be traced to a study error? Are they associated with the test system? Decision on inclusion in analyses; full justification required if not included in analysis. b. Abundance – for which taxa is there sufficient abundance to conduct univariate analyses? The abundance criteria should be set during study design. c. Which taxa are relevant to the regulatory question? Avoid opportunistic analysis of data. This should be decided during study design. Univariate analyses 2. Do data conform to parametric test requirements? Transform data as required, and/or use non-parametric analyses. 3. Select and conduct an appropriate parametric or non-parametric multisample method a. Derive significance of overall treatment on each taxon at each time point (multisample method) (if required). 4. Select and conduct an appropriate multiple comparison method a. Derive significance of each specific treatment on each taxon at each time point (multiple comparison method) (if required). May lead to taxon level NOEC/NOEL. 5. Two consecutive significant events may be required to demonstrate a consistent effect of treatment (interpret in relation to chosen level of α i.e. if many tests will be conducted and 5% of these may return significant results due to chance alone if α = 0.05). Note caveats on sampling intervals and biological vs. statistical significance. Multivariate analyses 6. Select appropriate multivariate method. 7. Transformation of abundance data as required. 8. Conduct initial analysis including whole time series permutation test, if required a. Derive ordination diagram and indication of significance of test item treatment on community as a whole (from permutation testing). b. Derive significance of first axis of PRC, for example. c. Collate accompanying statistics such as explained variance. 9. Perform permutation testing on each time point and treatment group combination, if adequate replication; if replication lower, consider Williams test on first component of PCA, for example a. Derive community level NOEC values. 10. Interpret first axis alongside univariate statistics, species scores and visual data patterns. Test significance of second axis and plot if significant. Return to abundance plots and univariate analyses for biological interpretation. General 11. Interpret whole study data to derive whole-study recovery endpoints (such as community NOEAEL) and time to recovery. Relate endpoint to protection goal – e.g. taxon level or community level effects and recovery? Maintain awareness of limitations of study design and statistical testing during interpretation. 12. Awareness of biological versus statistical significance throughout.

72

10 Monitoring studies for effects of plant protection products on terrestrial invertebrates

10.1. Background Monitoring impacts of PPPs on terrestrial invertebrates is challenging, due to the wide taxonomic diversity and variable abundance of invertebrates typically encountered in agro- ecosystems, the diversity of life histories represented, the complexity of trophic interactions, the diversity of anthropocentric activities (simultaneous management of soil fertility, crop growth and crop pests and diseases, typically involving multiple agents), and the heterogeneity of abiotic conditions (including spatial variation in geology and soils, and temporal variability in weather conditions). In addition to this variability there are considerable uncertainties, including a lack of knowledge about species’ life histories and their susceptibility to PPPs, and a lack of information on relevant exposure concentrations under field conditions. Key problems that need to be considered are how to detect impacts of PPPs against background spatial and temporal heterogeneity of endpoints, and how to ensure that the results of a monitoring study are applicable to the agricultural settings in which a PPP will be used. Approaches for conducting field monitoring studies for evaluating effects of PPPs on terrestrial invertebrates were discussed in detail most recently at the EPIF Workshop in October 2003, which included reviews of the state of the art (Liess et al, 2005). The EPIF Workshop distinguished between two types of field studies: (1) Higher-tier studies for informing regulatory risk assessments, conducted to assess effects of a PPP under controlled conditions. These are typically small-scale to minimise confounding factors (e.g. single-field, single-season studies) and are conducted routinely in cases where it is necessary to clarify whether risks to invertebrates identified in lower-tier tests would also apply under field conditions. (2) Field studies to determine effects of one or more PPPs under realistic agricultural conditions, taking into account the complex biotic and abiotic variability mentioned above. These latter studies have to be sufficiently large in their spatial and temporal scales to adequately capture the variability of agro-ecosystems (e.g. long-term farm-scale studies), but are demanding in terms of the resources they require. The EPIF Workshop identified very few large-scale studies that have monitored effects of PPPs on terrestrial invertebrates (Liess et al, 2005). Field monitoring studies that have been conducted at both large spatial and large temporal scales are particularly rare. For example, a systematic review of 30 studies that assessed the effects on invertebrates of reducing pesticide inputs in arable crop edges found that the longest-duration studies (4 years monitoring) had the smallest spatial scales (involving only one field) whilst the largest spatial scale studies (involving up to 33 fields) were typically only one season in duration (Frampton & Dorne 2007). Although field monitoring studies may be conducted at a range of spatial and temporal scales, there are some principles of study design, analysis and interpretation that apply generally. The EPIF Workshop made several recommendations that are relevant when planning field studies for monitoring effects of PPPs on terrestrial invertebrates (Liess et al, 2005), which included (among others): • Studies should be capable of identifying cause and effect for the specified PPP(s) • Study designs should have sufficient statistical power to enable effects of PPPs to be detected • Effects of PPPs on terrestrial invertebrates should be considered in both crop and non-crop habitats • Reference or control sites should be selected to ensure that studies are suitably representative of the agroecosystems in which the study PPP(s) are used

73

• Confounding variables and uncertainty should be accounted for where possible

In accordance with these recommendations, guidance is provided below on good practice for the design, analysis and interpretation of field studies for monitoring effects of PPPs on terrestrial invertebrates. This guidance is based on information from published studies as well as on the experiences of the members of the project team who have been directly involved in designing, analysing and interpreting field studies.

10.2. Setting the research question Key to the success of field monitoring studies is that the research questions are clearly specified and studies are designed in such a way that the questions can be answered, including demonstration of causality where appropriate. An hypothesis-based approach is preferable for addressing research questions, with clear criteria set for rejection of the hypothesis (e.g. as employed in the Farm-Scale Evaluations; Perry et al, 2003). Where multiple endpoints are to be measured, a distinction should be made between those for which a study is powered statistically to detect effects (i.e. differences or changes) and those which are not statistically powered, but may provide supporting contextual information. These are referred to as primary and secondary endpoints, respectively. It is not feasible to monitor all invertebrate taxa with adequate statistical precision in a single study, so monitoring studies typically focus on one or more ecological indicators (i.e. primary endpoints). Typical endpoints in studies of the effects of PPPs on terrestrial invertebrates are invertebrate abundance, diversity, and behaviour (e.g. foraging activity). Endpoints may be structural (concerning taxonomic composition) or functional (concerning ecological, biological or chemical processes, e.g. decomposition or pollination). In addition to the monitoring of biotic endpoints for terrestrial invertebrates, assessments of PPP residues may also be conducted (as recommended in the EPIF Workshop; Liess et al, 2005), as well as assessments of crop performance. To optimise the resources available, some monitoring studies may include taxonomic groups in addition to terrestrial invertebrates. For example, the Boxworth project simultaneously monitored soil fauna, non-target arthropods, birds, mammals and plants (Greig-Smith et al, 1992) and the UK farm-scale evaluation of GM crops monitored several groups of invertebrates and arable plants (Firbank et al, 2003). When planning a field monitoring study, it may be helpful to bear in mind that field studies monitoring effects of PPPs are increasingly being scrutinised in terms of their methodological rigour, as a result of wider use of systematic reviews of evidence for policy decision making (e.g. Frampton & Dorne 2007; EFSA 2010; Collaboration for Environmental Evidence). It is important that studies are designed to provide clear, defensible answers and that sources of variability and uncertainty are acknowledged.

10.3. Spatial scale An appropriate unit to consider for defining spatial scale is an individual field, since this is a natural unit at which both agricultural management and many ecological processes operate. Within the field, areas adjacent to the boundaries, referred to as field margins, may receive specific agricultural management (e.g. reduced PPP applications such as Conservation Headlands). Outside the field, beyond the field boundaries there may be adjacent cropped or uncropped land or (semi-)natural vegetation, referred to as ‘off-crop’ habitat. In terms of agricultural management, cropping, tillage, fertilisation and PPP applications typically occur at the level of individual fields, although groups of adjacent fields may receive similar management. Specific to each field unit is the size of the field, the crop sown and its

74 management, the types of field boundaries that are present (e.g. hedgerows, ditches, fences), the types of margin management, and the types of adjacent non-crop habitat (e.g. other fields, woodland, grassland, water, urban). These factors have an important bearing on ecological processes in agricultural fields and contribute to considerable variability among fields in both agricultural practices and ecological structure and processes. For instance, recolonisation by terrestrial invertebrates following PPP applications to a crop may occur from hedgerows and adjacent non-crop habitats. Species-specific patterns of dispersal into arable fields are well known, for example among carabid beetles the species Agonum dorsale exhibits a seasonal dispersal from hedgerows into cropped areas (Burn 1992). For this to take place, however, hedgerows or shrubby adjacent non-crop habitats are required as overwintering sites for this and other invertebrate species. A study to monitor effects of PPPs on Agonum dorsale would, therefore, need to include fields that provide suitable overwintering habitat for this species. Although the conceptual definition of an agricultural field (an agricultural management unit with specific boundaries, margins and adjacent off-crop habitats) is relatively simple and easy to visualise, fields also vary spatially in terms of their shape and size. To complicate matters further, an individual field may have more than one type of field boundary and more than one type of adjacent non-crop habitat. Abundance and diversity of terrestrial invertebrates typically decreases with distance into a field from the field boundary (although there may be exceptions, e.g. if the boundary is only a post-and-wire fence). The choice of field sites for conducting a monitoring study should ensure where possible that the combination of the above variables best addresses the research question. This may not be straightforward, given the variability in field characteristics. Another issue to consider is whether the monitoring site(s) is/are representative (i.e. whether their findings would be relevant to the scenario of concern). It may help to consider the representativeness of an agricultural site in terms of: • agricultural relevance - are the cropping and other management practices (including use of PPPs, fertilisation, margin management, etc), both relevant to the research question and typical of normal practice? • ecological relevance - does the field support the indicator species implied in the research question (for which primary endpoints will be monitored), and is the surrounding non-crop habitat typical? • geological relevance – does the soil type, drainage and altitude limit the applicability of extrapolating the results to other sites?

The spatial representativeness of field monitoring may be improved by extending the number of monitoring sites so that a wider variety of agricultural, ecological and geological conditions are included. Such an approach was used in the SCARAB project, by monitoring the impact of pesticides on invertebrates on three different farms which were located in different parts of England, and had different soil types and local crop rotations (Young et al, 2001). However, resources are typically limited, which results in a trade-off between statistical precision (requiring replication of sites) and applicability of effects that can be achieved (requiring a representative diversity of sites). The SCARAB project provided useful evidence of the spatial applicability of effects of PPPs, but with low statistical precision. The extent of spatial replication that would be required in order to achieve adequate statistical precision to detect effects on terrestrial invertebrates of herbicide-tolerant crops in an applicable way is illustrated by the Farm Scale Evaluations, which employed approximately 60 fields per year (Perry et al, 2003). Multivariate statistical approaches can be useful for visualising key variables that influence the abundance and diversity of terrestrial invertebrates in monitoring studies involving multiple farms, although the resolution of analysis may not be sufficient for the detection of effects of individual PPPs, and may be better suited to identifying the impacts of overall regimes of PPP use (Frampton & Van den Brink 2002).

75

10.4. Size and siting of study plots A study plot refers to the spatial area in which the experimental treatment (e.g. PPP application) is applied and from which samples will be taken for the assessment of (the pre- defined) endpoints which are needed to answer the research question. There has been considerable debate about the appropriate size and location of study plots to use in monitoring effects of agricultural practices on terrestrial invertebrates. In the first large-scale experimental monitoring study, the Boxworth project, whole fields were employed as the study plots (Greig-Smith et al, 1992). More recently, in the SCARAB project (Young et al, 2001) and in the Farm-Scale Evaluations of herbicide-tolerant crops (Firbank et al, 2003), half-field plots have been employed for monitoring effects of agricultural practices on terrestrial invertebrates. The aim of using half-field plots is to ensure that pairs of experimental treatments (e.g. a PPP application area and untreated control area in the same field) share similar site characteristics such as soil type, tillage and PPP applications, field boundary type and type of adjacent off-crop habitat, to reduce the possibility that effects of PPPs on invertebrates may be confounded with effects of other variables. Evidence from monitoring studies of Collembola and other terrestrial invertebrates suggests that neither whole fields nor half-field units may be the ‘perfect’ type of study plot, and that the choice of study plot may need to be justified on a case-by case basis. For example, the pre-treatment abundance and diversity of Collembola can differ between adjacent arable fields that share the same soil type and agricultural management history (Frampton 1999) and also differ between half-field study plots sited within the same field (Frampton 2001). When selecting the appropriate spatial scale of monitoring it is important to ensure, through pre-treatment monitoring (see below), that the relevant indicators/endpoints are present at the proposed study sites. In addition to study plots that are at a scale of whole fields or half-fields, many ecotoxicological studies have employed study plots at various other sub-field scales (including ‘semi-field’ studies). These can be problematic because the scale and location of plots are unlikely to match those of the ecological dynamics. For example, if small (e.g. up to 100 m x 100 m) plots are employed, invertebrate redistribution through dispersal may lead to exaggerated rates of population recovery, and/or redistribution of invertebrates across the study plots, possibly leading to incorrect exposure (e.g. control invertebrates may receive PPP spray or vice versa). If small study plots are employed, the scale of PPP application within an agricultural field may be unrealistically small, leading to atypically large unsprayed areas within fields that could act as refugia for invertebrate population recovery. Small plots may be separated by barriers to prevent invertebrate redistribution, but this reduces agricultural and ecological realism. It may appear at first sight that small plots could be appropriate for monitoring invertebrates with low powers of dispersal (e.g. Collembola). However, poorly dispersive invertebrates exhibit intimate ecological interactions with more highly mobile invertebrates (e.g. Collembola are predated by linyphiid spiders and carabid beetles), meaning that small plots are unlikely to realistically capture ecological dynamics even if the main species of interest is poorly dispersive. Small plots also have the limitation that they can only cover a small section of an agricultural field, that is, the field margin or field centre, but not both. Due to the inherent limitations and uncertainties of working with small study plots this guidance recommends that monitoring studies (as opposed to replicated manipulative field studies) for terrestrial invertebrates should always use study plots that are based on whole agricultural fields or on half-field units (as employed in the SCARAB project and Farm Scale Evaluations).

10.5. Sampling locations within study plots The locations at which invertebrates are sampled within study plots will be informed by the research question. If the question specifically concerns effects of a PPP within a field margin or the adjacent off-crop area then sampling may be restricted to those areas. However, it is

76 likely that the research question would seek to ascertain effects of a PPP on the full area of an agricultural field (i.e. including the margins and field centre), since invertebrate dynamics including dispersal and pest-antagonist interactions are not limited to any one specific area. Typically, invertebrate diversity and abundance are higher at field margins than the centre of fields, and this difference is likely to be more pronounced as field size increases. In order to capture this within-field heterogeneity of abundance and diversity, a transect approach to monitoring invertebrates is recommended, such that invertebrate samples are obtained in all relevant parts of a field, including uncropped field margins or Conservation Headlands if present, as well as areas of the crop itself at different distances from the field boundary. There is no consensus on the number of transects that should be employed within a field, the number of sampling locations per transect or the distance between sampling points within a transect. These details would depend on the availability of resources. However, for the specified primary endpoints the intensity of sampling should be planned a priori to ensure that statistical power is adequate to permit the primary research hypothesis to be accepted or rejected. An example of the use of transects for monitoring invertebrates is the Farm-Scale Evaluations (Firbank et al, 2003).

10.6. Temporal scale Ideally, field monitoring studies should be conducted over as long a time period as possible, to capture seasonal variations in weather, invertebrate population dynamics, and agricultural management practices. However, the temporal scale of a monitoring study is likely to be constrained by the availability of resources and the need to obtain answers within a short time frame. As mentioned above, longer-term monitoring studies have tended to have limited spatial replication whilst spatially large studies have tended to have short duration (Frampton & Dorne 2007), and examples of studies that are large in both spatial and temporal scales are rare (Liess et al, 2005). A key problem when planning a monitoring study is that the seasonal variation in weather cannot be predicted. Studies should be planned to run for at least two crop years in order to estimate seasonal variability in endpoints, and the second season can serve as an insurance if atypical conditions materialise (or even cause study failure) during the first season. Studies that extend beyond one crop season are important for capturing residual over-winter effects of PPP applications made in the preceding spring and summer. There is no clear consensus on how frequently invertebrates should be monitored within a study. Sampling should be undertaken at meaningful times in relation to the life histories of the taxa for which primary endpoints will be assessed. For example, during the Boxworth Project, weekly monitoring during the spring enabled the migration of carabid beetles from field boundaries into cereal fields to be determined in relation to the timing of spring insecticide applications (Burn 1992). Based on weekly sampling in six consecutive years it was possible to detect year-to-year differences in the timing of migration of the carabid Agonum dorsale (Burn 1992), but such patterns would not have been detectable if sampling had only been conducted at monthly intervals. One issue that has been the subject of extensive debate is how to account for pre-treatment spatial and temporal heterogeneity of invertebrate abundance and diversity in field monitoring studies. A key factor to be considered is that invertebrate taxa differ in their over- winter distributions, life histories and dispersal abilities meaning that not all taxa of interest may be present in the study plots at the beginning of the monitoring period. Among the carabid beetles, for example, A. dorsale migrates from the field boundary to the cereal crop in the spring whilst Bembidion obtusum remains during the season only in the cereal crop (Burn 1992). Pre-treatment samples taken from the cereal crop would provide a useful baseline against which to assess changes in the abundance of B. obtusum, but would provide no useful information against which to assess changes in the abundance of A.

77 dorsale. Given the inter-taxon variability of temporal and spatial dynamics, both pre- treatment sampling and spatial replication of study plots are recommended for ascertaining the spatiotemporal dynamics of study taxa at the start of the monitoring period. Although pre-treatment heterogeneity in abundance or diversity is sometimes taken into account in monitoring studies by using pre-treatment abundance to adjust or ‘correct’ post- treatment abundance, this is not recommended in studies involving multiple post-treatment sampling times since the long-term relevance of the initial abundance is unclear. Percentage changes in abundance through time may be misleading and should not be reported unless supported by information on absolute abundance. Overall, it is preferable that absolute abundance or diversity is reported both for pre-and post-treatment sampling periods and these absolute data can be incorporated in statistical analyses.

10.7. Sampling considerations Appropriate sampling method(s) should be employed that provide an estimate of abundance, diversity, activity or trophic interactions (e.g. predation or (hyper-)parasitism). The endpoints should be specified a priori and should be readily interpreted in terms of ecological effects (e.g. if activity-density is to be assessed as an endpoint using pitfall trapping, an explanation of how the results will be interpreted should be provided a priori).

10.8. Statistical analysis Two key points are: (1) primary endpoints should be powered statistically to detect effects in relation to the pre-specified hypothesis; (2) the limitations of interpreting secondary endpoints should be acknowledged. In particular, it is important to formally discuss the results of monitoring with regard to the role that chance might have played in obtaining the findings. For the multivariate analysis of field studies using PRC, the reader is referred to Van den Brink et al (2009).

78

11 References

Abbott, W.S. 1925. A method of computing the effectiveness of an insecticide, Journal of Economic Entomology 18, 265-267. Aldershof, S. & Bakker, F. 2011. Characterising off-field vegetation for NTA terrestrial mesocosm studies. SETAC Europe 2011 Absracts. Brock, T.C.M, Roessink, I., Dick, J., Belgers, M., Bransen, F., Maund S.J. 2009. Impact of a benzoyl urea insecticide on aquatic macroinvertebrates in ditch mesocosms with and without non-sprayed sections. Environmental Toxicology and Chemistry, 28, 2191-2205. Brock, T.C.M., Alix, A., Brown, C.D., Capri, E., Gottesbüren, B.F.F., Heimbach, F., Lythgo, C.M., Schulz, R. & Streloke, M. (Eds), 2010a. Linking aquatic exposure and effects: risk assessment of pesticides. SETAC Press & CRC Press, Taylor & Francis Group, Boca Raton, FL, USA, 398 pp. Brown, K. and Miles, M. 2006. How much precision does a regulatory field study need. Pesticides and Beneficial Organisms, IOBC/wprs Bulletin, 29, 43-52. Burn, A. J. 1992. Interactions between cereal pests and their predators and parasites. In Greig-Smith PW, Frampton GK, Hardy AR (Eds). Pesticides, cereal farming and the environment, 288 pp. HMSO, London. Candolfi, M., Bigler, F., Campbell, P., Heimbach, U., Schmuck, R., Angeli, G., Bakker, F., Brown, K., Carli, G., Dinter, A., Forti, D., Forster, R., Gathmann, A., Hassan, S., Mead- Briggs, M., Melandri, M., Neumann, P., Pasqualini, E., Powell, W., Reboulet, J.-N., Romijn, K., Sechser, B., Thieme, Th., Ufer, A., Vergnet, Ch., Vogt, H. 2000. Principles for regulatory testing and interpretation of semi-field and field studies with non-target arthropods. Journal of Pesticide Science, 73, 141-147 Candolfi, M.P., Barrett, K.L., Campbell, P.J., Forster, R., Grandy, N., Huet, M.-C., Lewis, G., Oomen, P.A., Schmuck, R., Vogt, H. 2001. ESCORT II. Guidance Document on Regulatory Testing and Risk Assessment Procedures for Plant Protection Products with Non-target Arthropods. SETAC. Carter, N. 1993. Field trials to study the within season effects of pesticides on beneficial arthropods in cereals in summer. Bulletin OEPP/EPPO Bulletin, 23, 709-712. Clarke, K. R. 1999. Nonmetric multivariate analysis in community-level ecotoxicology. Environmental Toxicology & Chemistry, 18, 118-127. Clements, W. H. & Kiffney, P. M. 1994. Assessing contaminant effects at higher levels of biological organisation. Environmental Toxicology & Chemistry, 13, 357-359. Cleveland, W.S. 1993. Visualizing Data. Hobart Press, Summit, NJ. DeJong, F.M.W., van Beelen, P., Smit, C.E., Montforts, M.H.M.M. 2006. Guidance for summarising earthworm studies. RIVM Report No. 601506006/2006. De Jong, F.M.W., Brock, T.C.M., Foekema, E.M., Leeuwangh, P. 2008. Guidance for summarising and evaluating aquatic micro and mesocosm studies. RIVM Report No. 601506009/2008. De Jong, F.M.W., Bakker, F.M., Brown, K., Jilesen, C.J.T.J., Posthuma-Doodeman, C.J.A.M., Smit, C.E., van der Steen, J.J.M., van Eekelen, G.M.A. 2010. Guidance for summarising and evaluating field studies with non-target arthropods. RIVM report No. 601712006/2010.

79

Duffield, S. & Aebischer, N. J. 1994. The effect of spatial scale of treatment with dimethoate on invertebrate population recovery in winter wheat. Journal of Applied Ecology, 31, 263- 281. Duffield, S. J., Jepson, E. C., Wratten, S. D. & Sotherton, N. W. 1996. Spatial changes in invertebrate predation rate in winter wheat following treatment with dimethoate. Entmologia Experimentalis et Applicata, 78, 9-17. EFSA, 2010. EFSA Panel on Plant Protection Products and their Residues (PPR); Scientific Opinion on the development of specific protection goal options for environmental risk assessment of pesticides, in particular in relation to the revision of the Guidance Documents on Aquatic and Terrestrial Ecotoxicology (SANCO/3268/2001 and SANCO/10329/2002). EFSA Journal 2010;8(10):1821. [55 pp.] doi:10.2903/j.efsa.2010.1821. Available online: www.efsa.europa.eu/efsajournal.htm EFSA. 2011. EFSA Scientific Committee; Statistical Significance and Biological Relevance. EFSA Journal 2011;9(9):2372. [17 pp.] doi:10.2903/j.efsa.2011.2372. Available online: www.efsa.europa.eu/efsajournal EU, 2012. SCENIHR (Scientific Committee on Emerging and Newly Identified Health Risks), SCHER (Scientific Committee on Health and Environmental Risks), SCCS (Scientific Committee on Consumer Safety), Preliminary report on Addressing the New Challenges for Risk Assessment, 8 October 2012. http://ec.europa.eu/health/scientific_committees/emerging/docs/scenihr_o_037.pdf. Accessed 13th November 2012. Firbank, LG, Heard MS, Woiwod IP, Hawes C, Haughton AJ, Champion GT, Scott RJ, Hill, MO, Dewar AM, Squire GR, May MJ, Brooks DR, Bohan DA, Daniels, RE, Osborne JL, Roy DB, Black HIJ, Rothery P, Perry JN (2003). An introduction to the farm-scale evaluations of genetically modified herbicide-tolerant crops. Journal of Applied Ecology, 40, 2-16. Frampton, G.K. 1999. Spatial variation in non-target effects of the insecticides chlorpyrifos, cypermethrin and pirimicarb on Collembola in winter wheat. Pesticide Science, 55, 875-886. Frampton, G. K. 2000. Large-scale monitoring of non-target pesticide effects on farmland arthropods in England: the compromise between replication and realism of scale. In: Johnston, J. J. (Ed.) Pesticides and Wildlife. ACS Symposium Series, American Chemical Society, Washington DC. Frampton, G. 2001. The effects of pesticide regimes on arthropods. In: Young JEB, Griffin MJ, Alford DV & Ogilvy SE (eds) Reducing agrochemical use on the arable farm. DEFRA. pp. 219-253. Frampton, G. K. 2002. Long-term impacts of an organophosphate-based regime of pesticides on field and field edge Collembola communities. Pest Management Science, 58, 991-1001. Frampton, G. K., Van den Brink, P. J. & Gould, P. J. 2000. Effects of spring drought and irrigation on farmland arthropods in southern Britain. Journal of Applied Ecology, 37, 865- 883. Frampton, G. K., Van den Brink, P. J. & Wratten, S. D. 2001. Diel activity patterns in an arable collembolan community. Applied Soil Ecology, 17, 63-80. Frampton, G. K., Van den Brink, P. J. & Gould, P. J. 2000. Effects of spring precipitation on a temperate arable collembolan community analysed using Principal Response Curves. Applied Soil Ecology, 14, 231-248. Frampton, G.K. & Van den Brink, P.J. 2002. Influence of cropping on the species composition of epigeic Collembola in arable fields. Pedobiologia, 46, 328-337.

80

Frampton, G.K., Dorne, J.-L.C.M. 2007. The effects on terrestrial invertebrates of reducing pesticide inputs in arable crop edges: A meta-analysis. Journal of Applied Ecology, 44, 362- 373. Frampton, G. K. & Van den Brink, P. J. 2007. Collembola and macroarthropod community responses to carbamate, organophosphate and synthetic pyrethroid insecticides: Direct and indirect effects. Environmental Pollution, 147, 14-25. Greig-Smith, P.W., Frampton, G.K., Hardy, A.R. (Eds) 1992. Pesticides, cereal farming and the environment, 288 pp. HMSO, London. Goodman, S. N. 2001. Of P-values and Bayes: a modest proposal. Epidemiology, 12, 295- 297. Hinds, W. T. 1984. Towards monitoring of long-term trends in terrestrial ecosystems. Environmental Conservation, 11, 11-18. Holland, J. M., Frampton, G. K. & Van den Brink, P. J. 2002. Carabids as indicators within temperate arable farming systems: implications from SCARAB and LINK Integrated Farming Systems projects. In: The Agroecology of Carabid Beetles, Ed. John M. Holland. Intercept, Andover, UK, 356 pp. Hommen, U., Veith, D. and Dülmer, U. 1994. A computer program to evaluate plankton data from freshwater field tests. In Hill, I.R., Heimbach, F., Leeuwangh, P. and Matthiesen, P. (eds) Freshwater Field Tests for Hazard Assessment of Chemicals, pp. 503-513. Boca Raton, FL: Lewis Publishers. Jenkins, R., Hodgkin, J., Jenkins, C., Pickering, F., Samuel, A., Stolz, V., Wyllie, J., Podd, G. & Norman, S. 2012. New methods for assessing the effects of insecticides on larvae and adult emergence in freshwater outdoor microcosms. SETAC Europe 2012 Absracts. Kedwards, T. J., Maund, S. J. & Chapman, P. F. 1999. Community level analysis of ecotoxicological field studies: II replicated-design studies. Environmental Toxicology & Chemistry, 18, 158-166. Kennedy, J. H., Ammann, L. P., Waller, W. T., Warren, J. E., Hosmer, A. J., Cairns, S. H., Johnson, P. C. & Graney, R. L. 1999. Using statistical power to optimize sensitivity of analysis of variance designs for microcosms and mesocosms. Environmental Toxicology & Chemistry, 18, 113-117. Lang, J. M., Rothman, K. J. & Cann, C. I. 1998. That confounded P-value. Epidemiology, 9, 7-8. Liess, M., Brown, C., Dohmen, P., Duquesne, S., Hart, A., Heimbach, F., Kreuger, J., Lagadic, L., Maund, S., Reinert, W., Streloke, M. & Tarazona, J.V. (eds) (2005). Effects of Pesticides in the Field. Society for Environmental Toxicology and Chemistry (SETAC). Berlin. 136pp. Maltby, L., Arnold, D., Arts, G., Davies, J., Heimbach, F., Pickl, C. & Poulsen, V. (Eds), 2010. Aquatic macrophyte risk assessment for pesticides. SETAC Press & CRC Press, Taylor & Francis Group, Boca Raton, FL, USA, 140 pp. Manly, B. F. J. 1990. Randomization and Monte Carlo methods in biology. Chapman & Hall, London, 281 pp. Maund, S., Chapman, P., Kedwards, T., Tattersfield, L., Matthiessen, P., Warwick, R. & Smith, E. 1999. Application of multivariate statistics to ecotoxicological field studies. Environmental Toxicology & Chemistry, 18, 111-112. OECD. 2006a. Guidance Document on Simulated Lentic Field tests (Outdoor Microcosms and Mesocosms). OECD series on testing and assessment No. 53. ENV/JM/MONO(2006)17.

81

OECD. 2006b. Current approaches in the statistical analysis of ecotoxicity data: A guidance to application. OECD series on testing and assessment No. 54. ENV/JM/MONO(2006)18. Perry J.N., Rothery, P., Clark, S.J., Heard, M.S. & Hawes, C. 2003. Design, analysis and statistical power of the Farm-Scale Evaluations of genetically modified herbicide-tolerant crops. Journal of Applied Ecology, 40, 17-31. Pitard, F. F. Pierre Gy’s Sampling Theory and Sampling Practice. 2 Volumes, CRC Press, Inc., Boca Raton, Florida. 1989. Pullen, A. J., Jepson, P. C. & Sotherton, N. W. 1992. Terrestrial non-target invertebrates and the autumn application of synthetic pyrethroids: experimental methodology and the trade-off between replication and plot size. Archives of Environmental Contamination & Toxicology, 23, 246 – 258. PSD. 1997 Guideline to study the within season effects of insecticides on non-target terrestrial arthropods in cereals in summer. In “The Registration Handbook”, Vol. II, Part Three, A3/ Appendix 2. Sparks, T. H., Scott, W. A. & Clarke, R. T. 1999. Traditional multivariate techniques: potential for use in ecotoxicology. Environmental Toxicology & Chemistry, 18, 128-137. Ter Braak, C. J F. & Šmilauer, P. 2002. CANOCO Reference manual and Canodraw for Windows user’s guide: software for canonical community ordination (version 4.5). Ithaca, NY, USA (www.canoco.com): Microcomputer Power. Ter Braak C.J.F. & Šmilauer P. 2012. Canoco reference manual and user's guide: software for ordination, version 5.0. Microcomputer Power, Ithaca, USA, 496 pp. Ter Braak, C. J. F. 1994. Canonical community ordination. Part 1: Basic theory and linear methods. Écoscience, 2, 127-140. Underwood, A. J. 1997. Experiments in Ecology: Their Logical Design and Interpretation Using Analysis of Variance. Cambridge University Press, 504 pp. Van den Brink, P. J., Van Donk, E. & Gylstra, R. 1995. Effects of chronic low concentrations of the pesticides chlorpyrifos and atrazine in indoor freshwater microcosms. Chemosphere, 31, 3181-3200. Van den Brink, P. J & Ter Braak, C. 1998. Multivariate analysis of stress in experimental ecosystems by Principal Response Curves and similarity analysis. Aquatic Ecology, 32, 163- 178. Van den Brink, P. J., Ter Braak, C. J. F. 1999. Principal response curves: analysis of time- dependent multivariate responses of biological community to stress. Environmental Toxicology & Chemistry, 18, 138-148. Van den Brink, P. J., Van den Brink, N. W., Ter Braak, C. J. F. 2003. Multivariate analysis of ecotoxicological data using ordination: demonstrations of utility on the basis of various examples. Australasian Journal of Ecotoxicology, 9, 141-156. Van den Brink, P.J., Den Besten, P.J., Bij de Vaate, A., Ter Braak, C.J.F. 2009. The use of the Principal Response Curves technique for the analysis of multivariate time series from biomonitoring studies. Environmental Monitoring and Assessment, 152, 271 – 281. Van Wijngaarden, R. P. A., Van den Brink, P. J., Oude Voshaar, J. H & Leeuwangh, P. 1995. Ordination techniques for analysing response of biological communities to toxic stress in experimental systems. Ecotoxicology, 4, 61-77. Van Wijngaarden, R.P.A., Brock, T.C.M., Van den Brink, P.J., Gylstra, R. & Maund, S.J. 2006. Ecological effects of spring and late summer applications of Lambda-Cyhalothrin in freshwater microcosms. Archives of Environmental Contamination and Toxicology, 50, 220- 239.

82

Verdonschot, P. F. M. & Ter Braak, C. J. F. 1994. An experimental manipulation of oligochaete communities in mesocosms treated with chlorpyrifos or nutrient additions: multivariate analyses with Monte Carlo permutation tests. Hydrobiologia, 278, 251-266. Weinberg, C. R. 2001. It’s time to rehabilitate the P-value. Epidemiology, 12, 288-290. White, H. W. A reality check for data snooping. Econometrica, 68, 1097-1126. Williams, D.A. 1971. A test for differences between treatment means when several dose levels are compared with a zero dose control. Biometrics, 27, 103-117. Williams, D.A. 1972. The comparison of several dose levels with zero dose control. Biometrics, 28, 519-531. Young, J.E.B., Griffin, M.J., Alford, D.V. & Ogilvy, S.E. (eds). 2005. Reducing agrochemical use on the arable farm. DEFRA. pp. 219-253. Zar, J. H. 1999. Biostatistical analysis. Fourth Edition. Prentice-Hall, USA.

Web resources (URL) URL 1, Data snooping: http://www.ma.utexas.edu/users/mks/statmistakes/datasnooping.html. Accessed 15.11.12 URL 2, Ordination: http://ordination.okstate.edu/. Accessed 15.11.12

83

Appendix I Hypothetical taxon-level data for PRC examples (Chapter 8) Example 1. Effect and recovery

Taxon A Taxon B 120 120

100 C 100 C 80 T1 80 T1 60 T2 60 T2 T3 40 40 T3 Abundance Abundance 20 20

0 0 1 2 3 4 1 2 3 4 Time point Time point

Taxon C Taxon D 120 120

100 100 C C 80 80 T1 T1 60 60 T2 T2 40 T3 40 T3 Abundance Abundance 20 20

0 0 1 2 3 4 1 2 3 4 Time point Time point

Taxon E 120

100 C 80 T1 60 T2 40 T3 Abundance 20

0 1 2 3 4 Time point

84

Decline in background abundance

Taxon A Taxon B 120 100

100 80 80 C C 60 T1 T1 60 T2 40 T2 40 Abundance T3 Abundance T3 20 20

0 0 1 2 3 4 1 2 3 4 Time point Time point

Taxon C Taxon D 120 100

100 80 80 C C 60 T1 T1 60 T2 40 T2 40 Abundance Abundance T3 T3 20 20

0 0 1 2 3 4 1 2 3 4 Time point Time point

Taxon E 100

80 C 60 T1 40 T2

Abundance T3 20

0 1 2 3 4 Time point

85

Redistribution

Taxon A Taxon B 120 120

100 100 C C 80 80 T1 T1 60 60 T2 T2 40 T3 40 T3 Abundance Abundance 20 20

0 0 1 2 3 4 1 2 3 4 Time point Time point

Taxon C Taxon D 120 120

100 100 C C 80 80 T1 T1 60 60 T2 T2 40 T3 40 T3 Abundance Abundance 20 20

0 0 1 2 3 4 1 2 3 4 Time point Time point

Taxon E 120

100 C 80 T1 60 T2 40 T3 Abundance 20

0 1 2 3 4 Time point

86

One very abundant, unaffected taxa present in test system

Taxon A Taxon B 120 120

100 100 C C 80 80 T1 T1 60 60 T2 T2 40 40

Abundance T3 Abundance T3 20 20

0 0 1 2 3 4 1 2 3 4 Time point Time point

Taxon C Taxon D 120 120

100 100 C C 80 80 T1 T1 60 60 T2 T2 40 40 Abundance T3 Abundance T3 20 20

0 0 1 2 3 4 1 2 3 4 Time point Time point

Taxon E 1000

800 C 600 T1

400 T2

Abundance T3 200

0 1 2 3 4 Time point

87

Example 2 Recovery in 1 abundant taxon

Taxon A Taxon B 500 100.0

400 C 80.0 C

300 T1 60.0 T1 T2 T2 200 40.0 T3 T3 Abundance Abundance 100 20.0

0 0.0 1 2 3 4 1 2 3 4 Time point Time point

Taxon C Taxon D 100 100

80 C 80 C

60 T1 60 T1 T2 T2 40 40 T3 T3 Abundance 20 Abundance 20

0 0 1 2 3 4 1 2 3 4 Time point Time point

Taxon E 100

80 C

60 T1 T2 40 T3 Abundance 20

0 1 2 3 4 Time point

88

Example 3. Increase in abundance due to life history

Taxon A Taxon B 60 60

C C 40 T1 40 T1 T2 T2 20 T3 20 T3 Abundance Abundance

0 0 1 2 3 4 1 2 3 4 Time point Time point

Taxon C Taxon D 60 60

C C 40 T1 40 T1 T2 T2 20 T3 20 T3 Abundance Abundance

0 0 1 2 3 4 1 2 3 4 Time point Time point

Taxon E 60

C 40 T1 T2 20 T3 Abundance

0 1 2 3 4 Time point

89

Appendix II Glossary of terms

Abbot’s formula Used to calculate percentage reductions ANOCOVA Analysis of covariance ANOVA Analysis of variance CANOCO Computer programme which is used to perform multivariate analyses Degrees of freedom Used during interpretation of a calculated test statistic, degrees of freedom is typically 1-n (but may vary) to reduce bias in estimates of variance. DIWS Depth integrated water sample ECX Effective Concentration that affects x% of the study population. This may be inhibition of motility or reproduction. Field monitoring studies Studies designed to assess effects on a community under full field conditions following application of a test item under normal agronomic practice. Cf replicated manipulative field studies. QQ plot Quantile probability plot for comparing distributions. NOEAEC No observed ecologically adverse effect concentration NOEAEL No observed ecologically adverse effect level NOEAER No observed ecologically adverse effect rate NOEC No observed effect concentration NOEC COMMUNITY No observed effect concentration for the community NOEL No observed effect level p Probability of rejecting the null hypothesis when it is true (type I error) PCA Principal components analysis. Unconstrained linear ordination method PRC Principal response curves analysis. PRC1 First axis of PRC – axis which explains most treatment related variance in the data PRC2 Second axis of PRC – axis which explains most treatment related variance in the data after the first axis Replicated manipulative field studies Studies conducted in a test system consisting of replicates (individual and independent aquatic ponds, ditches or field plots for example) to which various levels of test item are applied (dose response design). RDA Redundancy analysis. Constrained linear ordination method

90

Test item Chemical stressor to which the community of the test system is exposed in order to assess community response Test system Experimental system for field studies containing a naturalised biological community – e.g. replicated aquatic mesocosms, terrestrial field plots Treatment group Group of test system replicates which receive the same level of test item Type I error Rejection of the null hypothesis when it is true (false positive). Type II error Failure to reject the null hypothesis when it is false (false negative).

91