<<

Zoo Research Guidelines Statistics for typical datasets

© British and Irish Association of and Aquariums 2006 All rights reserved. No part of this publication my be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording or any information storage and retrieval system, without permission in writing from the publisher. Plowman, A.B. (ed)(2006) Zoo Research Guidelines: Statistics for typical zoo datasets. BIAZA, London. First published 2006 Published and printed by: BIAZA Zoological Gardens, Regent’s Park, London NW1 4RY, United Kingdom ISSN 1479-5647

2 Zoo Research Guidelines: Statistics for typical zoo datasets Edited by Dr Amy Plowman

Paignton Zoo Environmental Park, Totnes Road, Paignton, Devon TQ4 7EU, U.K.

Contributing authors: Prof Graeme Ruxton Institute of Biomedical and Life Sciences, Graham Kerr Building, University of , Glasgow G12 8QQ

Dr Nick Colegrave Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, King's Buildings, West Mains Road, Edinburgh EH9 3JT

Dr Juergen Engel Zoolution, Olchinger Str. 60, 82178 Puchheim, Germany.

Dr Nicola Marples Department of Zoology, Trinity College, Dublin 2, Ireland.

Dr Vicky Melfi Environmental Park, Totnes Road, Paignton, Devon TQ4 7EU, U.K.

Dr Stephanie Wehnelt, Zoo Schmiding, Schmidingerstr. 5, A-4631 Krenglbach, Austria.

Dr Sue Dow Gardens, Clifton, Bristol BS8 3HA, U.K.

Dr Christine Caldwell Department of Psychology, University of Stirling, Stirling FK9 4LA,

Dr Sheila Pankhurst Department of Life Sciences, Anglia Ruskin University, Cambridge CB1 1PT, U.K.

Dr Hannah Buchanan-Smith Department of Psychology, University of Stirling, Stirling FK9 4LA, Scotland.

Heidi Mitchell Marwell Zoological Park, Colden Common, Winchester, Hampshire SO21 1JH, U.K.

Acknowledgements These guidelines are a result of a workshop organized by the BIAZA Research Group and hosted at in July 2004. All the authors were participants at the workshop, additional participants were Rob Thomas, Charlie Nevison and Colleen Schaffner and we acknowledge their valuable contributions to these guidelines. Particular thanks also go to Rob Thomas for organizing the workshop logistics and to Minitab for sponsorship of the day.

3 Contents

1. Introduction (A.B. Plowman) 1.1 What are these guidelines for? 1.2 Why are these guidelines needed? 1.3 How to use these guidelines and flowchart guide to sections

2. Randomisation tests (N. Colegrave, J. Engel and A.B. Plowman) 2.1 The problem 2.2 The solution 2.3 Use of randomisation tests for single case and small samples sizes in a zoo setting 2.4 Limitations of randomisation tests 2.5 Presentation of results 2.6 Software for randomisation tests

3. Multivariate tests (V.A. Melfi, N. Marples and G.D. Ruxton) 3.1 The problem 3.2 Common mistakes 3.3 Solutions 3.4 How is it done? 3.5 How to interpret and present results

4. Analysing activity budgets using G-tests (N. Marples, G.D. Ruxton and N. Colegrave,) 4.1 The problem 4.2 Common mistakes 4.3 Solutions 4.4 Limitations

5. General issues 5.1 Autocorrelation, temporal independence and sampling regime (S. Dow, J. Engel and H. Mitchell) 5.2 Social independence (S. Wehnelt, H. Buchanan-Smith, G.D. Ruxton and N. Colegrave) 5.3 Multiple test corrections (C.A. Caldwell, G.D. Ruxton and N. Colegrave) 5.4 Parametric versus non-parametric tests (C.A. Caldwell)

6. References (S. Pankhurst)

4 1. Introduction A.B. Plowman

1.1 What are these guidelines for? This volume aims to give zoo researchers, particularly students, clear guidelines to enable them to choose the most appropriate statistical tests for the types of datasets typically collected in zoo settings. If the guidelines in this volume are followed then researchers should be confident that they have chosen correct, valid and robust statistical analyses. The guidelines highlight typical challenges in zoo research, offer solutions and give advice on how to present the results of the tests and how to interpret these results in terms of what conclusions may or may not be drawn. With these guidelines we hope to increase not only the quality of zoo research but also the acceptance rate of zoo-based research papers in peer reviewed scientific journals.

1.2 Why are these guidelines needed? Despite a long history of fascinating, innovative and robust research carried out in zoos around the world (e.g. de Waal and van Roosemalen, 1979) many researchers in other fields do not consider zoo research a scientifically worthwhile activity. The most common reasons given for this are that animals in zoo environments are not ‘natural’ and that robust statistical analyses are not possible. The first of these objections is something all researchers should be aware of. However, with recent developments in husbandry methods and naturalistic housing and social groupings, most modern zoos now provide an extremely useful research setting; bridging the gap between highly controlled, but often extremely unnatural, laboratory conditions and the totally natural, but very difficult working conditions of the field. The second objection will hopefully be dispelled by these guidelines, since they demonstrate that valid and robust statistical tests are possible for typical zoo datasets, even studies on a single animal. However, even robust statistics can not make up for low biological validity of a study on a small number of individuals (see section 2.4), but this is a problem in common with many field studies (see Bart et al., 1998) and these guidelines also provide ways to deal with this challenge.

In the past the zoo research community has not helped itself to dispel its image of poor statistical procedures and low validity. The typical statistical difficulties encountered (e.g. small samples, lack of independence of data points, non-normal distributions) have been dealt with in many different, more or less appropriate, ways by different researchers. In the published literature featuring zoo research one can find almost as many different statistical procedures applied to very similar datasets as there are papers. Thus, it is not surprising that many researchers find it hard to know which, if any, are correct. In addition to demonstrating which analyses are most appropriate we hope these guidelines will promote greater consistency in the way typical zoo research datasets are analysed and presented. Consensus on, and standardisation of, the methods we use can only be of benefit to all zoo researchers, increasing our own confidence and competence, improving the quality of our research and enhancing the value of our subject among the wider scientific community.

1.3 How to use these guidelines

It is vital that the relevant sections of these guidelines are read BEFORE starting research as the tests to be used will strongly influence the way data are collected

Sections 2, 3, and 4 of these guidelines provide information on the types of tests recommended in various situations that commonly occur during zoo research. The flowchart below provides a simple way for researchers to find the appropriate section for their experimental situation. Section 5 will be useful for all researchers as it provides general guidance on sampling procedures and how to avoid common statistical pitfalls, which are relevant irrespective of the tests being performed.

5 START Standard Do you have a parametric large sample size Yes (>15) and tests normally see readily available distributed data text books No (see section 5.4)?

Are you comparing the same animal(s) in two or more conditions e.g. evaluating enrichment, before and after an Are you focusing enclosure on one move or modification, dependent Randomisation Yes Yes with high and low visitor variable e.g. tests numbers? cortisol level, time spent Section 2 Or are you comparing pacing? two or more animals (or groups) in one condition e.g. males vs females, No adults vs juveniles?

Are you investigating No changes in Yes G-tests and several related derivatives dependent Section 4 variables e.g. the whole activity budget, Are you investigating the relationships between many dependent and independent variables e.g. multi-zoo studies Yes Multivariate using existing tests differences in Section 3 husbandry to evaluate their effects on animals, MBA studies?

6 2. Randomisation Tests N. Colegrave, J. Engel and A.B. Plowman

2.1 The problem A frequent problem of studies carried out in a zoo setting is that, due to practical or ethical limitations, they are often based around a limited number of replicates. For example, zoos may be limited in the number of animals that are available to test a hypothesis, or the number of independent enclosures in which animals can be kept while being studied. In multi zoo studies, individual zoos will often be used as the independent data points, creating obvious difficulties in generating large data sets.

Small sample size studies present three specific problems: • First, with few data points it is difficult to decide with any confidence whether the data meet the assumptions required for a particular test. For example, most parametric statistical tests assume that the data are drawn from populations with an underlying normal distribution. Determining whether this is the case in a study with only eight replicates is not realistic. • Second, small studies will generally have extremely low statistical power, and since the power of parametric tests will decline rapidly as assumptions are violated, they may be extremely inefficient tools for extracting the maximum information from our data. • Third, despite best intentions, it will often be difficult or impossible to design zoo studies with the idealised sampling regimes envisaged in statistical text books. Instead data will often be collected opportunistically, leading to obvious problems. The most frequent proposed solution to these problems is to use non-parametric tests. However, and despite popular belief, such tests are not assumption free, and also frequently have low statistical power as well as other limitations (e.g. more complex designs including multiple factors, or covariates may not be possible, or at least very difficult to do). Thus, alternative tests that are robust in the face of these problems are needed.

2.2 The solution Randomisation tests provide a powerful alternative to standard statistical procedures that we believe will prove useful in dealing with the problems of zoo studies described above (Edgington, 1995; Mundry, 1999; Todman and Dugard, 2001).

Hypothesis testing generally relies on the production of a P value, the probability of obtaining a result equal to or more extreme than the one actually observed in the study assuming that the null hypothesis is true. Whilst most well known parametric and non-parametric tests determine the P value using an assumed theoretical probability distribution for the test statistic (like the standard normal distribution or the χ² distribution), randomisation tests generate this sampling distribution directly. Practically, this is achieved by resampling or reshuffling the data obtained in the study to determine directly the probability of an experimental outcome as large as or more extreme than that observed. In very small studies (such as the example of the lady drinking tea, originally used by Fisher in 1935 to outline this procedure), it is possible to compute all possible permutations of the data, and provide what is known as an exact P value. For larger data sets, this is not feasible, but the recent increase in computing power has allowed these tests to be practically extended by generating a large number of random permutations of the observed data allowing an estimated P value to be generated. Since these tests do not rely on an underlying distribution, specified in advance, they are much more robust than parametric and non-parametric tests when the underlying distribution is unknown.

In theory, a randomisation test equivalent to any standard statistical test can be designed. Although more complex designs may need bespoke programming, appropriate randomisation tests for most

7 types of design commonly used in zoo research are available in a number of software packages (see section 2.6). Furthermore, a randomisation procedure can be designed to specifically examine the study as it was actually carried out (e.g. including any peculiarities in sampling) rather than assuming a perfect sampling scheme. Thus, the difficulties of analysing data from non-standard designs will be reduced significantly. Other advantages of randomisation tests, particularly over other non-parametric tests, are that they use the original data values rather than ranks so are more powerful and they have no difficulty with handling tied data.

2.2.1 Basic principles of randomisation tests The basic principle common to all randomisation and exact tests will be illustrated here with a simple example of testing for a significant difference between two means. This is the equivalent of a Student’s t-test or a Mann-Whitney U-test for two independent samples A and B. Thus, imagine that we wish to test the hypothesis that a novel nutritional regime increases the growth rate of penguin chicks in our penguin enclosure compared to the standard diet. We feed eight randomly chosen chicks the new diet, and another eight the control diet, and we determine their change in weight over that period. The first step in our analysis is to determine the average weight change for the eight experimental penguins, and the average weight change for the controls. To determine the observed experimental difference we then subtract the average control change from the average experimental change. We then begin our randomisation procedure. Taking our 16 data points, we randomly assign eight to the experimental group and eight to our control group, giving us a random data set (but based on the data actually obtained in the experiment). We then calculate the difference in weight gain in the same way as for the actual experimental data, and write it down. We then repeat this procedure to generate a second random data set, and a second difference. We continue this procedure (preferably using a computer) a large number (1000s) of times (in all, there are 12870 different possible rearrangements of this data set). We can then estimate the probability of obtaining a difference between groups at least as great as the one observed in the experiment (the one-tailed P value) as the proportion of all the random data sets in which this was the case. Example 2.1 presents a detailed example with an even smaller sample size.

Example 2.1

Data sampled (observations): A B 3 10 4 12 5 Difference between the means: 4 – 11 = -7

All possible permutations: A1 B1 A2 B2 A3 B3 A4 B4 A5 B5 A6 B6 A7 B7 A8 B8 A9 B9 A10 B10 3 10 3 5 3 5 3 4 3 4 4 3 4 3 5 3 4 3 3 4 4 12 4 12 4 10 5 12 5 10 5 12 5 10 10 4 10 5 10 5 5 10 12 10 12 10 12 12 12 12 Mean of each sample: 4 11 5.67 8.5 6.33 7.5 6 8 6.67 7 6.33 7.5 7 6.5 9 3.5 8.67 4 8.33 4.5 Difference between the means of Ax and Bx: -7 -2.83 -1.17 -2 -0.33 -1.17 0.5 5.5 4.67 3.83

The difference between the means of the two samples is equal or more extreme (in this case smaller) than the one calculated from the observations (i.e. -7) only once out of ten permutations. The one-tailed P value is 1/10 = 0.1

The same general procedure can be used to address more complex designs, although careful thought must always be given to the way in which the randomisation is done, and the interpretation that can be drawn from the P value. In the section that follows we will outline some of the most common designs

8 that occur in zoo research, with guidelines on how to carry out an appropriate test. However, for more detailed discussion we would urge the reader to consult a specialised text such as the ones listed in section 6 (e.g. Todman and Dugard, 2001).

2.3 Use of randomisations tests for single case and small sample sizes in a zoo setting 2.3.1 AB designs for single cases This section covers studies with a two-phase design (baseline and treatment) in which one treatment, which cannot be easily repeated or withdrawn, is applied once to one individual (or group, if group is the sample unit). A realistic zoo example might be investigating the effects of moving to a new enclosure on the behaviour of an animal. In studies of this type the date on which the treatment is applied (the intervention date) should be determined randomly. Ideally this should be done by a truly random method of selecting the day from the range of those available e.g. by selecting one of the possible dates by drawing from a hat. More likely the zoo will set the date due to its practical agenda. However, if the decision is based on practical zoo issues (e.g. when all the appropriate staff members are available) rather than animal issues (e.g. a particular point in an oestrous cycle/breeding season) then it is effectively random with respect to the animal so does not invalidate the results. Once the intervention date is decided the appropriate data are collected over a number of days before and after the intervention.

The difference between the daily mean value before the intervention and after the intervention of any variable measured can then be tested as in the example above by randomising the data and calculating the difference between the means.

Based on prior knowledge of the animal, and the day to day variation in the variable being measured, it is usually desirable to determine a minimum numbers of days of data collection before and after the intervention. For example, if there is a total of 60 days available for the study it may be determined that at least seven days data should be collected before and after the intervention. The intervention date would therefore be randomly allocated to any day between day 8 and day 53 inclusive. The re- randomisation procedure should follow the experimental procedure, therefore, the re-randomised permutations should only include those on which the intervention date fell between days 8-53, data collected on days 1-7 and 54-60 would be kept in the same position. Example 2.2 demonstrates how this is done with a very short study.

As can be seen in Examples 2.1 and 2.2 in such short studies it is not possible to obtain a P value <0.1 because only 10 permutations of the data are possible. In order to be able to obtain a result of P = 0.05 at least 20 permutations must be possible. Thus in Example 2.2 the study should be designed so there are at least 20 possible intervention dates (see also section 2.3.9 below).

2.3.2 ABAB designs for single cases This section covers studies in which one treatment is applied repeatedly to one individual (or group, if group is the sample unit). A typical zoo example might be the repeated use of one single environmental enrichment device that can be provided or removed on a daily basis. In these studies the treatment A (baseline) and treatment B (enrichment) days should be randomly assigned throughout the study period. Again as above if this cannot be allocated truly randomly then as long as the schedule is random with respect to the animal that is acceptable.

The difference between treatment A and treatment B means of any variable measured is analysed exactly as Example 2.1 above. Using such a design the researcher has to be aware that she/he might investigate some learning effects together with differences between the two conditions A and B.

9 Example 2.2 13 days were available for the study, it was decided to guarantee at least 2 days data collection before and after the intervention. The intervention date was therefore randomly assigned to a day between day 3 and 11 inclusive – in this case day 6. In the re-randomisations the data for days 1 and 2 and days 12 and 13 are always left in the same samples.

Data sampled (observations): A B 4 9 3 11 3 10 4 8 5 9 12 10 12 Difference between the means: 3.8 – 10.125 = -6.325

All possible permutations, randomising intervention date between days 3-11 only: A1 B1 A2 B2 A3 B3 A4 B4 A5 B5 A6 B6 A7 B7 A8 B8 A9 B9 A10 B10 4 9 4 10 4 12 4 9 4 8 4 10 4 11 4 5 4 4 4 3 3 11 3 12 3 10 3 12 3 9 3 8 3 10 3 9 3 5 3 4 3 10 3 3 12 3 10 3 12 3 9 3 8 3 11 3 9 5 4 8 4 4 4 12 4 10 4 12 4 9 4 10 11 9 5 9 5 5 5 5 12 5 10 5 12 8 10 11 12 9 9 9 9 9 12 9 10 9 8 10 10 11 11 11 11 11 12 12 9 8 12 10 10 10 10 10 12 9 8 8 8 12 10 12 9 9 12 10 12 12 Mean of each sample: 3.8 10.125 7.09 11 6.6 11.3 6.3 10.75 6.125 10.2 5.57 10.17 4.67 10.28 3.5 9.55 3.33 9 3.5 8.45 Difference between the means of Ax and Bx: -6.325 -3.91 -4.7 -4.45 -4.075 -4.6 -5.61 -6.05 -5.67 -4.95

The difference between the means of the two samples is equal or more extreme than the one calculated from the observations (i.e. -6.325) only once out of ten permutations. The one-tailed P value is 1/10 = 0.1

2.3.3 ABCDABCD designs for single cases This section covers studies in which multiple treatments are applied repeatedly to one individual (or group, if group is the sample unit). For instance, this would apply if multiple enrichment devices were provided separately on a number of occasions. As above, the days on which each treatment is provided should ideally be allocated randomly throughout the study period.

The largest difference between any two means is calculated followed by re-randomisation of all the data points across all treatments. If a significant result is obtained post-hoc tests on pairwise comparisons of treatments can be performed (see below).

2.3.4 Any of the above designs for small numbers of replicates If there is no need to test the effects of the experimental manipulation on individuals separately then any of the above experimental designs can be applied to a small number (<15) of individuals (or groups, if groups are the replicates) and treatments should be applied in exactly the same way. However the analysis is slightly different since now we need a test analogous to a repeated measures test (paired t-test, repeated measures ANOVA, or Friedman test) rather than a t-test, one-way ANOVA, or Kruskal-Wallis test.

10 Instead of calculating the difference between the treatment means the residual sum of squares (RSS) is calculated. The data are re-randomised across treatments but the same data points are kept within replicates (i.e. individuals or groups). The RSS is recalculated and the P value based on the number of times the RSS is equal or greater than that in the actual observed data.

See section 2.3.7 for information on how to analyse individuals separately and how to get an overall P value for the group.

2.3.5 Opportunistic designs This section deals with studies that investigate the effects of uncontrollable events so the application of treatments cannot be truly randomised. A typical zoo example might be studies of the effects of large numbers of zoo visitors. Treatments in this case might be ‘low visitors’ and ‘high visitors’ (analogous to ABAB designs) or ‘low visitors’, ‘medium visitors’ and ‘high visitors’ (analogous to ABCABC designs). In these cases the analysis is carried out exactly as the analogous test above, but justification is needed in the discussion for the lack of planned random assignment of treatment applications. Similarly, interpretation will be limited as in any non-experimental study, with the potential that effects are being driven by uncontrolled variables. In many zoo cases this is not a major problem because the assignment of treatments will be effectively random with respect to the animal and the researcher, so should not produce confounding error. However, care must obviously be taken to ensure that other variables, such as time of day, are equalised or randomised across the treatment being studied e.g. that not all low visitor observations occur in the morning and all high visitor observations occur in the afternoon.

2.3.6 Post-hoc tests If a significant difference is found in cases with more than two treatments, post-hoc tests may be performed to find out which treatments really differ from one another. Logically consistent, these pairwise comparisons should be done by appropriate randomisation tests again. One has to make provision for not inflating the chosen α (significance) level by performing a number of tests with the same data (multiple test problem - see section 5.3 for possible solutions to this problem).

2.3.7 Agglutination tests In many cases it may be desirable to investigate whether a treatment had an effect on each individual group member separately, for instance if it is expected that different age/sex classes may respond differently, but then to also generate an overall result for the whole group (or when groups are the replicates the effects on each group separately and an overall result for all groups). If this is the case, then separate randomisation tests for each individual can be performed as described above. Afterwards an agglutination test may be performed on those P values to generate an overall P value for the whole group.

Several agglutination tests are described in the literature. One of the best known is Fisher’s procedure which uses the χ² distribution to combine several probabilities from independent statistical tests of a general hypothesis to a single more powerful test of this hypothesis (see Sokal and Rohlf, 1994, p. 794ff).

th If Pi is the P value associated with the individual test on the i individual, then the first step is to evaluate n − 2 ∗ ∑()ln Pi i=1

11 The value of this summation can then be looked up in χ² tables with 2n degrees of freedom (where n is the total number of tests being combined), to provide a test of the overall hypothesis.

Another procedure which does not assume equal sample sizes and exact P values is described in example 2.3.

Example 2.3 This procedure uses the binomial distribution to combine several P values at a given level of significance α: n n x n−x P(X ≥ k) = ∑  ⋅α ⋅ (1−α) x=k  x where k out of n P values have been significant. If the resulting P is not greater than α, the whole result is considered significant at the chosen α level.

Let us, for example, consider the following experiment. To find out whether an artificial termite hill influenced the amount of comfort behaviour in a group of six zebras each individual was observed several times with and without the hill. The two samples for each individual zebra were compared using a randomisation test. As a result six different P values were obtained: P = 0.013 for individual A P = 0.042 for individual B P = 0.088 for individual C P = 0.153 for individual D P = 0.197 for individual E P = 0.460 for individual F.

Only the first two zebras A and B showed a significant increase in their amount of comfort behaviour (applying a level of significance of α = 0.05). To determine whether there is a difference in the behaviour of the whole group an overall P value is calculated using the formula given above.

Two out of six P values have been found significant in the first part of the study. This leads to k = 2, n = 6 and α = 0.05. 6 6 x 6−x P = ∑  ⋅ 0.05 ⋅ 0.95 x=2  x

 6! 2 6−2   6! 3 6−3   6! 6 6−6  P =  ⋅0.05 ⋅0.95  +  ⋅0.05 ⋅0.95  + K +  ⋅0.05 ⋅0.95  2!()6 − 2 !  3!()6 − 3 !  6!()6 − 6 !  P = 0.03054398+ 0.00214344+ 0.00008461+ 0.00000178+ 0.00000002 P = 0.03277383

This evaluation shows that the comfort behaviour of the whole group is significantly influenced by the artificial termite hill, because 0.0328 ≤ 0.05.

2.3.8 Correlation / Regression Sometimes the research question will not aim at differences between samples but at associations and dependencies between variables. For example one may be interested in the influence of a certain vitamin on the level of activity or the relationship between the number of group members and the number of social interactions. In this case special randomisation tests can be used to determine the error probability for a correlation coefficient or a regression analysis (described for example in Manly, 1997). In fact the published significance tables for Spearman’s ρ are based on data permutation.

2.3.9 Choosing the right number of randomisations The difference between exact (permutation) tests and randomisation tests is that exact tests use all possible rearrangements of the data whereas randomisation tests only use a subset. For more

12 complicated designs or larger datasets randomisation has the advantage of saving computer time without much loss of precision compared with exact tests.

The formula for the calculation of the maximum numbers of randomisations depends on the test used. For example, analysing two related samples with a sample size of N will result in 2N permutations for the exact test. This means more than a million different permutations are possible with a sample size M + N  (M + N)! of N=20. Two unrelated samples with a sample size of M and N will need   =  N  N! M! permutations for the exact test. This means more than a million different permutations are possible with sample sizes of M=12 and N=11 for example.

If it is not possible to calculate all permutations, a subset of them is used to estimate the true P value. The bigger the subset, the better. Several authors recommend use of 5000 to 10000 randomised pseudosamples, especially if the desired α (significance) level is smaller than 0.05 (see e.g. Onghena and May, 1995).

2.4 Limitations of randomisation tests Having read this far, the reader may consider randomisation tests the ultimate procedure for statistical tests. Despite our belief that randomisation procedures provide powerful and useful tools that can be used in analysing typical zoo studies, these tests are not entirely assumption free (see Adams and Anthony, 1996; Todman and Dugard, 2001; Ewen et al, 2003). As with all statistical tests, researchers should be aware of their limitations, and use and interpret the tests accordingly. Here we outline some of the theoretical and practical limitations that we are aware of.

2.4.1 Limitations specific to randomisation tests i) Choice of test statistic Most statistical textbooks dealing with randomisation tests use the same test statistic for the same type of test, e.g. the difference between the means of samples to test for differences of ‘typical’ values. Some test statistics have been found to be “equivalent test statistics”. This means they will yield identical P values when using the same data; for example residual sum of squares RSS and the sum of observations in one sample have been found to be such equivalent test statistics.

Unfortunately not every test statistic gives the same result. There may be a significant difference between two samples when using one test statistic (e.g. difference between the mean values) and no difference at all when using another test statistic (e.g. difference between the median values). Considering a hypothesis with two independent samples a lot of different test statistics are conceivable, all of which could be used in a randomisation test: difference of means, difference of medians, difference of mean residuals, difference of sums, RSS, sum of observations in one sample, Mann-Whitney U, Wilcoxon W etc. One of the great advantages of randomisation testing is it forces the researcher to think explicitly about the test statistic that is most appropriate in a given situation.

Although the mean may not be the optimum measure to describe the location of a skewed sample with outliers, it is often used as the test statistic, probably due to the fact that the values of the statistic are distributed around zero. This makes the evaluation of a two-sided P value easy, compared to other test statistics where it is almost impossible to calculate a valid two-sided P value.

There is no simple answer on the important issue of which test statistic to use and it is recommended that researchers consult a statistician. However, the selection is somewhat restricted because many

13 reasonable test statistics (like the difference between medians) are not available in any software packages at the moment. ii) Availability of randomisation test software Although software is available for a large number of standard tests (see section 2.6) for many, usually more complicated, designs researchers will probably need to programme their own tests, which may be a daunting prospect. As stated previously, existing computer programmes only use very few test statistics.

2.4.2 Limitations in common with other tests i) Differences in variances The difficulty of comparing the means of two samples that may come from populations with unequal variances has a long history under the name of the “Behrens-Fisher problem”. When the hypothesis of interest concerns mean differences most parametric and non-parametric statistical tests (e.g. t-test or Mann-Whitney U test) assume equal variances in the populations from which the samples being compared are taken (Hayes, 2000; Kasuya, 2001). Randomisation tests also assume this because the null hypothesis is that the samples come from exactly the same source, which is not true if variances are not identical.

There is no simple, ideal solution to this problem. The first part of the problem is that it is very difficult to determine whether variances are equal or not in typical zoo studies with small samples. Randomisation tests can be designed to test for unequal variances (Manly, 1997) but with small samples their power will be very low. It many cases it may be reasonable to assume that variances are likely to be similar based on sound biological reasoning and inspection of the data. If this is not the case, there are several possible ways to deal with it, none of which are ideal: • Manly (1995) examined six different randomisation tests to compare means with unequal variance. He found one test superior to all other tests under different conditions. Unfortunately, this test is not implemented in any of the common statistical software packages currently available and even this test does not remove the problem entirely. • Researchers may use any available transformation to minimise the problem and then use randomisation tests in the knowledge that fewer of their assumptions are being violated. Of course, all P values obtained should be treated with caution and the results may be difficult to interpret. • It may be possible to formulate the research hypothesis in a way that any difference between the samples is of interest (regardless of whether it is shape, variability, and/or location of the population distributions) in which case the problem no longer exists. For example a possible two-sided experimental/alternative hypothesis would not be: H1: Female and male elephant seals differ in the mean amount of time they spend with novel food objects, but rather: H1: Female and male elephant seals differ in the amount of time they spend with novel food objects. However, it is not possible to specify the type of difference (location, variability, skewness, kurtosis). Unfortunately, statistical software packages which will do such tests are not readily available at the moment. ii) Autocorrelation of data When repeated observations on the same subject(s) are done these data may be autocorrelated through time, this is especially true the shorter the time interval between two consecutive observations is. See section 5.1 for advice on how to avoid autocorrelated data.

14 iii) Biological (external) validity Although randomisation tests provide a robust and valid means of statistically testing data from single case and small sample studies this does not make up for poor external validity of the results. For example, in an enrichment study on one lion there was a statistically significant increase in feeding time on days when enrichment was provided compared with the days when it was not. If the experiment was designed and analysed correctly, as described here, it is perfectly reasonable to conclude that the enrichment caused an increase in time spent feeding by that single lion. It is not reasonable to conclude that the enrichment causes an increase in feeding time in captive lions generally. However, it is reasonable to argue that, assuming this lion and its living conditions are not exceptional in any way, the enrichment could well be expected to have a similar affect on other captive lions and would be worthy of further testing.

2.5 Presentation of results Randomisation tests are still relatively uncommon in published papers in biological fields. It is, therefore, very important to provide adequate details of the test performed and suitable references. As a general rule, as for all other methods, you should provide enough information to allow repetition of the test by anyone reading the paper. For randomisation tests this should include what test statistic was calculated (e.g. difference between means, RSS), what data were re-sampled across what conditions and how many permutations were done, especially whether all possible re-arrangements were performed (exact test) or only a subset (randomisation test).

Example 2.4 The case of the lion enrichment study above might be reported thus:

A randomisation test was performed on the difference between the daily mean time spent feeding on feeding enrichment and non-enrichment days. 5000 re-randomised pseudosamples were generated by randomising all the daily feeding times across both conditions. The difference between the daily mean feeding time on feeding enrichment and non-enrichment days was equal to or greater than the observed value (12.5) in 15 of the 5000 permutations (proportion = 0.003). Therefore, the observed difference in feeding time between enrichment and non-enrichment days is statistically significant (P<0.01; two-tailed).

2.6 Software for randomisation tests We are aware of the fact that we will not be able to give a comprehensive list of all software packages which are available at the moment to run randomisation tests. We would just like to introduce some of the most prominent examples.

• SPSS includes an option for “Exact Test” with all its non-parametric tests. Selecting this option means that SPSS performs an exact permutation test instead of the analogous non- parametric test which uses an asymptotic distribution function. Another option labelled “Monte Carlo” enables the user to perform a randomisation test with a given number of permutations. However, neither the maximum number of permutations for the case at hand is calculated nor the formula for the test statistic used is given therefore it is difficult to report the statistical procedure comprehensively. • StatXact is seemingly the most comprehensive package dedicated to randomisation tests at the moment. It carries out a variety of tests on one, two or k samples, contingency tables, and other situations using P values from either sampling or full permutation distributions. • RT carries out one- and two-sample tests, ANOVA, regression, and several other tests, all by sampling randomisation distributions. • PopTools is a free add-on for Microsoft Excel which enables a large number of randomisation tests to be performed. It can be downloaded from http://www.cse.csiro.au/poptools/

15 • Todman and Dugard provide a CD with their book (Todman and Dugard, 2001; see references) containing macros for Microsoft Excel, Minitab, and SPSS to perform a range of randomisation tests. Unfortunately, exact tests are not possible even with small samples. Without having performed comprehensive tests we do know one Excel macro from the CD which yields a wrong result under certain conditions. • SsS is a statistical software package which is available from Zoolution. Among other tests it carries out different permutation and randomisation tests with two related or two independent samples. However, at the moment this package is only available in German.

16 3. Multivariate Tests V.A. Melfi, N. Marples and G.D. Ruxton

3.1 The problem Some types of zoo research, especially multi-zoo studies (reviewed in Mellen, 1994), aim to test the effects of a large number of independent variables on animal biology, e.g. mortality, Carlstead et al., 1999a; personality, Carlstead et al., 1999b; play behaviour, Spijkerman et al., 1996; activity level, Perkins et al., 1992). To analyse the simultaneous effects of many variables requires the use of multivariate statistics and can be very complex.

3.1.1 Overview of multi-zoo studies Comparing animals across different zoos can provide an invaluable source of information about how different independent variables (IVs) e.g. environmental factors or individual characteristics, influence animal behaviour and biology (dependent variables DVs). A major advantage of this type of study is that the differences in independent variables already present between zoos (or animal groups) can be used to address both pure and applied topics. In addition, multi-zoo studies can be helpful to increase the sample size of the dataset, e.g. when evaluating an enrichment method, but in this case existing differences between zoos can be a limitation rather than an advantage.

A variety of methods can be used to collect data in multi-zoo studies; i) behavioural observations, either of the animals’ normal behaviour (Mellen, 1993; Melfi, 2001) or as a consequence of an experimental manipulation, e.g. the influence of a standardised enrichment or novel object submitted, ii) historical records, e.g. using data from ARKS or SPARKS (widely used zoo record systems) (Pickering et al., 1992), iii) surveying zoo employees or visitors (Carlstead et al., 1999a), iv) collecting biological samples which are used as a source of information, e.g. cortisol in faeces (Shepherdson et al., 2004; see Smith, 2004) and v) conducting a dietary intake study.

3.1.2 The main problems arising when applying multivariate statistics to your data • There are many variables that can not be controlled (or even identified), how does one choose which variables to measure? • Many variables will interact in their influence, how does one interpret these interactions? • Lack of social independence between individuals in the same group. • The only statistical tests available for multivariate analysis are parametric tests, but much of the data collected do not meet the assumptions of these tests (see section 5.4) • Multiple testing using the same data may increase the probability of Type I errors (see section 5.3)

3.2 Common mistakes The most common mistake is to ignore the above problems and therefore only have a poor understanding of the validity of the study. However, at the other extreme it is also a mistake to become over-concerned with the above problems and either not perform the study at all because they seem insurmountable or only use a subset of the data collected, lose valuable information from the dataset and only provide limited interpretation of the results.

3.3 Solutions 3.3.1 Choosing variables and controlling variables • Clearly outline the questions which the study intends to address, the hypotheses and the dataset required. • Use published data to determine which IVs have previously been associated with the DV you are interested in (see below, example 3.1).

17 • Consider which IVs vary considerably between your study groups, as these will be easier to measure and might be more likely to lead to differences in your DV. For example, animals’ activity budgets are affected by many factors, but since enclosure size and social composition may vary greatly between zoos these are IVs that definitely need to be included. • A pilot study may help to hone your research question down so that you need to consider fewer, or even a single IV. In addition, a pilot study and preliminary analysis of the dataset may be necessary to indicate unexpected confounding variables which should be controlled, other than obvious ones such as time of day and effects of season (see below, example 3.2).

Example 3.1 Choosing variables Mellen (1991) investigated the influence of a number of IVs, e.g. environment, social composition and individual characteristics, previously associated with breeding success (DV) measured as the number of litters produced per year of reproductive opportunity, in small zoo-housed exotic felid species. Data were collected for 20 species (134 individuals) in eight zoos. Eleven operationally defined and measured IVs were collected for each animal, including age, sex, enclosure size, medical treatments, and breeding success was calculated. The DV was assessed in terms of each male-female pairing; each male was assessed in terms of each female he was paired with and similarly each female was assessed in terms of her mates. There were some situations where an individual may have had more than one mate, which therefore led to problems with data independence (see section 5.2). There were also more data for females (N=78) than males (N=76) due to male death and insufficient records for another male.

Multiple IVs and a single DV meant that the data in this study were suitable for multiple regression analysis. There was some violation of data independence, acknowledged in the paper, but the application of this multiple regression analyses successfully established that IVs such as number of medical treatments performed and husbandry style (level of keeper- animal interaction) were significant predictors of breeding success in small exotic felid species.

Example 3.2 Controlling variables Melfi and Marples (2004) studied the influence of captive environmental variables (IVs) on the time spent feeding (DV) by Sulawesi crested black macaques (Macaca nigra). Data were collected on eight different troops. Before the application of multiple-regression analyses on the dataset, descriptive and preliminary analyses were performed.

Previous data had highlighted that the age/sex class of Sulawesi macaques significantly affected the time they spent feeding (Melfi and Feistner, 2002). This was supported by a 2-way ANOVA performed on the multi-zoo data, where age/sex class and zoo were fixed factors (both factors significantly affected feeding). As they were only interested in which environmental variables affected feeding, they had to control for the different social composition of the groups sampled i.e. a group with more adult males would appear to rest more than a group with more juveniles. This was done by manipulating the data incorporated into the multiple regression analysis; the data were weighted for sex-age class within SPSS (see SPSS Help or user guides).

3.3.2 Interaction of independent variables IVs may sometimes influence each other, for instance it is likely that larger enclosures are also more complex, so the complexity and the size cannot be considered entirely independent. Be aware of the potential for interactions between the IVs you choose and allow for it in the interpretation of your data (see below).

3.3.2 Social independence If animals are housed together, they may influence each other’s behaviour, either positively or negatively. For instance, if a dominant male is feeding, a subordinate one may not be able to feed. On the other hand, many animals show social facilitation so if one member of the group is resting, the others may be more likely to do so too. See section 5.2 for information on minimising the effects of lack of independence between individuals in the same group.

18 3.3.3 Non-normal data distributions Often behavioural data does not fit the criterion required by parametric statistics that the residuals of the data are normally distributed (see section 5.4 for details on how to deal with non-normality). In the case of multivariate tests where there are no non-parametric alternatives available it is better to proceed with the test, than not perform any analysis at all. However, it is absolutely crucial that you understand how your data violate the assumptions of the test, how robust the test is to these violations and that you exercise extreme caution when interpreting the results.

3.3.4 Multiple testing Multivariate analysis can involve multiple testing of your results, just as easily as any other form of analysis. See section 5.3 for information on how to deal with multiple testing on the same data.

3.4 How is it done? 3.4.1 What multivariate statistics are available and which should you choose? It is important to understand what type of data you have collected (e.g. continuous, interval, discrete, dichotomous, see standard text books for definitions) and how many IVs and DVs you have. Using information about the type of data you have, and the questions you wish to ask (as above, section 3.3.1), you can use the Table 3.1 to indicate which multivariate statistical analyses could be applied to your data. All the tests outlined can be performed by most major statistical software packages.

Table 3.1: Choosing among multivariate statistical techniques (adapted from Tabachnick and Fidell, 1996, with kind permission of HarperCollins College Publishing).

Major research Number (kind) Number (kind) of Covariates Analytic Goal of analysis question of DVs IVs strategy One, Bivariate r continuous Create linear One, combination of continuous None Multiple R IVs to optimally Multiple, predict DV Sequential continuous Some multiple R Maximally correlate a linear Degree of Multiple, Multiple, combination of Canonical R relationship continuous continuous DVs with a linear combination of IVs

Create a log- linear Multiway Multiple, combination of None frequency discrete IVs to optimally analysis predict category frequencies One-way None One, ANOVA or t-test Determine Significance One, discrete One-way reliability of of group Some continuous ANCOVA mean group difference Multiple, Factorial differences None discrete ANOVA

19 Table 3.1 continued. Major research Number (kind) Number (kind) of Covariates Analytic Goal of analysis question of DVs IVs strategy Factorial Some ANCOVA One-way None MANOVA or One, Hotellings T2 discrete Create a linear One-way combination of Multiple, Some MANCOVA DVs to maximize continuous Factorial mean group None Multiple, MANOVA differences discrete Factorial Some MANCOVA Multiple, Profile analysis One, one discrete of repeated Create linear continuous combinations of within-Ss measures DVs to maximise Multiple, Multiple, group differences continuous/ continuous/ Profile analysis and differences commensurate commensurate between levels of Multiple, Doubly Multiple, within-subjects one discrete multivariate continuous IVs within-Ss profile analysis One-way None discriminant Create a linear Multiple, function combination of continuous Sequential one- IVs to maximise Some way discriminant group differences function Create a loglinear Multi-way One, discrete combination of Multiple, discrete frequency IVs to optimally analysis (logit) predict DV Prediction of Logistic Create a linear group None Multiple, regression combination of membership continuous and/or Sequential the log of the discrete Some logistic odds of being in regression one group Factorial None discriminant Create a linear function combination of Multiple, Multiple, IVs to maximise discrete continuous Sequential factorial group differences Some discriminant (DVs) function Multiple, Create linear Factor analysis continuous Multiple, latent combinations of (theoretical) observed observed Multiple, Principal variables to Multiple, latent continuous components represent latent observed (empirical) variables Structure Create linear Multiple, Multiple, combinations of Structural continuous continuous observed and latent equation IVs to predict linear observed and/or observed and/or modelling combinations of latent latent observed and latent DVs

20 3.5 How to interpret and present results 3.5.1 Interpreting results i) Correlation limitations: Any test which is based on the significance of correlation coefficients has inherent limitations which need to be borne in mind when interpreting results. A correlation between an IV and DV is not necessarily causal, as a second IV not included in the analysis may be responsible for changes in both the DV and the first IV. In other words, just because feeding increases whenever straw is put into the enclosure does not necessarily mean the straw makes the animal feed more. It could be that on those days a different food is also given, or some other aspect of husbandry is also changed.

Another problem is that the strength of the correlation is not necessarily reflected by its significance, as the probability of finding a significant relationship increases with sample size. Sprinthall (1987) proposed an arbitrary set of definitions to interpret the strength of the correlation coefficient, which recognises that some significant correlations are so weak that they may be biologically meaningless.

Example 3.3 Correlation limitations In a study by Wilson (1982) enclosure complexity was identified as a greater determinant of gorilla (Gorilla gorilla) and orang utan (Pongo pygmaeus) behaviour than enclosure space. However, when Perkins (1992) replicated this study ten years later, enclosure space and complexity were found to be highly interrelated and therefore, considered to affect behaviour together. The discrepancy between these two studies may be explained by dramatic changes in the way captive primates are exhibited in zoos. In newer enclosures complexity and size appear to increase in parallel, whereas in older enclosures they did not. This disparity however, highlights the limitations of using results based on correlations (e.g. multiple regression). As a causal relationship between DV and IV can not be substantiated using correlations alone, it is suggested that they be used as the basis for identifying and planning studies in which the relationship between the DV and a specific IV are tested using manipulative experiments. ii) Interactions between variables: It is important to understand how each of the variables in your analysis interact with one another (intercorrelation) and also how they may act together to affect the DV.

Example 3.4 Interactions between variables Melfi (2001) investigated the influence of a large number of environmental factors (IVs) on feeding behaviour (DV) in zoo housed Sulawesi macaques; this study illustrates some of the associated limitations which occur when variables interact.

Using simple correlations many of the IVs appeared to have significant relationships with the DV (Table 3.2). However, which were the important IVs in terms of a causal relationship was difficult to determine since many of them were also significantly correlated with each other (Table 3.2).

Table 3.2 Correlation coefficients between several environmental factors (IVs) and feeding behaviour (DV) of Sualwesi crested macaques (** - P < 0.01, *** - P < 0.001). IV DV - Feeding Bark Troop size Area/individual Total floor No. substrates area Bark 0.53*** - - - - - Troop size 0.25** -0.05 - - - - Area/individual 0.26** 0.57*** 0.52*** - - - Total floor area 0.30** 0.48*** 0.7*** - - No. substrates 0.40*** 0.93*** -0.05 0.60*** - inside

Various combinations of the IVs were included in multiple regression models to determine the best model to predict feeding behaviour. The predictive value of a model is indicated by R2, the coefficient of determination, which describes how much of the observed variation in feeding can be ascribed to the model (to avoid overestimating R2 values due to a large number of IVs an adjusted R2 may be recommended, see Tabachnick and Fidell, 1996).

21 In the first model (Table 3.3) three IVs (troop size, area/individual and bark) were found which significantly predicted 2 45% of the variation observed in feeding behaviour (R = 0.45, F3/85 = 23.22, P < 0.001). The β coefficients (which represent the contribution made to the model by each IV) for all three IVs were significant, indicating that they were all useful independent predictors of feeding behaviour and suggest that feeding behaviour increases with troop size and in the presence of bark, but decreases as the area/individual rises. Both the β coefficient and the partial correlation coefficient (partial r) for area/individual indicate that it has a negative relationship with feeding behaviour. This is in contrast to the simple correlation coefficient which indicated a positive relationship between area/individual and feeding behaviour.

Table 3.3 First predictive model of feeding behaviour (*** - P < 0.001). IV β Partial r sr2 (sequential) Tolerance Bark 0.85*** 0.63 0.28 0.51 Troop size 0.56*** 0.49 0.08 0.55 Area/individual -0.52*** -0.39 0.10 0.37 (Intercept = 11.38) R2 = 0.45 Adjusted R2 = 0.43 R = 0.67***

Partial correlation coefficients provide a 'true' picture of the relationship between two variables, after removing the effects of other interrelated variables (Martin and Bateson, 1993). The positive relationship between area/individual and feeding indicated by simple correlation is therefore probably a reflection of intercorrelation between the IVs. Area/individual was highly correlated with both bark and troop size, which were positively related to feeding, thus resulting in the false positive relationship between area/individual and feeding behaviour.

The partial correlation coefficients indicate that all three IVs in the model were related to feeding behaviour, the strongest relationship being for bark (r = 0.63), then troop size (r = 0.49) and lastly area/individual (r = -0.39). In a stepwise multiple regression model the squared semipartial correlation coefficient (sr2) indicates the unique contribution made to the model by an IV at the point that it enters the equation and the sum of sr2 for all IVs will equal R2 (this is not true for other types of multiple regression models, see Tabachnick and Fidell, 1996). Any overlap in the contribution made to the model by previously entered IVs (e.g. bark) is deducted from the contribution made by a newly entered IV (e.g. troop size). Therefore the contribution made to the final R2 value by each IV depends on the order of IV entry into the equation.

In this model bark accounted for the majority of the predictive power of the model (sr2 = 0.28). Although, troop size had a greater partial correlation with feeding behaviour than area/individual, it actually contributed less to the final predictive power of the model (sr2 = 0.08, compared with 0.10). This indicates that troop size and bark overlapped in their abilities to predict feeding behaviour. As bark was the first IV to enter the model equation all of its predictive power is included in the model, whereas some of the power of troop size was reduced because of the overlap it shared with bark. Thus, the real strength of the relationship between feeding and troop size was not reflected clearly in the final model.

The second predictive model for feeding behaviour (Table 3.4) attempted to reduce the level of intercorrelation between the IVs in the model, by not including area/individual as it was clearly related to troop size. Instead total floor area was included as a representation of enclosure size. In this model troop size, total floor area and bark significantly predicted 2 44% of the variation observed in feeding behaviour (R = 0.44, F3/85 = 22.64, P < 0.001).

Table 3.4 Second predictive model of feeding behaviour (*** - P < 0.001). IV β Partial r sr2 (sequential) Tolerance Bark 0.85*** 0.62 0.28 0.49 Troop size 0.71*** 0.48 0.08 0.33 Total floor area -0.61*** -0.38 0.09 0.25 No. substrates inside excluded (Intercept = 11.38) R2 = 0.44 Adjusted R2 = 0.42 R = 0.67***

An additional IV, number of substrates inside was also tested but was highly correlated with troop size (r = 0.93). This value exceeded the default intercorrelation value imposed by the statistical package and therefore one of these two IVs had to be excluded (Norusis, 1990 or SPSS Help7). As troop size entered the multiple regression equation first it was included and number of substrates was excluded. The β coefficients for all three IVs included in the model were significant, indicating that they were all useful in accounting for variation in feeding behaviour. All of the IVs in the feeding model

22 were positively correlated with feeding indicating that the use of bark as an inside flooring substrate promotes feeding behaviour, as does an increase in troop size and total floor area. Bark accounted for the majority of the predictive power of the model (sr2 = 0.28), indicating that it had the largest impact on feeding behaviour. The tolerance values of the IVs in this model were closer to 0 than the tolerance values calculated for the first predictive model for feeding behaviour. Thus the attempt to reduce intercorrelation was not successful: as can be seen in Table 3.2 total floor area was highly correlated with bark and troop size, producing a higher level of intercorrelation than in the previous model. The degree of multicollinearity (intercorrelation between more than two variables) within a model affects the accuracy of interpreting how individual IVs affect the DV. The direct effect of each IV on the expression of the DV can only be inferred. iii) Biological relevance: A significant result may not always have ‘biological meaning for the animal’. That is that even if an IV is shown to significantly explain some of the variation in the DV, modification of the IV may not significantly influence the DV. Therefore, if a significant relationship between an IV and DV is detected it is useful to investigate this directly by experimental manipulation of the IV.

Example 3.5 Biological relevance Wielebnowski et al. (2002) conducted a series of studies to investigate the causes of behavioural problems which had been observed in captive clouded snow leopards (Neofelis nebulosa). They used multiple regression analyses to establish if feacal glucocorticoid levels (of 74 clouded leopards in 12 zoos) could be predicted by housing and husbandry variables (IVs). Three IVs (enclosure height, number of keepers per facility and the average hours that a primary keeper spent with each animal per week) were found to account for a substantial amount of observed variability in faecal glucocorticoid levels (R2 = 0.5, P < 0.001).

Shepherdson et al. (2004) set out to investigate directly the relationship between one of these IVs (enclosure height) and the same DV. Due to practical constraints it was not possible to vary the exact IV. They were not able to increase the height of existing enclosures, but with additional perching were able to increase the height to which clouded leopards could climb within the enclosures. Following the addition of higher perching faecal glucocorticoid levels declined suggesting that the IV-DV relationship found in the previous study did have biological relevance for the clouded leopards.

3.5.2 Presenting results - limitations and justifications In most situations the test limitations will be due to the dataset not entirely satisfying the assumptions of the statistical analysis used. If this is the case you must explicitly detail how and why you still feel that the study benefits from such analysis (e.g. the analysis is robust to the violations of assumptions in this case, supported by references wherever possible).

Example 3.6 Explaining limitations In a study investigating the effects of environmental factors (IVs) on the activity (DV) of zoo housed orang utans, Perkins (1992) studied 29 subjects housed in nine zoos, using seven IVs in a step-wise multiple regression analysis. To ensure that the analysis was reliable, a series of tests were made and the results justified with supporting literature. Initially a power test was performed which indicated that 28 subjects would yield a power level of 0.6; effectively there was a 60% probability of detecting an effect if one existed (Cohen and Cohen, 1983). Secondly, the distribution of the data was measured using Pearsons coefficient of skewness (= 0.49), suggesting that the dataset showed a ‘slight bias towards activity level values less than the mean’. Perkins (1992) suggested that multiple regression was robust to such deviations (Cohen and Cohen, 1983). Finally, an outlier was identified and omitted from analysis; ‘four separate measures of residual analyses indicated that this single data point was a significant outlier (Studentized residuals, hat diagonals, difference fitting and Cook’s distance. Afifi and Clark, 1984; Norusis, 1990)’.

23 4. Analysing Activity Budgets with G-tests N. Marples, G.D. Ruxton and N.Colegrave

4.1 The problem When an animal’s activity budget is recorded, either by scan sampling or by focal animal sampling, the behaviours it can be doing at any one time are dependent upon each other, because it often cannot do one action if it is already doing another, e.g. it cannot run and be inactive simultaneously. Statistical tests which assume the activities to be independent are therefore not appropriate for this type of data if the research question involves several behaviours.

4.2 Common mistakes: The commonly used approach to such activity budget research is to ask questions such as: 1. How much time does the animal spend on different activities? 2. Is this the same as another individual? 3. Is this changed when I add an enrichment? 4. Is this the same as in the wild? 5. Given a longer day, which activity increases to use up the extra time?

These are all legitimate questions to ask of such a data set, but the problem is that the statistical test applied is usually a t-test/ANOVA or a Mann Whitney U-test/Kruskal-Wallis test between the amounts of each behaviour (Q1) or between individuals for a given behaviour (Q2) or situation (Qs 3, 4 and 5). This is not ideal because the behaviours you are comparing are to some extent dependent upon each other.

The problem is worst for questions like Q1 because you are directly comparing the values of two or more behaviours which are affecting each other’s readings, so they really aren’t independent at all. If your question is not “how does the suite of behaviours change” but instead “how does this specific behaviour change”, then you are on firmer ground because your question is now not about the relative proportions of behaviours (which affect each other) but about how one single behaviour changes in two independent situations.

4.3 Solutions 4.3.1 Studies focusing on one behaviour For focused studies just involving one behaviour category (e.g. stereotypy) and comparing it between independent situations, (such as between individuals or with and without enrichment) we recommend using randomisation tests (see section 2). However, if your data meets their assumptions (see section 5.4) t-tests or ANOVAs may also be appropriate.

4.3.2 Studies involving a suite of behaviours For research questions involving suites of behaviours and how their relative proportions differ across individuals, or if there is no specific hypothesis about any particular behaviour, but instead a general hypothesis that behaviour will change under different conditions, use a categorical test such as Chi Squared, or, better, the related but more versatile G-test (Fowler et al., 1999). These tests are set up on the assumption that the readings are divided up amongst a number of categories, and they can’t be in more than one category at once. This is clearly only appropriate for behaviours where the animals can’t be doing more than one behaviour at once. We therefore strongly advise researchers to define an ethogram for the study such that it is not possible for an individual to perform more than one behaviour at the same time. This may seem difficult to do, but instead of having two categories of behaviour, such as “eat” and “run” which could be carried out simultaneously, you would have single category “eating while running”, and distinguish that from “eating while sitting”. If this makes your

24 number of categories too large, then consider carefully whether your question actually requires separation of the two types of eating at all.

As these tests are based around asking about the probability of an animal (or animals) performing a particular behaviour, or being in a particular state, the tests cope well with different sample sizes in the experiment (for example if fewer observations have been made on one individual than on another). A major advantage of such tests over simple chi-square contingency tests is the possibility to extend to multiple factors. So imagine that we are interested in whether visitor numbers affect the way in which a chimp divides its time between sleeping, feeding and other activities. If we observe the chimp for periods with high, medium and low visitor numbers, and note which activity it is engaged in, a simple contingency test would allow us to examine whether the probability of engaging in the three different activities is independent of visitor number. However, such a study on a single individual might have limited generality. Now suppose we repeat the study on several chimps. Multi-way tests allow us to include the chimp identity as a second factor in our analysis, which allows us not only to test whether visitors have effects on behaviour, as in the previous study, but also whether any effects are consistent across animals.

Multi-way categorical tests are available on many statistics packages such as SPSS (where they are called Log-Linear models), and Minitab (where they are called nominal logistic regressions). This approach allows several different factors to be included in the analysis, and in the case of the logistic regression, also allows continuous variables to be included as predictors too. Whilst the details of these tests may differ slightly, they take the same basic approach, and make the same general assumptions. Essentially, these models examine whether the probability of observing any particular outcome, such as a particular behaviour, depends on the factors in the model, and any interactions between them. Thus, in the above study, suppose we have data on four chimps that tells us for each of a number of observations, the behaviour that the chimp was performing, and the visitor density outside the enclosure (measured categorically as low, medium or high). We wish to ask whether visitor number has any effect on the behaviour of the chimps.

Since we have multiple observations on several chimps, the first step in our analysis is to examine whether the behaviours of the individuals in the study change consistently in response to the visitor level. Statistically, this amounts to asking whether there is a significant interaction between individual and the number of visitors, or whether the effect of the number of visitors on the probability of observing an individual performing a particular behaviour is independent of the individual concerned. If no interaction is found (i.e. all chimps respond the same way to visitor numbers), it is then possible to ask general questions about how the behaviour of these animals’ changes in response to the visitor number. On the other hand, if an interaction is found (i.e. visitor level is affecting the behaviour of the chimps, but in different ways for different chimps), then no general comments can be made about the effects of visitor level, but you have discovered that individuals differ in their response, an important finding in itself. To analyse this further you will have to analyse each chimp separately, and more detailed conclusions drawn will only apply to that individual.

4.4 Limitations 4.4.1 Independence of observations Often neglected requirements for the G-test (and the χ²-test) are the following: the samples have to be random and the objects counted have to be independent. This latter requirement is fulfilled if every data point is sampled from a different individual. In studying time budgets this is obviously not possible, so the best thing we can do is to make sure that the data points are not autocorrelated (see section 5.1). The requirement of randomness implies that your data should be representative, and hence data collection should not (for example) be triggered or prolonged by the expectation of getting

25 particularly interesting results. That means you should not prolong a data collection period because the animals are being particularly interesting, or cancel or curtail data collection because they seem to be behaving in a less interesting way at that time.

4.4.2 Minimum number of observations A well-known requirement for the G-test (and the χ²-test) is that the total sample size has to be big enough that all expected frequencies exceed a certain value. This value differs from one publication to the other; a very conservative stipulation would be 5.0. Unfortunately, some of the behaviours may be very rarely observed, either across all individuals, or by a particular individual. These empty cells in the design make statistical inference difficult (how can you say much about a behaviour you rarely observe?). Furthermore, there is a real practical issue that the models will be hard to fit if cells have very low numbers of observations. The best solution to this is to design your study more carefully, based on initial observations, only including behaviours that are frequently seen, however, some rare behaviours may be important (such as mating) and you may need to include them. A solution of last resort may be to pool observations of behaviours that are rarely observed into a single category of rare behaviours, ideally based on some a priori rule. However, we would caution against arbitrary pooling of categories if possible. Also, if you are interested in the rare behaviours themselves, this may not be possible, and this part of your data set will not be suitable for analysis by means of categorical tests.

4.4.3 Problems of fitting the models. Possibly the major practical problem facing the use of such tests is that the model fitting procedures used to carry out the tests are computationally intensive, and will often not converge for a particular data set. That is, the statistical software package will either fail to calculate your required statistics (coming back with an error message) or produce results with VERY wide confidence intervals. This will be particularly problematic with small data sets, data sets with several categories with low numbers of observations, and data sets with many factors. Whilst this may be viewed simply as an annoying computational problem, it is worth remembering that situations where algorithms have problems converging are exactly those where in reality robust statistical inference from the data is likely to be limited. In these situations, collapsing categories of behaviour (based on sensible biological criteria) may help. Alternatively redesigning your study to answer more focused questions, with fewer categories may be the only way forward.

26 5. General Issues

5.1 Autocorrelation, temporal independence, and sampling regime S.Dow, J.Engel and H.Mitchell

5.1.1 The problem Most statistical tests require that data points are independent. If data are collected over time on the same animal (or group) then it is likely that the measurement at one point in time has been affected by the one previously. For example, the level of a reproductive hormone circulating in a female chimpanzee will not change instantaneously, and if two measurements of hormone level are made in quick succession, the value at the second measurement cannot be independent of the measurement taken at the first. This is equally true of behaviour. It is important that the chosen sampling regime minimises the likelihood of autocorrelation, thus ensuring temporal independence of the data points.

5.1.2 Sampling regimes for animal behaviour To be able to compare behaviour between individuals or under different circumstances in a meaningful fashion the behaviour has to be quantified; how this is done should be determined by the question being addressed. To produce an exact record of behaviour requires continuous recording, measuring the duration and frequency of each behaviour under consideration. In practice continuous recording relies on the observer being able to see the animal throughout the recording period and fewer categories of behaviour can be recorded. Time sampling describes data collection schemes where data are recorded at predetermined points in time during the observation period. These techniques have been used successfully in behavioural studies for many years (e.g. see Martin and Bateson, 1993). By sampling at different points in time rather than continuous recording it is possible to record the activities of more than one animal in a group. Time sampling can also be combined with other behaviour recording methods such as focal animal sampling. However, time sampling techniques have been criticized on the grounds of non-independence of samples.

Choice of sampling method will depend on the behaviour patterns of interest, for example, some form of time sampling may be appropriate for behavioural states (i.e. long duration activities such as resting), whilst a form of continuous sampling is usually better for behavioural events (i.e. short duration or discrete activities, such as a vocalisation, scent marking).

Various time sampling techniques will be discussed with their limitations so that methods can be selected to avoid many of the pitfalls but other sampling methods are outlined first: • Ad libitum sampling – sampling decisions are made by the observer and notes are made of behaviour without reference to the timeframe. These data may be in the form of field notes. • Continuous sampling – sampling is based on changes in behaviour; behaviour and time are recorded each time the behaviour changes. Three subtypes can be described o All occurrence sampling – every occurrence of one or a few behaviours are recorded. This is often the best method for short duration events. It can involve simply counting the occurrences or counting and timing occurrences. If timing is required it is often best done using a computer and, for instance, THE OBSERVER software, to get a real time record. o Focal (animal) sampling – all behaviours of one individual (focal animal) are continuously recorded. o Sequence sampling – the focus is a sequence of behaviour patterns which may involve one or a group of animals

27 Time sampling methods In time sampling data are recorded at particular intervals but what behaviour is recorded may vary with the type of time sampling. Four major types of time sampling are: • Predominant activity sampling –the activity that occupied most or lasted for more than 50% of the preceding time interval is recorded, depending on the definition chosen. • Whole interval sampling – the behaviour is recorded if it occurs at the start and continues to the end of the time interval. • One-zero sampling – if a particular predetermined behaviour or selected behaviour pattern occurs during the interval then it is recorded, regardless of repetitions during the interval. • Instantaneous sampling – the behaviour occurring at the end of the time interval only is recorded. If applied to more than one individual the group can be scanned and data recorded from several animals, often referred to as scan sampling.

Examples of data collected using each method are illustrated in Box 1. Each of the above time sampling techniques has its limitations which are discussed below.

Box 1: How to apply different time sampling techniques for animal behaviour The figure below depicts the activities of a single animal. The behaviour is subdivided into six bouts of three different behaviour patterns A, B, and C. The behaviour is sampled using four different time sampling techniques, resulting in four different recordings. The vertical lines represent the predetermined sample points which terminate the respective sample intervals. In the case of whole interval sampling and predominant activity sampling, situations may occur where it is impossible to score an interval. These are marked with question marks in the example. It should be noted that actual sheets for data recording should be designed differently from the illustration below.

i) Predominant activity sampling It may be difficult to determine the predominant activity. If long sample intervals are used this method tends to overestimate mean bout length. Short bout duration and long sample interval will result in inaccurate estimates of frequency (Tyler, 1979). Results from predominant activity sampling

28 are similar to those obtained from instantaneous sampling but less accurate. Predominant activity sampling may be of use during pilot studies but is not recommended otherwise. ii) Whole interval sampling Whole interval sampling underestimates activities of short duration as no data are recorded when activity changes during the inter-sample interval. If bout length and inter-bout intervals have a normal, exponential or Weibull distribution it is possible to calculate estimates of duration and frequency of behaviour from the data (Ary and Suen, 1986; Quera, 1990). Whole interval sampling may be of use during pilot studies but is not recommended otherwise. iii) One-zero sampling Scores from this method are a composite and can be strongly correlated with both frequency and duration of behaviours. However, the original data do not reflect true frequencies of behaviour but frequencies of intervals. If bout length and inter-bout intervals have a normal, exponential or Weibull distribution it is possible to calculate estimates of true duration and frequency of behaviour from the data (Chow and Rosenblum, 1977; Ary and Suen, 1986; Suen, 1986; Quera, 1990). One-zero sampling is easy to learn and apply and has high inter-and intra-observer reliability (Powell et al., 1977; Rhine and Flanigon, 1978). It is a very useful method where the presence or absence of a particular behaviour is important. One-zero sampling is also useful where it is difficult to define the start and end points of a particular behaviour bout or for studies focusing on intermittent behaviour that may start and stop rapidly (e.g. play). One-zero sampling should be used with caution as the method overestimates the duration of a behaviour pattern and underestimates the number of bouts performed (Tyler, 1979). If long sample intervals are used one-zero sampling tends to overestimate the mean bout length. Long sample intervals combined with short bout duration will overestimate the percentage of time spent in a particular behaviour. However, some correction factors are available (e.g. Simpson and Simpson, 1977). One-zero sampling may be useful for specific studies within the limits noted above or for comparison with other data collected using this method. iv) Instantaneous (scan) sampling Instantaneous sampling is usually applied to one individual, scan sampling is an extension of the method to cover several individuals. These methods are easy to learn and apply so long as the behaviour patterns can be distinguished quickly and easily. Inter- and intra-observer reliability is usually high (Altmann, 1974). Martin and Bateson (1993) argue that the inter-sample interval should be as short as possible to approximate continuous recording and estimate duration of activities. Accuracy will also be affected by the average bout length and average inter-bout interval (the longer the better). With long inter-sample intervals instantaneous and scan sampling overestimate mean bout length. Long sample intervals with short bout duration will give poor estimates of frequency (Tyler, 1979) but are more accurate for short sample intervals (Powell et al., 1977). Generally the estimates of bout length are more accurate than from one-zero sampling (Altmann, 1974; Dunbar, 1976; Powell et al., 1977; Tyler, 1979). If bout length and inter-bout intervals have a normal, exponential or Weibull distribution it is possible to calculate estimates of true duration and frequency of behaviour from the data (Chow and Rosenblum, 1977; Ary and Suen, 1986; Suen, 1986; Quera, 1990. Instantaneous and scan sampling are recommended for studies of time budgets, activity cycles, behavioural synchrony, or spatial patterns (e.g. nearest neighbour). However, neither of these methods is suitable for investigating short duration events.

5.1.3 Determining optimum sample interval All time sampling methods are dependent on sampling at predetermined points in time to avoid biased results, therefore determining the optimum sampling interval is crucial. Ideally, to avoid autocorrelation the sampling interval should exceed the maximum bout length, however this would

29 not give a good approximation to a continuous record of behaviour and many short to medium duration behaviour categories would be undersampled. The optimum interval will depend on the mean bout length, mean inter-bout interval, size of group under study and the statistical analysis planned. It is very easy to generate large numbers of data points that are not independent of each other, e.g. for slow moving animal that may have bouts of a particular behaviour that last for several minutes. Shorter intervals result in more accurate measurements as they approach continuous sampling but sampling behaviour too frequently will result in runs of a particular behaviour that are not independent of each other as they are part of the same activity bout and thus inflating the sample size (Bernstein, 1991; Wirtz and Oldekop, 1991). Long inter-sample intervals require more observation time to collect appropriate volumes of data during which time other confounding variables may change.

At least four procedures have been developed to determine optimal sample intervals. • Bernstein (1991) proposed a calculation based on mean bout length and its standard deviation to avoid temporal autocorrelation. In practice measuring mean bout length can be difficult and time consuming. This method will overcome autocorrelation but does not take account of discrepancies with a continuous record. • Martin and Bateson (1993) propose two solutions. The first involves a pilot study where a large sample of behaviour is recorded using continuous sampling. From this scores are then calculated for each behaviour as if they had been sampled using the proposed time sampling method at different intervals. The percentage difference between true score and each different interval score is calculated and compared. They suggest accepting an interval length that gives most behaviour patterns (e.g. 90%) but yields small percentage differences (e.g. 10%) from the continuous sampling results. This method minimises discrepancies between continuous and time sampling but does not address possible autocorrelation. • Martin’s and Bateson’s (1993) second solution is that individual sample points collected over short intervals within an observation session are not treated as statistically independent measures, instead they are used to give a single score for the whole session, e.g. in a 40 minute period which was divided into 2 minute sample intervals (so 20 sample points) an animal was recorded as grooming for 6 of the sample points, then the score would be 6/20 = 0.3. It is the session scores, rather than the individual datapoints, that are then used for statistical analysis. In this case the interval between consecutive observation sessions should be at least the maximum bout length (determined in a pilot study) to ensure that session scores are independent. This method addresses both accuracy and autocorrelation and is very common in zoo studies. However it cannot be used for categorical analyses such as G-tests (see section 4) since the session scores are not counts. However, you can perform categorical analyses if you use counts of the number of sessions for which the scores (or count if all sessions are the same length) fall into certain categories e.g. in the above example the number of sessions in which grooming was recorded 0-5 times, 6-10 times, 11- 15 times and 16-20 times. • Engel (1996) has produced a method that attempts to find an interval where data both reflect the accuracy of continuous sampling while avoiding autocorrelation. Details are given in Box 2 below.

30 Box 2: How to determine an optimum sample interval (Engel 1996) The whole procedure comprises five consecutive steps. 1. Several pilot protocols (ten, in this example) are recorded using continuous sampling. 2. From each pilot protocol many pseudoprotocols with increasing bout length are generated for each behaviour pattern under investigation. This means that from each continuous protocol new pseudoprotocols are calculated as though they have been recorded using instantaneous sampling with various different sample intervals. The first pseudoprotocol may reflect the behaviour shown every other second, the next one the behaviour shown every third second, and so on. The longer the interval of a pseudoprotocol the fewer data it contains compared with the original pilot protocol. 3. The correlation between the behaviour frequency calculated from the continuous protocols and the one calculated from every pseudoprotocol are determined for each bout length considered (2 seconds, 3 seconds, … in this example) using the Spearman rank correlation coefficient (with N=10 in this example). 4. A special one-sided runs test checks whether succeeding data in a pseudoprotocol are statistically independent. If temporal autocorrelation exists, there are fewer runs (succeeding scans where the behaviour under investigation was noted down) than expected by chance. According to Stevens (1939) the probability P to get at most x runs is

x f !⋅( f −1)!⋅(n − f )!⋅(n +1− f )! P = ∑ r=1 n!⋅r!⋅()()(r −1 !⋅ f − r !⋅ n +1− f − r )!

where f represents the absolute frequency of the behaviour pattern, r the number of runs in the pseudoprotocol, and n the number of scans in the pseudoprotocol. If the probability P is greater than 0.1, we may conclude that the events occur in a random sequence. For example, if we look at a sequence of three behaviour patterns recorded at 10 sample points, B B C A A A A A B B, the probability to get at most two runs of B is P = 0.0333 + 0.3 = 0.3333. This indicates that behaviour B occurs in a random sequence - whereas behaviour A does not (P = 0.024). 5. The Spearman correlation coefficients (from step 3) and the P values (from step 4) are plotted as a function of sample interval length. The optimum sample interval is one where the correlation coefficient is close to 1.0 and the probability for independent data P is higher than 0.1. 6. Steps 3 to 5 are performed independently for each behaviour pattern that will be studied afterwards.

The figures below (from Engel 1997) show the final graphs for two behaviour patterns of scimitar-horned oryx (Oryx dammah). Possible optimum sample intervals for each behaviour are marked by a shaded area. The resulting common optimum sample interval is two minutes.

Standing Locomotion

1.0 1.0 1.0 1.0

0.9 0.9 0.9 0.9

0.8 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.2 probability for independent data independent for probability probability independentfor probability data Spearman rank correlation coefficient correlation rank Spearman

Spearman rank correlation coefficient correlation rank Spearman 0.2 0.2 0.2 0.1

0.1 0.0 0.1 0.1

0.0 -0.1 0.0 0.0 0 5 10 15 20 25 30 35 40 45 50 55 60 0 5 10 15 20 25 30 35 40 45 50 55 60 sample interval [min] sample interval [min]

5.1.4 Other sampling issues i) How long should the study be? This depends on your research question. For many studies the experimental manipulation is only expected to have a very short effect so observation times can be very short. For other studies the

31 intuitive answer is probably the longer the better. However, a more precise answer can be obtained by plotting the cumulative number of different behaviours observed against the number or total time of observations. Such a plot (open symbols, fig. 5.1) will result in an asymptotic curve and once the asymptote is reached it is unlikely new behaviours will be recorded. Alternatively the mean % time spent performing the key behaviour can be plotted against the number or total time of observations (closed symbols, fig. 5.1). Again this should reach a stable value as observation time increases. If no new behaviours are being recorded or 30 30 the particular behaviours selected for study are not changing in frequency 25 25 then further data collection may not provide any more accuracy. 20 20

15 15

10 10 Total behaviour types Fig. 5.1 The total number of different behaviours seen seen and the frequency of key behaviour X are both

5 Mean % time doing 5 behaviour performing time % stable after about 18 hours of observation Number of behaviour types seen types behaviour of Number behaviour X suggesting that this is the observation time required for each phase of this study. 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Hours of observation ii) Dealing with out-of sight There is perceived to be a problem if your animal can get to a part of the enclosure where you can’t see what it is doing. If it is carrying out some activity, such as grooming, you are therefore underestimating the total grooming as you can’t see some of it happening. This is just a fact of life, which cannot be avoided by any clever stats method, so if you really can’t arrange to see the animal (e.g. by video) in its hidden area, then you have to alter the question you are answering from e.g. “What does the animal do between 2pm and 3pm” to “What does the animal do while visible between 2pm and 3pm”. However, it is worth recording the category “out of sight” anyway, and using it like any other behaviour, because if the proportion of time the animal spends in that area changes suddenly after, for instance, an enrichment is added, that may tell you something about the animal’s attitude to the enrichment.

5.1.5 Time series analysis Time series approaches to analysing phase designs (e.g. ABA), such as ARIMA (Box and Jenkins, 1976), have an advantage in that they can effectively deal with autocorrelated data. However, they are extremely complex and require a very high degree of statistical experience and mathematical ability. Therefore, they are beyond the scope of these guidelines and we do not recommend that most zoo researchers use these methods as a solution to the problems of autocorrelation.

5.1.6 Summary Non-independence of data across time (or autocorrelation) is a serious problem for most statistical tests. The best way to avoid this problem is to design the sampling method to avoid autocorrelation. However, this has to be balanced against other limitations of recording methods in particular, with respect to common zoo studies, obtaining an accurate record of an animal’s behaviour pattern. There are a number of methods of recording animals’ activities and of avoiding autocorrelation and biases in the data, the optimum combination will depend on the hypothesis being addressed.

32 5.2 Social Independence S. Wehnelt, H. Buchanan-Smith, G.D. Ruxton and N. Colegrave

5.2.1 The problem The majority of zoo animals live in social associations and are usually kept in pairs or mixed-sex and/or -age groups. Most collections keep only one group of a given species due to limited resources. Strong links between zoos on breeding programmes also means that animals of an endangered species are often scattered in zoos all around the world. Although group sizes at a zoo can be relatively high (e.g. flocks of , groups of small primates), the data sets taken from the individuals of that group cannot be regarded as strictly independent. This is mainly because one individual has the potential to influence others in the group. For example, when a chimpanzee spots a popular food item in the enclosure, this can also cause others to forage, or an aggressive encounter between two individuals can induce other group members to join in.

Independence between data points is a requirement of most statistical tests (e.g. parametric, non- parametric and randomisation tests). Strictly speaking, many non-zoo studies may also violate this assumption when, for example, fish are sampled from the same lake or primates of the same group are studied in a forest patch. Data independence can be regarded as a continuum between totally independent (e.g. individuals of the same species sampled in geographically separated areas) and very dependent (e.g. the same individual is sampled repeatedly over a short time period). Rather than dismissing any statistics, it should be identified on biological grounds where each case stands on this continuum, whether this is important for the research question and what the best available approach is that can be taken. Statistics are a powerful summary of trends in data and are less subjective than “eye-balling” raw-data or graphs. However, the constraints of the validity of statistics need to be critically evaluated and the implications for the generality of any tendencies found need to be discussed on a biological basis (Ruxton and Colegrave, 2003). For example, if the sample size is small and the P value of the chosen probability test is slightly bigger than 5%, the result might still be interesting to discuss on a biological basis. Thus, it could be the case that all study animals showed the same tendency (e.g. an increase in stress hormone level after an environmental change) apart from one individual. The discussion could highlight that all individuals but one showed a stress reaction and the possible reasons why the one animal did not (e.g. the animal was the only one that was already familiar with the change or did not utilise the area of the enclosure where the alteration was performed). As the experiment might be time and resource consuming to repeat, the results might still be important to other researchers even if they did not reach statistical significance.

5.2.2 Possible solutions – maximize group number One possibility to avoid dependence of data is to increase the number of independent groups that are studied. In the case of zoo studies, this would mean visiting a sufficient number of zoos to meet the statistical requirements of the test chosen for the particular scientific question. In each individual zoo just one member of the group might be sampled (or an average taken over several individuals). For example, if the responses of male-female pairs of tamarins to predators are studied, and the question is “How long does it take for males versus females to spot a stuffed predator that is hidden in the enclosure?” it may be that social independence is critical. If the male always finds the ‘predator’ first and alerts the female, then no great difference might be found in the latency to spot the predator between the sexes. Here, the male and female cannot be treated as independent. Perhaps it is best to take the latency for ‘first to respond’ only and perform the study on multiple groups. But for other studies, for example on activity budgets, more than one group member in each of the groups should be observed. If only one individual per group is sampled, it is questionable that this animal represents the rest of its group (low external validity, Bart et al., 1998). The group may consist of individuals of different age and sex classes and all individuals or a selection of individuals may need to be recorded

33 (using a stratified sampling framework if necessary – e.g. to include equal numbers of males and females, juveniles and adults etc.). A mean can then be taken to get a better representation of the true behaviour. The mean will also be less affected by random variation than a single individual, so it is generally better to collect data on more individuals within the group. If the data are not symmetrically distributed and/or there are outliers the median may be a better solution than the mean.

Another possibility to avoid social dependence problems is to select multiple species of the same taxonomic group. For example, if the visitor effect on small primates is of interest, all the different groups of small primates represented in one zoo can be studied and each treated as an independent data point. However, in most cases, it is not practical and sometimes impossible to study multiple species or more than one group of a species in multiple zoos, especially when funds are limited or the species is rare or secretive. Further, different zoo settings may present too many confounding variables that cannot be controlled for and therefore make a direct comparison of the data impossible.

5.2.3 Possible solutions – only one group available There can be benefits to studying just one group, in one location. If the research question refers to only one particular group of zoo animals (e.g. access of group members to food; the effect of enrichment), general assumptions about a greater population are actually not necessary and groups at other zoos do not need to be sampled. In these cases, a high external validity of data is not required. For example, if one is interested in finding out whether individuals in a specific group are positively affected by a new enrichment, then it is not important to question whether the animals are independent or not. That animal A affects animal B’s usage of the enrichment in such a way that both profit, will not affect the interpretation of the results; one is simply interested to know if all individuals benefit. The observation may need to be repeated to ensure the enrichment is still effective if one of the animals is removed, as he/she may have been the facilitator, but the immediate question of whether enrichment benefits this specific group with its current composition, has been answered; the aim was not to extrapolate the findings to the wider population. Statistical tests can be performed (by violating the assumption of independence), but it must be explained that the lack of social independence is unimportant for the reasons given above. Examples of further studies where interpretations are less affected by social dependence are studies of other environmental factors such as temperature, visitor density or food distribution on a specific group of animals. In some cases, follow up studies on other groups can then determine if the same trends can be found for the wider population.

In cases, when we are limited to just one group, and extrapolation to the general population is important, we should endeavour to collect data in such a way as to maximise independence of data to allow us to perform statistics and interpret the findings more generally beyond the group. However, it is extremely difficult to determine whether the behaviour of individuals in the same group is independent from each other or not. It should also be noted that non-independency may also apply to adjacent groups, e.g. vocalisations from one group may impact on others within hearing range. Significant positive correlations between the behaviour (e.g. foraging) of individuals over different sessions will determine if the individuals are behaving similarly. However, determining social dependence is not that simple as animals may synchronise different behaviours simultaneously, e.g. a male always stands guard while his female feeds. Therefore, it is best to design the study in such a way as to attempt to avoid dependence.

The best way to avoid strong dependence of data is for animals to be studied at different times. In this case, scanning of the behaviour of the entire group at the same time should be avoided. Scanning can be replaced by focal animal sampling (Martin and Bateson, 1993) and by placing breaks between sampling. To identify the best sampling interval, a pilot study needs to be carried out to produce a histogram of bout length of behaviour categories (Martin and Bateson, 1993). The sample interval

34 should well exceed the mean bout length of the behaviours studied (see section above on autocorrelation and temporal independence). Therefore, if individuals are used as replicates, focal sampling with randomly selected individuals should be used to increase the social independence of the data.

5.2.4 Summary To sum up, the importance of social independence really depends upon the research question. If the interpretation is to be extended to the population at large, a sample should be taken of several groups from the population (but other potential confounds need to be considered). If only one group is accessible, but the data need to be treated as independent, then a sampling design that minimises non- independence should be applied, and the extent of dependence between data points measured, or discussed based on previous studies and biological reasoning. Alternatively, if only one specific group is of interest, then independence is less important. However, one should be aware that assumptions of the statistical test are being violated and an explanation should be provided of why this is not of importance for the interpretation in terms of the aims of the study and/or the likely extent of dependence.

5.3 Multiple test corrections C.A. Caldwell, G.D. Ruxton and N. Colegrave

5.3.1 The problem In statistical testing we take samples of wider populations and use these samples to make inferences about the populations of interest. It may be that sometimes, by chance, our samples mislead us. This chance can be reduced, by for example increased sample sizes, but can never be wholly eliminated. There are two potential types of error: Type I, where we reject the null hypothesis (of no effect) when there is in fact no real effect, and Type II where there is a real effect but we do not reject the null hypothesis. By convention, we control the rate at which Type I errors occur by setting a significance level (commonly denoted α and set at 0.05). Adopting this convention means that if we consider P values less than 0.05 to be indicative of a real effect (violating the null hypothesis) then we will be correct at least 95% of the time.

However, if we are right 95% of the time, then we are still wrong one time in twenty. However, consider a situation where a researcher wishes to carry out multiple tests; Chandler (1995) suggested that the mean number of statistical tests in an Animal Behaviour paper was around 35. Now not all these tests will report P values below 0.05, but the potential for spuriously rejecting the null hypothesis at least once in a paper that reports a large number of “significant” results is certainly there. The more tests that are carried out, the more likely it is that, simply as a result of chance, a significant result is found. For example, if a researcher carried out a total of twenty tests, and the null hypothesis was in fact true in all cases, then on average one of those tests would nonetheless return a significant result. The likelihood of making Type I errors rapidly becomes unacceptably high, the more tests that are carried out. This is true irrespective of what tests are used (parametric, non- parametric, randomisation etc)

García (2004) likens the process to buying lottery tickets, and states that, “In such purely random and independent events as the lottery, the probability of having a winning number depends directly on the number of tickets you have purchased. When one evaluates the outcome of a scientific work, attention must be given not only to the potential interest of the ‘significant’ outcomes, but also to the number of ‘lottery tickets’ the authors have bought. Those having many have a much higher chance of ‘winning a lottery prize’ than of getting a meaningful scientific result.” (p662)

35 5.3.2 Common mistakes The most common way to deal with this problem when carrying out multiple inferential statistical tests is simply to bury one’s head in the sand and ignore it. Obviously, this increases the chance of making a Type I error, and therefore renders any conclusions less credible. However, the next most common way is to use a Bonferroni or sequential Bonferroni correction (described below). Essentially, these corrections reduce the α values used in determining the statistical significance of P values (e.g. from P < 0.05 to P < 0.01) in individual tests, making it more challenging to reject the null hypothesis in each case, and so reducing the risk of spuriously rejecting the null hypothesis. The stricter α value is often set to maintain the overall Type I error rate at less than or equal to five percent. The problem with this approach is that it can increase the chance of a Type II error to an unacceptably high level.

5.3.3 Possible solutions i) The simple Bonferroni method The most straightforward approach to dealing with multiple testing is the simple Bonferroni method. An α level is selected such that the sum of the α levels used for all of the tests carried out is equal to the desired overall error rate (generally 5%). So, for example, if a researcher was to carry out five tests, and wanted an overall α level of five percent, then rather than looking for P < 0.05, she/he would need to find results for which P < 0.01 (because 0.05/5=0.01) before rejecting the null hypothesis in any of these individual cases. To calculate the required level for a given set of tests, the new criterion is given by: a1 = a/n where a1 is the new individual alpha level a is the desired overall alpha level and n is the number of tests carried out. This method has the benefit of simplicity, but brings with it the problem of placing severe restrictions on the α level used. As mentioned above, this can lead to researchers failing to reject the null hypothesis when it is in fact false. In other words, we fail to spot an interesting result. Obviously we want to avoid this if at all possible. There are other methods which can partially overcome this disadvantage of the standard Bonferroni correction, which still control for the overall error rate (e.g. see Shaffer, 1995). The simplest of these is probably the sequentially-rejective Bonferroni method. ii) Sequentially-rejective Bonferroni (or Holm) method This method is applied in stages (Rice, 1989). The first stage is to check that one of the individual tests fulfils the same criterion as would be applied in the simple Bonferroni method, i.e. P ≤ a/n. If this is the case, the null hypothesis for that prediction is rejected, and the next stage begins. The next stage is to check whether one of the remaining tests has returned P ≤ a/(n-1). If this is fulfilled, the null hypothesis for that prediction is rejected, and the researcher then looks for a result with P ≤ a/(n-2), and so on. If, however, at any point, none of the remaining results reaches the new criterion, then the researcher stops and does not reject any of the remaining null hypotheses.

Example 5.1 Imagine a researcher is carrying out a behavioural study and wants to test the effect of some variable (a new enrichment perhaps) on various different behaviours. They may end up with several P values, each relating to a different behaviour. The scores in Table 1 concern eight different behaviours, the researcher would begin by ranking the P values, starting with the lowest. The next step would be to calculate a/n, or in this case 0.05/8 = 0.00625. This is the criterion by which the lowest P value will be judged, in this case for scratching, P is lower than this value, so the null hypothesis is rejected. The next prediction is tested using P < 0.05/7 = 0.00714, and again the null hypothesis is rejected (yawning). Null hypotheses are also rejected for P < 0.05/6 (0.0083) and P < 0.05/5 (0.01), aggression and resting respectively. However, in this example, when P < 0.05/4 (0.0125) is reached, the lowest remaining P value is now higher than this criterion (locomotion,

36 P = 0.023). This means that the null hypothesis cannot be rejected for this prediction, and also means that the process stops and all remaining P values (foraging, pacing and grooming) are treated as non-significant.

Table 1 Example P values obtained from a behavioural study e.g. of the effect of an enrichment device Behaviour P Rank Order Conclusion Grooming 0.4350 8 do not reject H0 Foraging 0.0494 6 do not reject H0 Resting 0.0098 4 P < 0.01, reject H0 Locomotion 0.0230 5 P > 0.0125, do not reject H0 Aggression 0.0082 3 P < 0.0083, reject H0 Scratching 0.0002 1 P < 0.00625, reject H0 Yawning 0.0063 2 P < 0.00714, reject H0 Pacing 0.0818 7 do not reject H0

iii) Sound scientific reasoning Traditionally, the view has been that given the inevitable trade-off between Type I and Type II errors, it is worse to make Type I errors than Type II errors. However, there is no logical reason for this, and much will depend on the consequences of each type of error in a particular study (if the hypothesis under test is that a certain sort of food stuff will cause a painful disease in later years, which type of error would you prefer?).

Moran (2003) provides a short but ruthless attack on the use of the sequential Bonferroni or other multiple test corrections in ecological studies. Neither Moran nor ourselves suggest that you should simply sweep the danger of Type I errors under the carpet. Rather, in your results section, you should present your absolute P values and accept or reject individual null hypotheses without consideration of other tests. It is in your discussion where you interpret the biological conclusions that stem from your results where you should address the danger of Type I errors. You should aim to convince the reader that your conclusions are unlikely to be based on Type I errors. This should be done on the basis of logic, scientific understanding of the likely mechanisms underlying results, effect sizes, experimental design and the pooling of information from a number of lines of evidence. Specifically, if you reject a null hypothesis, this is less likely to result from a Type I error if:

• the results can be reproduced in another study • the significant results arose from a planned comparison • the result can be interpreted logically on the basis of accepted understanding of the processes involved.

To give an example, imagine a study where you are seeking to explain between–day variation in activity patterns in the ten big cats held in a zoo. Each day, for three months, you record the percentage of time that each cat is active and thirty variables involving weather, husbandry , behaviour of animals in the same enclosure, vocalisations of nearby animals, and visitor effects. Imagine further that you then carry out a number of statistical tests, and find that the only variable that appears to affect activity patterns is the percentage of visitors stopping at that animals’ enclosure that are wearing hats, the larger the fraction of hat wearers the less active the cats. We must now consider whether this statistically significant result is likely to reveal a real effect about captive cats’ behaviour or whether it is due to a Type I error. It is likely that your intuition is telling you that this example is most likely to be a Type I error, because it seems relatively unlikely that either cats are influenced directly by hat wearing visitors or that the same factors that influence hat wearing in zoo visitors also affects activity patterns of cats. This is certainly what you should expect editors and reviewers to conclude unless you can mount a very persuasive argument otherwise.

37 Perhaps you do actually believe this correlation indicates a real effect. It is possible to suggest that the combination of weather variables that influences hat wearing in humans on a particular day does also affect activity in the cats (rain?). So you have the beginnings of a defence under point (3) above. In order to explore this further, you ought to go back and re-consider your results with respect to weather variables in the light of this hypothesized mechanism. In order to convince the readers that you have found a real effect, you must explain how your hypothesized mechanism (of weather as a confounding variable affecting both cats and hat wearing in humans) can be reconciled with your failure to detect any effects of weather variables on cats’ activity.

Another way to convince the reader that your effect is a real one is to appeal to the repeatability argument (1) above. If you go out and repeat your monitoring (in the same or a different zoo) and get the same effect that only hat wearing by visitors (and none of the other 29 variables) is linked to cat behaviour, then this is compelling evidence that there is a real underlying effect. The chance of the same spurious result arising in the two cases is very low (< 0.0025, if the original α level was 0.05) and the most likely explanation is that a real effect is at work. Repeating the study also has the advantage that we now have a single and very specific planned comparison (only hat wearing is important), rather than the vague fishing expedition of the initial study. This allows a further defence under point (2)

Even more compelling would be to mount a defence based on point (2) above, but rather than repeating your correlational study, to do an experimental study where you manipulate hat wearing by visitors (and so reduce the chance of confounding variable interpretations based on say weather). If you still find an effect of hat wearing in this manipulative study, then this is compelling evidence that there really is a real effect at work. iv) Better experimental design In the above example the scientific reasoning solution has put you to a lot of extra work, but the underlying problem here is that you started off with a very loose experimental design where you were testing a very large number of null hypotheses simultaneously, in a study that was not tailored to critically addressing any one of these hypotheses. This is not how we’d recommend that you carry out a three-month study. Rather, we would recommend that you invest time in preparation: reading the literature, watching the animals, talking to the keepers, collecting some pilot data and thinking. At the end of this, you will have decided on what the most likely factors are in influencing activity patterns. That is, you’ll have narrowed your scatter-gun list of thirty variables down to a number less than five. Now spend some time thinking of the best way (perhaps by manipulative experiments) to collect really definitive data to confirm or refute a small number of null hypotheses, and carry out the appropriate study. It may be that you went for the “wrong” variables and at the end of your three months, you cannot explain what drives daily variation in activity patterns. This is not a disaster, because your main data collection should have had very high statistical power to detect real effects, so you can conclude with considerable certainty that several factors (that seemed a priori to be highly plausible candidates) do not appear to influence your cats’ activity. This is more valuable than the original experimental design that is likely to leave you uncertain of the effect of thirty variables, because you do not have strong statistical power to interrogate any one of them. As a side benefit from an improved experimental design, you find yourself doing much fewer statistical tests, and so Type I errors are less likely. Further, your improved design gives you a much better ability to discuss the plausibility of your conclusions in terms of the three criteria above.

5.3.4 Recommended best practice There is no substitute for designing research in a clear and directed manner. Ideally, the researcher should be completely clear about the hypotheses they are testing, and there should be no need for

38 ‘fishing expeditions’. Our recommendation is that you design a well-focused experiment in which multiple testing is not needed.

However, sometimes a large number of related statistical tests will simply be unavoidable and the traditional recommendation would be to use the sequential Bonferroni correction or one of a number of similar alternatives (see Chandler 1995 and Neuhäuser 2004). However, none of these corrections can resolve the inevitable trade-off between Type I and Type II errors so our recommendation is that you do not make multiple test corrections, however this is not suggesting that you can cop-out of defending your interpretation of your results from accusations of Type I error. Rather, we are suggesting that your defence lies in careful experimental design, knowledge of biology and logical reasoning, rather than a statistical fix.

5.4 Parametric versus non-parametric tests C.A. Caldwell

5.4.1 The problem Parametric statistical tests often have considerable advantages over non-parametric equivalents and sometimes over randomisation tests (see section 2). Depending on our research question, we may even find that there is no available alternative to parametric testing (e.g. for many multivariate situations, see section 3). We may, for example, wish to look for interactions between variables, as well as the independent effects of each. Or we may want to study the relative influence of multiple factors, for example in a multi-zoo study. Or we may actually want to extract parameters from our data, for example the rate at which aggression increases with day length. However, parametric statistics make certain assumptions about the data. We therefore need to be aware of whether or not the data that we are going to analyse using these techniques meet these assumptions. And if they do not, we need to know what we can do to address this (see Hawkins, 2005 for further information on choosing the right test).

5.4.2 Common mistakes A common mistake is for the researcher to go ahead and do the parametric test anyway, even though they may not be sure if the data meet the assumptions of the test. The problem here is that we are therefore not in a position to judge whether the results obtained from the test are valid. However, a common alternative is simply to assume that the data do not meet the assumptions and so omit the analyses altogether, or use less powerful tests, potentially missing out on some interesting results. This is a terrible waste of a data set, and should be avoided at all costs. Another mistake is to proceed with non-parametric tests without being aware of the assumptions involved in these. Use of randomisation tests can eliminate many of these mistakes (see section 2).

5.4.3 Assumptions of parametric statistics The major assumption of parametric tests (which does not apply to other types of test) is that variables are normally distributed i.e. there are a small numbers of very low scores and very high scores, the majority of scores are clustered around the middle, the mean is at the very centre of the distribution and the distribution is perfectly symmetrical. However, not all variables are normally distributed and can have either significant skewness, significant kurtosis, or indeed both (fig. 5.2). Skewness has to do with the symmetry of the distribution. The mean is not at the centre of the distribution of a skewed variable. When the right tail is longer, this is referred to as positive skew, and when the left tail is longer, this is referred to as negative skew. Kurtosis concerns the peakedness of a distribution. A distribution with non-normal kurtosis will be either too peaked (positive kurtosis), or too flat (negative

39 kurtosis). Other assumptions of parametric tests are that the samples have equal variances and the data at least approximate an interval scale.

Figure 5.2. Deviations from the normal distribution

Normal

Positive Skew Negative Skew

Positive Kurtosis Negative Kurtosis

5.4.4 Testing data for normality Most statistical packages will carry out tests which will tell you whether your data violate the assumptions of the test you are using (some will even do so automatically when you use the test itself). SPSS will give values for skewness and kurtosis, and their standard errors, under the Explore function (Descriptive Statistics). Histograms are also available so that you can visually inspect your data (under Explore or Frequencies). SPSS and other packages also provide the Shapiro-Wilk and the Lilliefors tests for deviations from normality. For more information see D’Agostino (1986), for comparisons of several techniques for inspecting data for non-normality.

A normal curve has a skewness of zero, and kurtosis of zero. We therefore want to test whether the values for a given data set are significantly different from zero. This is tested using Z-scores. The obtained skewness value is compared with zero using the Z distribution, where:

Z = (S-0)/seS where S = skewness and seS = standard error for skewness.

The obtained kurtosis value is compared with zero using the Z distribution, where:

Z = √[(K-0)/seK] where K = kurtosis and seK = standard error for kurtosis

The significance of the obtained values of skewness and kurtosis can therefore be determined using Z- score tables. For small samples Tabachnick and Fidell (2001) recommend the use of conventional but conservative α levels (0.01 or 0.001) when judging the significance of the skewness and kurtosis (Z ≥

40 +/- 2.33 would indicate P < 0.01). However, it is often assumed that these tests are not valid for small samples so non-parametric tests are automatically used.

In the case of large samples, however, it may be more important to visually inspect the distribution, rather than judging on the basis of Z-scores. In a large sample minor deviations from normality may result in a skewness or kurtosis value that is significantly different from zero. As mentioned above, frequency histograms of the data are available in most statistical packages, including SPSS. Expected normal probability plots and detrended expected normal probability plots are possibly even more useful when visually inspecting your data set for normality and are available in many common statistical packages. In these plots, the scores are ranked and sorted, and the expected normal value is computed and compared with the actual normal value for each case. If the distribution is normal, the points will fall along the diagonal (bottom left to top right). Any deviations from normality shift the points away from the diagonal (see Q-Q plots in SPSS).

5.4.5 What to do if the data violate the assumptions of parametric tests While planning your experiment it may be possible to predict that the data you are going to collect are unlikely to meet the assumptions of parametric tests or that the sample size will be too small to test the assumptions of parametric tests. In this case it is better to plan data collection so that your data are suitable for analysis by randomisation tests (see section 2) or G-tests (see section 4).

If this is not possible and your data distribution turns out to be non-normal there are methods that can be applied which may make the data approximate a normal distribution more closely. Many of these will also simultaneously make sample variances more equal. Common transformations include taking the square root of each score, inverse scores (1/x), and logarithms. Tabachnick and Fidell (2001) include tables which indicate appropriate transforms, depending on the distribution of the data set. Sokal and Rohlf (1994) also provide useful information on when and how to make use of transformations. Depending on whether the data are negatively or positively skewed, and whether this is severe or only moderate, different transformations will be more effective in producing a distribution that is close to normal. Obviously once transformation has been done, it is important to check the distribution again, to find the new values for skewness and kurtosis, and to inspect the resulting frequency histograms. Selecting the best transformation for any given data set may often involve trying out several different transformations and checking which produces the closest approximation to a normal distribution.

If you have transformed your data you need to decide whether to use the raw or transformed data to graphically represent your results. It is usually better to use the transformed data since this is what has actually been tested statistically and we are usually interested in the relative values of two or more groups rather than the absolute values e.g. of means in each sample. If the absolute values of means are important then the untransformed data can be plotted and this is more meaningful to readers. However, standard deviations or standard errors calculated from the untransformed data will be incorrect with respect to the test performed and it is often extremely difficult to calculate the correct values using back-transformation. A compromise is to plot graphs using the transformed data but then to re-label the tick marks on the axes with the appropriate raw values rather than the transformed values. This will demonstrate the relative differences between group means, the correct magnitude of error bars and will also give readers the absolute value of group means. It is crucial that whichever way you choose to represent the results graphically you make it absolutely clear what data is represented and how.

41 5.4.6 Best practice Always check before using parametric statistics whether your data meet the assumptions. If they do not, and it is possible to transform the data such that the data become normal, then use this method before proceeding with the test. If it is not possible to transform the data to meet the assumptions and there is no suitable alternative test available (e.g. for multivariate tests, see section 3), it is better to go ahead with your testing, rather than bin your dataset! Many parametric tests are said to be relatively robust, i.e. they will cause you to reject the null hypothesis the right number of times, even if the distributions do not meet the assumptions of the analysis. However, you should always be aware of which of the assumptions have not been met, and what effects this is likely to have on your results. Finally, if you revert to non parametric statistics, don’t forget that, whilst these have fewer assumptions, they are not assumption free, and you must check these assumptions are met too (Siegal and Castellan, 1988).

42 6. References S.Pankhurst

Afifi, A. A. & Clark, V. (1984). Computer-aided Multivariate Analysis. Van Nostrand Reinhold Co: New York, USA.

Adams, D. C. & Anthony, C.D. (1996). Using randomization techniques to analyse behavioural data. Animal Behaviour 51: 733-738

Altmann, J. (1974). Observational study of behavior: sampling methods. Behaviour 49: 227-267.

Ary, D. & Suen, H K. (1986). Interval length required for unbiased frequency and duration estimates with partial, whole and momentary time sampling. Midwestern Educational Researcher 8: 17-24.

Bart, J., Fligner, M.A. & Notz, W.I. (1998). Sampling and Statistical Methods for Behavioural Ecologists. Cambridge University Press.

Berstein, I S. (1991). An empirical comparison of focal and ad libitum scoring with commentary on instantaneous scans, all occurrence and one-zero techniques. Animal Behaviour 42: 721-728.

Box, G. & Jenkins, G. (1976). Time Series Analysis, Forecasting and Control. Holden-Day, Oakland, California, USA.

Carlstead, K.J., Mellen, J. & Kleiman, D.G. (1999a). Black rhinoceros in US zoos: I Individual behaviour profiles and their relationships to breeding success. Zoo Biology 18: 17-34.

Carlstead, K.J., Fraser, J., Bennett, C. & Kleiman, D.G. (1999b). Black rhinoceros (Diceros bicornis) in US zoos: II Behaviour, breeding success and mortality in relation to housing facilities. Zoo Biology 18: 35-52.

Chandler C.R. (1995). Practical considerations in the use of simultaneous inferences for multiple tests. Animal Behaviour 49: 524-527.

Chow, I .A. & Rosenblum, L. A. (1977). A statistical investigation of the time-sampling methods in studying primate behaviour. Primates 18: 555-563.

Cohen, J. & Cohen, P. (1983). Applied Multiple Regression/Correlation Analysis for the Behavioural Sciences (2nd edition). Lawrence Erlbaum Associates: Hillsdail, New Jersey, USA.

D’Agostino R.B. (1986). Tests for the normal distribution. In: Goodness-of-fit Techniques. R.B. D’Agostino & M.A. Stephens (eds.). Marcel Dekker, New York, pp 367-419. de Waal, F. B. M. & van Roosemalen, A. (1979). Reconciliation and consolation among chimpanzees. Behavioral Ecology and Sociobiology 5: 55-66.

Dunbar, R. I. M. (1976). Some aspects of research design and their implications in the observational study of behaviour. Behaviour 58: 78-98.

Edgington, E.S. (1995). Randomization Tests (3rd edition). Marcel Dekker, New York and Basel.

43

Engel, J. (1996). Choosing an appropriate sample interval for instantaneous sampling. Behavioural Processes 38: 11-17.

Engel, J. (1997). Die Bedeutung von Junggesellengruppen fur die Haltung von Saebelantilopen (Oryx dammah) in Zoologischen Gaerten. Doktorarbeit, Universitaet Erlangen-Nuernberg.

Ewen, J.G., Cassey, P. & King, R.A.R. (2003). Assessment of the randomization test for binomial sex-ratio distributions in birds. Auk 120 (1): 62-68

Fowler, J., Cohen, L. & Jarvis, P. (1999). Practical Statistics for Field Biology (2nd edition) John Wiley & Sons. This book has a particularly clear and useful section on different types of G-test.

Fisher, R.A. (1935). The Design of Experiments. Oliver & Boyd, Edinburgh.

García, L. (2004). Escaping the Bonferroni iron claw in ecological studies. Oikos, 105: 657-663.

Hayes, A.F. (2000). Randomization tests and the equality of variance assumption when comparing group means. Animal Behaviour 59: 653-656

Hawkins, D. (2005). Biomeasurement: Understanding, Analysing and Communicating Data in the Biosciences. Oxford University Press. (ISBN 0199265151) Plus see also: www.biomeasurement.net This is a website in progress, set up to accompany this textbook.

Kasuya, E. (2001). Mann-Whitney U test when variances are unequal. Animal Behaviour 61: 1247- 1249

Manly, B.F.J. (1995). Randomization tests to compare means with unequal variation. Sankhyā: The Indian Journal of Statistics 57: 200-222.

Manly, B.F.J. (1997). Randomization, Bootstrap and Monte Carlo Methods in Biology. (2nd edition). Chapman & Hall, London and Weinheim.

Martin, P. & Bateson, P. (1993). Measuring Behaviour: An Introductory Guide. (2nd edition). Cambridge University Press (ISBN 0521446147)

Melfi, V.A. (2001). Identification and evaluation of the captive environmental factors that affect the behaviour of Sulawesi crested black macaques (Macaca nigra). PhD thesis. University of Dublin, Trinity College, Dublin 2.

Melfi, V.A. & Marples, N. (2004). Identification of the environmental variables that affect captive Sulawesi crested black macaque (Macaca nigra) feeding behaviour. In: Proceedings of the XVIIIth Congress of the International Primatological Society, 2001, Adelaide, Australia.

Melfi, V.A. & Feistner, A.T.C. (2002). A comparison of the activity budgets of wild and captive Sulawesi crested black macaques (Macaca nigra). Animal Welfare 11: 213-222.

44 Mellen, J.D. (1991). Factors influencing reproductive success in small captive exotic felids (Felis spp.): A multiple regression analysis. Zoo Biology 10: 95-110.

Mellen, J.D. (1993). A comparative analysis of scent-marking, social and reproductive behaviours in 20 species of small cats (Felis). American Zoologist 33: 151-166

Mellen, J.D. (1994). Survey and inter-zoo studies used to address husbandry problems in some zoo vertebrates. Zoo Biology 13: 459-470.

Moran, M..D. (2003). Arguments for rejecting the sequential Bonferroni in ecological studies. Oikos 100: 403-405

Mundry, R. (1999). Testing related samples with missing values: a permutation approach. Animal Behaviour 58: 1143-1153

Norusis, M. J. (1990). SPSS/PC + Statistics 4.0. SPSS Inc.: Chicago, USA.

Neuhäuser M. (2004). Testing whether any of the significant tests within a table are indeed significant. Oikos 106: 409-410.

Onghena, P. & May, R.B. (1995). Pitfalls in computing and interpreting randomisation test p values: A commentary on Chen and Dunlap. Behavior Research Methods, Instruments & Computers 27: 408- 411.

Peres-Neto, P.R. & Olden, J.D. (2001). Assessing the robustness of randomization tests: examples from behavioural studies. Animal Behaviour 61: 79-86

Perkins, L (1992). Variables that influence the activity of captive orang-utans. Zoo Biology 11: 177- 186.

Pickering, S., Creighton, E. & Stevens-Wood, B. (1992). Flock size and breeding success in flamingos. Zoo Biology 11: 229-234.

Powell, J., Martindale, B., Kulp, S., Martindale, A. & Bauman, R. (1977). Taking a closer look: time sampling and measurement of error. Journal of Applied Behavior Analysis 10: 325-332.

Quera, V. (1990). A generalized technique to estimate frequency and duration in time sampling. Behavioral Assessment 12: 409-424.

Rhine, R. J. & Flanigon, M. (1978). An empirical comparison of one-zero, focal-animal and instantaneous methods of sampling spontaneous primate social behaviour. Primates 19: 353-361.

Rice W.R. (1989). Analysing tables of statistical tests. Evolution 43: 223-225.

Ruxton, G.D. & Colegrave, N. (2003). Experimental Design for the Life Sciences. Oxford University Press (ISBN 0199252327).

Shaffer, J. P. (1995). Multiple hypothesis testing. Annual Review of Psychology 46: 561-584.

Shepherdson, D. J., Carlstead, K. C. & Wielebnowski, N. (2004). Cross-institutional assessment of

45 stress responses in zoo animals using longitudinal monitoring of faecal corticoids and behaviour. Animal Welfare 13: S105-S113.

Siegel, S. & Castellan, N.J. Jr. (1988). Nonparametric Statistics for The Behavioral Sciences (2nd edition). McGraw-Hill (ISBN 0070573573).

Simpson, M. J. A. & Simpson, A. E. (1977). One-zero and scan methods for sampling animal behaviour. Animal Behaviour 25: 726-731.

Smith, T. E. (2004). Zoo Research Guidelines: Measuring stress. London: BIAZA.

Sokal, R.R. & Rohlf, W.H. (1994). Biometry: The Principles and Practice of Statistics in Biological Research. (3rd edition) W. H. Freeman & Company (ISBN 0716724111).

Spijkerman, R. P., Dienske, H., van Hoof, J. & Jens, W. (1996). Differences in variability, interactivity and skills in social play of young chimpanzees living in peer groups and in a large family zoo group. Behaviour 133: 717-739.

Sprinthall, R. (1987). Basic Statistical Analysis. Addison-Wesley, Reading, MA, USA.

Stevens, W.L. (1939). Distribution of groups in a sequence of alternatives. Annals of Eugenics 9: 10- 17.

Suen, H. K. (1986). On the utility of a post hoc correction procedure for one-zero sampling duration estimates. Primates 24: 237-244.

Tabachnick, B. G. & Fidell, L. S. (1996) Using multivariate statistics (3rd edition). New York: HarpersCollins College Publishers.

Tabachnick, B. G. & Fidell, L. S. (2001). Using Multivariate Statistics (4th edition). Boston: Allyn & Bacon.

Todman, J.B. & Dugard, P. (2001). Single-Case and Small-n Experimental Designs. A Practical Guide to Randomisation Tests. Lawrence Erlbaum Associates, Mahwah and London. This text includes a useful CD-ROM with worked examples of how to set up and carry out randomisation tests in Excel, SPSS etc.

Tyler, S. 1979 Time-sampling: a matter of convention. Animal Behaviour 25: 801-810.

Wielebnowski, N. C., Fletchall, N., Carlstead, K., Busso, J. M. & Brown, J. L. (2002). Noninvasive assessment of adrenal activity associated with husbandry and behavioral factors in the North American clouded leopard population. Zoo Biology 21 (1): 77-98.

Wilson, S. (1982). Environmental influences on the activity of captive apes. Zoo Biology 1: 201-209.

Wirtz, P. & Oldekop, G. (1991). Time budgets of Waterbuck (Kobus ellipsiprymnus) of different age, sex and social status. Zeitschrift fuer Saeugetierkunde 56: 48-58.

46