<<

Stratford & Wilkes University DRAFT 1

Table of Contents 1 Probability ...... 1 1.1 Simple probabilities...... 1 1.2 Conditional Probability...... 3 1.3 Bayes' theorem...... 3 2 Basic approaches to science and ...... 5 2.1 Introduction ...... 5 2.2 Schools of ...... 6 2.3 Neyman-Person and Bayesian statistics compared...... 9 2.4 estimation...... 9 3 Fundamentals...... 10 3.1 Computing...... 10 3.2 Populations and samples...... 10 3.3 Precision, Accuracy, Bias ...... 10 3.4 Significant units and rounding errors...... 11 4 Collection...... 12 4.1 Introduction...... 12 4.2 Types of Measurements...... 12 4.3 Coding data...... 13 4.4 The data sheet ...... 13 4.5 ...... 14 4.6 Data management...... 15 5 Common Distributions...... 16 5.1 ...... 16 5.2 Uniform...... 16 5.3 Binomial...... 16 5.4 Poisson...... 16 5.5 Gaussian or ...... 17 5.6 Exponential ...... 18 5.7 Cauchy...... 18 5.8 Gamma...... 18 5.9 Beta ...... 18 6 and Exploratory Data Analysis...... 19 6.1 Introduction...... 19 6.2 Law of large numbers...... 19 6.3 Descriptive Statistics...... 19 6.4 Relationships between variables ...... 22 6.5 Missing values...... 23 6.6 Looking for curvilinear relationships ...... 23 7 Transformations...... 24 7.1 Introduction...... 24 7.2 Arcsine transformation...... 24 7.3 Square-root arcsine transformation:...... 24 7.4 Exponential transformation...... 24 Stratford & Wilkes University DRAFT 2

7.5 Square-root transformation...... 24 7.6 Logarithmic transformation...... 25 7.7 Rank transformations...... 25 7.8 Box-Cox or power transformations...... 25 7.9 Data standardization...... 25 8 Introduction to Modeling...... 26 8.1 Introduction...... 26 8.2 Model Complexity and Error...... 27 8.3 Types of models...... 27 8.4 Fitting data to models...... 29 8.5 Issues in Modeling ...... 30 9 Variable and ...... 31 9.1 Introduction...... 31 9.2 Stepwise...... 31 9.3 Likelihood ratio test...... 31 9.4 Information Theoretic Methods...... 31 10 Generalized Linear Models (GLM)...... 34 10.1 Introduction...... 34 11 Summary ...... 35 12 Responses and Predictors are Continuous (Regression) ...... 36 1.2 Introduction...... 36 12.1 Simple ...... 36 12.2 Hypothesis testing and regression...... 38 12.3 Diagnostics...... 40 12.4 Review of y...... 42 12.5 ...... 44 13 Multiple regression ...... 45 13.1 Introduction...... 45 13.2 Polynomials ...... 45 13.3 Multivariate regression...... 45 13.4 Issues ...... 45 13.5 Nature of the relationship between the response and predictors ...... 45 14 Discrete response variables ...... 47 14.1 ...... 47 14.2 Multinomial (logistic) regression...... 47 14.3 Ordinal (logistic) regression...... 47 15 Responses are counts/ Predictors are continuous () ...... 48 15.1 Poisson Regression...... 48 15.2 Over- and under-dispersion...... 49 15.3 Negative Binomial...... 50 16 Responses are continuous/Predictors are categorical (ANOVA)...... 53 16.1 Introduction...... 53 16.2 Assumptions of the ANOVA...... 54 16.3 Two- Independent Samples...... 54 16.4 Mann-Whitney U test for two independent samples...... 55 16.5 Nonparametric Wilcoxon Rank Sum...... 55 Stratford & Wilkes University DRAFT 3

16.6 Parametric or nonparametric ...... 55 16.7 Two related groups (before-after)...... 55 16.8 More than two samples...... 56 16.9 Basic Assumptions of ANOVA...... 56 16.10 One-way ANOVA...... 56 16.11 Two-way ANOVA ...... 56 16.12 Randomized block design (RBD)...... 56 16.13 Interpreting Results from ANOVA...... 56 16.14 Nested...... 57 17 Multiple Comparisons...... 58 17.1 A priori comparisons: Orthogonal Contrasts ...... 58 17.2 Post priori comparison...... 59 17.3 Interpreting and Visualization of Multiple Comparisons ...... 59 18 Repeated Measures ...... 60 19 Response is continuous and predictors are categorical and continuous (ANCOVA)...... 60 20 Mixed Models...... 61 20.1 Introduction...... 61 21 Responses are ordinal ...... 61 21.1 Introduction...... 61 21.2 Proportional odds model ...... 61 21.3 code...... 61 22 Alternative Regression Techniques: Generalized Additive Models and related techniques...... 62 22.1 Introduction...... 62 22.2 Generalized additive models (GAMs) ...... 62 23 with a preponderance of zeros...... 63 24 Responses are proportions or counts and predictors are categorical...... 66 24.1 Tests ...... 66 24.2 1 x m or n x 1 data ...... 66 24.3 Contingency tables...... 66 24.4 Data Visualization ...... 67 24.5 Null Matrix...... 67 1.3 Fisher's ...... 68 24.6 G Test or Log-likelihood Test...... 68 24.7 Binomial Test...... 69 25 ...... 70 25.1 Introduction...... 70 25.2 Fitting nonlinear models...... 70 25.3 Splines ...... 70 26 Neural networks and Path analysis...... 70 27 Dealing with Interactions ...... 71 28 Mantel Analysis...... 71 29 Generalized Liner Mixed Models (GLMM)...... 71 30 General (GEE)...... 71 31 Proportional Odds Model...... 71 32 DATA REDUCTION...... 72 1.4 Introduction...... 72 Stratford & Wilkes University DRAFT 4

1.5 Relating the species’ abundance to environmental gradients...... 72 1.6 Methods utilizing eigenanalysis...... 73 32.1 Non-metric Multidimensional scaling ...... 73 33 PREDICTING GROUP MEMBERSHIP: DISCRIMINANT FUNCTION ANALYSIS...... 75 34 REPEATED MEASURES...... 78 35 THE NUMBER OF SPECIES...... 78 36 POPULATION ESTIMATION...... 78 37 Spatial Statistics...... 78 1.7 Introduction...... 78 37.1 Spatial correlation violates the assumption of independent observations...... 78 37.2 Software ...... 80 38 Experimental Design...... 81 38.1 Completely Randomized Design...... 81 39 Glossary...... 81 40 Literature Cited...... 81 PART ONE: THE BASICS 1 Probability

1.1 Simple probabilities

Most people have an intuitive understand of simple probability. In one sense, the probability of an event, the outcome, is the expected relative of the event or outcome occurring in a trial or over repeated trials. A simple estimation of probability is the relative frequency of given outcomes over the sample where sample space is the set of possible outcomes. For example, the sample space of a penny toss is {heads, tails} and rolling a die is S={1, 2, 3, 4, 5, 6}. Therefore, the probability of rolling any outcome on a die is 1/6 per roll. Note that a set of numbers is denoted with brackets, “{}”, not parentheses. Probability can be estimated empirically by dividing the number of outcomes by the number of trails. Events are mutually exclusive if the possible outcomes cannot happen together. For example, with a flip of a coin you cannot get a head and a tail. If events occur without affecting and being affected by other events then those events are said to occur independently. If the probability of an event is influenced by the outcome of another event then probabilities are said to be conditional. A simple way to alter probabilities is to sample without replacement. If we were to remove names from a hat then that would alter the probability of selection of the remaining names. Probabilities for one outcome includes and falls between 0 and 1 and the sum of outcomes in a sample space must add to 1.0. This is true for simple probabilities such as the rolling of dice or coin flipping. The complement is the probability of not getting that particular outcome. So the complement to heads is tails. In the parlance of statistical language the probability of getting tails is P T  The complement of P T  , the probability of getting heads, is denoted as        P T or P T or P T ' is the probability that a tail will not occur. In this case we should note that  P T P T =1 Calculating the probabilities of two independent events occurring in the same trial is called the joint probability and is Stratford & Wilkes University DRAFT 2 accomplished through multiplication or P A∩B=P A⋅P  B the symbol ∩ is read intersection and logically is a simple “AND” statement. A simple example is the probability of getting a particular color and suit such as the King of diamonds. What is the probability of selecting the seven of clubs? 4 14 P  seven of clubs=P seven∗P clubs= × =0.0179 56 56

If events are mutually exclusive then P A∩B=∅ where ∅ is an empty or null set. The probability of getting a head and tail on a coin toss is therefore zero. For calculating a subset of outcomes then we use the symbol U read “union” and logically is a simple “OR”. Symbolically represented by P  A∪B=P  AP B−P  A∩B for mutually exclusive events. So the probability of getting a head or a tail on a penny toss is 1 since P heads∪tail=P headsP tails−P heads∩tails=0.50.50=1 For three events, say A, B, C, the probability of P A∪B∪C  is P  AP BP C −P  A∩B−P  A∩C −P  B∩C P  A∩B∩C  The probability of getting a 3, 4, or 5 on a roll of a dice is 1 1 1 1 P {3,4,5}=   −0−0−00= =0.5 6 6 6 2 If the events are not mutually exclusive then the calculations will have fewer zeros. For example, if we want to calculate the probability of getting a seven or a club then  ∪ =    −  ∩ = 4 14 − 1 = . P seven club P seven P club P seven club 0.304 56 56 16 Example at Nongeneral Hospital There are 1000 70-80 year old patients 700 Nonsmokers healthy (NH) 150 Nonsmokers with cancer (NC) 10 Smokers healthy (SH) 140 Smokers with cancer (SC) The sample space is represented {NH, NC, SH, SC} Stratford & Wilkes University DRAFT 3

We can assume all the patients are wandering about with equal chance of being sampled.

What is the probability of selecting a smoker or P S ? We can add the two types of smokers since they are mutually exclusive group (cancer and no cancer) 10 140 150 P SH P SC=  = =0.15 1000 1000 1000

What is the probability of selecting a nonsmoker or someone with cancer? P  N ∪C = P N  P C −P  NC = 0.96 + 0.29 – 0.26 = 0.99

1.2 Conditional Probability Conditional probabilities exist when one outcome is influenced by a previous event and is estimated with

 ∩   ∣ = P A B P A B   P B where P A∣B is read probability of A given B. In the unconditional case P A∩B reduces to P(A)*P(B) and P A∣B becomes P(A) since P(B) cancels out of the equation.

1.3 Bayes' theorem So far, probabilities are derived from relative frequencies. In the mid-18th century, Rev. Thomas Bayes (1702-1761) developed a method to incorporate previous beliefs and use new observations to allow us to update our beliefs. This is an extraordinary result but Bayes did not seek to publish it. After Bayes' death, a colleague rescued the manuscript and had it published in 1764 and 1765 as An Essay towards solving a problem in the doctrine of chances in Philosophical Transactions of the Royal Society {Chatterjee, 2003 #6577}. Bayes theorem is an expansion of conditional probability such that  ∣ ×    ∣ = P A B P B P B A   P  A∣B×P  BP  A∣B × P B where P A∣B is the after taking into account data. In many cases A can represent a hypotheses and B can represent the data. The source of the , P A , can be subjective, representing a previous belief or based on previous observations. The term P B∣A is Stratford & Wilkes University DRAFT 4 the likelihood. The inclusion of previous beliefs is a subjective decision. The subjective aspect of Bayesian statistics turns off some scientists, however, with the inclusion of many samples (evidence), the influence of priors is lessened. This last equation gives the probability of a single hypothesis. If you have multiple hypotheses competing to explain the data we have

 ∣ ×    ∣ = P A Bi P Bi P Bi A n ∑  ∣ ×   P A B j P B j j =1 where B j are alternate hypotheses. We can rewrite this equation as [likelihood × prior ] [total of all likelihoods× priors of alternates] Example 1.

A population of birds has 60 males and 40 females. About 50% of the males will bring food to the nest and all females will bring food to the nest. If a bird is seen carrying food to the nest, what is the probability of the bird being a male?

P B∣A Is the probability of a bird carrying food given that the bird is male or 0.5 P  A is the probability of observing a male bird or 0.6 P  B is the probability of observing a bird carrying food independent of sex or (0.4)(1) + (0.6)(.5) =0.7

0.5⋅0.6 so P A∣B= =0.43 0.7

Example 2.

The presence of antibodies is tested to infer the presence of a virus. It is correct at a rate of 0.9985 which that a false negative occurs at a rate of 0.0015. The false positive rate is 0.0060 and a person that does not have the virus will test when they do not have the virus at a rate of 0.9940. Let's assume that 1% of the population has the virus.  ∣ = So P positive A 0.9985 P negative∣A=0.0015 P  positive∣no A=0.0060 P negative∣no A=0.9940

What is the probability that a person that tests positive actually has the virus? Stratford & Wilkes University DRAFT 5

P  positive∣A⋅P  A P A∣positive= P  positive P  positive∣A⋅P  A P A∣positive= P  positive∣A⋅P AP  positive∣no A⋅P no A 0.9985⋅0.01 P  A∣positive= =0.627 0.9985⋅0.010.0060⋅0.99

2 Basic approaches to science and statistics

2.1 Introduction Epistemology is the study of knowledge – a central objective of statistics. New knowledge, at least interesting knowledge, comes from inferences. Inferences are logical conclusions available from evidence. One type of inference is induction, when facts and observations are put together to make a generalization about those facts and observations. Consider these to be the novel explanations of existing facts. Watson and Crick were able to get to the structure of DNA by examining difference pieces of evidence and literally putting the facts together in a novel model. Induction dominates when a particular field of study is young. Indeed, there were several models to explain the genetic material and what the structure of DNA. How do we know that these explanations are worth anything and how do we sort through multiple models? There were several competing models of DNA but only one was correct. Different models need to be tested. We go about testing models using deduction. Deduction is the processes of going from the generalizable explanation to specific cases. If the basic mental exercise of induction is “what if...” then the basic mental exercise of deduction is “if.. then..”. As a particular field of study progresses it could be seen as entering a deductive confirmatory phase. Science, as a social enterprise, should not impose a dichotomy where one does not exist. Progress in science is a cycle of induction and deduction where we come up with new ideas and modify them based on new evidence. is unique among the natural sciences because the inherent variation. Variation makes biology interesting, difficult. Statistics attempts to sort this variation into meaningful patterns. Hypotheses are our attempts to explain variation. Why do some birds lay two eggs and other species lay a dozen? Hypotheses are the purported invisible mechanisms that create observable variation. The classic case is that scientists start with observation. This is true in some cases but usually we start with a body of knowledge with gaps where knowledge is lacking. Hypotheses attempt to bridge those gaps. Where do hypotheses come from? Imagination plays a big part in constructing hypotheses. Generalized explanations are considered theories. There is no metric for the amount of phenomena that an explanation must incorporate to become a theory. The word “theory” does not imply the amount for or evidence against a theory. Theories do not become facts even if very well supported. One might say that a theory that performs well has been well validated or well supported. What does it for a theory to perform well? Well supported theories have been evaluated by testing the hypotheses they produce. We evaluate hypotheses by testing predictions. Predictions are statements about the outcomes of observations and . Hypotheses and Stratford & Wilkes University DRAFT 6 predictions can be linked together through if-then statements. For example, if dispersal affects colonization on islands then there should be positive relationship between dispersal ability and incidence on islands. A good scientist should consider all the hypotheses that might generate a phenomenon and make predictions that will separate these hypotheses. Theories that have become accepted by the larger by of scientists may become a paradigm. There are two general approaches to having several competing ideas. We can consider several potential explanations but only one is true. Until the true hypothesis is borne out, we have several alternative hypotheses. This is the method espoused by Francis Bacon (1620) in the 17th century. Our job as scientists then are to go through each hypothesis and eliminate those that no longer merit our consideration. The other approach is to consider several hypotheses that may all be potentially “true” yet differ in their power to explain. This approach considers multiple working hypotheses as proposed by geologist T.C. Chamberlin in 1890. What is meant by an explanation or hypothesis being true? We should never consider an explanation to be absolutely true or, more simply True. Subscribing to Truths makes for bad science. We should always be skeptical of our explanations and always leave room for the fact that we might be wrong – about everything. We should always remember that science is a social enterprise that strives for objectivity.

2.2 Schools of statistical inference

Statistics is a branch of science that wrangles with how we collect data, make inferences from data and then how or if we make decisions. Since science is a sociological enterprise complete with personalities and debate, we might expect there to arise major epistemological issues. This is indeed the case; there are two schools of statistical inference: the Bayes school, where the weight of evidence for a hypothesis is estimated, and the Neyman-Person school of falsification, where hypothesis are either supported or discarded as they are tested. These schools can be referred to as the subjective and objective schools of statistical inference {Chatterjee, 2003 #6577} but I believes this understates the objective nature of Bayesians statistics and the subjective nature of the frequentist school.

2.2.1 Frequentist Approach or Neyman-Pearson Statistics

Most students are already familiar with the frequentist approach that includes the t- and chi-square tests. Measurements are taken, a is estimated then the statistic is compared to a critical value. A statistic less than the critical value is considered to be statistically insignificant. A statistic larger than the critical value indicates . Take for example, a penny flipping experiment where one asks is a penny is fair. The penny is flipped a number and then the number of observed head is compared to the number of heads expected. Since we expect an equal number of heads to tails, what is the difference between expected and observed that will result in us concluding that we do not have a fair coin? The frequentist approach gives us that ability with a particular rate at which we would make the wrong conclusion. The point at which we make a decision is called the critical value. The critical value is based on a theoretical of outcomes (thus the frequentist designation) for given degrees of Stratford & Wilkes University DRAFT 7 freedom, error rate, and expected frequency of particular outcomes. Degrees of freedom are the number of values that are from the data minus the number of values derived from the data. A simple example of degrees of freedom is with determination of the fairness of coin flips. The number of degrees of freedom is 1 because the proportion of heads can be determined from flipping the coin and the number of tails can be derived from the number of heads by simply subtracting the proportion of heads from 1. The error rate (α) or is the proportion of times that one would conclude a significance difference if, in reality, there was no significant difference. Typically, an error rate of 0.05 is used. This means that one would conclude unfair dice or an unfair coin wrongly five percent of the . If one wants to make that mistake less frequently, then a smaller p-value is used. One will often use a p- value of 0.01 or less in medicine or engineers and p-values larger than 0.05 in ecology.

2.2.2 Types of hypothesis in the frequentist approach

The null hypothesis, H 0 , is a statement that describes the situation in which the effect of the supposed mechanism is absent when applied. What the heck does that mean? Let's give a clarifying example: if you posit that a fertilizer increases the growth of a particular plant then the null hypothesis is that applying fertilizer will have no effect when applied. The describes what is expected if the supposed mechanism is at work and if represented as H A . In this case, the alternative hypothesis can be stated one of in two ways: one, there is a difference between the reference and the experimental group and two, the experimental group will grow faster, thus taller, than the reference group. The former is referred to as a non-directional hypothesis and the latter a directional hypothesis. The difference is not insignificant. A non-directional hypothesis suggests we have less knowledge about the system under study – applying chemical X will have some effect. If we have an understanding of the workings of our system we can make a better argument and give the hypothesis direction. For a There are circumstances when a non-directional hypothesis is appropriate; in particular in emerging fields of science. I doubt very much that Newton had directional hypotheses when exploring the nature of light. Applying a directional or non-directional hypotheses should always be justified. We often think of the classic experiment as one with a reference or a control group and an experimental group. Going back to the fertilizer example, we might want to know if the average height of plants is higher when they getter fertilizer than the control group which does not. In this case the null hypothesis is that there is no effect on average plant growth so that the mean heights of the two groups are the same. The null and alternative hypotheses can be represented as  = H 0: control reference  ≠ H 1: control reference   H 2 : control reference

We aren't bound by such simple situations. When gathering data from the field we are often presented with a of data where removing the treatment would be meaningless. For example, we might be interested in the effect of temperature on the number of bird species singing. A lab experiment would be untenable. So we record temperature and relate to the number of birds singing. There is not a control or experimental group – they are not necessary. Instead, we are interested in the slope of the relationship,  . Symbolically, such statistical hypotheses can be expressed as Stratford & Wilkes University DRAFT 8

=0 and 0 for the null and alternative hypothesis, respectively. Graphically these hypotheses can also be expressed as

2.2.3 A tale of tales In the frequentist paradigm, a directional hypothesis is said to be one-tailed and a nondirectional hypothesis is said to be two-tailed. Since there are two ways you can falsify the null, the alpha level is usually cut in half.

2.2.4 Power Power is the ability to reject the reject the null hypothesis when it is false. Power increases with sample size, and decreases with variability. Effect size, is the influence of the treatment on slopes and means. Prospective power analysis is used to estimate sample size and power and variability are given. Retrospective analysis is done after the analysis and is used to estimate power and sample size is given. It is usually used when you fail to reject the null and want to know why.

2.2.5 Potential issues with frequentist statistics

2.2.5.1 Test gives statistical significance not biological significance

2.2.5.2 Type I and Type II errors There are two ways to make errors in statistical inference. A type I error is when one incorrectly infers that there is a treatment effect (N0 is found false) and there is not (N0 is actually true) . The rate at which you make this mistake is the alpha that is chosen. Typically, an alpha of 0.05 is chosen. This means that if the experiment is repeated the correct conclusion will be made 95% of the time and the experimenter will conclude an effect when there is not 5% of the time. An alpha of 0.05 is arbitrary used more out of tradition. Other error rates are used depending on the field. In exploratory analysis, especially in ecology, a larger error rate are used (alpha = 0.1). In medicine, error rates tend to be more stringent (alpha = 0.01 or less). In any case, error rates should always be reported. A type II error is when you conclude that the null hypothesis is true when the null hypothesis is actually false. Type II errors occur at rate 1-ß. Beta is the and refers to the ability of a test to reject the null when it is actually false. A test that has low power will be over-conservative. There is a certain amount of balance we should aim for: a test that is too conservative will suffer from Type II errors and a test that is too powerful will suffer from increased Type I errors.

Conclusion True False Reality True 1-alpha Type II error False Type I error Power (1-beta) Stratford & Wilkes University DRAFT 9

2.2.6 What to report under the frequentist paradigm One should always always report the test used, the effect size, sample size, the test statistic, and the actual p-value.

2.3 Neyman-Person and Bayesian statistics compared

Get the probability of the hypothesis given the data and in the frequentist approach we get the probability of the data given a hypotheses.

Frequentist Bayesian Parameter estimates 95% confidence 95% probability of estimate intervals Hypotheses Duelistic: significant or Likelihood of hypothesis insignificant Prior information Not used Used Inference P-value or P  H ∣data  ∣  0 P data H 0 Computation Easy Difficult

2.4 Parameter estimation Parameter estimation is not a school of statistical thought but it is an approach to science. An effect may be known but the exact relationship may not be precisely defined. Parameter estimation goes about finding the relationship between explanatory and predictor variables. This can be accomplished in two ways: bootstrapping and model averaging. Bootstrapping involves repeated sampling to get the mean effect, which overcomes the problem of sample error. One can also model average and this is more appropriate when the “true model” is not known. Model averaging takes a Bayesian perspective in that each model is given a likelihood. The is weighed by the likelihood of the parameter for each model. Stratford & Wilkes University DRAFT 10

3 Fundamentals

3.1 Computing I have not sampled all the statistical programs out there – and I do not plan on it. I have however, used a gradient of “boxed” programs from Systat, to SAS, to R. The more sophisticated programmers and enjoy the hands on approach of R and SAS. Most students will probably prefer Systat or SPSS. I have settled on R. Many will find the price or R ideal: it is free. Many statisticians prefer R because they can manipulate or make adjustments to statistical procedures. I like R for several reasons; the primary attraction being the many procedures (packages) that are tailored for ecologists that are free to download off the internet. A warning though; R is a bear to learn.

3.2 Populations and samples A sample is a subset of the population, which contains all the individuals. A statistic describes a sample, whereas a parameter is a property of the population. Parameters are typically represented in Greek and sample statistics are represented in Latin. Since samples statistics are estimates of parameters, they can also be represented by the Greek alphabet but usually with a “hat.” For example, the population mean is represented by  , and the sample mean is represented by x or  .

3.2.1 Variables vs Parameters A variable is a quantity whose value potentially differs from one individual to another. Parameters are values associates with a variable. The combination in a model is usually represented as bx where x are the values of the variable and b is the parameter. So, for instance, the number of hot dogs sold at a baseball game is the number of attendees (x) times some parameter, say 0.001 (or 1/1000 of the attendance). Usually variables are represented with Arabic letters at the end of the alphabet (e.g., X, Y, Z). Observations (realizations of those variables) are set in lowercase (e.g., x, y, z). A constant has the same value for all observations and are represented with letters at the beginning of the Arabic alphabet (e.g., a, b, c). A parameter is a summery number for a population and is a constant since it was generated from all the observations. Parameters are traditionally represented with Greek letters (e.g., α, βˆ β, μ). Estimated parameters are given by lowercase (e.g., b) or with a hat (e.g., - “beta hat”). A statistic, generally represented with Arabic letters (e.g., s, t, F), is a variable generated from a sample and is an estimate of a parameter. Subscripts denote position in an array and letters in the middle of the alphabet are used (e.g., i, j, m, n).

3.3 Precision, Accuracy, Bias Precision is the repeatability of a measurement. For example, an imprecise scale would give different measurements every time a person steps on and off. Precision can be improved with better instruments or with larger sample sizes. A measure of precision is also useful and requires repeated measurements. Accuracy is the proximity of a measured value to the true value. So if the scale reads 193 pounds for a 200 pound person the scale is less accurate than a scale that reads 199 pounds. Inaccurate Stratford & Wilkes University DRAFT 11 measurements have two sources: coarseness measurement and bias. Coarseness of measurement results when the increments of measurement are larger than the variability in the objects being measured. A rule with one meter increments would give inaccurate measurements of objects much smaller than a meter, say acorns. Better instrumentation reduces the effect of course measurements. Bias is systematic error.

3.4 Significant units and rounding errors

One should avoid presenting measurements with more accuracy than what was actually taken. For example, a measure of 0.200 implies a measure to the thousandth place but this is impossible for an instrument that measures to the nearest tenth. Derived variables however, should have a significant unit added. This is for presenting. When using numbers for calculations one should be very aware of rounding errors. Rounding errors occur when a number is rounded up or down and influences the outcome of later estimates. Always prudent not to round off estimates until the very end. This is particularly true when working with logs and exponential forms. Stratford & Wilkes University DRAFT 12

4

4.1 Introduction Ecology, by its very nature, is complex. The simplest questions, such as asking why a species of plant grows where it does, have nearly an infinite number of explanations. What you are attempting to do is to answer questions about a complex system by finding a compromise between funding, time, needs of sample size, and what you want to explain so the more upfront thinking you can do the better. If you are working on a funded research grant you’re in luck because much of the work has been done by your advisor – pat her on the back and buy her lunch. Then hope she buys you several dinners because you’ll soon not able to afford going out. The larger goal should be to test multiple hypotheses by doing the least amount of work – this is the art of being a successful scientist. Testing multiple hypotheses implies collecting different types of explanatory data. The question is then what exploratory variables should you collect. This is the part where you hurt your brain. Go to the library and read up on your response variable. Use it as a key word in bibliographic search engines. The more hits you return the more you will have to think about it and the slicker you will need to be to come up with good observations. Create a working model on paper and start with explanatory -> response. Note that each path is a hypothesis or a sub-hypothesis. Start adding explanatory variables. Then think of what affects the explanatory variables. Soon you should come up with a messy paper and arrows going everywhere. Good! Now photocopy this and stow it away. Bring it back out when you are writing up your manuscripts and reflect on the entire thing. Now you have some idea of the alternatives that reviewers are going to point out when your paper comes back so it might be good to justify in your methods why you chose certain parameters and ignored others. In your discussion, you can add what parameters you wish you had included. With the original copy get out the red pen and start lining through pathways you just can’t do. Rank your hypotheses from most likely or the most influential to the least likely or least influential. This is the point where you should consult fellow graduate students and other experts. Once you’ve think the list is a good approximation then think of different ways to measure the explanatory variable and the response variable. You should be able to biologically justify each decision. Take on the attitude of a four-year old and keep asking “why?”. Head spinning and hurting? Good, go rent a good movie. I suggest El Mariachi.

4.2 Types of Measurements

4.2.1 Continuous and discreet data Continuous data are measurements that can take on any value between integers. Examples of continuous data include height and weight. Continuous data are usually quantitative. Discrete data are measurements that are integers, such as number of individuals, or qualitative data, such as an individual's sex. Stratford & Wilkes University DRAFT 13

4.2.2 Ratio scale Data on a ratio scale are continuous and have a zero that is physically meaningful. There is usually a unit measurement with constant intervals between units. Examples include height, weight, amount of time (not time of data).

4.2.3 Interval scale: Interval data are continuous and are like ratio scale data; however, zero is not physically meaningful and is arbitrary. The intervals among units are not constant. Degrees Fahrenheit and Celsius are on an interval scale, whereas degrees Kelvin are on a ratio scale. Other examples of interval data are directional measurements and time of day. These latter examples require special treatment since they are circular data.

4.2.4 Nominal data: Nominal data are qualitative, discrete, and occur in or is designates as class (e.g., male/female)

4.2.5 Ordinal or ranked data: data are recorded in relation to each other (e.g., light, medium, dark). Variables that are qualitative (e.g., dead/alive, black/white) but are treated as quantitative are called dummy variables. Krebs (1989) suggests coding ranked data as letters to reinforce the idea that these data are not absolute.

4.2.6 Count data the number of objects or occurrences per observation. Count data are positive integers.

4.3 Coding data Qualitative data are often coded for ease of input and analysis. Binary data, such as dead/alive or male/female or often coded with 0 and 1. Ordered data (e.g., high, medium, low) or often coded as 1,2,3 but many statistical programs can handle text. Indeed, it is probably better practice that text be used for qualitative data so dummy variables (0,1,2 representing, for example, high, medium and low) are not mistaken for quantitative data. Dates and time also require special treatment if time and date are going to be included in the analysis.

4.4 The data sheet With the availability of hand held computers, collection of field data has never been easier. Indeed, remote recording devices are norm for collecting environmental data. The most efficient way to add data is to have each variable as a column and each row as an observation. site date time species Broad Lake 01022007 1346 NOCA Broad Lake 01022007 1456 BAEA Stratford & Wilkes University DRAFT 14

Pohatcong Creek 01232007 600 LOWA Pohatcong Creek 01232007 714 CAWA

4.5 Sampling

4.5.1 Random Sampling The importance of cannot be overemphasized. The underlying assumption in many of the tests presented here is that the samplings of subjects (e.g., your plots, individuals sampled, etc.) are randomly sampled. Most tests assume that the explanatory variable has effects on the individual that are independent from subject to subject. Formally, we say that that the data are independent and identically distributed or, simply, i.id. All samples have an equal chance of being sampled and do not influence each other. One is not doomed if data are not independent but these variables must be taken into account. In general, dependent or correlated data come from samples being clustered in time or clustered in space. The former is often referred to spatial and the latter being repeated measures. A simple example of a repeated measure is paired sampling where the same individuals are measured twice, say before and after a drug is given. Nonindependence is a common problem for large-scale or behavioral studies. For example, a restoration site may compare itself to another site or a before-after comparison. Workers may go out and collect hundreds or thousands of samples and make some inference about the effect of the restoration treatment. But such inferences are not warranted. This problem is often referred to as pseudoreplication . If the treatment was the effect of flood control on plant growth the flooding may have nothing to do with the response. Other effects maybe more important than flooding such as soil quality, herbivory, or plant genetics. To make inferences about flooding treatment, several restoration will have to be sampled. Workers should become familiar with hierarchy theory to become aware of how large-scale phenomena (working in time or space) may affect inferences. Randomization is important because it allocates systematic error into the error term. In some circumstance, nonindependence can be estimated and may actually be of interest. In the last example, a worker might be interested in the amount of variation within a site and this is certainly a valid or even necessary question. In these cases, samples within a site or related individuals are considered subsamples and create a nested design. The worker must find the compromise between the number of sites sampled and the number of samples taken within a site. Great biometrician: Sir . The contributions of Ron Fisher are peppered throughout any textbook and his contributions extend into evolutionary biology (e.g., Fishers runaway selection) and genetics. Fisher had extremely poor eyesight and did much of his mathematics in his mind. Instead of numerical methods of solving statistical problems, Fisher used geometry. His daughter wrote an excellent biography see Stratford & Wilkes University DRAFT 15

4.5.2 Stratified Random Sampling

4.5.3 Paired Sampling Paired sampling is common with before/after studies where subjects are sampled before and after a treatment. Paired sampling can increase precision if the measurements within subjects is correlated > 0.05

4.5.4 Repeated Measures Repeated measures are those in where the subjected are repeated over time. Repeated measures are common in long-term studies where the treatment is expected to change over time. In terms of random and fixed effects, time is a .

4.6 Data management

4.6.1 Keeping a diary

4.6.2 Software for data management

4.6.3 Discovering errors

4.6.4 Stratford & Wilkes University DRAFT 16

5 Common Distributions

A probability density function (pdf) is a numerical function (remember “f of x”?) that describes the frequency, thus probabilities, of different outcomes. If a statistical function or test assumes a particular distribution then the function or test is said to be parametric. Most of the standard

5.1 Histograms Histograms are a graphical representation of probability density functions.

5.2 Uniform The uniform distribution results when all the outcomes have an equal chance of occurring. Examples include the outcomes of a roulette wheel or the rolling of dice. There are only two parameters to describe the uniform distribution and these are the lower and upper bounds. This distribution is also called a rectangular distribution. In general, an interval is expressed as P a xb if probabilities are expressed as the area under a curve then a P a xb=∫ f xdx b

5.3 Binomial

A variable that is binomially distributed is expressed X ~Binn , p where n is the number of trials and p is the probability of success per trial.

Binomial random variables have two outcomes such as live/dead, present/absent, and male/female. The probability of obtaining x successes for a binomial variable is:

 = n! x  − n− x n! n P x p 1 p where =  read “n chose x” x! n− x! x!n− x! x

The expected outcome is E(x) is np and the , 2 , associated with a Binomial variable X is 2 X =np1− p

5.4 Poisson

The describes counts in a sample of time or space. The Poisson distribution can be described by a single parameter λ () which is also called the rate parameter and is the expected Stratford & Wilkes University DRAFT 17 count per sample.

X ~Poisson Where lambda is the

x  − The probability of observing x counts in a sample is P  x= e with E(x) = lambda and the x! variance, 2 , associated with a Poisson variable X is λ. As lambda increases, the shape of the converges on a normal distribution.

5.4.1 Using the Poisson to measure dispersion The ratio of the mean or E( λ) is a very useful measure for looking at the dispersion of objects or how organisms are distributed across space. If a species is distributed randomly then the variance should be close to the mean. Species that are clumpy in space, such as herds of deer, will have variance that exceeds the mean. Most samples will have few individual or none and a few samples will have very many counts. If individuals repel each other then counts will be very uniform and the counts will depend on the sampling design but the variance with be less than the mean.

Dispersion λ/variance Spatial example Random ~1

Clumped >1

Repelled < 1

5.5 Gaussian or Normal distribution A normal distribution is the classic symmetric hump shape distribution. A Gaussian random variable is expressed as X ~N  ,

The function of a normal distribution is Stratford & Wilkes University DRAFT 18

− −2  2  = 1 xi 2 f x i e where the is mu and sigma-squared is the variance  2 

With the hump-shaped curve in mind, the peak will occur at the mean, μ, and the spread of the curve is defined by σ, the . A normal curve is symmetric about μ so that an equal number of observations will occur less than μ as greater. There are several methods of testing to see if a variable (vector) is distributed normally. For ordered discrete and continuous data there are the Kolmogorov-Smirnov* Goodness-of-fit tests (K-S tests). However, a more powerful, thus preferred test is the Shapiro-Wilk test of normality. ∑ w X ' 2 W = i i ∑  −  2 X i X *Named after two Russian mathematicians, Andrei Nikolaevich Kolmogorov (1903-1987) and Nikolai Ivanovich Smirnov (1887-1974). They were not academically related (graduate student and advisor).

5.6 Exponential − E x=B e Bx

5.7 Cauchy

5.8 Gamma

5.9 Beta Stratford & Wilkes University DRAFT 19

6 Descriptive Statistics and Exploratory Data Analysis

6.1 Introduction Descriptive statistics is how you get intimate with your data. Some text books refer to this step as Exploratory Data Analysis (EDA). Despite being intimate it is very unsexy stuff (as opposed to sexy statistics such as splines and ordination – there you get twist and contort your data) and very little will be included in a manuscript. The most important tool in understanding your data is data visualization using graphs. You should examine each variable by itself then examine the relationship between explanatory variables.

6.2 Law of large numbers n ∑ y  i   = = Where, given a normal distribution, E X lim i 1 = E  X  x∞ n

This is the general approach of the frequentist paradigm.

6.3 Descriptive Statistics

6.3.1 Location Statistics

6.3.1.1

The mean of a sample can be estimated using ∑n xi = x= i 1 n

Is x an unbiased estimate of  ? Yes if samples were taken randomly, samples are independent of each other and the population has a random distribution.

6.3.1.2 Weighted mean If there are multiple subgroups and uneven numbers in each subgroup then a weighted mean might be Stratford & Wilkes University DRAFT 20 appropriate: n ∑ ⋅ wi xi = x= i 1 ∑n w i i=1

1.1.1.1 The median is the midpoint of the ordered data. For an odd number of observations, this is the middle number. For even number of observations, the median is the average of the center two observations.

1.1.1.2 Is the most frequently occurring number

1.1.1.3 Quantiles Ordered (sorted) data can be broken up into quantiles. A quartile is when the data are broken up into four equal parts. A breaks the data into one hundred equal parts.

1.1.2 Statistics of Variation

1.1.2.1 Range The range is difference between the largest and smallest values.

6.3.1.3 Deviation − A deviate is the measure of the distance between an individual observation and the mean x i x

6.3.1.4 Sum of Squares (SS) Although the sum of squares (SS) is not a useful statistic by itself, SS is used in many other statistics and statistical tests. =∑  −2 SS xi x

6.3.1.5 Variance The populations variance (σ2) of a random variable can be estimated using

∑ x − x2 2 = i n and the sample variance can be estimated using Stratford & Wilkes University DRAFT 21

∑ x −x2 s2= i . n−1

The denominator is reduced by 1 because, without it, the variance would be underestimated.

6.3.2 Standard deviation  2 s= s  In a normal distribution

One standard deviation unit contains 68.37% of the points Two standard deviation units contains 95.45% of the points Three standard deviation units contains 99.73% of the points

50% of the points fall within 0.674 standard errors 95% of the points fall within 1.960 standard errors 99% of the points fall within 2.576 standard errors

6.3.3 Confidence intervals

95% CI = x−1.96s−x1.96s

6.3.4 of the Mean The standard error of the mean (or any statistic) indicates the accuracy of a particular measurement. This differs from the standard deviation, which indicates the spread of the data. It is equivalent to the standard deviation divided by the square root of the sample size. Therefore, at the sample size increases, standard error decreases.  = = s SEM  And SEM N n

6.3.5 n = = 1 ∑  −3 skew g1 3 xi x ns i =1 If negative then a skew to the left (more smaller values than expected. If positive then more larger Stratford & Wilkes University DRAFT 22 numbers than expected. If g1 > 0 then right skewed and if < 0 then left-skewed.

Mean, mode and median with skewed data.

6.3.6 A measure of the shape of the hump. A platykurtic distribution is relatively flat and a leptokurtic distribution is “peaky”. It is the fourth .

n = = 1 ∑  −4− kurtosis g2 4 xi x 4 ns i=1

6.3.7 (CV) The coefficient of variation is a useful statistic when comparing observations from different populations or dissimilar groups. For example, CV is useful when comparing lengths of different body parts or the same body part on different species. s CV = ×100 x

6.4 Relationships between variables

6.4.1 Graphing Graphing is the most important step in detecting relationships between variables. A simple linear relationship (correlation) is detectable with a correlation analysis. A problem may arise if a more complex relationship exists. Relationships between predictor variables are important to discern since correlated variables will complicate analysis. Relationships between a predictor variable and a explanatory variable can take on a variety of shapes and it is important to understand the nature of that relationship before a more formal relationship. One type of situation that complicates analysis is when the direction of the relationship changes over different values.

6.4.2 Types of relationships and what to graph

Predictor variable Response Variable Type of graph Continuous Continuous Categorical or nominal Continuous Bar Stratford & Wilkes University DRAFT 23

6.4.2.1 Correlation Correlation is a linear relationship between two variables (say X, Y). If the two variables are distributed normally then a bivariate normal distribution may exist. This is 2 dimensional normal distribution. These two variables are correlated if values of one variable are related to values of the other variable. A positive correlation exists if low values of X are related to low values of Y and high values of X are related into high values of Y. The parameter, rho, is the population and estimated with r, the sample correlation coefficient. The Pearson product moment correlation coefficient is a parametric test for correlation introduced by Karl Pearson in 1895 and is estimated by the formula cov x , y  r =  =∑  − −  2 2  where cov x , y x i x y i y s x , s y

n∑ x y −∑ x ∑ y  r= i i i i  ∑ 2−∑ 2⋅ ∑ 2−∑ 2 n xi xi n yi yi

If either variable is not normal a relationship can still be estimated using the Spearman coefficient can be calculated. The formula is

6 ∑ d 2 r =1− i s nn2−1

If X and Y are sorted and ranked then d is the distance between rank of Yi from rank of Xi. Both the Spearman and Pearson correlation coefficients range between -1 and 1. Perfect relationships have values of |1| and no relationship is a coefficient of 0. Ordinary

6.5 Missing values

Do not put zeros for missing data. Consult your software and make sure when moving between programs that missing data are handled accordingly. na.action=na.omit # observation length will not equal fitted values, use na.action=na.exlude to include NA in fitted values.

6.6 Looking for curvilinear relationships Stratford & Wilkes University DRAFT 24

7 Transformations

7.1 Introduction Why transform? Many tests make the assumption that the data are normal and ignoring this assumption can potentially lead to misleading inferences. Many ecological data sets are not normally distributed yet these “parametric” tests can still be applied if the data can be transformed to become normal. Transforming data is useful for analyzing data however the more complex the transformation the harder it is to intuitively translate results. It easy to imagine the response between percent urban cover and a landscape and the number of tree species but it’s much harder to translate the arcsine-squareroot transformed data. Always report if a transformation if used, means should be reported with confidence intervals in the transformed scale and standard errors should not be reported {Sokal, 1995 #6268} Positively skewed data can be transformed with powers <1 such as taking the square root of the variable. Negatively skewed data can be transformed with powers >1 such as taking the square. Other transformations should be applied a priori when the dimensionality or power of the original data is not one. For example, surface area should be square rooted and volume should be cube rooted.

7.2 Arcsine transformation The arcsine transformation is used for percentages and proportions when the sum across groups can add to one. This transformation tends to flatten out a leptokurtic distribution {Sokal, 1995 #6268}.

7.3 Square-root arcsine transformation: Used for rates and percentages. If given a percentage then convert to a proportion first (divide by 100). To transform data via square-root arcsine, first take the square root then take the arcsine of that number.

= − 1 ( ) t s i n x . Then to transform your data back you take the square the sine of the transformed value Back or reverse transformations When presenting descriptive data be sure to back transform the data. Presenting the mean of square- arcsined data is not very meaningful.

7.4 Exponential transformation Exponents less than 1 affect the larger values more than the smaller values. Exponents greater than one affect the smaller values more than the larger.

7.5 Square-root transformation “When the data are counts… such distributions are likely to be Poisson rather than normally distributed and that in a Poisson dist the variance is the same as the mean. Therefore the mean and variance cannot be independent but will vary identically. “ Add 0.5 if the counts include 0. Stratford & Wilkes University DRAFT 25

7.6 Logarithmic transformation “The mean is positively correlated with the variance” “Frequency of distributions skewed to the right” Since the natural logarithym of 0 is undefined it is necessary to add 1 to all observations.

7.7 Rank transformations Nonparametric tests often use rank transformations. The data are sorted and then replaced by their rank. For some tests it is important to realize that only one variable is used to rank the data

7.8 Box-Cox or power transformations The Box-Cox or power transformation estimates the optimal power that transforms data to normality. It does not guaranty a normal distribution will result and should be used when a transformation is not known a priori. The Box-Cox is used to find the optimal power for transformation is

v 2 v L=− ln s −1 ∑ ln Y 2 T n where s is the variance of the transformed data, v are the degrees of freedom and n is the sample size.

7.9 Data standardization It is often difficult to translate effects of different variables measured on different scales. One method to overcome these difficulties is data standardization. Typically data are standardized to mean 0 and 1 standard deviation. This transformation should not be used when data are not normal. − s x x x = sd

Standardization is a function in the R package “arm.” Stratford & Wilkes University DRAFT 26

8 Introduction to Modeling

8.1 Introduction

A model is a mathematical expression of a hypothesis. Models approximate hypotheses and some authors equate hypotheses to models (e.g., {Woodworth, 2004 #6247}. In the frequentist paradigm, a model allows us to make empirical predictions. From the Bayesian viewpoint, a model provides a framework by which support for a particular hypothesis can be generated by “confronting” the model with data {Hilborn, 1997 #6271}. It is possible for a biological hypothesis to have several statistical models to accompany it. For example, one may hypothesize that temperature affects the growth rate of developing amphibians. There are several ways to model the relationship between temperature and growth rates, including a linear, quadratic, and or a step function. The more explicit the hypothesis, the more it converges on a particular model. Statistical models are simply approximations of reality. There is no perfect model; a “perfect” model would account for all the heterogeneity that exists among individuals and the environment (Burnham and Anderson 2002). The model would be rendered imperfect the moment something novel, even slight, occurred. Moreover, the utility of a model, in some cases, may be diminished if it is too good. For example, one might want to understand how weather regulates populations and the researcher is able to very accurately model weather parameters and populations. However, the model may fail to predict populations of other organism; therefore we only understand how weather affects the population of a specific species but not organisms in general. It should be apparent to the reader though, that the relative utility of a model is dependent on the goal. The goal is increased knowledge through inference. For example, if the species in question was endangered then having a very accurate population model would be very important even if it isn’t applicable to other organisms. We may want to make models that account for other species that are ecologically similar or we may want a model to account for all species. Explaining variation in one phenomenon by another phenomenon or other phenomena is the goal of all ecological investigations. It is the reason for this book; it is the reason we go out into the field. In general the association between the response and the explanatory variable is called covariation. Our goal as ecologists is to understand the behavior or a response variable (typically y) to one or more predictor variables (typically x). A simple representation of this relationship is E  y∣x and is read “the expected value of Y given the value of x.” Another perspective is that the value of Y is conditional on x. Another way to look at a is  = =   P Y yi f x We will use the following notation =  yi x Stratford & Wilkes University DRAFT 27

8.2 Model Complexity and Error

Wikipedia Albert Einstein probably had this in mind when he wrote in 1933 that "The supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience" often paraphrased as "Theories should be as simple as possible, but no simpler." By adding variables you increase model fit – the congruence between the model and observation - but you're adding model complexity and making the model less general. By losing generality, it makes the model less able to predict new observations. Therefore we should aim to strike a balance between model complexity and model applicability. A term often applied to modeling is parsimony: we should use models that account for the greatest amount of variation with the fewest variables. Frequentists will want to eliminate variables that are not statistically significance. A Bayesian will compare the model with and without the parameter to see how much information is lost or gained. Variability not accounted for by a model is called error. Error has many sources including measurement error (many statistics assume measurement without error) and error from leaving out variables. The goal is to minimize error and maximize variation that is accounted for.

8.2.1 Fixed Effects vs. Random Effects

Fixed effects are those in which the levels of treatment are selected, rather manipulated, by the experimented. Random effects are those in which the levels of treatment occur without being manipulated by the researcher. For example, a researcher might be interested in the effects of temperature on larval anuran development. She might go to the field and install temperature loggers at several egg mass where she might expect different temperatures she also might take a number of egg mass and set them in a controlled lab where she sets the temperatures. The former would be random effects and the latter would be fixed effects. The differences put different bounds on the inferences one can make. Random effects assume that one is randomly sampling a population so that inferences can be made about the population. Fixed effects have limited inferences that are set by the range of treatments used. The advantage of fixed effects models is that they are generally more precise and accurate because uncontrolled variance is usually greater in random effects models.

8.3 Types of models

8.3.1 Deterministic and Stochastic models Deterministic models have parameters that are not associated with any uncertainty so the outcomes are predictable and do not change. Stochastic models incorporate uncertainty so putting in the same same values may yield different results. Sensitivity can be measured to determine the amount of uncertainty in a model and the response of the overall model to changes in the explanatory variables. Elasticity is the influence of a particular variable to the sensitivity of a particular variable.

8.3.2 Saturated model Each observation has its own parameter so that the model has perfect fit. There are zero degrees of Stratford & Wilkes University DRAFT 28 freedom.

8.3.3 Maximal model Has all the variables of interest, including their interactions, with n-p-1 degrees of freedom. This model will have great fit, however, it will have variables that are redundant with other variables or do not meaningfully influence the response variable.

8.3.4 Intercept only or null model This model fits the overall mean of the response variable so has n-1 degrees of freedom. This model has no explanatory power. It may seem silly and pointless to create a model with no explanatory power. These models are, however, extremely useful when they serve as standards.

8.3.5 Linear Models Linear refers to the relationship between variables not the shape of the line when graphed. =    One popular way to represent a is y B 0 B1 x 1 Bn xn β0 is the intercept, where the line crosses the y-axis. The other betas are coefficients and the x's are the values of each of the explanatory variables. Epsilon represents the error of the model. Each explanatory variable has it's own coefficient. Betas can also be thought of as slopes for each of the parameters. In this form, the explanatory variable can be continuous or categorical. The response variable can be continuous, binary, or categorical.

8.3.6 Properties of B Beta is a parameter of a model representing the y-intercept (beta-naught) or variable coefficients (beta- n). Beta has a mean and standard deviation and standard error. The direction that a predictor variable has is determined by the sign of beta and the magnitude of the effect is determined by the magnitude of the coefficient. Generally, if beta includes zero in the standard error then the effect is considered negligible.

8.3.7 With interactions An is when the effect of one explanatory variable is influenced by the level of Note that the interaction term has its own coefficient although the two explanatory variables are multiplied together. =     y B 0 B1 x 1 B 2 x 2 B3 x 1 x2

8.3.8 Polynomials =   2 y B 0 B1 x 1 B 2 x Note that the square term has its own beta that is not equal to the square of beta one Stratford & Wilkes University DRAFT 29

If we examine the B xn term, the sign determines the direction of influence, n determines the order and the magnitude of beta determines the bend. When n = 1 then the regression line is straight. Generally, polynomial models include all the lower oder terms.

8.3.9 Nonlinear models

1.1.2.2 Defined y= x , B

1.1.2.3 Examples    = B 0 B1 x 1  y    1 B2 x 2

  x  y=e 0 1 1 however, this model can become linear by taking the logarithm of both sides  =  ln y 0 1 x 1

8.3.10 Nested models and nested variables Variables are nested if levels occur as subsets of other variables. Models are nested when a models contains a term occurs at a lower level and occurs at a higher level with another term. Example of a nested variables ((((student) classs) school) district) Example of nested models M1: time M2: time + time2

8.4 Fitting data to models There are two general methods of fitting data to models. Least squares estimation (LSE) works by reducing the errors associated with the model. =∑  −  2 SSE y yi y Maximum likelihood estimation (MLE) works by finding the parameter estimates that will most likely result in the data given. Maximum likelihood was originally proposed by r. Ronald Fisher. The link between the model and the data is the likelihood function. n ∣ =  ∣=∏  ∣ L y f y f yi i =1 For example, − n e y f  y∣=∏ i =   i 1 yi ! Stratford & Wilkes University DRAFT 30

Typically, log-likelihoods are computed or ln L x∣y If the log-likelihood is differentiable then it must satisfy ∂  ∣  ln L x y = ∂ 0 wi Rarely, can the partial derivative be solve analytically. Instead a parameter is usually fitted iteratively via trial and error on a computer. The relationship between the log-likelihood and the parameter is a simple convex curve such that the maximum likelihood is the global maximum of the curve. However, more complex curves are possible and local maxima can be selected if searches are over-limited. LSE and MLE can converge on estimates when observations of the response variable are independent, normally-distributed and constant variance across the range of the predictor variable {Myung, 2003 #6723}. Classic Paper Neyman, J. 1937. Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society of London Series A 231:333-380.

8.5 Issues in Modeling

8.5.1 Multicollinearity Having correlated variations in the same model can cause problems. Accounting for variation in one variable will “steal” it away from the other.

8.5.2 The residuals are not normally distributed but become larger or smaller in relation to the response variable. There are two ways to check for heterscedasticity: plot the residuals against fitted values and plot residuals against the quantiles of the normal distribution.

8.5.3 Leverage  {  − }   = Ai y i i i f y , , exp  i i   y i , Ai

Stratford & Wilkes University DRAFT 31

9 Variable and Model Selection

9.1 Introduction There are two approaches to dealing with uncertainty in what variables to include in a model. One strategy is variable selection and the other is model selection. Generally, variable selection is used by the frequentists and model selection is used by Bayesians. A model selection approach treats each different combination of parameters of as a different and separate model. A variable selection approach doesn't have a working model until the computer finishes the analysis but a variable selection approach does allow to select for standards of keeping for removing a variable.

9.2 Stepwise Stepwise regression either adds (forward step) or removes (backward step) parameters based on the influence of each parameter. Typically a variable is kept if the beta of the variable is significant. Many computer programs can do stepwise regression automatically and requires one to set the parameters to keep a variable or to remove it from the finished model. Stepwise selection has largely fallen out of favor. Problems if variables are correlated.

9.3 Likelihood ratio test The likelihood ratio test is used when comparing two models. Typically used when comparing a null model (not a null hypothesis per se) and a model with the variable of interest or two nested models.

= simpler model 2 The ratio is and the test statistic is  =−2ln  with k degrees of more complex model freedom. Where k is the difference in the number of estimated parameters.

9.4 Information Theoretic Methods

9.4.1 Akaike's information criterion (AIC) • Named after Hirotugu Akaike, born 1927 in Japan.

9.4.1.1 Estimating AIC for an individual model AIC finds the model that explains the most variation with the fewest parameters. Models are assigned an AIC value then ranked with the “best” model having the smallest AIC value.

AIC =−2ln L2K with K parameters and L is the model likelihood For small (< 40) sample sizes use : Stratford & Wilkes University DRAFT 32

2K K1 AIC =−2lnL2K c n−K −1 For methods, such as regression and analysis of variation, AIC is calculated using  AIC =n log22K where sigma-hat is the predicted residual sum of squares or PRESS. Note that AIC and its derivations all rely on y being constant; that is the response variable is unchanged or untransformed for all the candidate models.

9.4.1.2 Using AIC to sort models An AIC value is calculated for each model such that the absolute value of each is not as important as the relative value. Once models are ranked then the AIC assigned to the model with lowest AIC is subtracted from all the candidate models.  = − i AIC i AIC min

 As i grows larger the less plausible the model.

∆ Models with i values between 0 and 2 have strong support, 4-7 less support and > 10 have no support (Burnham and Anderson 2002). A modified or corrected version of AIC (Sugiura 1978) should be used when sample sizes are small compared to the number of parameters (< 40, Burnham and Anderson 2002).

9.4.1.3 Model averaging The first step is determining the model weight. This is the relative support for a model across all models.

−1  i e 2 w = i R − 1  i ∑ e 2 r=1

R  =∑  w i i i=1

9.4.1.4 Pitfalls when using AIC If a response variable is transformed then these models cannot be part of the same AIC analyses. Failure for a model to converge to get the model maximum likelihood value will also trip up the analyses. Often a local maximum value is found and not a global. Programs like SAS will still report Stratford & Wilkes University DRAFT 33 the log-likelihood value in the output but will put the warning in the log file, which is often over- looked.

9.4.2 BIC AIC =−2ln L2KKlog n

9.4.3 BIC or AIC? As sample size becomes larger, BIC tends to select the true model but AIC will select the more complex model. With smaller sample sizes, BIC tends to select simpler models.

9.4.4 Multimodel inference Stratford & Wilkes University DRAFT 34

10 Generalized Linear Models (GLM)

10.1 Introduction There are three components of a : random component, the linear predictor, and the link function. The random component is the response variable and can take on various distributions including Gaussian, Poisson, binomial and gamma. The linear predictor is the explanatory variable (x) p  =∑    multiplied by a coefficient (beta). In general, the form is i x ib j . The linear predictor is often j=1 referred to as η (eta). The link function is how the response variable is transformed to match the linear = predictor, eta and is represented with g i

The random component is the response variable y whose probability density function takes on the distrubutions described in the previous paragraph. The linear predictor, ηi, is the same as that described in the linear model section. The link function transforms the random component into the linear = predictor and is represented g i . Stratford & Wilkes University DRAFT 35

11 Summary

Read Find a problem whose solution is fundable, interesting and publishable Think of the most likely hypotheses Think of the predictions that these hypotheses generate Convert your biological predictions into statistical models Select the predictions that best test the hypotheses Collect necessary data Get to know your data histograms and splomming look for non-linear relationships Re-express your statistical models and determine the nature of the variables Select the right statistical model Sort models are decide on results Stratford & Wilkes University DRAFT 36

12 Responses and Predictors are Continuous (Regression)

1.2 Introduction When the response variable and the predictor or predictor variables are continuous a regression is an appropriate analysis. Realize, however, that regression is a generic term that encompasses various statistical techniques of varying complexity. Regression differs from correlation in that the relationship between the two variables is causative in regression. Typically, there is a response variable (y) and at least one predictor variable (x). The nature of that relationship can be ortholinear (e.g., a straight line) or curvilinear and be monotonic (y values only increase or decrease along x) or non-monotonic, taking on more complex curves. Inferences from regression need to limited to the range of x that were included in the analysis. Making inferences of the response variable between sampled x values is interpolation and is strengthened by increasing sample size. Extrapolation is making inferences of the response variable outside the range of the predictors in the analysis and is more risky because the relationship between between the two variables may not be linear. Indeed, many biological phenomena may be linear but only in a limited range. Ideally, the broadest range of x is sampled for which we would want to understand the response variable.

12.1 The formula for a line that one learns in high school geometry y=mxb has the same structure as a simple linear regression with a single predictor with the addition of an error term, ε: =   Y i B0 B1 X i i

Where Β0 is the intercept, B1 is the coefficient of X1 or slope, and εi is the error term. B1 transforms the values of X1 such that a change in one unit of X will be associated with some value of Y.

12.1.1 Partitioning variation in regression Unless all the points occur on a straight line, some observations will fall off the line. The vertical distance between the points and the line is the error or residual. We expect there to be as many points below the line (negative residuals) as above (positive residuals) such that the mean is 0. The spread of the points will be normally distributed as σ2. Formally we express this as

~N 0, 2 . Stratford & Wilkes University DRAFT 37

A regression model with a perfect fit would have all the points along the regression line. As points deviate from the line the fit is poorer. Residual variation for any point is the distance between the point and a the best fitting line. Residual variation for a point is = −  d yi y

Draw a line with fitted y and obs y

When d > 0 then the observation is above the line and when d < 0 the observation is below the line.

Once a line is fitted to the data, the residual sum of squares are n =∑  −  2 RSS Y i Y i i=1 A best fit regression line will minimize RSS.

The sample sum of squares is n =∑  −  −   SS XY Y i Y X i X i=1 and the sample is n = 1 ∑  −  −   s XY − Y i Y X i X n 1 i=1 note that the sample covariance can be less than 0. A decreasing relationship between X and Y will have a covariance converging on 0. The slope parameter В can now be estimated using

 = s XY = SS XY B1 2 sX SS X Stratford & Wilkes University DRAFT 38

The intercept can now be estimated using algebra. =  −  B0 B1 X Y

The sum of squares of the regression is = − SSreg SSY RSS

12.1.2 Coefficient of determination, R2

2= SS reg = SS reg The proportion of the variation explained by the regression is R  this is SS Y SS reg RSS commonly called “r squared” or coefficient of determination. The value of r-squared ranges from 1 for a perfect fit to 0. Examining the formula for r2 shows that the value is 1 when SSR=SSY and is 0 when SSE=SSY.

12.1.3 Residuals The variance of the residuals is

∑  −  2 2 2 RSS yi yi MSE=s = = = n−k−2 n−k−1 and the standard error of the residuals is 2

12.2 Hypothesis testing and regression In general, the hypothesis one asks with a regression model is if the response variable is related to the predictor variable. This would be evidenced in a slope that is statistically significant from zero. One may also want to know if the y-intercept was significantly different from zero.

Comes down to signal over noise (ANOVA) table for a linear regression model with a single predictor variable. Source df SS MS F-ratio p-value

Predictor 1 SSreg SS reg MS REG The probability of the F- ratio exceeding the critical 1 MS residual value if the null hypothesis was true Stratford & Wilkes University DRAFT 39

Residual n-2 SSreg RSS n−2

Total n-1 SSY SSY n−1

n 2 2∑ 2 Expected mean squares for the model B1 X i=1

12.2.1 Uncertainty and regression The 95% for the intercept is  −     −  B t −   B B t −   0 , n 2 B 0 0 0 ,n 2 B0

  where α is is the error rate, n is the sample size and B0 is the square root of the variance associated with the estimate of the intercept. The 95% confidence interval for the slope estimate is  −     −  B t −   B B t −   1 , n 2 B1 1 1 , n 2 B1 Now that we have an intercept, slope and the estimates of uncertainty for each of these we can estimates the 95% confidence intervals for fitted values.  −           −    −   Y t , n 2 Y∣X  Y Y t ,n 2 Y∣X   −  2   = 1  X i X   where Y ∣X  n SS X

12.2.2 Regression and Prediction A very useful function of regression is prediction of the response variable, Y, given the estimated parameters (intercept, slope and their respective confidence intervals) of the regression model. The  interval of th new predicted values Y can be estimated by    −        Y t , n−2 Y∣X  Y Y t  ,n−2 Y∣X 

12.2.3 Types of Regression In most cases Y, the response variable, is measured with some error. However, the X variable can be determined by the investigator in a Type I Model {Sokal, 1995 #4675} or variation in X may be imposed by nature and measured with some error as a Type II model. The investigator might be interested in the amount of fertilizer and height of plants after a given time. The investigator sets the amount of fertilizer each plant gets. This is a Type I model. A Type II model might be interested in the amount the level of a nutrient and plant heights associated with each value. Stratford & Wilkes University DRAFT 40

12.2.4 Assumptions of simple linear regression ● The correct function is chosen ● The predictor variable is measured without error ● The measurements of the predictor and response variables at each observation are independent ● The errors or residuals have a mean zero ● The are constant along the regression line: homoscedastic

12.3 Diagnostics The previous steps to include exploratory data analysis and the estimation of the regression parameters. There might be some temptation to stop there but it is essential to check the regression to make sure it is appropriate.

12.3.1 Fitted versus observed Residuals are the points that are not predicted by the regression equation. Of the five assumptions listed above, four deal with the errors. An important tool to determine the adequacy of a model is to plot the residuals (d) versus the fitted  values, yi . This scatterplot should show no relationship between the variables so the points should neither ● expand or collapse along a horizontal line that represents a perfect fit (d=0) – the residuals are heteroscadistic ● Show an increasing or decreasing trend along the perfect fit line

12.3.2 Plot against possible biases The variables we include in our analysis are those we expect to influence our response variable. We should take care however, of effects that occur because of the collection methods. Plot can be made of residuals versus collection time and residuals can be coded by observer to account for observer effects. Should effects be suspected, then those effects should be incorporated into a new model. Stratford & Wilkes University DRAFT 41

12.3.3 Quantile-quantile plots (Q-Q plots) Q-Q plots are useful for examining a given distribution against a theoretical distribution. Q-Q plots are regularly used to check for normality. In regression, Q-Q plots can be used to check the normality of residuals.

12.3.4 Outliers are extreme observations. What is extreme? Consider the relationship between between standard (average) deviations from the mean. One standard deviation will capture % of the data, two standard deviations will capture % of the data. Extreme observations are those that exceed 3 sd from the mean. Extreme observation can come from • mismeasurement or copying errors • poor sampling in the sense that the tails were not sampled adequately • model misspecification: outliers are an artifact of the analysis

Errors due to mismeasurement should be discarded. To identify a pattern, use symbols to code observations according to observers, in cases of multiple observers, equipment, in cases where equipment is changed. This is where metadata and a detailed field notebook come in handy. In cases where outliers are due to poor sampling or model misspecification, outliers should be kept.

12.3.5 Influence and leverage Influence is the effect that a single data point has on parameter estimates. A bootstrap analysis is used, which iteratively removes a point computes the parameters, replaces the point but removes another point and recomputes the regression parameters. This is repeated until all the points, one by one, are removed. Influential points are revealed when samples are plotted against parameter estimates. Leverage occurs when points do not influence parameter estimates but affect other such as R2. Typically, influence points occur away from the fitted line and leverage points fall along the fitted line but away from the pack. Stratford & Wilkes University DRAFT 42

12.3.6 Predicting X from Y

12.3.7 Difference between regressions

12.3.7.1 Difference between slopes b −b t= 1 2 s − b1 b 2

Tested with n1-2 + n2-2 degrees of freedom and the standard error of the difference between the two slopes is

2 2 = sYX  sYX s −  b1 b 2 ∑ 2 ∑ 2 x 1 x 2 and the pooled residual mean square is SSR SSR s2 = 1 2 YX df df residual1 residual2 and the 1-α confidence interval is

− ± b b t   , s − 1 2 2 b1 b2

12.3.8 Intersection between lines (Xi, Yi) − = a2 a1 =  X i − , Y i a1 b1 X i b1 b2

12.4 Review of y Y Hypothetical vector of the response variable y Observed instances of the response variable y Values of y that are fitted to the regression line derived from the y that are observed y Predicted values of y Overview of Regression Stratford & Wilkes University DRAFT 44

12.5 Nonparametric Regression

Kendall's robust line-fit method (Kendall's τ) Stratford & Wilkes University DRAFT 45

13 Multiple regression

13.1 Introduction Partial regression parameters.

13.2 Polynomials

13.3 Multivariate regression The parameters change and the R square increases.

13.4 Issues

13.4.1 Correlations among variables If variables are correlated the standard errors of the explanatory variables will be inflated. This is the multicollinearity problem. The variance inflation factor (VIF) will quantify the inflation and indicate the influence on residual variance. Ridge regression and using principle components are common ways to handle correlation.

13.4.2 Model selection

13.5 Nature of the relationship between the response and predictors Start with GAM. Diagnostics > plot(fitted(mymodel), studres(mymodel)); abline(h=0, lty=2); ccnorm(studres(mymodel)); ccline(studres(mymodel)) Tolerance Tolerance is a measure of multicollinearity. influential points leverage, Cook’s D, and Studentized residuals. Points with leverage > 0.5 are marginal. Any points with a leverage > 0.6 should be dropped. Cases with Cook’s D > 1.0 consider dropping. Find the largest value in the data set, drop if necessary and rerun analysis. Student values > 3 should be dropped. Homogeneity of variance can be examined by plotting the predicted values on the x-axis and Studentized residuals on the y-axis. Automated model selection mymodels <- stepAIC(myobject,scope=list(upper= ~var1*var2*var3, lower = ~1, trace = F) Stratford & Wilkes University DRAFT 46

# upper is the most complex model and “lower” is the simplest model, in this case an intercept only model

13.5.1 Hierarchical partitioning p =∑   Y f j X j j=1 Interpretation of Results  # splomming  paris(mydata, panel=panel.smooth)  #Understand the nature of relationships by generalized additive models  library(mgcv)  my.gam <- gam(y ~ s(x1) + s(x2) + s(x3)); plot(my.gam)  # Looking for complex relationship  library(tree)  my.tree <- tree(y ~.); plot(my.tree); text(my.tree) Stratford & Wilkes University DRAFT 47

14 Discrete response variables

14.1 Logistic Regression Logistic or binary models are used when the response is binary or dichotomous such as “live/dead” or “present/absent.” The relationship between the predictor variable and the response is a sigmoid or logistic curve. The logistic model is given as B B  ∣ = = e 0 x = 1 E Y x x B  B −   x 1e 0 x  1e 0

B0 defines the probability of success (that with the dummy variable “1”) when x is 0. If the intercept is 0 then the probability of success per trial is 0.5. The variable coefficient, Bx determines the slope in the transition from one state to the other of the response variable. The error term is binomially distributed. The predictor variable is not necessarily continuous and can be categorical or even binary.

  In terms of the conditional mean, x , the logit transformation is defined as p   x=ln [ ]=logit  p=logodds 0 1− p note there is no error term Parameter estimates (b coefficients) associated with explanatory variables are estimators of the change in the logit caused by a unit change in the independent.

14.2 Multinomial (logistic) regression

14.3 Ordinal (logistic) regression Stratford & Wilkes University DRAFT 48

15 Responses are counts/ Predictors are continuous (Poisson regression) Count or rate data can be model by a number of methods – the two most common being Poisson and the negative models. If means of rates or counts is greater than zero then the distribution will approximate a normal distribution and there will be some temptation to use OLS regression. This latter technique, however, is inappropriate for two reasons (1) there are no negative rates or counts and OLS will have predicted values that are < 0 and (2) the variance increases with the mean. The decision to use either Poisson or negative binomial should be made a priori. The primary assumption that separates the two is accounting for variance: the Poisson distribution assumes that the predictor variables account for all the variance such that the mean and the variance are equivalent. This might be the case for your research but as one moves up in the ecological hierarchy, the greater the probability that mechanisms are complex and many underlying mechanisms are unaccounted for.

15.1 Poisson Regression

15.1.1 The Poisson distribution The Poisson distribution is names after the French mathematician Simeon-Denis Poisson who formulated it in 1837 . − e  y P  y=  y! Covariates can be continuous or categorical and the log is usually taken of this equation or log linked to produce the linear model

  x y =e 0

  =  ln y 0 x the above is often called a log-linear model

Effects of parameters are tested using the Wald statistic or likelihood tests. Note: the nonlinear nature of the Poisson model means that coefficients of covariates cannot be interpreted as a change in unit x results in a change in y. The chi-square test uses the sum of the residuals r derived from expected and observed values Events must be rare – mean occurrence of zero. S&R suggest “p < 0.1 and the product of sample size and probability kp < 5”. Stratford & Wilkes University DRAFT 49

One function is to determine if events occur independently of each other, usually spatially or temporally. read as the probability of number (yi) occurring is the mean raised the that number divided by e to mu time that number . , where mu (μ) is the mean Including covariates the model is The number of parameters is equal to the intercept and the number of estimated parameters. An interaction term counts as a single parameter. Interpreting results Every can be transformed by to make the inference more direct. So an increase in 1 unit x will result in increase in the response variable. Negative betas result in numbers < 1.0. If variables are measured in the same scale then these numbers are comparable across different parameters. Since ex cannot be < 0 the lower bounds of a count are > 0. This should make sense for count data, which cannot have negative counts. Poisson model elasticities for the ith for the kth continuous independent variable. Values are percentages of change for every percent change in the dependent variable. If the explanatory variable is discrete then the pseudo-elasticity is calculated using the formula Model fit A measure comparable to the R2 is estimated using where yi is the observed count, lambda-hat is the estimated count, and y-bar is the mean. Fit can also be compared to the null model by examining the reduction in . In R, the deviance of the null model is given as null deviance and the model deviance is indicated with “residual deviance”.

 In predicting use type='response' to back transform the fitted values

15.2 Over- and under-dispersion Over- and under-dispersion occurs when the residual variance is larger or smaller, respectively, than the mean. Dispersion is calculated from the deviance or Pearson chi-square divided by the model degrees of freedom. Most software programs will report dispersion. Sources of over-dispersion include multicollinearity among explanatory variables or spatial autocorrelation or a salient explanatory variable. There are two approaches to incorporate deviations from a Poisson distribution. The simplest solution is to scale the distribution so that dispersion is one. In SAS this simply involves dscale or pscale to the model statement. Another alternative is to use a negative binomial. If over-dispersion is ignored then the predicted responses will cluster around the mean and there will be too few zeros and too few larger counts. Scaling the distributions adds a parameter called the scale or dispersion parameter (φ)and is estimated Stratford & Wilkes University DRAFT 50 from the chi-square distribution and the degrees of freedom using the formula and SAS reports . Including a scaling parameter makes the model a quasi-likelihood regression model. Quasi- likelihood models do not make any assumptions about the distribution of the dispersion parameter. if φ is > 1 then over-dispersion is present; however, if c > 6 then the wrong distribution model is probably being used (Burnham and Anderson 2002 p. 68). Parameter estimates, including the intercept are unaffected by the addition of the dispersion parameter but the confidence interval and errors associated with the parameters are increased. Consequently, tests of significance are more conservative. Conversely, if over-dispersion is unaccounted for errors will be underestimated . Remember to add another parameter when using model selection techniques that incorporate the number of parameters (e.g., AIC). A note about e. The actual origin of e remains unknown but John Napier, a Scot, is generally attributed to having popularized the use of e to ease the calculations of large numbers Assumptions Interpretation of Results The deviance and Pearson chi-square should be approximately distributed chi-square and this is examined by dividing each by the degrees of freedom. The degrees of freedom is n-p where n is the number of observations and p is the number of fitted parameters. The resultant values should be approximately 1. If a dispersion parameter is used then the scaled values will be 1.0. The predicted observations can be calculated using

 my.pois <- glm(y ~ x, family=quasipoisson)

 AIC not given and quasibinomial is also available

15.3 Negative Binomial

1.2.1 Introduction

1.2.1.1 Background (Lawless 1987) The negative binomial is a that incorporates variance beyond that expected by the Poisson distribution by including another variable with a . The negative binomial does not assume the variance and the mean are equal and can be used in cases of over-dispersion. The variance of the can only be equal to or larger than the mean {Hilborn, 1997 #4685}. Do not use the negative binomial in cases of underdisperion. In cases where the variance and mean are equivalent, the model parameters will be the same as those estimated with the Poisson distribution. The parameter k is used to incorporate the increased variance. Unlike the Poisson distribution, the distribution of the mean is not constant but is distributed with a . −  k k  x  x p x =1−     k  x! k  k  Stratford & Wilkes University DRAFT 51

2 ≈ k k > 0 and as k gets smaller the more clumped the data. s2− Data that are over-dispersed and modeled with Poisson will have a residual deviance to degrees of freedom ratio > 1 and in some case much greater than one. However, this is only true for large sample sizes {Venables, 2003 #5310}. The Poisson model can be used to estimate the dispersion parameter or it can be estimated from the data.

1.2.2 R code  # when the dispersion parameter is known

 my.nb <- glm(y~x, family =negative.binomial(k), data=mydata)

 # when the dispersion is estimated

 my.nb.model <- glm.nb(y~x, data=mydata)

 my.nb.model$anova

 c(theta =my.nb.model$theta, SE=my.nb.model$SE

 rs <- resid(my.nb.model, type=”deviance”)

 plot(predict(my.nb.model), rs, xlab=”My predictors”, ylab=”Deviance Residuals”); abline(h=0, lty=2); qqnorm(rs, ylab=”Deviance Residuals”)

 qqline(rs) Stratford & Wilkes University DRAFT 52

Sir Ronald AylmerFisher (1890-1962)

A talented mathematician from an early age. Solved problems using geometry. “In 1919 Fisher started work at Rothamsted Experimental Station”. There he developed ● Analysis of Variance ● Experimental Design ● Maximum likelihood (precomputer – that's impressive)

R.A. Fisher was involved in eugenics and proposed that intellect was genetically linked and differed among races. He was also important for linking Mendelian genetics (seen as discrete) to Darwinian selection (seen as continuous). Stratford & Wilkes University DRAFT 53

16 Responses are continuous/Predictors are categorical (ANOVA)

16.1 Introduction The classic question in the sciences is “does the experimental group differ from a control group?” There are few ecological circumstances where one can set up this type of scenario. Regardless, questions about differences between groups is extremely common, however, these questions can also have complex analysis – especially when looking at interesting at the interesting ecological phenomena. So it goes. The explanatory variables are called factors and the different applications of each factor is called a level. For example, densities of ungulates may be related to levels of predators: high, medium and low. Each combination of factor and level is called a treatment. The essential nature of ANOVA type analysis is to partition the variation is unexplained to that variation brought on by treatments. If enough variation is brought about by treatments then those factors will emerge as being important. Put another way SSY = SST + SSE where SSY is the overall variation, SST is the variation due to treatment and SSE is the residual variation or error after accounting for treatments.

Steps to getting the sums of squares

 Y ij Overall average Y =∑ n

=∑  − 2 The total sum of squares is SStotal Y ij Y . The sum of squares for among k groups is k n =∑ ∑  − 2 SSamong.treatments Y i Y .. where is the mean of treatment i. The greater the treatment effect, the i=1 j=1 great the among group SS. The variation that is left over is the variation within groups – the deviations

a ni =∑∑  − 2 from the mean within a group. This is estimated with SSwithin groups Y ij Y i. and is also i =1 j=1 called the residual sum of squares or error. This variation is not accounted for in the model by the treatments. The model for the individual response is: =    Y ij Y Y i ij =    The null hypothesis would be Y ij Y Y i ij , that is, no effect of the treatment. Stratford & Wilkes University DRAFT 54

16.2 Assumptions of the ANOVA The model discussed above will work under certain assumptions. Should you be depressed if your data do not meet these assumptions? Certainly not – at least not yet. There are ways to handle certain circumstances where these assumptions are not met.

16.2.1 Individual samples are independent and identically distributed In the statistical lexicon this is simply stated as i.i.d. The model assumes that individual samples are taken are random and do not influence to the outcome of other individual samples. This assumption is violated when samples are nested (a literal example: eggs in a nest are nested) or samples are able to interact.

16.2.2 The within group variances are similar among groups

16.2.3 Residuals are normally distributed and are independent and identically distributed This assumption is the same as the assumption for ordinary least squares regression. That is N ~0,2, i.i.d.

16.2.4 The main effects are additive and linear

16.2.5 The correct treatment is identified for each sample

16.3 Two- Independent Samples

16.3.1 Parametric Test : Student's t-Test For two dependent groups, the appropriate test is a paired t-test. An example of dependent groups would be a before/after test using the same subjects. For two independent groups the appropriate test is the t-test (statistic). For more than two groups, the appropriate test is an analysis of variance (ANOVA). The type of ANOVA is highly dependent on the experimental design.    X − X  = A B 2 2 t  n −1 s n −1s Where s 2 = A A B B 2 1  1 s X X s  A B  − x A x B nA nB 2 nA nB

16.3.1.1 Assumptions The assumption is that the two groups have equal variance. Homogeneity of variance is tested with the F-test, which is named after Ronald Fisher. If the variances are not equal then a Welch test is appropriate.

16.3.1.2 What to report For two groups, one should report Stratford & Wilkes University DRAFT 55

● sample size of each group ● variance of each group and if they are different ● test used ● test statistic and exact probability (if possible) ● difference between the means ● biological implications

16.4 Mann-Whitney U test for two independent samples The Mann-Whitney test is a nonparametric alternative to the Student's t-test. There is still the assumption that samples are random and independent. This tests uses ranks. The null model is that the two samples come from the same population. The directional alternative hypothesis posits that the observations of one group tend to be larger than another group. The nondirectional alternative hypothesis posits that the distributions of the two groups differ. To calculate U, rank the data as if the data were from the same population.

16.5 Nonparametric Wilcoxon Rank Sum For two dependent groups, the appropriate nonparametric sample is the (statistic). For example, a before and after treatment effect where the same site/organism is sampled. For two independent groups, the appropriate test is the Mann-Whitney Test (statistic) or a Wilcoxon sign-rank test (statistic). For more than two groups, the appropriate test is the Kruskal-Wallis test (statistic). The Wilcoxon sign rank test is used when the errors are not normally distributed. statistic is W wilcoxon test (var1, var2)

16.6 Parametric or nonparametric For two independent groups, if variances are unequal the t-test is still more powerful but the t-test is inappropriate if data are not normal.

16.7 Two related groups (before-after)

16.7.1 Parametric: Paired t-Test t.test(var1, var2, paired=T) Nonparametric: Wilcoxon signed-rank test R code wilcox.test(var1, var2, paired = T) Stratford & Wilkes University DRAFT 56

16.8 More than two samples

16.8.1 Analysis of variance There are different types of ANOVA based on the types of explanatory variables. In Type I ANOVA models, the explanatory variables are fixed, in Type II ANOVA models the explanatory variables are random, and in mixed models there is a mixture of random and fixed effects. In the frequentist paradigm, the null hypothesis in a ANOVA framework is the means of different groups come from the same population. The alternative hypothesis is that one or more of the means differs. ANOVA does not identify the group that differs – the part is covered under multiple comparisons

16.9 Basic Assumptions of ANOVA

16.10 One-way ANOVA

16.11 Two-way ANOVA Main effects and interactions

16.12 Randomized block design (RBD) Blocks are physical groupings of samples such as greenhouses in a single experiment. The block effect is a nuisance parameter – an effect that must be accounted for but isn't of interest. In a randomized block design, individuals are randomly assigned to treatments that are randomly assigned to blocks. Blocks are treated as main effects that do not interact with other main effects. The basic RBD model is =    yij 0 Ai BLOCK ij

16.13 Interpreting Results from ANOVA Interactions The number of parameters that are estimated in the interaction are (levels of var1 -1)(levels of var2 -1). Great Biometrician: Residuals  my.anova <- summary(aov(y~x)  # diagnostic plots  plot(my.anova) Interactions

 my.aov <- aov(y ~ var1*var2); # will give main effects plus the interaction Stratford & Wilkes University DRAFT 57

 interaction.plot(x1, x2, y)

16.14 Nested

=   Y ijk 0 Ai j i 

In R represented with “/” example see package mlirt

16.14.1.1 Conditional y ~ x|z; # y as a function of x given z

16.14.1.2 Split plot ANOVA my.spit <- aov(y ~ x1*x2 +Error(District/School/Class))

1.2.2.1

 # remove the intercept

 lm(y ~ x1 -1) Stratford & Wilkes University DRAFT 58

17 Multiple Comparisons Tests that look for group differences only demonstrate an overall effect while the investigator may also be interested in differences between groups. For example, there might be a control and four levels of herbicide and one may ask if all of the experimental groups, as a group, a different from the control group. Multiple comparisons are divided into a priori comparisons and post priori comparisons depending on if the questions are asked before data are analyzed or afterward in an ad hoc fashion. A priori comparisons are more powerful than post hoc tests.

The potential number of comparisons is very large. For three levels (say A, B, C) the six possible comparisons include A vs B, A vs C, B vs C, (A + B) vs C, A vs (B + C), ( A + C) vs B. The number of comparisons quickly increase with the number of levels.

17.1 A priori comparisons: Orthogonal Contrasts

There are only k-1 orthogonal contrasts where k is the number of levels. Setting up orthogonal contrasts can be a bit tricky if all the possible contrasts are sought. The groups of interests get contrasts coefficients that sum to 0. The groups that are not of interest have 0 for their cofficients. One group gets positive coefficients and the groups contrasting this group get negative coefficients. For example, if there are only two groups contrasted (say A vs B) then the coefficients would be 1 and -1 for A and B and 0 for C. If there are two members in one group and one member in the contrasted group then coefficients would be (1, 1, -2).

Also, the sum of the level-wise cross products of the any two comparisons also sums to 0. Contrast A B C D E SUM 1 1 1 -1 -1 -1 0 2 -1 1 0 0 0 0 3 0 0 -1 1 0 0 4 0 0 0 -1 1 0 5 0 0 -1 0 1 0 Stratford & Wilkes University DRAFT 59

17.2 Post priori comparison

17.2.1 Tukey's Honest Significant Difference (HSD) =  1  1  Tukey ' s HSD q MS residual ni n j

Classic Paper Tukey, J.W. 1991. The philosophy of multiple comparisons. Statistical Science 6:100- 116.

17.2.2 Alpha corrections for multiple comparisons

17.2.2.1 Bonferroni   = BONF k

17.2.2.2 Dunn-Sidak method 1  = − −k DS 1 1

17.2.2.3 Alternative View: Fisher's Combined Probabilities Some authors suggest no correction to alpha levels {Gotelli, 2004 #6303}. k =− ∑   CP 2 ln pi i =1

17.3 Interpreting and Visualization of Multiple Comparisons

Probability Effect Size Stratford & Wilkes University DRAFT 60

18 Repeated Measures Two different types of repeated measures. In the first type of repeated measures, treatments are randomly assigned to individuals at one time and then randomly assigned again at a later times. The general model for this type is

=    Y ij 0 Ai B j ij

In the second type of repeated measures, treatments are applied once and the responses are followed through time. The general model is

=       Y ij 0 Ai B j i C k AC ik CBkji ij

19 Response is continuous and predictors are categorical and continuous (ANCOVA)

An (ANCOVA) performs a simultaneous t-test (for two groups) or ANOVA (for more than two groups) and a regression. The general model is =    −   Y ij 0 Ai i X ij X i ij where beta naught is the overall mean, Ai are the treatment means and beta I is the regression slope for each group. Xij is the individual observation and bar X are the group means. If there is no effect of the continuous variable then the slopes are zero and the model reduces to the ANOVA model: =   Y ij 0 Ai ij

 # to work out each slope separately

 lm(y[group==”group1”]~ x[group==”male”])

 my.spit <- split(variable to be split, split by variable) Stratford & Wilkes University DRAFT 61

20 Mixed Models

20.1 Introduction Fixed effects assume that the errors are independent from observation to observation. Fixed effects influence the mean response and the levels are of interest. Random effects are those whose errors are correlated because of grouping or clustering. Grouping can be in space (such as a classroom within a school) or in time (repeated measurements). Random effects influence the variance of y and the levels are not of interest. Mixed models can account for nestedness. If a linear relationship then lme or lmer if nonlinear then nlme.

21 Responses are ordinal

21.1 Introduction

21.2 Proportional odds model Ordinal responses can be modeled by a multinomial but the proportional odds model is a more direct method and coefficients are more easily interpreted {Venables, 2003 #5310}. As Venables and Ripley point out (2002), a more direct or parsimonious method will select a more complex model if AIC is used.

21.3 R code

1.2.3 my.model <- polr(y~x, data, weights)

21.3.1 Myprediction <- predict (my.model, expand.grid(hnames[-1]), type=”probs”) cbind(expand.grid(hnames[-1], round(myprediction, 2)) Stratford & Wilkes University DRAFT 62

22 Alternative Regression Techniques: Generalized Additive Models and related techniques

22.1 Introduction

22.2 Generalized additive models (GAMs) are an approach to describing relationships that is almost ad hoc. Like the name suggests, they are a generalized form of linear models such that relationships are not linear but the response variable is the result of adding the explanatory variables. Hastie and Tibsharani (1999) describe GAMs as “let[ting] the data show us the appropriate form.” Since the data are not confined to any particular distribution, GAMs are considered a nonparametric or semiparametric version of regression modeling. GAM’s are another tool in the exploratory toolbox. It is easiest to see how GAMs work by imaging a scatter plot of data. GAMs fit curves to scatter plots by iteratively fitting smaller curves to sections of the data then smoothing the entire curve. The locally- weighted running-mean is the means of estimating these curves and is based on a local mean and the distance of local points (y values) from those means. Each parameter is assumed to have its own function such that GAMs fit a parameter using the residuals from other parameters or backfitting . The function used to estimate averages is the smoother. There are two decisions to be made by the investigator. The first is how large of a moving window or neighborhood is used to estimate the averages. A smaller window will use fewer points but will tend to overfit the model and make it a poor predictor. The second decision is the choice of smoothing function. There are a number of smoothing functions and one should investigate to select the proper function. In general they advise again bin, running mean, running line, parametric regression and suggest kernels and spline smoothers. Since there is a tendency to have fewer points representing extreme values of the explanatory variable there will be fewer points used in the smoothing functions and represents a potential source of bias. The output statement creates the table estimate and adds the prefix p_ to partially predicted parameters. To get the entire partial prediction effect for each variable multiply the parameter estimate for the variable by the original variable and add this to the partial estimate.  library(mgcv)  my.gam <- gam(y ~ s(x));  plot(my.gam)  summary(my.gam) Stratford & Wilkes University DRAFT 63

23 Count data with a preponderance of zeros It is common in ecology to count the number of individuals or species in a plot or transect and have a large number of zero counts. This may occur when sampling rare species or sampling a large number of inappropriate habitats . Zero-inflated models take into account a large count of zero class. These models model the occurrence of ψ which is the probability of being in a group where zero counts are constant; that is zero counts are expected. Zero-inflated Poisson regression (ZIP) are commonly used in these cases. The models are What to report Simple relationships should be represented with a scatterplot. Not necessary to report the distribution of the raw data but this should be the first step in analysis. Overall mean frequency and frequency per treatment. Regression coefficients and Wald standard errors. The disperson factor. SAS Code Standard Poisson and Negative Binomial Data POISS; Proc genmod data=Poiss; CLASS=CLASSVAR; Model predictor = response/ dist=P link=log type3 wald; run; Zero-inflated Poisson PROC NLMIXED DATA=yourdata; /* below are initial values for the variables */ PARMS Z0=0 ZA=0 P0=0 PA=0; /* linear predictor for the zero-inflation */ LINPRED= Z0 + ZA*VAR1; /* inflation probability */ INFLATPROB = 1/(1+EXP(LINPRED)); /* linear predictor of the Poisson mean */ LAMBDA = EXP(P0 + PA*VAR1); IF Y = 0 THEN LL = LOG(INFLATPROB + (1-INFLATPROB)*EXP(-LAMBDA)); ELSE LL = LOG((1-INFLATPROB)) + count*LOG(LAMBDA) – LGAMMA(count+1) – LAMBDA; MODEL Y ~ GENERAL(LOGLIKE); PREDICT INFLATPROB; RUN; Zero-inflated Negative Binomial PROC NLMIXED DATA = yourdata; /* initial parm values and included V, the extra-variation parameter of the negative binomial */ PARMS Z0=0 ZA=0 P0=0 PA=0 V=0; /* linear predictor of the zero-inflation */ LINPRED= Z0 + ZA*VAR1; Stratford & Wilkes University DRAFT 64

/* inflation probability */ INFLATPROB = 1/(1+EXP(LINPRED)); MEAN = EXP(P0 + PA*VAR1); PROB_ZERO = INFLATPROB + (1-INFLATIONPROB)*EXP(-(Y+(V))*LOG(1+(1/V)*MEAN)); PROB_ELSE = (1-INFLATPROB)*EXP(LGAMMA(Y+(V))-LGAMMA(Y+1)-LGAMMA(V) + Y*LOG((1/V)*MEAN) – (Y+(V))*LOG(1+(1/V)*MEAN)); IF Y=0 THEN LOGLIKE=LOG(PROB_ZERO); ELSE LOGLIKE=LOG(PROB_ELSE); MODEL Y ~ GENERAL(LL); RUN; Proportions Binomial test

I am fairly certain that you can not fit a zero-inflated model using glmer() (and I am not familiar with the fmr function). As for whether you *need* a zero-inflated or negative binomial, there is evidence for over-dispersion (assuming we can trust the sigma reported in the quasipoisson, which I believe Doug Bates said might be suspect...). One option could be to use the MCMCglmm package. It fits a broad class of generalized linear mixed models from a Bayesian perspective using Markov Chain Monte Carlo (MCMC) methods. If you look at its family argument, you will see that it can fit a zero-inflated Poisson. I have just started using it myself and have found it very intuitive. Of course, it would help tremendously to have some familiarity with Bayesian statistics and the basics of MCMC fitting methods. Jarrod Hadfield (the package author) may be able to provide more direct comments on what the call to MCMCglmm might look like for a zero-inflated Poisson (and perhaps a quasi-Poisson? ideally, it would be nice to test a succession of models re. Over-dispersion). See the tutorial vignette in the MCMCglmm package for an introduction. Hope that helps. cheers, Dave Dave Atkins, PhDResearch Associate Professor Center for the Study of Health and Risk Behaviors Department of Psychiatry and Behavioral Science University of Washington 1100 NE 45th Street, Suite 300 Seattle, WA 98105 206-616-3879 [email protected] Hi,

The zero-inflated Poisson is specified as a bivariate model, so the residual and random effect models usually follow a bivariate structure such as us(trait):site or idh(trait):units. These specify 2x2 (co)variance matrices, in the first a covariance is estimated and in the second it is set to zero. For ZIP models I would use the idh function because the covariance between the Poisson process and the zero-inflation process cannot be estimated from the data. The prior term would have something like: prior=list(R=list(V=diag(2), n=2, fix=2), G=list(G1=list(V=diag(2), n=2), G2=list(V=diag(2), n=2), G3=list(V=diag(2), n=2))) or if you don't want to have a random effect for the zero-inflation process have something like: prior=list(R=list(V=diag(2), n=2, fix=2), G=list(G1=list(V=diag(c(1,0.000001)) , n=2, fix=2), G2=list(V=diag(c(1,0.000001)) , n=2, fix=2), G3=list(V=diag(c(1,0.000001)) , n=2, fix=2))

Don't fix the residual variance to 0.000001 because the chain will not then mix.

As long as n is greater then or equal to the dimension of the the prior is proper and the covariance matrix should not become ill-conditioned. Stratford & Wilkes University DRAFT 65

Bear in mind that I just made these priors up - you need to pick something sensible, although the residual variance for the zero-inflation process (trait 2) is immaterial because there is no info in the data regarding this variance as with standard binary variables.

Cheers,

Jarrod Stratford & Wilkes University DRAFT 66

24 Responses are proportions or counts and predictors are categorical

24.1 Tests χ2 test, G test, and Fishers exact test

24.2 1 x m or n x 1 data

24.2.1 Visualization

24.2.2 Null model Constructed null models for frequency data is a matter of estimating the counts that should go into cells if a null model was true. This may not always be an easy task.

24.2.3 Hypothesis testing Degrees of freedom depend on if the null is derived independently or is generated from the data. If the null is based on previous data then the null is a extrinsic hypothesis and if the null is based on the data themselves then the hypotheses is intrinsic. Classical Mendelian experiments have extrinsic hypothesis. For example, a dihybrid cross is expected to get ratios of 9:3:3:1 phenotypes.

24.2.3.1 χ2 test a −  2=∑ f f  f

24.2.3.2 G test a = ∑  f i  G 2 f i ln  f i

24.2.3.3 Fishers exact test    ∗   ∗   ∗   = Y 1,1 Y 1,2 ! Y 2,1 Y 2,2 ! Y 1,1 Y 2,1 ! Y 1,2 Y 2,2 ! P ∗ ∗ ∗ Y 1,1! Y 1,2 ! Y 2,1! Y 2,2 !

24.3 Contingency tables Contingency, in this context, refers to all the possible combinations of events. Contingency tables are Stratford & Wilkes University DRAFT 67 appropriate for categorical data and the boxes represent the counts of each combination. A B

1 Combination A1 Combination B1 Row 1 total = c frequency = f11 frequency = f12 ∑ =  f 1j f 11 f 12 j=1

2 Combination A2 Combination B2 Row 2 total = c frequency = f21 frequency = f22 ∑ =  f 2j f 21 f 22 j=1

Column 1 total = Column 2 total = N= grand total = r r ∑∑ f ∑ =  ∑ =  ij f i1 f 11 f 21 f i2 f 12 f 22 i=1 i=2 Table 1: 2 x 2 layout

24.4 Data Visualization Mosaic plot for 2 x 2

24.5 Null Matrix

A B     1  f 11 f 12 f 11 f 21  f 11 f 12 f 12 f 22 f = × , f = × , 11 N N 12 N N P A1=N ∗P  A∩1=N ∗P A∗P 1 P B1=N∗P  B∩1=N ∗P  B∗P 1     2  f 21 f 22 f 11 f 21  f 21 f 22 f 12 f 22 f = × , f = × , 21 N N 22 N N P A2= N∗P  A∩2=N ∗P  A∗P  2 P B2=N ∗P  B∩2=N∗P B∗P 2 Table 2: Matrix of expected frequencies

A chi-square statistic, Χ2, is used to estimate the differences between the observed and predicted values and is given by:  −   −  2 2=∑  observed cell value expected cell value  ∑∑ f ij f ij Pearson X   or  expected cell value f ij

Test the statistic against a chi-square statistic. The degrees of freedom of the test are determined by (r- Stratford & Wilkes University DRAFT 68

1)*(c-1) where r is the number of rows and c is the number of columns. For a 2 X 2 matrix, such as the above example, the degrees of freedom is 1.

Use Yates correction factor or Fisher exact test when cell frequencies are (fij <5). It is possible to test pooled data

1.2.4 Assumptions All cell values should be have at least one observation and no more than 20% of the cell values should be less than 5 {Zar, 1984 #5773}. Cases that fall under these conditions should be analyzed using Fisher's Exact Test (below) or the G-test.

1.2.5 R code for contingency tables > mytable <- matrix(c(n1, n2, n3, n4), nrow=2) # alternative is matrix(c(n1, n2, n3, n4), 2,2) > mytable > chisq.test(mytable(var1, var2)) > fisher.test(mytable)

1.3 Fisher's Exact Test Use this test when any of the counts are < 5. fisher.test(var1, var2) note: data should be formated with each case being a row and not the counts themselves. Example: Case Species Diet

1 A FROGS

2 B FISH

3 B FROGS

4 A FROGS

5 A FISH

6 B FISH

24.6 G Test or Log-likelihood Test = ∑ × f G 2 f ln  f distributed approximately as the χ2 distribution Stratford & Wilkes University DRAFT 69

24.7 Binomial Test binom.test(failures, sample size) binom.test(50, 150); # 50/150 were failures prop.test(c(successes group 1, successes group 2), c(total group 1, total group 2))

Goodness-of-Fit test Kolmogorov-Smirnov test Stratford & Wilkes University DRAFT 70

25 Nonlinear Regression

25.1 Introduction R code gam.temp <- gam(y~x, control = gam.control(maxit=50, bf.maxit=50), data = yourdata) summary (gam.temp) plot(gam.temp, se=T)

25.2 Fitting nonlinear models

25.3 Splines The regression line is divided into a set of points called knots. Splines a cubic polynomials that fit the points between the knots. There are several type of splines but B-splines are generally recommended. Recommended Green and Silverman 1994. Use nonlinear models when the relationship is known a priori, otherwise, use generalized additive models (GAM). .  MYSTART <- c(b0 = 0, b1 = 0, th=120) # starting values  nls(y ~ x, data=MYDATA, start=MYSTART, trace=T)

26 Neural networks and Path analysis

See {Gotelli, 2004 #6303}page 279.

library(nnet); R code; nnet(formula, data, weights, size, Wts, linout=F, entropy=F, softmax= F, skip=F, rang=0.7, decay=0, maxit=200, trace=T) If over-dispersion is suspected one can test for the appropriateness of the negative binomial using the log-likelihood ration test. Use the formula Stratford & Wilkes University DRAFT 71

27 Dealing with Interactions note: the interaction has a single beta The strong interaction suggests that we should interpret the effect of one variable (either X or Z) because the effect depends on the level of the other variable. Centered-score regression adjusts the frame of reference by standardizing each parameter such that the mean of each is zero. This is easily accomplished by subtracting the mean from each of the raw values. Aiken, L. S. and S. G. West. 1991. Multiple Regression: Testing and Interpreting Interactions. Sage Publications, Newbury Park, California. Katrichis, J. 1992. The conceptual implications of data centering in interactive regression models. Journal of Market Research Society 35, 183-192. Kromrey, J. D. and L. Foster-Johnson. 1988. Mean centering in moderated multiple regression: much ado about nothing. Educational and Psychological Measurement 58, 42-68. Pedhazur, E. J. 1997. Multiple Regression in Behavioral Research (3rd Ed.). Harcourt Brace, Fort Worth, Texas.

28 Mantel Analysis 29 Generalized Liner Mixed Models (GLMM) 30 General Estimating Equations (GEE)

31 Proportional Odds Model Used when looking at ordinal data. #ordered factors var1 <- ordered(var1, labels=levels(var1) Stratford & Wilkes University DRAFT 72

32 DATA REDUCTION

1.4 Introduction No other area of statistical analysis will come across as more subjective as ordination. But then ordination is able to insight into ecological problems not amenable to other types of analyses. Ordination can relate the abundance of species to each other in environmental space or you can relate species to the environment in “ordination space” or combinations of these. It is very important that you realize which one of these you are doing since many methods, though applicable, have many drawbacks.

1.5 Relating the species’ abundance to environmental gradients

1.5.1 and Principal Component Analysis Sometimes these terms are used synonymously but PCA should be considered a form of factor analysis. The goals are the same: data reduction and summarization (Hair et al. 1998). PCA is based on a covariance matrix. When the number of individuals is measures in a sample along with a measure of the environment then the PCA will represent samples in a species space. PCA assumes a monotonic relationship between abundance and an environmental parameter. It might be more common however, that species have a unimodal relationship with an environmental gradient. A species having a unimodal relationship will distort the species scores at the ends of the gradient forming an arch and this is called the arch or horseshoe effect. If a two dimensional graph is produced, T\the axes will represent environmental gradients with species scores as points.

> my.pca <- princomp(data, cor=T) > summary(my.pca) > plot(my.pca) > loadings(my.pca) > mypca.prediction <- predict(my.pca) > eqscplot(my.pca[,1:2], type=”n”, xlab=”First PCA Axis”, ylab=”Second PCA axis”) my.pca <- prcomp(~ var1 + var2 + var3, data=my.data) summary(my.pca) my.pca$rotation screeplot(my.pca, npcs=3, type="lines") (my.pca) In the R console... ?read.table -- getting the data into a data.frame ?prcomp Stratford & Wilkes University DRAFT 73

OR ?princomp -- for the PCA (see 'center' and 'scale' options) ?biplot -- for the biplot ?screeplot -- for a

1.5.2 Reciporcal averaging (RA) or correspondance analysis (CA) “In RA, sample scores are calculated as a weighted average of species scores, and species scores are calculated as a weighted average of sample scores, and iterations continue until there is no change. RA simultaneously ordinates species and samples. There are as many axes as there are species or samples, whichever is less. “ The interpretations of axes are more vague than PCA but the eigenvalues are still meaningful. Eigenvalues resulting from equal to the correlation coefficient which is maximized between species scores and sample scores. Because correspondence assumes a unimodal response, the arch effect is reduced (although it may not be eliminated altogether depending on the variance in the species abundance across the gradient). Another disadvantage of RA is the compression at the ends of the gradients. The relationship between a species and the environment is less vague because the species is located along the gradient where it is most abundant. However, one cannot interpret eigenvalues as the variance explained as in PCA but these numbers represent the correlation coefficient between species scores and sample scores. A two-dimensional graph in correspondence analysis will have CA/RA axis with both environmental and species scores as points on the graph.

1.5.3 Detrended Correspondance Analysis (DCA) Shares many of the features of RA such as the production of sample and species scores and the production of eigenvalues. DCA is used more extensively than RA because the arch effect is removed. This is accomplished by segmenting the primary axis with the highest eigenvalue into segments along the “zero” of second axis and re-estimating the average for the segment – this is the detrended of the detrended. Because each segments is centered on zero, information is lost.

1.6 Methods utilizing eigenanalysis Principal component analysis, correspondence analysis (reciprocal averaging), detrended correspondence analysis, are commonly used methods of ordination based on eigenanalysis.

1.6.1 CRDA and CCA (Makarenkov and Legendre 2002). Ordination assumes there is an underlying structure to the data such that many ecological variables are correlated. There are often no independent variatiables. A common use of ordination is the examination of habitat use by numerous species using several, if not many habitat variables. The subject is treated more thoroughly in Chapter

32.1 Non-metric Multidimensional scaling

Non-metric multidimensional scaling (NMDS) converts raw data to ranks and should be used when variables are nonnormally distributed or data is ordinal, or there are gaps in gradients. The drawbacks Stratford & Wilkes University DRAFT 74 then are that the results are not as interpretable as those from principle components. One goal of NMDS is to reduce the dimensionality of the data. “Stress” is a measurement of the departure from a monotonic relationship between the original distance and the distance of the reduced matrix. A measure of raw stress is n−1 n ∑∑  −  2 d ij d ij where d represents the distance between elements of a matrix. Distance measures can i=1 j =2 be estimated via several algorithms but for community data Sørensen (Bray-Curtis) is recommended .

Steps for running NMMDS > temp <- cmdscale(dist(file), k=2, eig=T); file$points[,2] <- file$points[, 2]; eqscplot(file$points, type=”n”) # calculate stress using Euclidean distances > stress <- dist(file); stress2 <- dist(file$points); sum((stress- stress2)^2/sum(stress^2)) Stratford & Wilkes University DRAFT 75

33 PREDICTING GROUP MEMBERSHIP: DISCRIMINANT FUNCTION ANALYSIS Taxonomists and field biologists often want to predict group membership. For example, a monochromatic species may have sexes that differ in subtle ways and body measurements may be used to separate males and females. One might be tempted to do a number of individual tests. Unfortunately, this would be misleading if any of the predictor variables overlapped. There may also be a combination of variables that act better as a group to discriminate than any individual variable alone. Discriminant function analysis (DFA) is similar to regression in that a linear combination of variables and coefficients are constructed. Statistically, DFA is related to MANOVA but instead of known groups predicting differences in predictor variables DFA uses the predictor variables to classify observations into particular groups such as species, treatments, etc. Multiple functions are possible and those produced by DFA are orthogonal to each other. For example in a football-shaped cloud of data, the first function, the function that accounts for the most amount of variation would ideally go through the central axis of the football. If the data were spun around the axis, the football would spin like a perfect Phil Simms (NY Giants) pass. The second axis is perpendicular to the central axis, splitting the football in half and if the data were spun around this axis it would look like I was throwing the football with my right hand (I’m left handed) – a ball going end over end. Now if we think about this example, we should take note that there doesn’t appear to be good separation among groups. That is, if there were two groups, the data are smoothed into each other to give a smooth football-shaped cloud. In reality, we may not be able to discriminate very well. Groups that are going to have well-performing functions will have clouds of data that have little or no overlap. For this reason, it is important to have a holdout sample to test the predicting power of the discriminant function or functions. There are multiple methods of variable selection and the most commonly used method is to give the software cutoff points of significance for variable to be included or excluded in a stepwise fashion. DFA creates the functions then gives the loadings of each variable on the particular function. The coordinate system can be rotated so the high loadings can be maximized and the low loadings can be minimized yet preserving the overall data structure. A multitude of rotations are possible but varimax rotation appears to the most commonly used. This is the same sort of rotation for principal component analysis and improves the ability to interpret variables. The number of functions produced by DFA can be up to the number of variables in the analysis. In this case each and every variable acts alone to discriminate groups. Such a result is unlikely and would be difficult to interpret. Software should provide the number of functions and summaries of each function include the amount of variance that each function accounts for. There is a important distinction between this accounting; however, between regression and DFA. In regression the R2 can be thought of as the explained variance. In DFA, what is usually reported is the relative percent of variance that is explained not the amount of variance explained per se. So the cumulative relative amount of explained variance in DFA will be 100 percent but this does not mean all of the variance in the data structure has been account for. Before interpreting results, several aspects of the analysis should be examined including the natural log Stratford & Wilkes University DRAFT 76 of the determinant of the covariance matrix - it should be approximately zero. Ideally, the within-group covariance matrices should be homogeneous and a test should be done. SAS does a chi-square test and if the results are significant then the matrices are different from each other. If matrices are not homogenous then the within-group are used instead of the overall covariance matrix. Another tests is Box’s M but this is usually considered too sensitive. If the above criteria indicate that the analysis was appropriate then one should judge the overall discriminating ability of the analysis. This can be accomplished by examining the classification matrix for both the heuristic or model-building data set and the hold out data set. In a balanced design where the number of observations in each group is the same this is straightforward. However, if one group contains a majority of observations then, just by chance, this group should also have most of the classifications and can produce an overestimate of the success of DFA. For unbalanced designs, the classification should be compared to the maximum chance criterion. This means comparing the relative percent of observation in the largest group to the number of successful classifications in the largest group. Another method of assessment is the Press’s Q statistic (Q). Where n is the total number of observations, nc is the number of correct classifications and k is the number of groups. If more than one function is estimated then a potency index should be derived. The potency index gives the relative contribution of each variable to the overall discrimination. What to report If the analysis is exploratory or confirmatory The method of variable selection The variable loadings and the rotation method The sorted functions with eigenvalues, the Wilk’s Lambda value, chi-square, the degrees of freedom and the significance of each function. loading matrix – the degree that a parameter loads on a particular function. Press’s Q statistic or maximum chance criterion Examine the pair-wise (squared) distance between groups or the Mahalanova distances. Potency index of each variable The functions as an equation. This is an absolute must if the ecologist wants to put the function out to be used by others. This is particularly true for sex or species determination in the field. \ The discussion would be an appropriate place to put forward new hypotheses to explain the misclassifications. Was there a pattern? What variables should have been measured or measured differently? Though this is speculative, it may give future ecologists a heads up. I also suggest graphing individual cases along the axis or axes that are the functions. If the graph is cluttered then plotting a group centroid is useful. Exploratory analyses should include a hull containing all the points and confirmatory analysis should include 95% confidence intervals. If there was good group separation then individual cases should cluster amongst themselves. I also suggest making a table with the actual group and the predicted group and the number in each. Stratford & Wilkes University DRAFT 77

SAS PROC STEPDISK data =dataset forward pool=test slstay=0.05 simple tcorr wcorr; Class clasvariable; Var predictor1 predictor2; Run; PROC DISCRIM data=dataset POOL=no WCOV WCORR PCOV LIST TESTDATA=testdatafile; CLASS classvariable; Var predictor1 predictor2; Run; Proc discrim Stratford & Wilkes University DRAFT 78

34 REPEATED MEASURES GEE SECTION THREE Ecology specific Questions

35 THE NUMBER OF SPECIES In community ecology, species richness is the simplest and most intuitive measure of species composition. There are two issues that arise when studying species richness; detectability and estimating species richness of an area from samples repeated spatially and temporally .

36 POPULATION ESTIMATION Capture-mark-recapture (CMR) are costly, time-consuming and data hungry but usually provide more insight to population dynamics. The choice of estimator depends on assumptions about the populations. The primary assumption is whether the population is closed (no emigration or immigration) or open.

37 Spatial Statistics

1.7 Introduction

1.7.1 Spatial autocorrelation versus nonstationarity Some processes are constant across space and the relationship between two variables do not change depending on location. Physical laws (e.g., gas laws) are examples such stationarity. Nonstationarity is when the relationship between two variables changes across space. Local effects result from nonstationarity and global effects are those that are stationarity. Spatial autocorrelation results when observations are more similar are dissimilar the closer those observations are. Measures of spatial autocorrelation can be global (average autocorrelation for the entire area) or local (how autocorrelation changes over space). A global regression model can be represented with

= ∑      yi B 0 B ui , v i xi i

37.1 Spatial correlation violates the assumption of independent observations.

37.1.1 Spatial lag Spatial lag occurs if observations of the response variable are related because the predictor variables are spatially correlated. One test for spatial lag is the Breush-Pagan test. This test measures heteroskedasticity. Stratford & Wilkes University DRAFT 79

The spatial lag model takes on the form y=Wy x B Where rho is the spatial autoregressive coefficient and Wy are the spatially lagged dependent variable. Lagrange Multiplier?

Spatial error Spatial error occurs if the errors are spatially correlated and takes on the form y=x B W  Where lambda is the autoregressive coefficient and W epsilon is the spatially weighted errors. The error term is N(0,σ2I) Moran's I is a measure of spatial autocorrelation in the residuals. R2 is not the same as the R2 from regression. Moran's I and Lagrange Multiplier (LM) Tests both require normality of errors (really?). However, robust LM does not require normality of error terms.

37.1.2 Geographically weighted regression Geographically weighted regression uses a focal approach to get estimates. A focal approach uses a window or kernel centered on a point. A weighting function can be applied that gives greater weight to observations closer to the center point than those closer to the edge of the kernel. The kernel can be a fix width or can be related to the density of observations. In areas where there is a high density of observations, the extent of the kernel can shrink. Conversely, in areas where the observations are sparse, the extent of the kernel can increase to include more samples.

Model selection There is the possibility that spatial lag and spatial error models perform better than a nonspatial model.

37.1.3 Transformations among projections

More in general you need to take a look at the sp-pacakge for Spatial classes in R. Once your data is in these classes you can use spTransform from the rgdal pacakge to perform the projection. spTransform uses the proj4 library. You need to find the so called proj4 strings describing both the source and target projection. Stratford & Wilkes University DRAFT 80

37.2 Software

1.7.2 GeoDa

1.7.2.1 The Jarque-Bera test measures the normality of the error distribution using a goodness-of-fit.

1.7.2.2 When the multicollinearity condition is > 20 then this is cause for concern. Stratford & Wilkes University DRAFT 81

38 Experimental Design

38.1 Completely Randomized Design

38.1.1 Latin Square

38.1.2 Factorial Factorial Treatment Arrangement Individuals are randomly assigned one of each treatments of the several factors. For example, if there are four pesticides treatment levels and three herbicide treatment levels then there will be twelve different treatments each having several (more the better) individuals. This allows for the estimation of the interaction between the two treatments (pesticide and herbicide). Separate experiments can be done but the interaction variable cannot be estimated.

39 Glossary nuisance parameter: a factor in a model that influences the inferences but the effect of that factor is not of general interest

40 Literature Cited