The Scientific Method and Basic Statistics

Objectives: Understand the steps in the Scientific Method Be able to describe basic statistical parameters and how they relate to the Normal (Gaussian) Distribution Model Be able to explain how hypotheses are tested; supported or rejected. What Do Scientists Do?

•Scientists collect data and develop theories, models, and laws about how nature works. Science searches for natural causes to explain natural phenomenon

1. Purpose of science a. to determine cause and effect b. to gain insight into natural events 2. Science does not include “absolutes” 3. Science provides tentative explanations to explain natural phenomenon 4. Fundamental basis of science: The Principal of Uncertainty “Science cannot prove anything, nor is it a search for the ‘truth’.” 1. Science develops tentative answers for guesses (hypotheses) based on evidence 2. Theory - when supporting evidence is very strong! Science Is a Search for Order in Nature Identify a problem Find out what is known about the problem Ask a question to be investigated Gather data through experiments Propose a scientific hypothesis Science Is a Search for Order in Nature Make testable predictions Keep testing and making observations Accept or reject the hypothesis Scientific theory: well-tested and widely accepted hypothesis Characteristics of Science…and Scientists Curiosity Skepticism Reproducibility Peer review Openness to new ideas Critical thinking Creativity Observation: Nothing happens when I try to turn on my flashlight.

Question: Why didn’t the light come on? Are the batteries dead?

Hypothesis: Maybe the batteries are dead.

Test hypothesis with an experiment: Put in new batteries and try to turn on the flashlight.

Result: Flashlight still does not work.

New hypothesis: Maybe the bulb is burned out.

Experiment: Put in a new bulb.

Result: Flashlight works.

Conclusion: New hypothesis is verified. Fig. 2-3, p. 33 Concept 1.1 Connections in Nature

Observation of Pacific tree frogs suggested that a parasite can cause deformities. Small glass beads implanted in tadpoles to mimic the effect of cysts of , a trematode , also produced deformities. Concept 1.1 Connections in Nature

Further studies: • Deformities of Pacific tree frogs occurred only in ponds that also had an aquatic snail, Helisoma tenuis, an intermediate host of the parasite. • All frogs with deformed limbs had Ribeiroia cysts. Figure 1.3 The Life Cycle of Ribeiroia 1. Observation

• The awareness of a natural event or natural phenomenon directly or indirectly by means of our senses. Observation: North facing slopes have heavier tree growth than south facing slopes

N S Observation: North facing slopes have heavier tree growth than south facing slopes Possible Questions:  What causes trees to grow more abundantly on north facing slopes? Question both relevant and testable, but very general.

 What causes the slope to be north facing? Probably not relevant.

 Did Martians plant these trees 10,000 years ago? Probably not testable.  Is evaporation of water less on north facing slopes than south facing slopes? More relevant and to the point. Observation: North facing slopes have heavier tree growth than south facing slopes Question: Is evaporation of water less on north facing slopes than south facing slopes?

N S 3. Hypothesis:

A guess postulating an answer to the question Must be relevant and testable Bias

My idea is so logical, so reasonable, and it sounds so right, it must be correct

Where is the supporting evidence? Observation: North facing slopes have heavier tree growth than south facing slopes Question: Is evaporation of water less on north facing slopes than south facing slopes? Hypothesis: Evaporation is greater on south facing slopes than north facing slopes. 4. Experiment

•Additional observations gathered to test the hypothesis. Observation: North facing slopes have heavier tree growth than south facing slopes Question: Is evaporation of water less on north facing slopes than south facing slopes? Hypothesis: Evaporation is greater on south facing slopes than north facing slopes. Experiment: Test evaporation using a sling psychrometer. Experimental Difficulties

• Bias • Experimental Errors • Sample Size

What are the odds of flipping: • 5 heads in a row? 2-5 = 1/32 •10 heads in a row? 2-10 = 1/1024 •100 heads in a row? 2-100 = 1.27x1030 or 1 in 1,270,000,000,000,000,000,000,000,000,000 Charlie Charlie’s Sick

Diagnosis – Ick Fish Ick Medicine Controlled Experiment

•Run two side-by-side experiments 1. No change 2. Change one experimental variable only Controlled Study

Experimental Group Control Group Conditions Identical Except Fish ick medicine no medicine How many of each? ~50 experimental fish ~50 control fish 5. Evaluation – Conclusions • Analyze the results of the experiment

50 Experimental Fish 50 Control Fish How many of each lived? Live 40 / 50 10 / 50 Conclusion – Medication helps

Live 40 / 50 32 / 50 Conclusion – Not clear if medication helps 5. Evaluation • When results are close the sample size is critical.

Experimental Fish Control Fish How many fish should be used? Inconclusive result if 100 fish are used (difference = 1/256 chance) Live 40 / 50 32 / 50

More conclusive result if 1000 fish are used Live 400 / 500 320 / 500 (difference = 1/1.21x1030 chance) Statistical Approach to Science

 How does science develop theories?  A theory is an hypothesis which is solidly supported by evidence. Support for hypotheses comes from statistics  Using a sample, the mean of an experimental population can be determined along with other statistical parameters  The absolute “true mean” (denoted as m) cannot be determined. instead a we estimate a mean (x) for our sample population.  We can estimate a confidence interval in which the true mean of the population lies at a given level of probability  This honors the Uncertainty Principal in Science Statistical Method • There is a high degree of variability in living things: cells, organisms, populations

• Sample – a portion of a population must be sufficiently large, but obtained randomly • Random selection reduces bias

“Normal” Distribution

The line of a bell-shaped curve reveals continuous variation in the

population

some value of the trait the of value some Number of individuals with individuals of Number

Range of values for the trait

Fig. 8-14a, p.120

some value trait of value the some Numberof with individuals

Range of values for the trait Fig. 8-14b, p.120 Statistics

 Summation Notation and • Mean Symbols 1  i is the index variable, or x   xi counter. The index variable is N used to identify each observed value. • Variance  n is the number of observations x  x2  Xi is the variable of interest for s 2   i observation number i. x N 1  ∑ is sigma (Greek capital S)  This means to add, or sum, all observations of variable X • Standard deviation x  x2 x2  Nx 2 s   i   i x N 1 N 1 Arithmetic Mean

 Mean is the average value of observations;  Determined by adding up all values then dividing them by the number of observations  The mean represents an estimate of the absolute “true mean” denoted with a Greek lower case m (m)

1 x  x N  i Variance

Variance is an estimate of the range of values from our observations  Obtained by summing the square of the differences between individual values and the mean then dividing by the number of observations minus one.  Again, this is an estimate of the “true variance” (s2)

x  x2 s 2   i x N 1 Standard deviation

Standard deviation is another estimate of the range of values in relation to the mean. Again, this is an estimate of the “true deviation” (s) represented by a lower case Greek s

Simply calculated as the square root of the variance

x  x2 x2  Nx 2 s   i   i x N 1 N 1 Confidence Interval

CI gives the probability that the spread of values will lie within a distribution; with our sample mean and the true population in the center of the range It also provides our level of confidence for rejecting or failing to reject a null hypothesis

2 2 s1 s2 CI  X 1  X 2  t  n1 n2 Confidence Level

• In biology the level of confidence used is usually 95%. • This means there is a 5% chance that our conclusion is in error! Confidence Level 95% Confidence interval: 95% of data will be contained within non-shaded area of curve

In biology the level of confidence used is usually 95%. This means there is a 5% chance that our conclusion is in error! Fig. 8-15, p.121 T-test determines probability that two data sets are from a single population

Hypotheses

Ho: µ1 = µ2 6

H1: µ1  µ2 5

In this example we can 4 visually see a significant difference among two N means. 3 After conducting a t-test, we would reject the null 2 TAXON hypothesis; the two Pelv means are not equal 1 Porph 250200150100 50 0 50 100150200250 Count Count Null vs Alternate Hypotheses

• Null Hypothesis Ho: µ1 = µ2

• By default, the null hypothesis is that there is no significant difference among our two sample means.

• Alternate Hypothesis H1: µ1  µ2

Decision Rule If the p-value is less than alpha Reject the Hypothesis If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis • t Test Decision Rule If the p-value is less than alpha, reject the null Hypothesis (two means are not equal) If the p-value is greater than or equal to alpha Fail to Reject the Hypothesis

 Two-sample t-test on TEMP grouped by TREATMENT$ against Alternative = 'not equal' 

Group N Mean SD  None 116 16.55697 2.60453  Shade 287 14.57568 2.03032  25  Separate variance:   Difference in means = 1.98130  95.00% CI = 1.44862 to 2.51398  t = 7.34105 20  df = 174.2

 p-value = 0.00000 P

M

 E

 Pooled variance: T  15  Difference in means = 1.98130  95.00% CI = 1.50322 to 2.45937  t = 8.14733 TREATMENT  df = 401 None  p-value = 0.00000 10 Shade 60 50 40 30 20 10 0 10 20 30 40 50 60 Count Count Comparing more than two means

•T-tests work when we want to determine the equality of two means. •What if we have 3 or more sample populations to compare? •There are additional statistical analyses performed on more than two populations, but they depend on the type of data and on the question we’re asking •Typically results in models Types of Data

 Categorical- qualitative data that fall into distinct categories. Further divided into two types:  Nominal- descriptive ( color, gender)  Ordinal- where order is important ( mature, immature)  Numerical- quantitative, measured numerical observations, also subdivided into two types  Discrete- only certain values are possible (number of seeds, offspring etc)  Continuous- any value within an interval is possible and limited only by the resolution of the measuring device (height, weight, concentration, temperature) The General Linear Model

• Used for comparing multiple populations or data sets • Analysis of variance- like a t-test on 3 or more groups • Correlation- tests whether two variables are correlated (display a linear relationship) • Regression analysis- once correlation is established, determines how well an independent variable (x-axis) predicts the value of a dependent variable (y- axis) Analysis of Variance (ANOVA)

Least Squares Means 19

19

16

P 16

M

E

P

T

M

E

13 T 13

10 10 HCN HCS LP MP HCN HCS LP MP SITE SITE General Linear Model Regression on continuous variables

NDVI vs Leaf Chloropyll

0.6 2 0.5 R = 0.8114

0.4

0.3 NDVI 0.2

0.1 0 0 20 40 60 80 100 120 Chlorophyll mg/cm2 ANOVA

 Sometimes data must be re- classified  Here, we measured actual concentrations of (continuous data), but had to run an ANOVA as if the data were categorical  This was decided by peers reviewing our manuscript for publication

General Linear Model: Linear Regression

 A data set has values yi each of which has an associated modeled value fi (also sometimes referred to as ). Here, the values yi are called the observed values and the modeled values fi are sometimes called the predicted values.

 The "variability" of the data set is measured through different sum of squares

 the total sum of squares (proportional to the sample variance);

 the regression sum of squares, also called the explained sum of squares,

 the sum of squares of residuals, also called the residual sum of squares. In the above, is the mean of the observed data:

 The most general definition of the coefficient of determination is