Parametric Statistics: Exploring Assumptions
Total Page:16
File Type:pdf, Size:1020Kb
Parametric Statistics: Exploring Assumptions http://www.pelagicos.net/classes_biometry_fa20.htm PSA – Quiz #3 Probability that a given snail would be a given size, if - in fact - it belongs to a given population “Large” species Find a snail = “Small” species Mean = 12mm, STD = 2 7mm long. What Mean = 4mm, STD = 1 species is it? Size Range: Size Range: (12 – 4) to (12 + 4) (4 – 2) to (4 + 2) 96% from 8 to 16 96% from 2 to 6 Z = (7 – 12) / 2 Z = (7 – 4) / 1 Z = -5 / 2 Z = 3 / 1 Z = -2.5 Z = +3 R Assignment #2 - PSA Question1: What happens if you type this command in the console: > Arbuthnot Explain why ? > Arbuthnot Error: object 'Arbuthnot' not found HINT: Arbuthnot is different from Arbuthnot Why? R is case sensitive Question 18: Describe the streak length distributions: What is the most common streak length in kobe dataset? What is the maximum streak length in kobe dataset? Paste the plot of the kobe dataset >table(kobe_streak) > kobe_streak 0 1 2 3 4 39 24 6 6 1 The most common streak length is “0”, which means kobe missed his first shot. The largest streak is “4” which means kobe made up to 4 shots in a row. Question 18: Describe the streak length distributions: What is the most common streak length in cold dataset? What is the maximum streak length in cold dataset? Paste the plot of the kobe dataset >table(cold_streak) > cold_streak 0 1 2 3 4 5 37 12 9 5 4 1 The most common streak length is “0”, which means a cold hand player would miss the first shot. The largest streak is “5” which means the cold hand player would make up to 5 shots in a row. Question 20: Do you think Kobe has a “hold hand”? Explain your reasoning of why / why not? For full credit, use the evidence you obtained from the data and the model prediction. It is hard to tell how different the model results (expected results) are from the observations (observed results): Some statistics agree and some disagree What summary statistics can we use to compare these two data distributions? mean, median, S.D., IQR > mean(kobe_streak) > mean(cold_streak) [1] 0.7631579 [1] 0.9705882 > sd(kobe_streak) > sd(cold_streak) [1] 0.9915432 [1] 1.326769 > median(kobe_streak) > median(cold_streak) [1] 0 [1] 0 > IQR(kobe_streak) > IQR(cold_streak) [1] 1 [1] 2 Comparing Two Sets of Categorical Variables Kobe Model Rcmdr> fisher.test (.Table) Fisher's Exact p-value = 0.1983 R Packages Used in This Chapter For this chapter, you will use the following packages: Start Rcmdr install.packages(“car”); install.packages(“ggplot2”); install.packages(“pastecs”); install.packages(“psych”); library(car); library(ggplot2); library(pastecs); library(psych); NOTE: red font indicates Rcmdr dependencies Definition – (Non)Parametric Parametric statistics assume that data come from a specific probability distribution (a normal distribution) and make inferences about parameters of the distribution. Non-parametric statistics involves: - distribution free techniques do not rely on data belonging to a particular distribution. For instance, randomization tests, whereby observations are shuffled. - non-parametric statistics whose interpretation does not depend on fitting any parameterized distribution. For instance, statistics based on ranks of observations are in the core of many non-parametric approaches. Parametric Statistics Benefits and Costs: - Because parametric statistics require a normal probability distribution, they are not distribution-free. - Parametric methods make more assumptions than non- parametric methods. If the extra assumptions are correct, parametric methods have more statistical power (produce more accurate and precise estimates.) - However, if those assumptions are incorrect, parametric methods can be very misleading. They can cause false positives (type –I errors). Thus, they are often not considered robust. Parametric Statistics Suggested Approach: - Use parametric tests – whenever possible. -Take care to examine diagnostic statistics and to determine if extra assumptions are met. - Perform the matching non-parametric test and compare results. What causes disagreements? Exploring Assumptions • Parametric tests based on the normal distribution assume: – Independent Observations – Interval or Ratio Data (not binomial / nominal) – Normally Distributed • Sampling Distribution • Residuals of Tests – Many tests: Homogeneity of Variances Exploring Assumptions • Assumptions of parametric tests based on the normal distribution • Aim of this chapter: • Quantify the assumption of normality o Graphical displays o Skew o Kurtosis o Normality tests • Quantify the homogeneity of variances (when dealing with 2 or more samples): Levene’s test Assessing Normality • We do not have access to sample the entire biological population, so we test the observed data • 1) Central Limit Theorem – If N < 25, sampling distribution rarely normal • 2) Graphical Displays – Histogram – Q-Q plot • 3) Skewness / Kurtosis (point estimate +/- SE) – Do they overlap with 0 ? (normal distribution) Assessing Normality • 4) Performing Statistical Tests o Shapiro – Wilk Test –Tests if data differ from a normal distribution Significant = non-Normal data Non-Significant = Normal data o Equal Variances (for 2 or more samples) Tests if the data distributions have equal variances Significant = different variances Non-Significant = equal variances Assessing Normality - Graphically Characteristics of Normal Distributions Unimodal, Symmetrical, Bell-shaped Assessing Normality - Graphically Comparing observations against a cumulative normal distribution (same mean and S.D.) Assessing Normality - Graphically 3.5 3.0 3.0 2.5 2.5 2.0 2.0 e l e l p p m m a a s s 1.5 1.5 1.0 1.0 0.5 0.5 0.0 -3 -2 -1 0 1 2 3 -2 -1 0 1 2 theoretical theoretical The percentiles denote the proportion of cases (observations) that fall below a certain value. Compared observed percentiles to percentiles we would expect from a normal distribution. Example: Festival Data Set Biologist worried about potential health effects of music festivals. Measured hygiene of 810 concert-goers over the three days of a music festival. Hygiene measured using standardized index (from 0 to 4): 0 = you smell terribly 4 = you smell beautifully Download and Import Festival Data (MusicFestival.xlsx) For ease of use, rename the Data Set “Festival” > Festival <- DownloadFestival Explore Data Graphically: RCmdr day1 day2 day3 histogram density histogram Graphs in Rcmdr – Quantiles Graphically compares an observed Normal Distribution is the Default (empirical) distribution (points) with a chosen Identifies Max / Min as Default theoretical expectation (line) Identify Points: Automatic or Interactive Graphs in Rcmdr – Quantiles The solid red line day1 is the expected pattern a normal distribution with the same mean and SD and the sampled data. Points outside of the dashed line envelope suggest significant deviations Graphs in Rcmdr – Quantiles day 2 day 3 Note: The straight line represents the expected pattern for a normal distribution Explore Festival Data Set We can also explore the summary statistics describing the three datasets (day1, day2, day3) using Rcmdr: > numSummary(Festival[,c("day1", "day2", "day3"), drop=FALSE], statistics=c("mean", "sd", "IQR", "quantiles", "skewness", "kurtosis"), quantiles=c(0,.25,.5,.75,1), type="2") Explore Festival Data Set We can also explore the summary statistics describing the three datasets (day1, day2, day3) using Rcmdr: NOTE: multiple datasets can be analyzed at once What statistics would you use to assess data normality? Explore Festival Data Set Exploring the summary statistics describing the three datasets (day1, day2, day3) using Rcmdr: > numSummary(Festival[,c("day1", "day2", "day3"), drop=FALSE], statistics=c("mean", "quantiles", "skewness", "kurtosis"), quantiles=c(.5), type="2") mean skewness kurtosis 50% n NA day1 1.7933580 8.865312 170.4502658 1.79 810 0 day2 0.9609091 1.095226 0.8222057 0.79 264 546 day3 0.9765041 1.032868 0.7315003 0.76 123 687 Further Explore Festival Data Set Exploring additional datasets using other functions: describe() function in psych package > describe(Festival$day1) vars n mean sd median skew kurtosis 1 810 1.79 0.94 1.79 8.83 168.97 trimmed mad min max range se 1.77 0.7 0.02 20.02 20 0.03 Further Explore Festival Data Set Exploring additional datasets using other functions: stat.desc() function in pastecs package > stat.desc(Festival$day1, basic = FALSE, norm = TRUE) basic argument: Basic statistics included if TRUE (Note: FALSE is the default) norm argument: Statistics relating to normal distribution included if TRUE (Note: FALSE is the default) Further Explore Festival Data Set > stat.desc(Festival$day1, basic = FALSE, norm = TRUE) median mean 1.790000e+00 1.793358e+00 SE.mean C.I.mean.0.95 3.318617e-02 6.514115e-02 var std.dev 8.920705e-01 9.444949e-01 coef.var 5.266627e-01 Explore Festival Data Set We can also explore the summary statistics describing the three datasets (day1, day2, day3) using RCmdr: > numSummary(Festival[,c("day1", "day2", "day3"), drop=FALSE], statistics=c("mean", "sd", "IQR", "quantiles", "skewness", "kurtosis"), quantiles=c(0,.25,.5,.75,1), type="2") Further Explore Festival Data Set > stat.desc(Festival$day1, basic = FALSE, norm = TRUE) skewness skew.2SE skew.2SE: 8.832504e+00 5.140707e+01 Skew divided by 2 SE kurtosis kurt.2SE kurtosis.2SE: 1.689671e+02 4.923139e+02 Kurtosis divided by 2 SE • How can we interpret these results? Z= (observed value – theoretical value) / (SE of value) Further Explore Festival Data Set skewness skew.2SE skew.2SE: 8.832504e+00 5.140707e+01 Skew