Laboratory in Oceanography: Data and Methods Intro to the Statistics

Laboratory in Oceanography: Data and Methods Intro to the Statistics Toolbox MAR550, Spring 2020 Miles A. Sundermeyer Sundermeyer MAR 550 Spring 2020 1 Intro to Statistics Toolbox Statistics Toolbox/Descriptive Statistics Measures of Central Tendency Function Name Description Geomean Geometric mean harmmean Harmonic mean mean Arithmetic mean median 50th percentile mode Most frequent value trimmean Trimmed mean (specify percentile) 1/ n n 1 n n • Geometric Mean: ai a1 a2 an exp ln ai i1 n i1 n n , a 0 for all i • Harmonic Mean: 1 1 1 n 1 i i1 ai a1 a2 an Sundermeyer MAR 550 Spring 2020 2 Intro to Statistics Toolbox Statistics Toolbox/Descriptive Statistics Measures of Dispersion (or Spread) Function Name Description iqr Interquartile range mad Mean absolute deviation moment Central moment of all orders range Range std Standard deviation var Variance • Interquartile range: difference between the 75th and 25th percentiles • Mean absolute deviation: mean(abs(x-mean(x))) • Moment: mean((x-mean(x)).^order (e.g., order=2 gives variance) • skewness: third central moment of x, divided by cube of its standard deviation (pos/neg skewness implies longer right/left tail) • kurtosis: fourth central moment of x, divided by 4th power of its standard deviation (high kurtosis means sharper peak and longer/fatter tails) Sundermeyer MAR 550 Spring 2020 3 Intro to Statistics Toolbox Statistics Toolbox/Descriptive Statistics Examples of Skewness & Kurtosis: Gaussian (normal) Distribution Sundermeyer MAR 550 Spring 2020 4 Intro to Statistics Toolbox Statistics Toolbox/Descriptive Statistics Bootstrap Method • Involves choosing random samples with replacement from a data set and analyzing each sample data set the same way as the original data set. The number of elements in each bootstrap sample set equals the number of elements in the original data set. The range of sample estimates obtained provides a means of estimating uncertainty of the quantity being estimated. • In general, bootstrap method can be used to compute uncertainty for any functional calculation, provided the sample data set is ‘representative’ of the true distribution. Jacknife Method • Similar to the bootstrap is the jackknife, but uses re-sampling to estimate the bias and variance of sample statistics. Sundermeyer MAR 550 Spring 2020 5 Intro to Statistics Toolbox Statistics Toolbox/Descriptive Statistics Example: Bootstrap Method for estimating uncertainty on Lagrangian Integral Time Scale (from Sundermeyer and Price, 1998) “Integrating the LACFs using 100 days as the upper limit of the integral of Rii(t) in (12) gives the integral timescales I(11,22) = (10.6 ± 4.8, 5.4 ± 2.8) days for the (zonal, meridional) components, where uncertainties represent 95% confidence limits estimated using a bootstrap method [e.g., Press et al., 1986].” Sundermeyer MAR 550 Spring 2020 6 Intro to Statistics Toolbox Statistics Toolbox/Statistical Visualization Probability Distribution Plots • Normal Probability Plots: >> x = normrnd(10,1,25,1); >> normplot(x) >> x = exprnd(10,100,1); >> normplot(x) Sundermeyer MAR 550 Spring 2020 7 Intro to Statistics Toolbox Statistics Toolbox/Statistical Visualization Probability Distribution Plots • Quantile-Quantile Plots: >> x = poissrnd(10,50,1); y = poissrnd(5,100,1); >> qqplot(x,y); >> x = normrnd(5,1,100,1); >> y = wblrnd(2,0.5,100,1); >> qqplot(x,y); Sundermeyer MAR 550 Spring 2020 8 Intro to Statistics Toolbox Statistics Toolbox/Statistical Visualization Probability Distribution Plots • Cumulative Distribution Plots: >> y = evrnd(0,3,100,1); >> cdfplot(y) >> hold on >> x = -20:0.1:10; >> f = evcdf(x,0,3); >> plot(x,f,'m') >> legend('Empirical', ... 'Theoretical', ... 'Location','NW') Sundermeyer MAR 550 Spring 2020 9 Intro to Statistics Toolbox Statistics Toolbox/Probability Distributions/Supported Distributions Supported distributions include wide range of: • Continuous distributions (data) • Continuous distributions (statistics) • Discrete distributions • Multivariate distributions Function Name Description pdf Probability density functions cdf Cumulative distribution functions inv Inverse cumulative distribution functions stat Distribution statistics functions fit Distribution fitting functions like Negative log-likelihood functions rnd Random number generators https://www.mathworks.com/help/stats/supported-distributions.html Sundermeyer MAR 550 Spring 2020 10 Intro to Statistics Toolbox Statistics Toolbox/Probability Distributions/Supported Distributions Supported distributions (cont’d) Name pdf cdf inv stat fit like rnd ... normrnd, normfit, Normal Normpdf, Normcdf, norminv, randn, normstat mle, normlike (Gaussian) pdf cdf icdf random, dfittool randtool Pearson pearsrnd pearsrnd system Piecewise pdf cdf icdf paretotails random raylrnd, raylpdf, raylcdf, raylinv, raylfit, mle, Rayleigh raylstat random, pdf cdf icdf dfittool randtool ... Sundermeyer MAR 550 Spring 2020 11 Intro to Statistics Toolbox Statistics Toolbox/Probability Distributions/Supported Distributions Supported statistics Name pdf cdf inv stat fit like rnd Chi-square chi2pdf, chi2cdf, chi2inv, chi2sta chi2rnd, pdf cdf icdf t random, randtool F fpdf, pdf fcdf, cdf finv, icdf fstat frnd, random, randtool Noncentral chi-square ncx2pdf, ncx2cdf, ncx2inv, ncx2st ncx2rnd, pdf cdf icdf at random, randtool Noncentral F ncfpdf, ncfcdf, ncfinv, ncfstat ncfrnd, pdf cdf icdf random, randtool Noncentral t nctpdf, nctcdf, nctinv, nctstat nctrnd, pdf cdf icdf random, randtool Student's t tpdf, pdf tcdf, cdf tinv, icdf tstat trnd, random, randtool t location- scale dfittool Sundermeyer MAR 550 Spring 2020 12 Intro to Statistics Toolbox Statistics Toolbox/Hypothesis Tests Hypothesis Testing • Can only disprove a hypothesis • null hypothesis – an assertion about a population. It is "null" in that it represents a status quo belief, such as the absence of a characteristic or the lack of an effect. • alternative hypothesis – a contrasting assertion about the population that can be tested against the null hypothesis H1: µ ≠ null hypothesis value — (two-tailed test) H1: µ > null hypothesis value — (right-tail test) H1: µ< null hypothesis value — (left-tail test) • test statistic – random sample of population collected, and test statistic computed to characterize the sample. The statistic varies with type of test, but distribution under null hypothesis must be known (or assumed). • p-value - probability, under null hypothesis, of obtaining a value of the test statistic as extreme or more extreme than the value computed from the sample. • significance level - threshold of probability, typical value of a is 0.05. If p-value < a the test rejects the null hypothesis; if p-value > α, there is insufficient evidence to reject the null hypothesis. • confidence interval - estimated range of values with a specified probability of containing the true population value of a parameter. Sundermeyer MAR 550 Spring 2020 13 Intro to Statistics Toolbox Statistics Toolbox/Hypothesis Tests Hypothesis Testing • Hypothesis tests make assumptions about the distribution of the random variable being sampled in the data. These must be considered when choosing a test and when interpreting the results. • Z-test (ztest) and the t-test (ttest) test whether a sample mean is significantly different from a given value. Both tests assume the data are independently sampled from a normal distribution. • Both the z-test and the t-test are relatively robust with respect to departures from this assumption, so long as the sample size n is large enough. • Difference between the z-test and the t-test is in the assumption of the standard deviation σ of the underlying normal distribution. A z-test assumes that σ is known; a t-test does not. Thus t-test must determine s from the sample. Sundermeyer MAR 550 Spring 2020 14 Intro to Statistics Toolbox Statistics Toolbox/Hypothesis Tests ztest • The test requires σ (the standard deviation of the population) to be known • The formula for calculating the z score for the z-test is: http://www.stats4students.com/Essentials/Standard-Score/Overview.php x z s / n where: x is the sample mean μ is the mean of the population • The z-score is compared to a z-table, which contains the percent of area under the normal curve between the mean and the z-score. This table will indicate whether the calculated z-score is within the realm of chance, or if it is so different from the mean that the sample mean is unlikely to have happened by chance. Sundermeyer MAR 550 Spring 2020 15 Intro to Statistics Toolbox Statistics Toolbox/Hypothesis Tests ttest • Like z-test, except the t-test does not require σ to be known • The formula for calculating the t score for the t-test is: http://www.stats4students.com/Essentials/Standard-Score/Overview.php x t s / n where: x is the sample mean μ is the mean of the population s is the sample variance • Under the null hypothesis that the population is distributed with mean μ, the z-statistic has a standard normal distribution, N(0,1). Under the same null hypothesis, the t-statistic has Student's t distribution with n – 1 degrees of freedom. Sundermeyer MAR 550 Spring 2020 16 Intro to Statistics Toolbox Statistics Toolbox/Hypothesis Tests ttest2 • performs a t-test of the null hypothesis that data in the vectors x and y are independent random samples from normal distributions with equal means and equal but unknown variances – unknown variances may be either equal or unequal. • The formula for calculating the score for the t-test2 is: x y t 2 2 sx sy n m where: x, y are sample means sx, sy are the sample variances http://www.socialresearchmethods.net/kb/stat_t.php • The null hypothesis is that the two samples are distributed with the same mean. Sundermeyer MAR 550 Spring 2020 17 Intro to Statistics Toolbox Statistics Toolbox/Hypothesis Tests Function Description Ansari-Bradley test. Tests if two independent samples come from the same distribution, against the alternative that they come from ansaribradley distributions that have the same median and shape but different variances. Chi-square goodness-of-fit test. Tests if a sample comes from a specified distribution, against the alternative that it does not come from that chi2gof distribution.

Load more