Laboratory in Oceanography: and Methods

Intro to the Toolbox

MAR550, Spring 2020 Miles A. Sundermeyer

Sundermeyer MAR 550 Spring 2020 1 Intro to Statistics Toolbox Statistics Toolbox/ Measures of

Function Name Description

Geomean Geometric harmmean mean 50th Most frequent value

trimmean Trimmed mean (specify percentile)

1/ n  n  1 n n   • : ai   a1 a2  an  exp ln ai   i1  n i1 

n n  , a  0 for all i • Harmonic Mean: 1 1 1 n 1 i    i1 ai a1 a2 an

Sundermeyer MAR 550 Spring 2020 2 Intro to Statistics Toolbox Statistics Toolbox/Descriptive Statistics Measures of Dispersion (or Spread)

Function Name Description

iqr Interquartile mad Mean absolute deviation Central moment of all orders range Range std

var

: difference between the 75th and 25th • Mean absolute deviation: mean(abs(x-mean(x))) • Moment: mean((x-mean(x)).^order (e.g., order=2 gives variance) • : third central moment of x, divided by cube of its standard deviation (pos/neg skewness implies longer right/left tail) • : fourth central moment of x, divided by 4th power of its standard deviation (high kurtosis sharper peak and longer/fatter tails)

Sundermeyer MAR 550 Spring 2020 3 Intro to Statistics Toolbox Statistics Toolbox/Descriptive Statistics

Examples of Skewness & Kurtosis:

Gaussian (normal) Distribution

Sundermeyer MAR 550 Spring 2020 4 Intro to Statistics Toolbox Statistics Toolbox/Descriptive Statistics

Bootstrap Method • Involves choosing random samples with replacement from a data set and analyzing each sample data set the same way as the original data set. The number of elements in each bootstrap sample set equals the number of elements in the original data set. The range of sample estimates obtained provides a means of estimating uncertainty of the quantity being estimated. • In general, bootstrap method can be used to compute uncertainty for any functional calculation, provided the sample data set is ‘representative’ of the true distribution.

Jacknife Method • Similar to the bootstrap is the jackknife, but uses re- to estimate the bias and variance of sample statistics.

Sundermeyer MAR 550 Spring 2020 5 Intro to Statistics Toolbox Statistics Toolbox/Descriptive Statistics Example: Bootstrap Method for estimating uncertainty on Lagrangian Integral Time Scale (from Sundermeyer and Price, 1998)

“Integrating the LACFs using 100 days as the upper limit of the integral of Rii(t) in (12) gives the integral timescales I(11,22) = (10.6 ± 4.8, 5.4 ± 2.8) days for the (zonal, meridional) components, where uncertainties represent 95% confidence limits estimated using a bootstrap method [e.g., Press et al., 1986].” Sundermeyer MAR 550 Spring 2020 6 Intro to Statistics Toolbox Statistics Toolbox/Statistical Visualization

Probability Distribution Plots • Normal Probability Plots: >> x = normrnd(10,1,25,1); >> normplot(x)

>> x = exprnd(10,100,1); >> normplot(x)

Sundermeyer MAR 550 Spring 2020 7 Intro to Statistics Toolbox Statistics Toolbox/Statistical Visualization Plots • Quantile-Quantile Plots: >> x = poissrnd(10,50,1); y = poissrnd(5,100,1); >> qqplot(x,y); >> x = normrnd(5,1,100,1); >> y = wblrnd(2,0.5,100,1); >> qqplot(x,y);

Sundermeyer MAR 550 Spring 2020 8 Intro to Statistics Toolbox Statistics Toolbox/Statistical Visualization

Probability Distribution Plots • Cumulative Distribution Plots: >> y = evrnd(0,3,100,1); >> cdfplot(y) >> hold on >> x = -20:0.1:10; >> f = evcdf(x,0,3); >> plot(x,f,'m') >> legend('Empirical', ... 'Theoretical', ... 'Location','NW')

Sundermeyer MAR 550 Spring 2020 9 Intro to Statistics Toolbox Statistics Toolbox/Probability Distributions/Supported Distributions Supported distributions include wide range of: • Continuous distributions (data) • Continuous distributions (statistics) • Discrete distributions • Multivariate distributions

Function Name Description pdf Probability density functions cdf Cumulative distribution functions inv Inverse cumulative distribution functions stat Distribution statistics functions fit Distribution fitting functions like Negative log-likelihood functions rnd Random number generators https://www.mathworks.com/help/stats/supported-distributions.html

Sundermeyer MAR 550 Spring 2020 10 Intro to Statistics Toolbox Statistics Toolbox/Probability Distributions/Supported Distributions Supported distributions (cont’d)

Name pdf cdf inv stat fit like rnd

...

normrnd, normfit, Normal Normpdf, Normcdf, norminv, randn, normstat mle, normlike (Gaussian) pdf cdf icdf random, dfittool randtool Pearson pearsrnd pearsrnd system

Piecewise pdf cdf icdf paretotails random

raylrnd, raylpdf, raylcdf, raylinv, raylfit, mle, Rayleigh raylstat random, pdf cdf icdf dfittool randtool

...

Sundermeyer MAR 550 Spring 2020 11 Intro to Statistics Toolbox Statistics Toolbox/Probability Distributions/Supported Distributions Supported statistics

Name pdf cdf inv stat fit like rnd

Chi-square chi2pdf, chi2cdf, chi2inv, chi2sta chi2rnd, pdf cdf icdf t random, randtool F fpdf, pdf fcdf, cdf finv, icdf fstat frnd, random, randtool Noncentral chi-square ncx2pdf, ncx2cdf, ncx2inv, ncx2st ncx2rnd, pdf cdf icdf at random, randtool Noncentral F ncfpdf, ncfcdf, ncfinv, ncfstat ncfrnd, pdf cdf icdf random, randtool Noncentral t nctpdf, nctcdf, nctinv, nctstat nctrnd, pdf cdf icdf random, randtool Student's t tpdf, pdf tcdf, cdf tinv, icdf tstat trnd, random, randtool t location- scale dfittool

Sundermeyer MAR 550 Spring 2020 12 Intro to Statistics Toolbox Statistics Toolbox/Hypothesis Tests Hypothesis Testing • Can only disprove a hypothesis • null hypothesis – an assertion about a population. It is "null" in that it represents a status quo belief, such as the absence of a characteristic or the lack of an effect. • – a contrasting assertion about the population that can be tested against the null hypothesis H1: µ ≠ null hypothesis value — (two-tailed test) H1: µ > null hypothesis value — (right-tail test) H1: µ< null hypothesis value — (left-tail test)

• test – random sample of population collected, and test statistic computed to characterize the sample. The statistic varies with type of test, but distribution under null hypothesis must be known (or assumed). • p-value - probability, under null hypothesis, of obtaining a value of the test statistic as extreme or more extreme than the value computed from the sample. • significance level - threshold of probability, typical value of a is 0.05. If p-value < a the test rejects the null hypothesis; if p-value > α, there is insufficient evidence to reject the null hypothesis. • - estimated range of values with a specified probability of containing the true population value of a parameter.

Sundermeyer MAR 550 Spring 2020 13 Intro to Statistics Toolbox Statistics Toolbox/Hypothesis Tests

Hypothesis Testing • Hypothesis tests make assumptions about the distribution of the being sampled in the data. These must be considered when choosing a test and when interpreting the results. • Z-test (ztest) and the t-test (ttest) test whether a sample mean is significantly different from a given value. Both tests assume the data are independently sampled from a .

• Both the z-test and the t-test are relatively robust with respect to departures from this assumption, so long as the sample size n is large enough. • Difference between the z-test and the t-test is in the assumption of the standard deviation σ of the underlying normal distribution. A z-test assumes that σ is known; a t-test does not. Thus t-test must determine s from the sample.

Sundermeyer MAR 550 Spring 2020 14 Intro to Statistics Toolbox Statistics Toolbox/Hypothesis Tests ztest • The test requires σ (the standard deviation of the population) to be known • The formula for calculating the z score for the z-test is:

http://www.stats4students.com/Essentials/Standard-Score/Overview.php x   z  s / n

where: x is the sample mean μ is the mean of the population

• The z-score is compared to a z-table, which contains the percent of area under the normal curve between the mean and the z-score. This table will indicate whether the calculated z-score is within the realm of chance, or if it is so different from the mean that the sample mean is unlikely to have happened by chance.

Sundermeyer MAR 550 Spring 2020 15 Intro to Statistics Toolbox Statistics Toolbox/Hypothesis Tests ttest • Like z-test, except the t-test does not require σ to be known • The formula for calculating the t score for the t-test is:

http://www.stats4students.com/Essentials/Standard-Score/Overview.php x   t  s / n

where: x is the sample mean μ is the mean of the population s is the sample variance

• Under the null hypothesis that the population is distributed with mean μ, the z-statistic has a standard normal distribution, N(0,1). Under the same null hypothesis, the t-statistic has Student's t distribution with n – 1 degrees of freedom.

Sundermeyer MAR 550 Spring 2020 16 Intro to Statistics Toolbox Statistics Toolbox/Hypothesis Tests ttest2 • performs a t-test of the null hypothesis that data in the vectors x and y are independent random samples from normal distributions with equal means and equal but unknown – unknown variances may be either equal or unequal. • The formula for calculating the score for the t-test2 is:

x  y t  2 2 sx sy  n m

where: x, y are sample means

sx, sy are the sample variances http://www.socialresearchmethods.net/kb/stat_t.php • The null hypothesis is that the two samples are distributed with the same mean.

Sundermeyer MAR 550 Spring 2020 17 Intro to Statistics Toolbox Statistics Toolbox/Hypothesis Tests

Function Description Ansari-Bradley test. Tests if two independent samples come from the same distribution, against the alternative that they come from ansaribradley distributions that have the same median and shape but different variances. Chi-square goodness-of-fit test. Tests if a sample comes from a specified distribution, against the alternative that it does not come from that chi2gof distribution. Durbin-Watson test. Tests if the residuals from a are independent, against the alternative that there is dwtest among them. Jarque-Bera test. Tests if a sample comes from a normal distribution with unknown mean and variance, against the alternative that it does jbtest not come from a normal distribution. Linear hypothesis test. Tests if H*b = c for parameter estimates b with estimated H and specified c, against the alternative that linhyptest H*b ≠ c. One-sample Kolmogorov-Smirnov test. Tests if a sample comes from a continuous distribution with specified parameters, against the kstest alternative that it does not come from that distribution. Two-sample Kolmogorov-Smirnov test. Tests if two samples come from the same continuous distribution, against the alternative that they kstest2 do not come from the same distribution. Lilliefors test. Tests if a sample comes from a distribution in the normal family, against the alternative that it does not come from a normal lillietest distribution. Wilcoxon rank sum test. Tests if two independent samples come from identical continuous distributions with equal , against the ranksum alternative that they do not have equal medians. runstest Runs test. Tests if a sequence of values comes in random order, against the alternative that the ordering is not random. One-sample or paired-sample Wilcoxon signed rank test. Tests if a sample comes from a continuous distribution symmetric about a signrank specified median, against the alternative that it does not have that median. One-sample or paired-sample . Tests if a sample comes from an arbitrary continuous distribution with a specified median, against signtest the alternative that it does not have that median. One-sample or paired-sample t-test. Tests if a sample comes from a normal distribution with unknown variance and a specified mean, ttest against the alternative that it does not have that mean. Two-sample t-test. Tests if two independent samples come from normal distributions with unknown but equal (or, optionally, unequal) ttest2 variances and the same mean, against the alternative that the means are unequal. One-sample chi-square variance test. Tests if a sample comes from a normal distribution with specified variance, against the alternative vartest that it comes from a normal distribution with a different variance. Two-sample F-test for equal variances. Tests if two independent samples come from normal distributions with the same variance, against vartest2 the alternative that they come from normal distributions with different variances. Bartlett multiple-sample test for equal variances. Tests if multiple samples come from normal distributions with the same variance, against vartestn the alternative that they come from normal distributions with different variances. One-sample z-test. Tests if a sample comes from a normal distribution with known variance and specified mean, against the alternative that ztest it does not have that mean. Sundermeyer MAR 550 Spring 2020 18 Intro to Statistics Toolbox Statistics Toolbox/ ANOVA (ANalysis Of VAriance) • ANOVA is like a t-test among multiple (typically >2) data sets simultaneously • T-tests can be done between two data sets, or one set and a “true” value • uses the f-distribution instead of the t-distribution • assumes that all of the data sets have equal variances

One-way ANOVA is a simple special case of the . The one-way ANOVA form of the model is

yij a. j ij where:

• yij is a matrix of observations, each column represents a different group.

• a.j is a matrix whose columns are the group means. (The "dot j" notation means a applies to all rows of column j. That is, αij is the same for all i.)

• εij is a matrix of random disturbances.

The model assumes that the columns of y are a constant (i.e., a mean) plus a random disturbance. ANOVA tests if the constants are all the same.

Sundermeyer MAR 550 Spring 2020 19 Intro to Statistics Toolbox Statistics Toolbox/Analysis of Variance One-way ANOVA Example: Hogg and Ledolter bacteria counts in milk. Columns represent different shipments, rows are bacteria counts from cartons chosen randomly from each shipment. Do some shipments have higher counts than others?

>> load hogg >> hogg hogg = 24 14 11 7 19 15 7 9 7 24 21 12 7 4 19 27 17 13 7 15 33 14 12 12 10 23 16 18 18 20

>> [p,tbl,stats] = anova1(hogg); >> p p = 1.1971e-04

• standard ANOVA table has columns for the sums of squares, dof, mean squares (SS/df), F statistic, and p-value.

• P-value is from F statistic of hypothesis test whether bacteria counts are same.

Sundermeyer MAR 550 Spring 2020 20 Intro to Statistics Toolbox Statistics Toolbox/Analysis of Variance One-way ANOVA (cont’d) • In this case the p-value is about 0.0001, a very small value. This is a strong indication that the bacteria counts from the different shipments are not the same. An F statistic as extreme as this would occur by chance only once in 10,000 times if the counts were truly equal. • The p-value returned by anova1 depends on assumptions about random disturbances

εij in the model equation. For the p-value to be correct, these disturbances need to be: independent, normally distributed, and have constant variance.

Sundermeyer MAR 550 Spring 2020 21 Intro to Statistics Toolbox Statistics Toolbox/Analysis of Variance Multiple Comparisons • Sometimes need to determine not just whether there are differences among means, but which pairs of means are significantly different.

• In t-test, compute t-statistic and compare to a critical value. However, when testing multiple pairs, for example, if probability of t-statistic exceeding critical value is 5%, then for 10 pairs, much more likely that one of these will falsely fail that criterion.

• Can perform a multiple comparison test using the multcompare function by supplying it with the stats output from anova1.

Example: >> load hogg >> [p,tbl,stats] = anova1(hogg); >> [c,m] = multcompare(stats)

Example: see Light_DO.m

Sundermeyer MAR 550 Spring 2020 22 Intro to Statistics Toolbox Statistics Toolbox/Analysis of Variance Two-way ANOVA Determine whether data from several groups have a common mean. Differs from one- way ANOVA in that the groups in two-way ANOVA have two categories of defining characteristics instead of one (e.g., think of two independent variables/dimensions)

Two-way ANOVA is again a special case of the linear model. The two-way ANOVA form of the model is

yijk   a. j  b. j  (ab)ij ijk where:

• yijk is a matrix of observations (with rows i, columns j, and repetition k). •  is a constant matrix of the overall mean of the observations.

• a.j is a matrix whose columns are deviations of each observation attributable to the first independent variable. All values in a given column of are identical, and values in each row sum to 0.

• b.j is a matrix whose rows are the deviations of each observation attributable to the second independent variable. All values in a given row of are identical, and values in each column of sum to 0.

• (ab)ij is a matrix of interactions. Values in each row sum to 0, and values in each column sum to 0. • εij is a matrix of random disturbances.

The model assumes that the columns of y are a series of constants plus a random disturbance. You want to know if the constants are all the same. Sundermeyer MAR 550 Spring 2020 23 Intro to Statistics Toolbox Statistics Toolbox/Analysis of Variance Two-way ANOVA Example: Determine effect of car model and factory on the mileage rating of cars. There are three models (columns) and two factories (rows). Data from first factory is in first three rows, data from second factory is in last three rows. Do some cars have different mileage than others?

>> load mileage mileage = 33.3000 34.5000 37.4000 33.4000 34.8000 36.8000 32.9000 33.8000 37.6000 32.6000 33.4000 36.6000 32.5000 33.7000 37.0000 33.0000 33.9000 36.7000

>> cars = 3; >> [p,tbl,stats] = anova2(mileage,cars);[p,tbl,stats] = anova1(hogg);

Sundermeyer MAR 550 Spring 2020 24 Intro to Statistics Toolbox Statistics Toolbox/Analysis of Variance Two-way ANOVA (cont’d) • In this case the p-value for the first effect is zero to four decimal places. This indicates that the effect of the first predictor varies from one sample to another. An F-statistic as extreme as this would occur by chance less than once in 10,000 times if the samples were truly equal.

• The p-value for the second effect is 0.0039, which is also highly significant. This indicates that the effect of the second predictor varies from one sample to another.

• Does not appear to be any between the two predictors. The p-value, 0.8411, means that the observed result is quite likely (84 out 100 times) given that there is no interaction.

• The p-values returned by anova2 depend on assumptions about the random

disturbances εij in the model equation. For the p-values to be correct, these disturbances need to be: independent, normally distributed, and have constant variance.

• In addition, anova2 requires that data be balanced, which means there must be the same number of samples for each combination of control variables. Other ANOVA methods support unbalanced data with any number of predictors.

Sundermeyer MAR 550 Spring 2020 25 Intro to Statistics Toolbox Statistics Toolbox/

Linear Regression Models • In statistics, linear regression models take the form of a summation of coefficient · (independent variable or combination of independent variables).

For example: y  b  b x  b x  b x x  b x2  b x2  o 1 1 2 2 3 1 2 4 1 5 2

• In this example, the response variable y is modeled as a combination of constant, linear,

interaction, and quadratic terms formed from two predictor variables x1 and x2.

• Uncontrolled factors and experimental errors are modeled by ε. Given data on x1, x2, and y, regression estimates the model parameters βj (j = 1, ..., 5).

• More general linear regression models represent the relationship between a continuous response y and a continuous or categorical predictor x in the form:

y  b1 f1(x)   b p f p (x) 

Sundermeyer MAR 550 Spring 2020 26 Intro to Statistics Toolbox Statistics Toolbox/Regression Analysis

Example (system of equations):

Suppose we have a series of measurements of stream discharge and stage, measured at n different times.

time (day) = [0 14 28 42 56 70] stage (m) = [0.612 0.647 0.580 0.629 0.688 0.583] discharge (m3/s) = [0.330 0.395 0.241 0.338 0.531 0.279]

Suppose we now wish to fit a rating curve to these measurements. Let x = stage, y = discharge, then we can write this series of measurements as:

y = mx + b, with i = 1:n. i i y x 1  1   1      y2 x2 1     m

y2   x3 1   This in turn can be written as: y = X b, or:     b           yn  xn 1

Sundermeyer [n1]  [n 2] [21] MAR 550 Spring 2020 27 Intro to Statistics Toolbox Statistics Toolbox/Regression Analysis

yi = mxi + b

y = X b

 y1   x1 1 y  x 1  2   2  m y2   x3 1       b           yn  xn 1 [n1]  [n 2] [21]

Sundermeyer MAR 550 Spring 2020 28 Intro to Statistics Toolbox Statistics Toolbox/Regression Analysis Example: Harmonic Analysis: • sin(q+f) = sin(q)cos(f) + sin(f)cos(q)

• Let: A=Ccos(f), B=Csin(f) => Csin(t+f) = Asin(t) + Bcos(t)

• Assume: u =  + A1sin(M2t) + B1cos(M2t) + …

• Linear regression: y = Xb

u1  sin(M 2 t1) cos(M 2 t1)  1  A1  u  sin( t ) cos( t )  1  B   2    M 2 2 M 2 2   1               u sin( t ) cos( t )  1   n     M2 n     M2 n      y X b

Sundermeyer MAR 550 Spring 2020 29 Intro to Statistics Toolbox Statistics Toolbox/Regression Analysis Example: Harmonic analysis (cont’d) Southampton Surface Currents: Harmonic analysis for M2, M4=2xM2, M6=3xM2 ...

80

)

1

-

s

40

1000 m

c

(

d 0

100 e

e 0 2 4 6 8 10 12 14 16 18 20 22 24

p

S Ti m e ( h o u r s ) -40

2 10

)

1 – 1 -80 0.1

PSD (cm PSD s 0.01 Note: Tidal Harmonics can 0.001 cause tidal cycle to appear asymmetric. 0.0001 1 10 100 1000 cycles day-1 www.soes.soton.ac.uk/teaching/courses/oa311/tides_3.ppt Sundermeyer MAR 550 Spring 2020 30 Intro to Statistics Toolbox Statistics Toolbox/Regression Analysis

Generalized linear models (GLM; glmfit) are a flexible generalization of ordinary regression. They relate the random distribution of the measured variable of the (the distribution function) to the systematic (non- random) portion of the experiment (the linear predictor) through a function called the link function.

Generalized additive models (GAMs; user add-ons only in Matlab) are another extension to GLMs in which the linear predictor η is not restricted to be linear in s the covariates X but is an additive function of the xi :

The smooth functions fi are estimated from the data. In general this requires a large number of data points and is computationally intensive.

Sundermeyer MAR 550 Spring 2020 31 Data Handling Matlab Useful Tidbits …

Useful Tidbits • regress - performs multiple linear regression using least squares • nlinfit - performs nonlinear least-squares regression. • glmfit - fits a .

Sundermeyer MAR 550 Spring 2020 32