<<

Basic Statistical Concepts

Statistical Population

• The entire underlying set of observations from which samples are drawn. – Philosophical meaning: all observations that could ever be taken for of inference • e.g. all barnacle populations that have ever existed, that exist or that will exist – Practical meaning: all observations within a reasonable range of inference • e.g. barnacle populations on that stretch of coast

1 Statistical

•A representative subset of a population. – What counts as being representative • Unbiased and hopefully precise

Strategies

• Define survey objectives: what is the goal of survey or ? What are your hypotheses?

• Define population to estimate (e.g. number of individuals, growth, color etc).

• Implement strategy – measure every individual (think of implications in terms of cost, time, practicality especially if destructive) – measure a representative portion of the population (a sample)

2 Sampling

• Goal: – Every unit and combination of units in the population (of interest) has an equal chance of selection. • This is a fundamental assumption in all estimation procedures •How: – Many ways if underlying distribution is not uniform » In the absence of information about underlying distribution the only safe strategy is random sampling » Costs: sometimes difficult, and may lead to its own source of (if sample size is low). Much more about this later

Sampling Objectives

• To obtain an unbiased estimate of a population

• To assess the precision of the estimate (i.e. calculate the of the mean)

• To obtain as precise an estimate of the parameters as possible for time, effort and money spent

3 Measures of location

• Population mean () - the average value • Sample mean = y estimates 

• Population - the middle value • Sample median estimates population median

• In a normal distribution the mean=median (also the ), this is not ensured in other distributions

Y Y Mean & median Median Mean

Measures of dispersion

• Population (2) - average sum of squared deviations from mean • Measured sample variance (s2) estimates population

variance 2 (xi - x) n -1

(s) – square root of variance – same units as original variable

4 Measures (statistics) of Dispersion

2 Population Sum of Squares (xi - )

2 Sample Sum of Squares SS = (xi - x) (x - )2 2 i Population variance  = n • Note, units are squared • Denominator is (n) (x - x)2 2 i Sample variance s = n -1 • Note, units are squared • Denominator is (n-1)

Sample 2 (xi - x) standard deviation s = n -1 • Note, units are not squared 

More Statistics of Dispersion

2 s s = Standard error of the mean sx = • This is also the Standard Deviation  n  n of the sample s CV = • Measurement of variation x independent of units • Expressed as a percentage of mean (xi -x ) (yi -y ) sxy = • Measure of how two variables covary n -1

• Range is between -8 and + 8 • Value depends in part on range in – bigger numbers yield bigger values of covariance

5 Types of estimates

• Point estimate – Single value estimate of the , e.g. y is a point estimate of , s is a point estimate of  • Interval estimate – Range within which the parameter lies known with some degree of confidence, e.g. 95% is an interval estimate of 

Sampling distribution

The (or probability) distribution of a (e.g. sample mean): • Many samples (size n) from population • Calculate all the sample means • of sample means ()

6 P(y)

y y

Multiple samples - P(y) - multiple sample means

y-

Sampling distribution of sample means

True Mean = 25 Means 21.5

36 22.3 22 27 19 23.0

41 23.9 12 24.9 25 33 23 25.1 25.8 31 26.5 Mean = 21.5 27.8 10 20 30 40 29.9 23 36

24 28 28 25

21 17 16 40

Mean = 25.8 Number of cases Estimate of Mean

7 Sampling distribution of mean

• The sampling distribution of the sample mean approaches a normal distribution as n gets larger - . • The mean of this sampling distribution is , the mean of original population.

Large number of Samples

16 0.3 Proportion per Bar

12 0.2 8 # of cases # 0.1 4 Probability 0 0.0 15 20 25 30 35 Estimate of Mean  Estimate of Mean (x)

8 Sampling distribution of mean

• The sampling distribution of the sample means approaches a normal distribution as n gets larger - Central Limit Theorem. • The mean of this sampling distribution is , the mean of original population. • The standard deviation of this sampling distribution is approximated by s/n, the standard deviation of any given sample divided by square root of sample size - the standard error of the mean.

Standard deviation can be calculated for any distribution The standard deviation of the distribution of sample means can be calculated the same as for a given sample

x (x - x)2 sx = i  N -1

Where: 1. x = mean of the means and ~  Probability    number of 2.5% 2.5% ~2 s ~2 sx means used in x distribution  Estimate of Mean (x)

9 Standard deviation can be calculated for any distribution The standard deviation of the distribution of sample means can be calculated the same as for a given sample (x - x)2 sx = i However:  N -1 To do so would require an x immense sampling effort, hence an approximation is used: s sx ~ SEM =  n Where: s = sample standard deviation Probability 2.5% 2.5% and ~2 SEM ~2 SEM n = number of replicates in the sample  Estimate of Mean (x)

Standard error of mean

• population SD estimated by sample SE: s/n • measures precision of sample mean • how close sample mean is likely to be to true population mean

10 Standard error of mean • If SE is low: – repeated samples would produce similar sample means – therefore, any single sample mean likely to be close to population mean • If SE is high: – repeated samples would produce very different sample means – therefore, any single sample mean may not be close to population mean

Effect of Standard error on estimate of  (assume df= large)

1 SEM=2 1 SEM=5 0.30 0.30

0.24  0.24 

y

t

y

i

0.18 t

l

i i 0.18

l

i

b

b

a

a

b

b

o

o

r

0.12 r P 0.12

P

~2 SEM ~2 SEM 0.06 0.06 2.5% 2.5% ~2 SEM ~2 SEM 0.00 0.00 0 10 20 30 40 0 10 20 30 40 Estimate of Mean Estimate of Mean

16 24

11 Worked example

Lovett et al. (2000) measured the 2- concentration of SO4 (sulfate) in 39 North American forested streams (qk2002, Box 2.2)

2- Statistic Value Stream SO4 (mmol.L-1) Sample mean 61.92 Santa Cruz 50.6 Sample median 62.10 Colgate 55.4 Sample variance 27.46 Halsey 56.5 Sample SD 5.24 Batavia Hill 57.5 SE of mean 0.84

Interval estimate

• How confident are we in a single sample estimate of , i.e. how close do we think our sample mean is to the unknown population mean.

• Remember  is a fixed, but unknown, value.

• Interval (range of values) within which we are 95% (for example) sure  occurs - a confidence interval

12 Distribution of sample means

99% 95% P( y ) y

Calculate the proportion of sample means within a range of values.

Transform distribution of means to a distribution with mean = 0 and standard deviation = 1

t statistic

y   s / n

13 0.4

0.3

y

t

i

l

i Null distribution

b

a 0.2

b

o

r

P

0.1

0.0 -5 -4 -3 -2 -1 0 1 2 3 4 5   t = y s / n

t statistic – interpretation and units • The deviation between the sample and population mean is expressed in terms of Standard error (i.e.   Standard deviations of the y sampling distribution) • Hence the value of t’s are in standard errors • For example t=2 indicates s / n that the deviation (y-  ) is equal to 2 x the standard error

14 The t statistic

•This t statistic follows a t-distribution, which has a mathematical formula. • Same as normal distribution for n>30 otherwise flatter, more spread than normal distribution. • Different t distributions for different sample sizes < 30 (actually df which is n-1).

0.4 N=30

0.3 Null distributions N=3

y

t

i

l

i

b

a 0.2

b

o

r

P

0.1

0.0 -5 -4 -3 -2 -1 0 1 2 3 4 5   t = y s / n

15 Two tailed t-values   Probabilities of t = y occurring outside the range s / n –tdf to + tdf

Probability

Degrees of Freedom .01 .02 .05 .10 .20 1 63.66 31.82 12.71 6.314 3.078 4 df 2 9.925 6.965 4.303 2.920 1.886 3 5.841 4.541 3.182 2.353 1.638 4 4.604 3.747 2.776 2.132 1.533 95% 5 4.032 3.365 2.571 2.015 1.476 -2.78 +2.78 10 3.169 2.764 2.228 1.812 1.372 -5 -4 -3 -2 -1 0 1 2 3 4 5 15 2.947 2.602 2.132 1.753 1.341   t = y 20 2.845 2.528 2.086 1.725 1.325 s / n 25 2.787 2.485 2.060 1.708 1.316 z 2.575 2.326 1.960 1.645 1.282

One and two tailed t-values (df 4)

Degrees of Freedom .005/.01 .01/.02 .025/.05 .05/.10 .10/.20 1 63.66 31.82 12.71 6.314 3.078 2 9.925 6.965 4.303 2.920 1.886 3 5.841 4.541 3.182 2.353 1.638 4 4.604 3.747 2.776 2.132 1.533 5 4.032 3.365 2.571 2.015 1.476 10 3.169 2.764 2.228 1.812 1.372 15 2.947 2.602 2.132 1.753 1.341 20 2.845 2.528 2.086 1.725 1.325 25 2.787 2.485 2.060 1.708 1.316 z 2.575 2.326 1.960 1.645 1.282

2 tailed 1 tailed 1 tailed

95% 95% -2.132 95% -2.78 +2.78 +2.132

-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5   t = y s / n

16 The t statistic

•This t statistic follows a t-distribution, which has a mathematical formula. • Same as normal distribution for n>30 otherwise flatter, more spread than normal distribution. • Different t distributions for different sample sizes < 30 (actually df which is n-1). • The proportions of t values between particular t values, yield a confidence estimate (the likelihood that the true mean is in the range)

For n = 5 (df = 4), 95% of all t values occur between t = -2.78 and t = +2.78

95% -2.78 +2.78 Pr(t) -5 -4 -3 -2 -1 0 1 2 3 4 5 95%

-2.780 +2.78 t • Probability is 95% that t is between -2.78 and +2.78 y   • Probability is 95% that is between -2.78 and s n +2.78 • Rearrange equation to solve for 

17 Rearrange to solve for 

  1. t = y s / n

2. t( s / n ) = ( y   ) Solve for (using df): 1. Calculated t values 2. Desired confidence level For two tailed test (to determine range in values that are likely to  y  t( s / n ) contain  ) 3. and Pr[y t( s / n )  y  t( s / n )]  y  t( s / n )

For 95% CI, use the t value between which 95% of all t values occur, for specific df (n-1): P[ y  t(s n )    y  t(s n )]  0.95

This is a confidence interval.

• CI’s from repeated samples of size n , 95% of the CI's would contain  and 5% wouldn’t.

• 95% probability that this interval includes the true population mean.

18 Worked example (Lovett et al. 2000) Sample mean 61.92 Sample SD 5.24 SE 0.84 =(5.24 / 39), where 39 = number of samples

• The t value (95%, 38df) = 2.02 (from a t-table) •2.5% of t values are greater than 2.02 •2.5% of t values are less than -2.02 • 95% of t values are between -2.02 and +2.02

P {61.92 - 2.02 (5.24 / 39) <  < 61.92 + 2.02 (5.24 / 39)} = 0.95 P {61.92 - 2.02 (.84) <  < 61.92 + 2.02 (.84)} = 0.95

P {60.22 <  < 63.62} = 0.95

Confidence Interval (2 tailed) assume 95% CI is desired

Pr[y t( s / n )  y  t( s / n )] Lovett et al. (2000) 38 df Probability

Degrees of Freedom .01 .02 .05 .10 .20 1 63.66 31.82 12.71 6.314 3.078 2 9.925 6.965 4.303 2.920 1.886 3 5.841 4.541 3.182 2.353 1.638 4 4.604 3.747 2.776 2.132 1.533 5 4.032 3.365 2.571 2.015 1.476 10 3.169 2.764 2.228 1.812 1.372 95% 15 2.947 2.602 2.132 1.753 1.341 20 2.845 2.528 2.086 1.725 1.325 25 2.787 2.485 2.060 1.708 1.316 38 2.705 2.426 2.020 1.685 1.302 61.92 Sample mean 61.92 y t( s / n ) y  t( s / n ) SEM 0.84 61.92 – 2.02(0.84) 61.92 + 2.02(0.84) DF 38 60.22 <  < 63.62

19 • The interval 60.22 – 63.62 will contain  95% of the time.

• We are 95% confident that the interval 60.22 – 63.62 contains .

Effect on Confidence Interval

Case Mean Sample Standard Standard Probability Lower Upper size (SS) deviation Error (%) Confidence Confidence (SD) Limit Limit

Reference 61.92 39 5.24 0.834 95% 60.22 63.62

Double 61.92 39 10.48 1.68 95% 58.53 65.31 SD Reduce 61.92 20 5.24 1.17 95% 59.47 64.37 Sample Size Increase 61.92 39 5.24 0.834 99% 59.65 64.20 Probabiity %

20 Estimating other parameters

• Logic of of population mean using t-distribution can be extended to – For example: confidence interval of the mean

Confidence Interval – using resampling vs t-test • CI from t distribution is based on creation of a distribution from mean and standard deviation calculated from sample data • CI from resampling is based on sample data • For example, assume we have the following observations and want to determine if the mean is different from 10 – 9, 8,9,10, 9, 8,9,7,11,11

21 Confidence Interval – using t distribution

0.4 0.4

0.3 y  9.14 H : µ=10 0.3 s  1.30 o 0.2 0.2 Prob(t) Prob(y) Use t distribution 0.1 0.1

0.0 0.0 5 7 9 111315 -5 -4 -3 -2 -1 0 1 2 3 4 5 y t

Sample mean 9.139 Sample SD 1.30 SE 0.412

• The t value (95%, 9 df) = 2.26 (from a t-table) •2.5% of t values are greater than 2.26 •2.5% of t values are less than -2.26 • 95% of t values are between -2.26 and +2.26

P {9.14 – 2.26 (1.30 / 10) <  < 9.14 + 2.26 (1.30 / 10)} = 0.95

P {8.22<  <10.07} = 0.95 Resampling

Confidence Interval – using resampling

•Resample many times, with replacement, each with 10 observations •Calculate means of all samples •Generate distribution of means and determine empirical confidence interval

Histogram of the Estimates of Mean

150 0.14

0.12 Proportion per Bar 95.0% Confidence Interval for Mean

100 0.10 Variable ¦ 0.08 ¦ Mean Lower Upper

Count 0.06 ------+------50 y ¦ 9.142 8.453 9.912 0.04

0.02

0 0.00 891011 Mean value

22 Compare approaches

Statistic Using t- distribution Using resampling Mean 9.139 9.142 Upper Confidence 10.07 9.91 limit Lower Confidence 8.22 8.45 Limit Accept YES NO

Ho: µ =10 (is 10 within 95% CI)

Confidence Intervals using resampling • The same technique may be used to set confidence limits to any statistic e.g. the median, the average (absolute) deviation, standard deviation (s), coefficient of variation, or .

23