<<

Chapter 3. Univariate Empirical distribution: Histogram shows the number of points in a given data bin

Syntax [n,xout]=hist(data) %n: row vector if the number of data in each bin hist (data) %xout: bin locations hist(data, # of bins) hist(data, vector of data bins) Updated functions: hist à histogram [n, edges]=histcounts(data) center=edges(1:end-1)+diff(edges)/2 Empirical distribution: Histogram x=randn(1000, 1); histogram(x) hist(x, 22) %gives similar results histogram(x, 50) %50 bins y=-2:0.1:2; hist(x,y) %not pretty histogram(x,y) %much better Empirical distributions How do we describe a dataset? Discrete parameters n min, max, n , quartile n n n n Mean: Why different definitions?

1 N n xx= å i N i=1

n 1/ N xxxx=(...)12×××N

N n Harmonic mean x = N å 1 i=1 xi Median: write a median function

function m=mymedian(x) a=sort(x); b=length(x); b2=floor(b/2); if (b/2 > b2) %if mod(b,2) m=a(b2+1); else m=0.5*(a(b2)+a(b2+1)); end end Quantiles

Divide ordered data into (approximately) equal- sized subsets of data. 4-quantiles: quartiles 100-quantiles: 1st quartile: 25th 2nd quartile: median: 50th percentile Quartiles x=1:15, what is the 3rd quartile?

1. Use the median to divide the data to 2 subset (do not include the median value) 2. The lower quartile is the median of the lower half.

The 3rd quartile is 12.

Matlab uses linear interpolation: prctile(x,[25 50 75]) Dispersion of the data: Central moments

N th ' 1 n n n moments: µni= å x N i=1

N 1 n th n n central moments µni=å()xx- st N i=1 q 1 µ1=0 nd q 2 µ2à variance rd q 3 µ3 àskewness th q 4 µ4 àkurtosis statistics

1 N n Variance and standard 22 µ2 ==s å()xxi - deviation N -1 i=1

N ()xx- 3 n Skewness å i i=1 /s 3 N

N 4 å()xxi - n Kurtosis i=1 /s 4 N Moment statistics

Skewness > 0 dist. shifts to the right of mean = 0 dist. symmetric around the mean < 0 dist. shifts to the left of mean N 4 å()xxi - Kurtosis (some define it as i =1 /3 s 4 - ) N >3 “wide” dist. =3 normal dist <3 “narrow” dist

Which variable is needed to compare mean with the median? Moment statistics 3000

2500 n Variance: var(x) 2000 n Standard deviation: 1500

std(x) 1000

500

0 n How do variance, -4 -3 -2 -1 0 1 2 3 4 skewness and kurtosis of “red” data compare to “blue” data? Moment statistics How do variance, skewness and kurtosis of “red” data compare to “blue” data? 6000

5000

4000

3000

2000

1000

0 -5 0 5 10 15 20 Dealing with NaN x=[1:120, NaN]; mean(x), var(x) nanmean(x), nanvar(x) skewness(x) kurtosis(x)

How do we remove the NaN values? x(isnan(x))=[] x=x(~isnan(x)) NaN==NaN always return 0; must use isnan Organic matter data org=load('organicmatter_one.txt'); %checkout the data plot(org,'o-'), ylabel('wt %') %histogram %sqrt of the number of data is often a good first guess of intervals to use hist(org, 8) Statistics: mean(org) 12.3 median(org) 12.5 std(org) 1.17 var(org) 1.36 skewness(org) -0.25 kurtosis(org) 2.47 prctile(org,[25,50,75]) [11.4 12.5 13.3] Historgram: customized org=load('organicmatter_one.txt'); [n,xout]=hist(org,8); %n: raw with the number of data of each bin %xout: bin locations bar(xout, n, 'r') %red bar

%3d bar bar3(xout, n, 'b') Sensitivity to outliers sodium = load('sodiumcontent.txt'); whos sodium hist(sodium,11)

%add an outliner sodium2=sodium; sodium2(121,1) = 0.1; %sodim2=[sodium;0.1];

Which variable is most sensitive? Sensitivity to outliers

original outlier

40

Mean 5.7 5.6 35

30

Median 6.0 6.0 25

20

Std 1.1 1.2 15

10

Skewness -1.1 -1.5 5

0 0 1 2 3 4 5 6 7 8 Kurtosis 3.7 6.1 boxplot boxplot(org) n Box shows the lower quartile, median, and upper quartile values. n Whiskers show the most extreme data within 1.5 times interquatile (25th-75th percentile) from the ends of the box (25th, 75th percentile) n Red + signs: outliners load carsmall boxplot(MPG,Origin) %MPG is a vector of numbers, Origin a vector of strings that define “group” : group assignment {} data=[sodium; sodium2]; name(1:length(sodium))={'original'}; ed= length(sodium); name(ed+1:ed+length(sodium2))={'outlier'}; boxplot(data, name) Statistical distribution

n Discrete

n Continuous probability distribution

f(t): PDF probability density function F(x): CDF cumulative distribution function Discrete distribution: Poisson

λ and k are integers. Continuous PDF: Boltzman Gaussian (normal) distributions

n Parameters

q Mean µ Syntax Y=pdf(name, p1,..) q Standard deviation s Y=cdf(name, p1,…) name: distribution name n PDF pi: parameters for the distribution

Guassian Y=pdf(‘norm’,data vector, mean,std) Y=cdf(‘norm’,data vector, mean,std) n CDF Or Y=normpdf(data vector, mean,std) Y=normcdf(data vector, mean,std) Distributions

Beta Lognormal Binomial Nakagami Birnbaum-Saunders Negative Binomial Burr Type XII Noncentral F Chi-Square Noncentral t Exponential Noncentral Chi-Square Extreme Value Normal F Poisson Gamma Rayleigh Generalized Extreme Rician Value Student's t Generalized Pareto t Location-Scale Geometric Uniform (Continuous) Hypergeometric Uniform (Discrete) Inverse Gaussian Weibull Logistic Loglogistic Gaussian distribution

µ1=0; s1=0.2 µ2=2; s2=1 µ3=-2; s4=0.5 µ4=0; s4=3 mu=[0, 2, -2, 0];sig=[0.2,1,0.5,3]; x=linspace(-5,5,100); for i=1:4 xpdf(:,i)=pdf('norm',x,mu(i),sig(i)); xcdf(:,i)=cdf('norm',x,mu(i),sig(i)); end subplot(2,1,1), plot(x,xpdf) subplot(2,1,2), plot(x,xcdf) Gaussian distribution

PDF

CDF The sum of a large number of independent and identically distributed random variables, each with finite mean and variance, is approximately normally distributed. -the 2nd fundamental theorem of probability for i=1:2000 x=rand(1000,1)<0.5; heads=sum(x); tails=1000-heads; y(i)=heads-tails; end histogram(y) If winning odds is 50% If winning odds is 45% Can you afford going to Vegas?

What are the probabilities of losing $50 and $100 if the winning odds is 50%? ymean=mean(y) ystd=std(y) cdf('norm',-50,ymean,ystd) cdf('norm',-100,ymean,ystd) Gaussian distribution Vegas: Poll: n draws of 1 or -1 with a n samples of yes (1) or no (0) winning odds of p p: probability of yes

Total earn/loss Total vote mean: n*(p-(1-p)) mean: n*p std: 2*sqrt(n*p*(1-p)) std: sqrt(n*p*(1-p))

The poll result: a Gaussian with a mean of p and a standard deviation of sqrt(p*(1-p))/sqrt(n) Polling uncertainty

A Gaussian distribution with a mean of p and a standard deviation of sqrt(p*(1-p))/sqrt(n)

(1) If p=50% and 1,000 people are sampled, what is the 95th percentile of the polling result? (2) If p=30% and 1,000 people are sampled, what is the 95th percentile confidence interval of the polling result?

0.4

0.3 18.5% 18.5% 0.2

0.1 68% 0 -5 -4 -3 -2 -1 0 1 2 3 4 5

1

0.8

0.6 2.5% 2.5%

0.4 0.2 95% 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 Central limit theorem

n Let X1, X2, X3, ... be a set of n independent and identically distributed (not necessarily normal) random variables having finite values of mean µ and variance σ2. As the sample size n increases, the distribution of the sample approaches the with a mean µ and variance σ2/n irrespective of the shape of the original distribution. n The PDF of the sum of two or more independent variables is the convolution of their densities (if these densities exist). The convolution of a number of density functions tends to the normal density as the number of density functions increases without bound, under the conditions stated above. Gaussian distribution

and

(1)

(2)

(3) Estimate of the errors

n Constant error

X12++XX+... n Y = 2 n XN~ (µXX,)s s 2 s 2 2 2 å YN~(,µYYs ) sY = 2 = nn 2 aX~( N aµXX ,()) as n Weighted error 22 XY+++~( NµµXYXY,)ss wX Y = å ii å wi 22 2 å wiis s Y = 2 (å wi ) Propagation of error (normal distribution) Central limit theorem

The log of a product of random variables that take only positive values tends to have a normal distribution, which makes the product itself have a log-normal distribution. Log-normal distribution

If Y is a with a normal distribution, then X = exp(Y) has a log-normal distribution PDF

CDF Log-normal distribution

n If and then Y is a log-normally distributed variable as well n If then Y can be reasonably approximated by another log-normal distribution. Log-normal distribution mu=[0, 0, 1, 1];sig=[1/4,1/2,1,2]; x=linspace(0,3,100); for i=1:4 xpdf(:,i)=pdf('logn',x,mu(i),sig(i)); xcdf(:,i)=cdf('logn',x,mu(i),sig(i)); end subplot(2,1,1), semilogx(x,xpdf) subplot(2,1,2), semilogx(x,xcdf) Atmospheric aerosol size distribution Chi-squared distribution one of the most widely used in tests.

If Xi are k independent, normally distributed random variables with mean 0 and variance 1, then the random variable is distributed according to the chi-squared distribution Chi-squared distribution

n PDF ( k is DOF)

n if are N(μ,σ2), then where F distribution

n U1 and U2 have chi-squared distributions with d1 and d2 degrees of freedom respectively, and n U1 and U2 are independent n F-distribution is

n The F-distribution arises frequently as the null distribution in the Student’s t-distribution n The t-distribution is often used as an alternative to the normal distribution as a model for data. It is frequently the case that real data have heavier tails than the normal distribution allows for. The classical approach was to identify outliers and exclude or downweight them in some way. Student’s t-distribution

n Arises in the problem of estimating the mean of a normally distributed population when the sample size is small (hence the standard deviation is not known well). n Suppose X1, ..., Xn are independent random variables that are normally distributed with μ and variance σ2. DOF=n-1 Let

Then Using the distribution

Scaled distribution with a mean of 0 and a std of 1.

If T is a Gaussian distribution, it has a mean of 0 and a standard deviation of 1. 0.4

0.3 18.5% 18.5% 0.2

0.1 68% 0 -5 -4 -3 -2 -1 0 1 2 3 4 5

1

0.8 2.5% 2.5% 0.6

0.4

0.2 95%

0 -5 -4 -3 -2 -1 0 1 2 3 4 5 t-distribution

A random of screws gives weights 30.02, 29.99, 30.11, 29.97, 30.01, 29.99 Calculate a 95% confidence interval for the population's mean weight. t-distribution

The probability that x is greater than the confidence

interval za/2 is a (5%).

S µ - X

Since this is normally distributed, and thus symmetric, the

interval -za/2< z < za/2 is where z will have its value with probabiliity 1- a. Plugging in for z: µ - x -zz< < or aa/2s /2 n xz-×ss <µ< xz +× aa/2nn /2 This shows the confidence interval on µ at the 1- a confidence level. This is a commonly used statistic for estimation of the population mean. Often the level used is 95%, or 2s. t-distribution

Find the data value corresponding to probability

a/2 = cdf(‘t’, za/2, DOF) (forward)

za/2 = tinv(a/2 , DOF) (invert)

For Gaussian distribution

za/2 = norminv(a/2, mean, std) µ - x -zz< < or z = tinv(a/2 , DOF) aa/2s /2 a/2 t-distribution n xz-×ss <µ< xz +× aa/2nn /2 x=[30.02, 29.99, 30.11, 29.97, 30.01, 29.99] xmean=mean(x) xstd=std(x) n=length(x) %t-value at 2.5% (5%/2), DOF=n-1 %t-distribution cdf is symmetrical tvalue=abs(tinv(0.025,n-1)) %low/high bound low=xmean-tvalue*xstd/sqrt(n) high=xmean+tvalue*xstd/sqrt(n) Comparison to normal distribution %if the sample size is large xmean=mean(x); xsig=xstd/sqrt(n); t-distribution [29.963, 30.067] tvalue=abs(norminv(.025,0,1)) tvalue=2.57 xmean+tvalue*xsig Normal dist [29.975, 30.055] xmean-tvalue*xsig tvalue=2 …….…or …………. norminv(0.975,xmean,xsig) norminv(0.025,xmean,xsig) Student’s t and Guassian distributions

As the number of sample increases, the student-t distribution value for 95% confidence interval

tvalue approaches that of Gaussian.

Number of samples Exam schedule

n Exam #1: February 25 n Q/A session: In class (February 23) Statistical testing

n Null hypothesis is used to test differences in treatment and control groups, and the assumption at the outset of the is that no difference exists between the two groups for the variable being compared. n "A statistically significant difference" simply there is statistical evidence that there is a difference; it does not mean the difference is necessarily large, important or significant in the common meaning of the word. n Confidence level: Statistical testing

n The null hypothesis must be stated in mathematical/statistical terms that make it possible to calculate the probability of possible samples assuming the hypothesis is correct. n A test must be chosen that will summarize the information in the sample that is relevant to the hypothesis. In the example given above, it might be the numerical difference between the two sample means, m1 − m2. n The distribution of the test statistics is used to calculate the probability sets of possible values (usually an interval or union of intervals). n Among all the sets of possible values, we must choose one that we think represents the most extreme evidence against the hypothesis. That is called the critical region of the test statistic. The probability of the test statistic falling in the critical region when the null hypothesis is correct, is called the p value (“surprise” value) of the test. Probability

n probability (frequentists) is the interpretation of probability that defines an event's probability as the limit of its relative frequency in a large number of trials. The problems and paradoxes of the classical interpretation motivated the development of the relative frequency concept of probability. n is an interpretation of the probability calculus which holds that the concept of probability can be defined as the degree to which a person (or community) believes that a proposition is true. A posteriori is a function of a priori and observations. n The groups have agreed that Bayesian and Frequentist analyses answer genuinely different questions, but disagreed about which class of question it is more important to answer in scientific and engineering contexts. Pearson’s Chi-squared test

n The null hypothesis: the relative frequencies of occurrence of observed events follow a specified . n Pearson's chi-squared is the original and most widely-used chi-squared test. n The test compares the difference between each observed and theoretical frequency for each possible outcome. Pearson’s Chi-squared test n DOG=n-1

where

Oi=an observed frequency Ei=an expected (theoretical) frequency n=the number of possible outcomes of each event Pearson’s Chi-squared test

n A chi-squared probability of 0.05 or less (alpha value, α=0.05) is commonly used for rejecting the null hypothesis n Critics of α-values point out that the criterion used to decide "statistical significance" is based on the somewhat arbitrary choice of level (often set at 0.05). Pearson’s Chi-squared test A random sample of 100 people has been drawn from a population in which men and women are equal in frequency. There were 45 men and 55 women in the sample, what is chi- squared value?

Probability in Chi-squared distribution: cdf(‘chi2’, value, DOF)

Is this Chi-squared value within 95% of the distribution? cdf('chi2',1,1) Pearson’s Chi-squared test

n What should we 1 use? 0. 9 0. 8

0. 7 x=linspace(0,8,100); 0. 6 0. 5 v=cdf('chi2',x,ones(1,100)); 0. 4 plot(x,v) 0. 3 0. 2

0. 1

0 %cdf('chi2',1,1) = 0.68 0 1 2 3 4 5 6 7 8 chi2inv

%chi2inv(probability, DOF)

%the value exceeds 95% of the data with DOF =1 chi2inv(0.95,1) if the chi2 values < ch2inv(0.95,1), the hypothesis cannot be rejected. 2-component Bernoulli Chi-square Pearson’s Chi-squared test

(1) Load organic matter data you downloaded previously (http://apollo.eas.gatech.edu/eas4480/data/organicmatter_one.txt ). (2) Compute the histogram for 8 bins. Save both the histogram values and bin location (3) Generate a synthetic dataset following Gaussian distribution with the same mean and standard deviation as the organic matter dataset (4) Scale the distribution such that the sum of histogram values in the synthetic dataset is the same as the organic matter dataset (5) Plot the two distributions (6) Compute the Ch2 values using (7) Compute the degree of freedom in the comparison (8) Compute the Chi2 value for 95% probability (9) Can the hypothesis that the dataset follows Gaussian be rejected? Pearson’s Chi-squared test corg = load('organicmatter_one.txt'); % 60 data points, define 8 bins [n_exp,v] = hist(corg,8);

% generate a synthetic dataset n_syn = pdf('norm',v,mean(corg),std(corg));

%redistribute the n_exp sum n_syn = n_syn.*sum(n_exp)/sum(n_syn) ; subplot(1,2,1), bar(v,n_syn,'r') subplot(1,2,2), bar(v,n_exp,'b') Pearson’s Chi-squared test

%test chi2 = sum((n_exp - n_syn).^2 ./n_syn)

%dof (# of bins - # of parameters – 1) %For Gaussian, # of parameters is 2, mean & std dof=8-3;

%0.05 value chi2inv(0.95,dof) F-test

n The hypothesis: the standard deviations of two normally distributed populations are equal. n DOF Fa=na-1, Fb=nb-1 n Test s2 F = a 2 F_crit=finv(0.95, DOF1,DOF2) sb 22 where ssab> F-test load('organicmatter_four.mat');

%compare std 18

16 s1 = std(corg1) 14 12 s2 = std(corg2) 10 8

6

4

2

0 %DOF 22 23 24 25 26 27 28 29 df1 = length(corg1) - 1; df2 = length(corg2) - 1; F-test if s1>s2 Freal=(s1/s2)^2 else Freal=(s2/s1)^2 end

%find the table value for 5% extreme, inv-f cdf Ftable = finv(0.95,df1,df2) F-test Student’s t-Test

The null hypothesis: The means of two distributions are equal.

Assumptions n Normal distribution of data (What test to use?) n Equality of (what test to use?) Student’s t-Test Matlab syntax [h,p,ci] = ttest2(x,y,p-value) h: 1 rejects the null hypotheis; Function 0 cannot reject XX12- p: significance for the difference t = of the means between x and y s ci: confidence interval X12-X where

DOF of x1 and x2 is n1+n2- 2 Student’s t-Test load('organicmatter_two.mat');

12

[n1,x1] = hist(corg1); 10 [n2,x2] = hist(corg2); 8

6 h1 = bar(x1,n1); 4 hold on 2

0 h2 = bar(x2,n2); 22 23 24 25 26 27 28 29 set(h1,'FaceColor','none','EdgeColor','r') set(h2,'FaceColor','none','EdgeColor','b') hold off Student’s t-Test

%difference of mean mean(corg1)-mean(corg2)

%h=1 rejects the null hypothesis %significance: probability for the null hypothesis to be true %ci: 95% confidence interval on the difference of the means (if there is no statistically significant difference)

[h, significance, ci] = ttest2(corg1,corg2,0.05) Jarque-Bera test n The Jarque–Bera test is a goodness-of-fit test of whether sample data have the skewness and kurtosis a normal distribution.

where n is the number of observations (or degrees of freedom in general); S is the sample skewness, and K is the sample kurtosis Kolmogorov-Smirnov test n The Kolmogorov–Smirnov test (K–S test) is a nonparametric test for the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test). n “nonparametric” means that we do not assume a priori distribution of the data or a priori structure of the model. Extreme value distribution (Black Swan Event) Gaussian distribution? Precipitation distribution Extremal types theorem

n The maximum of a large number of independent identically distributed random variables is distributed like the Gumbel or Fréchet or Weibull Distributions independently of the parent distribution Extremal types distributions Extremal types distributions

n Weibull A distribution with a bounded upper tail. n Gumbel A distribution with a light upper tail and positively skewed. n Frechet A distribution with a heavy upper tail and infinite higher order moments. Calculation: Generalized Extreme Value (GEV) distribution n Build Blocks

q Divide full dataset into equal sized chunks of data e.g. yearly blocks of 365/366 daily precipitation measurements n Extract Block Maxima

q Determine the Max for each block n Fit GEV to the Max and estimate X(T)

q Estimate parameters of a GEV fitted to the block maxima.

q Calculate the return value function X(T) and its uncertainty. Precipitation distribution Return period: Risk communication

n Suppose that the cumulative probability of a given extreme event is p per year n The return period is T=1/p n Average waiting time until next occurrence of event is T years

If the cumulative probability of precipitation rate of 4 inch day-1 is 0.05, what is the return period? Extreme threshold distribution

n Beta A bounded distribution. n Exponential A light-tailed distribution with a "memoryless" property. n Pareto A heavy-tailed distribution (sometimes called "power law"). Weibull distribution

β is the , also known as the Weibull slope η is the γ is the ; often set to 0 Weibull scale parameter η n If η is increased (decreased), while β and γ are kept the same, the distribution gets stretched out to the right (left) and its height decreases Weibull distribution Weibull Shape Parameter β