Chapter 3. Univariate Statistics Empirical Distribution: Histogram Histogram Shows the Number of Data Points in a Given Data Bin
Total Page:16
File Type:pdf, Size:1020Kb
Chapter 3. Univariate Statistics Empirical distribution: Histogram Histogram shows the number of data points in a given data bin Syntax [n,xout]=hist(data) %n: row vector if the number of data in each bin hist (data) %xout: bin locations hist(data, # of bins) hist(data, vector of data bins) Updated functions: hist à histogram [n, edges]=histcounts(data) center=edges(1:end-1)+diff(edges)/2 Empirical distribution: Histogram x=randn(1000, 1); histogram(x) hist(x, 22) %gives similar results histogram(x, 50) %50 bins y=-2:0.1:2; hist(x,y) %not pretty histogram(x,y) %much better Empirical distributions How do we describe a dataset? Discrete parameters n min, max, mean n Median, quartile n standard deviation n variance n skewness n kurtosis Mean: Why different definitions? 1 N n Arithmetic mean xx= å i N i=1 n Geometric mean 1/ N xxxx=(...)12×××N N n Harmonic mean x = N å 1 i=1 xi Median: write a median function function m=mymedian(x) a=sort(x); b=length(x); b2=floor(b/2); if (b/2 > b2) %if mod(b,2) m=a(b2+1); else m=0.5*(a(b2)+a(b2+1)); end end Quantiles Divide ordered data into (approximately) equal- sized subsets of data. 4-quantiles: quartiles 100-quantiles: percentiles 1st quartile: 25th percentile 2nd quartile: median: 50th percentile Quartiles x=1:15, what is the 3rd quartile? 1. Use the median to divide the data to 2 subset (do not include the median value) 2. The lower quartile is the median of the lower half. The 3rd quartile is 12. Matlab uses linear interpolation: prctile(x,[25 50 75]) Dispersion of the data: Central moments N th ' 1 n n n moments: µni= å x N i=1 N 1 n th n n central moments µni=å()xx- st N i=1 q 1 µ1=0 nd q 2 µ2à variance rd q 3 µ3 àskewness th q 4 µ4 àkurtosis Moment statistics 1 N n Variance and standard 22 µ2 ==s å()xxi - deviation N -1 i=1 N ()xx- 3 n Skewness å i i=1 /s 3 N N 4 å()xxi - n Kurtosis i=1 /s 4 N Moment statistics Skewness > 0 dist. shifts to the right of mean = 0 dist. symmetric around the mean < 0 dist. shifts to the left of mean N 4 å()xxi - Kurtosis (some define it as i =1 /3 s 4 - ) N >3 “wide” dist. =3 normal dist <3 “narrow” dist Which variable is needed to compare mean with the median? Moment statistics 3000 2500 n Variance: var(x) 2000 n Standard deviation: 1500 std(x) 1000 500 0 n How do variance, -4 -3 -2 -1 0 1 2 3 4 skewness and kurtosis of “red” data compare to “blue” data? Moment statistics How do variance, skewness and kurtosis of “red” data compare to “blue” data? 6000 5000 4000 3000 2000 1000 0 -5 0 5 10 15 20 Dealing with NaN x=[1:120, NaN]; mean(x), var(x) nanmean(x), nanvar(x) skewness(x) kurtosis(x) How do we remove the NaN values? x(isnan(x))=[] x=x(~isnan(x)) NaN==NaN always return 0; must use isnan Organic matter data org=load('organicmatter_one.txt'); %checkout the data plot(org,'o-'), ylabel('wt %') %histogram %sqrt of the number of data is often a good first guess of intervals to use hist(org, 8) Statistics: mean(org) 12.3 median(org) 12.5 std(org) 1.17 var(org) 1.36 skewness(org) -0.25 kurtosis(org) 2.47 prctile(org,[25,50,75]) [11.4 12.5 13.3] Historgram: customized org=load('organicmatter_one.txt'); [n,xout]=hist(org,8); %n: raw with the number of data of each bin %xout: bin locations bar(xout, n, 'r') %red bar %3d bar bar3(xout, n, 'b') Sensitivity to outliers sodium = load('sodiumcontent.txt'); whos sodium hist(sodium,11) %add an outliner sodium2=sodium; sodium2(121,1) = 0.1; %sodim2=[sodium;0.1]; Which variable is most sensitive? Sensitivity to outliers original outlier 40 Mean 5.7 5.6 35 30 Median 6.0 6.0 25 20 Std 1.1 1.2 15 10 Skewness -1.1 -1.5 5 0 0 1 2 3 4 5 6 7 8 Kurtosis 3.7 6.1 boxplot boxplot(org) n Box shows the lower quartile, median, and upper quartile values. n Whiskers show the most extreme data within 1.5 times interquatile range (25th-75th percentile) from the ends of the box (25th, 75th percentile) n Red + signs: outliners load carsmall boxplot(MPG,Origin) %MPG is a vector of numbers, Origin a vector of strings that define “group” Box plot: group assignment {} data=[sodium; sodium2]; name(1:length(sodium))={'original'}; ed= length(sodium); name(ed+1:ed+length(sodium2))={'outlier'}; boxplot(data, name) Statistical distribution n Discrete probability distribution n Continuous probability distribution f(t): PDF probability density function F(x): CDF cumulative distribution function Discrete distribution: Poisson λ and k are integers. Continuous PDF: Boltzman Gaussian (normal) distributions n Parameters q Mean µ Syntax Y=pdf(name, p1,..) q Standard deviation s Y=cdf(name, p1,…) name: distribution name n PDF pi: parameters for the distribution Guassian Y=pdf(‘norm’,data vector, mean,std) Y=cdf(‘norm’,data vector, mean,std) n CDF Or Y=normpdf(data vector, mean,std) Y=normcdf(data vector, mean,std) Distributions Beta Lognormal Binomial Nakagami Birnbaum-Saunders Negative Binomial Burr Type XII Noncentral F Chi-Square Noncentral t Exponential Noncentral Chi-Square Extreme Value Normal F Poisson Gamma Rayleigh Generalized Extreme Rician Value Student's t Generalized Pareto t Location-Scale Geometric Uniform (Continuous) Hypergeometric Uniform (Discrete) Inverse Gaussian Weibull Logistic Loglogistic Gaussian distribution µ1=0; s1=0.2 µ2=2; s2=1 µ3=-2; s4=0.5 µ4=0; s4=3 mu=[0, 2, -2, 0];sig=[0.2,1,0.5,3]; x=linspace(-5,5,100); for i=1:4 xpdf(:,i)=pdf('norm',x,mu(i),sig(i)); xcdf(:,i)=cdf('norm',x,mu(i),sig(i)); end subplot(2,1,1), plot(x,xpdf) subplot(2,1,2), plot(x,xcdf) Gaussian distribution PDF CDF Central limit theorem The sum of a large number of independent and identically distributed random variables, each with finite mean and variance, is approximately normally distributed. -the 2nd fundamental theorem of probability for i=1:2000 x=rand(1000,1)<0.5; heads=sum(x); tails=1000-heads; y(i)=heads-tails; end histogram(y) If winning odds is 50% If winning odds is 45% Can you afford going to Vegas? What are the probabilities of losing $50 and $100 if the winning odds is 50%? ymean=mean(y) ystd=std(y) cdf('norm',-50,ymean,ystd) cdf('norm',-100,ymean,ystd) Gaussian distribution Vegas: Poll: n draws of 1 or -1 with a n samples of yes (1) or no (0) winning odds of p p: probability of yes Total earn/loss Total vote mean: n*(p-(1-p)) mean: n*p std: 2*sqrt(n*p*(1-p)) std: sqrt(n*p*(1-p)) The poll result: a Gaussian with a mean of p and a standard deviation of sqrt(p*(1-p))/sqrt(n) Polling uncertainty A Gaussian distribution with a mean of p and a standard deviation of sqrt(p*(1-p))/sqrt(n) (1) If p=50% and 1,000 people are sampled, what is the 95th percentile confidence interval of the polling result? (2) If p=30% and 1,000 people are sampled, what is the 95th percentile confidence interval of the polling result? 0.4 0.3 18.5% 18.5% 0.2 0.1 68% 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 1 0.8 0.6 2.5% 2.5% 0.4 0.2 95% 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 Central limit theorem n Let X1, X2, X3, ... be a set of n independent and identically distributed (not necessarily normal) random variables having finite values of mean µ and variance σ2. As the sample size n increases, the distribution of the sample average approaches the normal distribution with a mean µ and variance σ2/n irrespective of the shape of the original distribution. n The PDF of the sum of two or more independent variables is the convolution of their densities (if these densities exist). The convolution of a number of density functions tends to the normal density as the number of density functions increases without bound, under the conditions stated above. Gaussian distribution and (1) (2) (3) Estimate of the errors n Constant error X12++XX+... n Y = 2 n XN~ (µXX,)s s 2 s 2 2 2 å YN~(,µYYs ) sY = 2 = nn 2 aX~( N aµXX ,()) as n Weighted error 22 XY+++~( NµµXYXY,)ss wX Y = å ii å wi 22 2 å wiis s Y = 2 (å wi ) Propagation of error (normal distribution) Central limit theorem The log of a product of random variables that take only positive values tends to have a normal distribution, which makes the product itself have a log-normal distribution. Log-normal distribution If Y is a random variable with a normal distribution, then X = exp(Y) has a log-normal distribution PDF CDF Log-normal distribution n If and then Y is a log-normally distributed variable as well n If then Y can be reasonably approximated by another log-normal distribution. Log-normal distribution mu=[0, 0, 1, 1];sig=[1/4,1/2,1,2]; x=linspace(0,3,100); for i=1:4 xpdf(:,i)=pdf('logn',x,mu(i),sig(i)); xcdf(:,i)=cdf('logn',x,mu(i),sig(i)); end subplot(2,1,1), semilogx(x,xpdf) subplot(2,1,2), semilogx(x,xcdf) Atmospheric aerosol size distribution Chi-sQuared distribution one of the most widely used in statistical significance tests.