Previous lecture
previous lecture: overview of statistical techniques and concepts
data descriptors - median, mode, mean, stdev, IQR
Data analysis and Geostatistics - lecture III data visualization - histograms, box plots
data analysis - separation (DFA) and clustering Univariate statistics and the use of data descriptors - process recognition (PCA & factor) - curve fitting (regression)
also: impartiality of the observer -> influences your confidence intervals
lack of appropriate control group in the Earth Sciences
start with core techniques and then move on to more complicated stats
Data and nomenclature Populations and samples
what are data and why do we gather them ? given a lavaflow;
a datum is a measurement of a property on a sample...
where property can be density, length, ppm Ca, thermal conductivity sample can be a rock, soil, water, plant
...intended to give us a value for the material where this sample came from The complete lava flow is the population - if you want to know the exact we are actually not interested in the composition of the sample, but composition of the population, you have to analyze it in its entirety rather in the composition of the source of this sample obviously impossible:
brings us to the concept of a statistical population and sample instead: analyze a representative sample of this population - estimate the properties of the host population Estimating the properties of a population Populations and samples
From a set of samples, we can estimate the properties of the population, such A representative sample has to cover all data characteristics of the host as its characteristic value (mean or median) and the spread in values. population: its central value (mean, median) spread ≠ error ! the spread in the data (stdev, IQR) the data distribution (lognormal, modality) the relations with other variables A jeans shop will have a mean size, but it will also stock a spread of sizes invariably it requires more than one geological sample to obtain a representative statistical sample
This mean + spread is the shop’s estimate of the the number of samples depends on the characteristics of the jeans sizes for its clientele population host population, but also on the sampling technique employed, the sample treatment and the analytical technique
e.g. granite vs. basalt, spot samples vs. mixtures, soil vs. stream sediments, mixing of crushed or milled rocks, field variance vs. lab variance
Populations and samples Representative samples
The statistics on national health suggest that 1 out of every 4 Americans, If you know something about your material you can estimate the number of or 1 out of every 5 Canadians will suffer from a certain type of illness in samples you will need to get a representative sample of the population their lifetime 1000 This means 3 in the current Geostats cohort…. 100
10 ppb % 1 ppb 0 ppm ppm 100 1 0 wt Is this reasoning correct ? 1000 1 1
0.1 Sample size in kg
0.01
0.001 μg mg g Grainsize in weight Required sample size for a representative sample Populations and samples
1 Gg In geology we generally no longer have the population at our disposal e.g. due to erosion, weathering and alteration
1 Mg ) -1 all the more important to make sure that your sample is representative 1 ppb mg kg 100 ppb 10 ppm ( 1000 ppm 1 kg 10 wt% n conc stdev sample size in mass
n 1 g 80 μm 0.8 mm 8 mm increasing number of samples -> when data characteristics no longer change -> 10 cm -3 garnet diameter (for ρ = 4 g cm ) representative sample
we would essentially need a sample bigger than the complete outcrop can estimate this if you know something of your samples: pilot sampling
Infill drilling: vein-type deposit with halo Infill drilling: vein-type deposit with halo
30! 4000!
25! 3000! 20!
15! 2000!
10! resource 1000! 5! % of area that is ore % of area 0! 0! 1 out of 7 2 out of 7 3 out of 7 4 out of 7 7 out of 7 0.0! 0.2! 0.4! 0.6! 0.8! 1.0! 0.0! 0.2! 0.4! 0.6! 0.8! 1.0! lines sampled lines sampled lines sampled lines sampled lines sampled fraction of lines sampled fraction of lines sampled Infill drilling: disseminated deposit Infill drilling: disseminated deposit
10010! ! 1000!
8! 900! 75! 6! 800! 50! 4! 700! resource 25! 2! 600! % of area that is ore % of area 00! ! 500! 1 out of 7 2 out of 7 3 out of 7 4 out of 7 7 out of 7 0.00.0! ! 0.20.2! ! 0.40.4!! 0.60.6!! 0.8!! 1.0! 0.0! 0.2! 0.4! 0.6! 0.8! 1.0! lines sampled lines sampled lines sampled lines sampled lines sampled fraction of lines sampled fraction of lines sampled
Types of data Ways to analyze your data
Not all data are equal in “quality” and this requires specific stats for some $ univariate: each variable is analyzed separately: data distribution, $ ratio scale data: the most versatile of all. They have a natural zero point. central value and data spread/uncertainty e.g. charge, weight, length, concentration $ bivariate: two variables are analyzed together to look for correlation or $ interval scale data: the intervals between the values are constant, but they do separation of data - regression not have a natural zero point. e.g. ºF, ºC $ multivariate: more than 2 variables are analyzed together. Generally $ closed data: the data sum to a specified value. e.g. wt%, % of a core. di%cult to visualize data and results Note the closure problem in these. $ spatial statistics: variation of variables in space, either 1D (well logs), 2D and $ ordinal scale data: the intervals between the values are not constant. e.g. Moh’s 3D (topography) or >3D, but some have to be spatial ! hardness scale of minerals $ time series: variation of variables along a time progression $ discrete data: only certain values are allowed, mostly the integers. e.g. number of grains in a sample. Not ppm ! We will start with univariate techniques - the distribution of data $ categorical data non-numerical observations. e.g. colour, presence/absence of a feature in a fossil. Univariate statistics Data visualization
To understand your data: plot their distribution! xx x xxxxx xxxxx x xxxxxxxx xxxxxxx xxxxx x x x x nn = = 1025 152 12 16
14 weight % SiO2 10 12 8 10 repeated analyses of the same sample, or a variety of samples from the 6 8 same host population, will not return an identical value due to: 6 4 sample heterogeneity: % olivine in each sample 4 2 lava heterogeneity: layering or phase separation 2 analytical uncertainty: not every ion makes it to the detector 0 0 45 46 47 48 49 50 51 52 78 84 90 96 102 108 114 120
weight % SiO2 Pb (ppm) analytical uncertainty ~ error, but heterogeneity is a property of the host population and is not error -> both result in uncertainty on your estimate of For these distribution you can now define a central value and spread: the central population value S (x ) S (x -m)2 S (x ) S (x -x)2 e.g. average Pb content of all Canadian rocks = i 2 = i x = i s2 = i m n s n n n - 1
The normal or Gaussian distribution Normality in data sets
If your data describe a phenomenon with one central value and variance normal distribution only when looking at one phenomenon, when all around it due to many di#erent disturbances: will trend to normal at high n variation is averaged out, or when one phenomenon is dominant
mean = median So: in most cases in geology -> deviations from normality 1 2 f(x) = e - 1 ( x- m ) total surface area = 1 s 2p ( 2 s ) skewness is especially common and leads to the lognormal distribution now median is not equal to mean: negative skew: mean-median < 0, tail to the left (low values) 1 stdev = 2/3 of data positive skew: mean-median > 0, tail to the right (high values)
2 stdev = 95% of data
-5.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0 Log-normal distribution Standardized data descriptors
If your data describe a phenomenon with one central value and random disturbances around this value: will trend to a normal distribution mode mean = median
median
1 stdev = 67% of data mean
2 stdev = 95% of data
min range max
Standardized data descriptors Robust descriptors: median and percentiles
Unfortunately, many data sets are not normally distributed The median - middle value - is a robust indicator that is not influenced by the range in the data is identical, but the data distribution has changed outliers. Now need an estimator of the spread: the interquartile range IQR
interquartile range - IQR 50 percentile P 25 median P 75 median
mean 0 1 2 3 4 5 6 7 8 9 10
1 stdev 10 percentile 40 percentile 80 percentile
min range max Robust descriptors: median and percentiles Robust descriptors: median and percentiles
The median - middle value - is a robust indicator that is not influenced by The median - middle value - is a robust indicator that is not influenced by outliers. Now need an estimator of the spread: the interquartile range IQR outliers. Now need an estimator of the spread: the interquartile range IQR
50% of the distribution P 16 median ± IQR P 84 P 16 median ± IQR P 84 P 25 median P 75 P 25 median P 75
0 1 2 3 4 5 6 7 8 9 10 0 1 3 9 10 15 16 17 18 22 23
-1 stdev mean +1 stdev -1 stdev mean +1 stdev mean ± 1 stdev mean ± 1 stdev 67% of the distribution
Robust descriptors: median and percentiles Hours of Netflix watched per week
The median - middle value - is a robust indicator that is not influenced by Median + IQR (or P84-P16) is a robust indicator of characteristic value + spread, outliers. Now need an estimator of the spread: the interquartile range IQR whereas mean ± stdev is not-robust and sensitive to outliers:
Hours of Netflix watched per week for a group of students: P 16 P 84 median ± IQR 2,4,6,8,10 mean = 6, median = 6 P 25 median P 75 2,4,6,8,60 mean = 16, median = 6
Including the stdev and P84-P16 indicators of spread: 0 1 3 9 10 15 16 17 18 22 62 2,4,6,8,10 mean = 6 ± 3, median = 6 -3,+3 2,4,6,8,60 mean = 16 ± 25, median = 6 -3,+21
-1 stdev mean +1 stdev This says that 2/3 of the data fall between -9 and +41 in the case of the mean. mean ± 1 stdev Although true, this does not describe the data well at all !
mean - 1 stdev < minimum value ! What is an outlier ? Summarizing your data
Outliers are extreme values in a dataset, but these are not necessarily caused by By reporting a dataset’s characteristic value and its spread as mean ± stdev, things like measurement errors: outliers are NOT faulty data, although they can be the reader has an expectation of the data distribution, which is only correct for a normal distribution. Median + IQR is generally more appropriate and correct. Better definition (from Banett and Lewis): a set of data that are inconsistent with the remainder of the dataset. In exploration geochemistry one of the tasks is in fact the mean median identification of outliers.
1 stdev = P84 - P16 = 68% of data 68% of data Outliers are always defined relative to a data distribution, because they are values that are not expected for that given data distribution. P97.5 - P2.5 = 2 stdev = 95% of data A data distribution essentially tells you the probability of finding a given value. This 95% of data is useful, because it allows us to designate a value as an outlier:
for example, if the probability of that value occurring in my distribution is less than 1%, I will classify the value as an outlier For a non-normal distribution, spread is generally asymmetric when using median + However, this depends on the distribution of the data ! percentiles. This immediately gives information on the distribution of the data !
Ratio and logarithmic data in spider-diagrams Log-normal transformation
log (flux) most statistical techniques cannot deal with a lognormal distribution -> !(((((("#$%&' transform it to a normal distribution U flux of volcano in kg/yr
16 median and P 16 84 median + P mean ± 1s 84 1200 !((("#$%&' 1000 700 450 log transform
4 !"#$%&' 16 median and P 84
/ = A -. -< ;. B, 2' 7. 3' 48 -: D -* ?C )* 4* 01 4, 46 2, -> 56 0$ Cu 3@ 3= ?$ Au 9: 9* 76 +, += -300 ! mean and stdev Not only are median + IQR robust, these properties also represent real values Benefits of a robust indicator - example Median absolute deviation - MAD
The name “robust” refers to these parameters not being sensitive to outliers Sometimes it is impractical to have a lower and upper uncertainty on the or to addition of small sets of data: the value stays the same. This is in sharp median, and one characteristic value for robust spread is needed: MAD contrast to the mean, for example, which changes with every value added The median absolute deviation is the robust equivalent of the stdev. Ni (ppm) MAD = the median of the absolute deviations from the data median 34 55 Pb content |deviation| sorted mean = 40 91 167 157 23 (ppm) from median |deviation| stdev = ±13 ±169 ±308 ±291 25 10 10 0 31 median = 40 41 42 41 10 10 0 65 P25 = -10.5 -10 -10 -9.5 20 0 10 39 P75 = +7.5 +14 +20.5 +20 20 0 10 MAD 45 40 20 20 41 60 40 40 43 90 70 70 600 1000 Standard deviation and MAD di#er by a scaling factor. For the normal distribution, 40 this scaling is stdev = 1.4826 $ MAD
Confidence level on your data descriptors Bootstrapping - example with PAST
It is very useful to know what the confidence is on your central value and its spread: How much is my mean likely to shift if I collect more data, assuming that my pilot study is representative?
If you know your data distribution, this can be calculated exactly. However, in geochemistry, we generally estimate the distribution from the data we have.
Ni Bootstrapping: Ni Ni Ni Ni (ppm 34 (ppm34 (ppm23 (ppm31 (ppm39 subsampling etc 55 55 25 65 45 23 your dataset 23 31 39 60 25 calculating the 31 parameters on Ni Ni Ni Ni 65 these subsets (ppm34 (ppm23 (ppm31 (ppm39 39 23 31 39 60 etc 45 resulting spread: 31 39 60 55 60 confidence level 2.50 3.00 3.00
2.00
2.00 2.00 1.50
Other deviations from normality 1.00 Multi-modal datasets: need to split them up 1.00 1.00
0.50 Multi-modal datasets: datasets that represent multiple samples or processes Many other deviations from normality: 0.00 0.00 0.00 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 0to 0.5 interpret 1 such 1.5 datasets 2 you 2.5 will need 3 to0 split them50 up, otherwise100 you 150look at 200 -0.50 outliers - need robust estimators a mixed signal. But how to split up a dataset; where to put the boundary ? -1.00 -1.00 -1.00
bimodality or multimodality - data set will have-1.50 to be split Individual distributions in a multi-modal dataset are likely to overlap -2.00 -2.00 kurtosis - steepness of the distribution -2.00 Probability plots allow you to determine where to split a dataset -2.50 -3.00 -3.00 One way to check for normality is the cumulative frequency plot
How to deal with multi-modal data sets How to deal with multi-modal data sets
Probability plot allows for identifying deviations from normality and multi-modality Have to split up the data set into groups: probability plots
2.50 3.00 3.00 90 2.00 normal bi-modal tri-modal
2.00 2.00 80 1.50
70 1.00 1.00 1.00 60 0.50
50 0.00 0.00 0.00 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 0 0.5 1 1.5 2 2.5 3 0 50 100 150 200 40 -0.50 -1.00 -1.00 -1.00 30 -1.50 20 -2.00 -2.00 central value of the class -2.00 10 -2.50 -3.00 -3.00 0 3 6 5 1 2 1 5 20 30 40 50 60 70 80 9995 / / / / /
17 17 17 17 17 cumulative frequency (%) 0-20 20-40 40-60 60-80 80-100 11 16 17 5 2 / / / / / 17 17 17 17 17 -2 -1 0 1 2
0-20 0-40 0-60 0-80 0-100 -2s -1s mean +1s +2s