Data Analysis and Geostatistics - Lecture III Data Visualization - Histograms, Box Plots

Home , Geostatistics, Statistical population

Previous lecture

previous lecture: overview of statistical techniques and concepts

data descriptors - median, mode, mean, stdev, IQR

Data analysis and Geostatistics - lecture III data visualization - histograms, box plots

data analysis - separation (DFA) and clustering Univariate statistics and the use of data descriptors - process recognition (PCA & factor) - curve ﬁtting (regression)

also: impartiality of the observer -> inﬂuences your conﬁdence intervals

lack of appropriate control group in the Earth Sciences

start with core techniques and then move on to more complicated stats

Data and nomenclature Populations and samples

what are data and why do we gather them ? given a lavaﬂow;

a datum is a measurement of a property on a sample...

where property can be density, length, ppm Ca, thermal conductivity sample can be a rock, soil, water, plant

...intended to give us a value for the material where this sample came from The complete lava ﬂow is the population - if you want to know the exact we are actually not interested in the composition of the sample, but composition of the population, you have to analyze it in its entirety rather in the composition of the source of this sample obviously impossible:

brings us to the concept of a statistical population and sample instead: analyze a representative sample of this population - estimate the properties of the host population Estimating the properties of a population Populations and samples

From a set of samples, we can estimate the properties of the population, such A representative sample has to cover all data characteristics of the host as its characteristic value (mean or median) and the spread in values. population: its central value (mean, median) spread ≠ error ! the spread in the data (stdev, IQR) the data distribution (lognormal, modality) the relations with other variables A jeans shop will have a mean size, but it will also stock a spread of sizes invariably it requires more than one geological sample to obtain a representative statistical sample

This mean + spread is the shop’s estimate of the the number of samples depends on the characteristics of the jeans sizes for its clientele population host population, but also on the sampling technique employed, the sample treatment and the analytical technique

e.g. granite vs. basalt, spot samples vs. mixtures, soil vs. stream sediments, mixing of crushed or milled rocks, ﬁeld variance vs. lab variance

Populations and samples Representative samples

The statistics on national health suggest that 1 out of every 4 Americans, If you know something about your material you can estimate the number of or 1 out of every 5 Canadians will suﬀer from a certain type of illness in samples you will need to get a representative sample of the population their lifetime 1000 This means 3 in the current Geostats cohort…. 100

10 ppb % 1 ppb 0 ppm ppm 100 1 0 wt Is this reasoning correct ? 1000 1 1

0.1 Sample size in kg

0.01

0.001 μg mg g Grainsize in weight Required sample size for a representative sample Populations and samples

1 Gg In geology we generally no longer have the population at our disposal e.g. due to erosion, weathering and alteration

1 Mg ) -1 all the more important to make sure that your sample is representative 1 ppb mg kg 100 ppb 10 ppm ( 1000 ppm 1 kg 10 wt% n conc stdev sample size in mass

n 1 g 80 μm 0.8 mm 8 mm increasing number of samples -> when data characteristics no longer change -> 10 cm -3 garnet diameter (for ρ = 4 g cm ) representative sample

we would essentially need a sample bigger than the complete outcrop can estimate this if you know something of your samples: pilot sampling

Inﬁll drilling: vein-type deposit with halo Inﬁll drilling: vein-type deposit with halo

30! 4000!

25! 3000! 20!

15! 2000!

10! resource 1000! 5! % of area that is ore % of area 0! 0! 1 out of 7 2 out of 7 3 out of 7 4 out of 7 7 out of 7 0.0! 0.2! 0.4! 0.6! 0.8! 1.0! 0.0! 0.2! 0.4! 0.6! 0.8! 1.0! lines sampled lines sampled lines sampled lines sampled lines sampled fraction of lines sampled fraction of lines sampled Inﬁll drilling: disseminated deposit Inﬁll drilling: disseminated deposit

10010! ! 1000!

8! 900! 75! 6! 800! 50! 4! 700! resource 25! 2! 600! % of area that is ore % of area 00! ! 500! 1 out of 7 2 out of 7 3 out of 7 4 out of 7 7 out of 7 0.00.0! ! 0.20.2! ! 0.40.4!! 0.60.6!! 0.8!! 1.0! 0.0! 0.2! 0.4! 0.6! 0.8! 1.0! lines sampled lines sampled lines sampled lines sampled lines sampled fraction of lines sampled fraction of lines sampled

Types of data Ways to analyze your data

Not all data are equal in “quality” and this requires speciﬁc stats for some $ univariate: each variable is analyzed separately: data distribution, $ ratio scale data: the most versatile of all. They have a natural zero point. central value and data spread/uncertainty e.g. charge, weight, length, concentration $ bivariate: two variables are analyzed together to look for correlation or $ interval scale data: the intervals between the values are constant, but they do separation of data - regression not have a natural zero point. e.g. ºF, ºC $ multivariate: more than 2 variables are analyzed together. Generally $ closed data: the data sum to a speciﬁed value. e.g. wt%, % of a core. di%cult to visualize data and results Note the closure problem in these. $ spatial statistics: variation of variables in space, either 1D (well logs), 2D and $ ordinal scale data: the intervals between the values are not constant. e.g. Moh’s 3D (topography) or >3D, but some have to be spatial ! hardness scale of minerals $ time series: variation of variables along a time progression $ discrete data: only certain values are allowed, mostly the integers. e.g. number of grains in a sample. Not ppm ! We will start with univariate techniques - the distribution of data $ categorical data non-numerical observations. e.g. colour, presence/absence of a feature in a fossil. Univariate statistics Data visualization

To understand your data: plot their distribution! xx x xxxxx xxxxx x xxxxxxxx xxxxxxx xxxxx x x x x nn = = 1025 152 12 16

14 weight % SiO2 10 12 8 10 repeated analyses of the same sample, or a variety of samples from the 6 8 same host population, will not return an identical value due to: 6 4 sample heterogeneity: % olivine in each sample 4 2 lava heterogeneity: layering or phase separation 2 analytical uncertainty: not every ion makes it to the detector 0 0 45 46 47 48 49 50 51 52 78 84 90 96 102 108 114 120

weight % SiO2 Pb (ppm) analytical uncertainty ~ error, but heterogeneity is a property of the host population and is not error -> both result in uncertainty on your estimate of For these distribution you can now deﬁne a central value and spread: the central population value S (x ) S (x -m)2 S (x ) S (x -x)2 e.g. average Pb content of all Canadian rocks = i 2 = i x = i s2 = i m n s n n n - 1

The normal or Gaussian distribution Normality in data sets

If your data describe a phenomenon with one central value and variance normal distribution only when looking at one phenomenon, when all around it due to many di#erent disturbances: will trend to normal at high n variation is averaged out, or when one phenomenon is dominant

mean = median So: in most cases in geology -> deviations from normality 1 2 f(x) = e - 1 ( x- m ) total surface area = 1 s 2p ( 2 s ) skewness is especially common and leads to the lognormal distribution now median is not equal to mean: negative skew: mean-median < 0, tail to the left (low values) 1 stdev = 2/3 of data positive skew: mean-median > 0, tail to the right (high values)

2 stdev = 95% of data

-5.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0 Log-normal distribution Standardized data descriptors

If your data describe a phenomenon with one central value and random disturbances around this value: will trend to a normal distribution mode mean = median

median

1 stdev = 67% of data mean

2 stdev = 95% of data

min range max

Standardized data descriptors Robust descriptors: median and percentiles

Unfortunately, many data sets are not normally distributed The median - middle value - is a robust indicator that is not inﬂuenced by the range in the data is identical, but the data distribution has changed outliers. Now need an estimator of the spread: the interquartile range IQR

interquartile range - IQR 50 percentile P 25 median P 75 median

mean 0 1 2 3 4 5 6 7 8 9 10

1 stdev 10 percentile 40 percentile 80 percentile

min range max Robust descriptors: median and percentiles Robust descriptors: median and percentiles

The median - middle value - is a robust indicator that is not inﬂuenced by The median - middle value - is a robust indicator that is not inﬂuenced by outliers. Now need an estimator of the spread: the interquartile range IQR outliers. Now need an estimator of the spread: the interquartile range IQR

50% of the distribution P 16 median ± IQR P 84 P 16 median ± IQR P 84 P 25 median P 75 P 25 median P 75

0 1 2 3 4 5 6 7 8 9 10 0 1 3 9 10 15 16 17 18 22 23

-1 stdev mean +1 stdev -1 stdev mean +1 stdev mean ± 1 stdev mean ± 1 stdev 67% of the distribution

Robust descriptors: median and percentiles Hours of Netﬂix watched per week

The median - middle value - is a robust indicator that is not inﬂuenced by Median + IQR (or P84-P16) is a robust indicator of characteristic value + spread, outliers. Now need an estimator of the spread: the interquartile range IQR whereas mean ± stdev is not-robust and sensitive to outliers:

Hours of Netﬂix watched per week for a group of students: P 16 P 84 median ± IQR 2,4,6,8,10 mean = 6, median = 6 P 25 median P 75 2,4,6,8,60 mean = 16, median = 6

Including the stdev and P84-P16 indicators of spread: 0 1 3 9 10 15 16 17 18 22 62 2,4,6,8,10 mean = 6 ± 3, median = 6 -3,+3 2,4,6,8,60 mean = 16 ± 25, median = 6 -3,+21

-1 stdev mean +1 stdev This says that 2/3 of the data fall between -9 and +41 in the case of the mean. mean ± 1 stdev Although true, this does not describe the data well at all !

mean - 1 stdev < minimum value ! What is an outlier ? Summarizing your data

Outliers are extreme values in a dataset, but these are not necessarily caused by By reporting a dataset’s characteristic value and its spread as mean ± stdev, things like measurement errors: outliers are NOT faulty data, although they can be the reader has an expectation of the data distribution, which is only correct for a normal distribution. Median + IQR is generally more appropriate and correct. Better definition (from Banett and Lewis): a set of data that are inconsistent with the remainder of the dataset. In exploration geochemistry one of the tasks is in fact the mean median identification of outliers.

1 stdev = P84 - P16 = 68% of data 68% of data Outliers are always defined relative to a data distribution, because they are values that are not expected for that given data distribution. P97.5 - P2.5 = 2 stdev = 95% of data A data distribution essentially tells you the probability of finding a given value. This 95% of data is useful, because it allows us to designate a value as an outlier:

for example, if the probability of that value occurring in my distribution is less than 1%, I will classify the value as an outlier For a non-normal distribution, spread is generally asymmetric when using median + However, this depends on the distribution of the data ! percentiles. This immediately gives information on the distribution of the data !

Ratio and logarithmic data in spider-diagrams Log-normal transformation

log (ﬂux) most statistical techniques cannot deal with a lognormal distribution -> !(((((("#$%&' transform it to a normal distribution U ﬂux of volcano in kg/yr

16 median and P 16 84 median + P mean ± 1s 84 1200 !((("#$%&' 1000 700 450 log transform

4 !"#$%&' 16 median and P 84

/ = A -. -< ;. B, 2' 7. 3' 48 -: D -* ?C )* 4* 01 4, 46 2, -> 56 0$ Cu 3@ 3= ?$ Au 9: 9* 76 +, += -300 ! mean and stdev Not only are median + IQR robust, these properties also represent real values Beneﬁts of a robust indicator - example Median absolute deviation - MAD

The name “robust” refers to these parameters not being sensitive to outliers Sometimes it is impractical to have a lower and upper uncertainty on the or to addition of small sets of data: the value stays the same. This is in sharp median, and one characteristic value for robust spread is needed: MAD contrast to the mean, for example, which changes with every value added The median absolute deviation is the robust equivalent of the stdev. Ni (ppm) MAD = the median of the absolute deviations from the data median 34 55 Pb content |deviation| sorted mean = 40 91 167 157 23 (ppm) from median |deviation| stdev = ±13 ±169 ±308 ±291 25 10 10 0 31 median = 40 41 42 41 10 10 0 65 P25 = -10.5 -10 -10 -9.5 20 0 10 39 P75 = +7.5 +14 +20.5 +20 20 0 10 MAD 45 40 20 20 41 60 40 40 43 90 70 70 600 1000 Standard deviation and MAD di#er by a scaling factor. For the normal distribution, 40 this scaling is stdev = 1.4826 $ MAD

Conﬁdence level on your data descriptors Bootstrapping - example with PAST

It is very useful to know what the conﬁdence is on your central value and its spread: How much is my mean likely to shift if I collect more data, assuming that my pilot study is representative?

If you know your data distribution, this can be calculated exactly. However, in geochemistry, we generally estimate the distribution from the data we have.

Ni Bootstrapping: Ni Ni Ni Ni (ppm 34 (ppm34 (ppm23 (ppm31 (ppm39 subsampling etc 55 55 25 65 45 23 your dataset 23 31 39 60 25 calculating the 31 parameters on Ni Ni Ni Ni 65 these subsets (ppm34 (ppm23 (ppm31 (ppm39 39 23 31 39 60 etc 45 resulting spread: 31 39 60 55 60 conﬁdence level 2.50 3.00 3.00

2.00

2.00 2.00 1.50

Other deviations from normality 1.00 Multi-modal datasets: need to split them up 1.00 1.00

0.50 Multi-modal datasets: datasets that represent multiple samples or processes Many other deviations from normality: 0.00 0.00 0.00 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 0to 0.5 interpret 1 such 1.5 datasets 2 you 2.5 will need 3 to0 split them50 up, otherwise100 you 150look at 200 -0.50 outliers - need robust estimators a mixed signal. But how to split up a dataset; where to put the boundary ? -1.00 -1.00 -1.00

bimodality or multimodality - data set will have-1.50 to be split Individual distributions in a multi-modal dataset are likely to overlap -2.00 -2.00 kurtosis - steepness of the distribution -2.00 Probability plots allow you to determine where to split a dataset -2.50 -3.00 -3.00 One way to check for normality is the cumulative frequency plot

How to deal with multi-modal data sets How to deal with multi-modal data sets

Probability plot allows for identifying deviations from normality and multi-modality Have to split up the data set into groups: probability plots

2.50 3.00 3.00 90 2.00 normal bi-modal tri-modal

2.00 2.00 80 1.50

70 1.00 1.00 1.00 60 0.50

50 0.00 0.00 0.00 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 0 0.5 1 1.5 2 2.5 3 0 50 100 150 200 40 -0.50 -1.00 -1.00 -1.00 30 -1.50 20 -2.00 -2.00 central value of the class -2.00 10 -2.50 -3.00 -3.00 0 3 6 5 1 2 1 5 20 30 40 50 60 70 80 9995 / / / / /

17 17 17 17 17 cumulative frequency (%) 0-20 20-40 40-60 60-80 80-100 11 16 17 5 2 / / / / / 17 17 17 17 17 -2 -1 0 1 2

0-20 0-40 0-60 0-80 0-100 -2s -1s mean +1s +2s