Exercise 3 Analysis of Occurrence Data Exercise 3 Exe 3.1 Data

TWS on Risk Management of Contaminants (MAFF, Japan) 2019/1/24

January 24, 2019 Training Workshop for Building Capacities Exercise 3 “Risk management of contaminants in foods” Tokyo, Japan 3.1 Data aggregation calculation of basic statistics Exercise 3 Analysis of Occurrence data maximum, minimum, mean, median 3.2 Creating a frequency table, histogram Takanori UKENA, Ph.D. 3.3 Calculation of high percentile [email protected]

Ministry of Agriculture Forestry and Fisheries Food Safety and Consumer Affairs Bureau MAFF 1 MAFF 2

Data analysis using occurrence data

Purpose:  To estimate population (e.g. nationwide situation) form sample data Exe 3.1 Data aggregation  For further consideration

• setting maximum level • evaluating effectiveness of risk management measures • time-course analysis

4 MAFF 3

Analysis of surveillance results Analysis of surveillance results 1. laboratory conditions 2. Dataset • Sampling plan • Basic statistics mean, median, maximum and minimum value • Internal quality control  LOD, LOQ and their definitions • Results below LOD or LOQ  Calibration curve Replace

1 TWS on Risk Management of Contaminants (MAFF, Japan) 2019/1/24

Analysis of surveillance results Basic statistics

5. Statistical analysis Estimation of population from sample data. • distribution model • estimation of high percentiles Population size N • exposure assessment  Maximum value mean μ  Minimum value standard deviation σ  Range  Average sampling inference  Mean (arithmetic mean)  Median Sample  Variance size n mean m  Sample standard deviation standard deviation s

7 MAFF 8

Median, Mode Percentile (%ile)

• Median  The values are ranked in ascending order, i.e. from – the middle value in a set of values arranged in smallest to largest. order of size:  Percentile is a number where a certain percentage of – the average of the two middle values if there is no one middle value. observations fall below that number. For example, the 20th percentile is the value below which 20% of the – a robust measure of central tendency observations may be found. Comparing to mean, median is robust to outlier value.  0 percentile (0%ile): minimum value

• Mode  25 percentile (25%ile)： first quartile (Q1)

– a set of data values in a dataset that appears  50 percentile (50%ile)： median or second quartile (Q2) most often (most-frequently occurring value)  75 percentile (75%ile) ： third quartile (Q3).  100 percentile (100%ile) : maximum value 9 10

Exercise parameter

• Let’s calculate basic statistics of Data1 Mean ・population mean  Maximum value ・sample mean ̅  Minimum value Variance  Range ・population 2  Average ・sample 2  Mean (arithmetic mean)  Median Standard deviation  Variance ・population  Sample standard deviation ・sample

11 12

2 TWS on Risk Management of Contaminants (MAFF, Japan) 2019/1/24

Mean Deviation

•Deviation： difference between the observed value population mean of a variable and mean 1 ＝ ̅ sample mean • Squared deviation from the mean 1 needed to calculate sample variance ̅ ＝ n 2  i  xx )( i1 13 14

Unbiased estimation of variance Standard deviation (SD) a measure to quantify the amount of variation or •use of n − 1 for sample variance formula dispersion of a set of data values instead of sample size n ∑ 1 Population standard ＝ 2 σ Population variance：＝ deviation Sample standard ∑ ̅ 1 ＝ Sample variance：＝ ̅ 2 deviation 1 1 N : number of population : number of observations in the sample : observed values of the items : population mean ̅ : sample mean 15 16

Skewness Kurtosis

degree of distortion from symmetrical curve High kurtosis is an indication of an outlier (or outliers)

Sk < 0 : left-skewed Sk = 0: no skew Sk > 0: right-skewed Histogram of rnorm(1000)

Histgram of -1*rnorm(100)^2 Histogram of rnorm(1000) Histogram of rnorm(100)^2 Frequency Frequency Frequency Frequency 0 50 150 050 150 0204060 0 204060 -3 -2 -1 0 1 2 3 -8 -6 -4 -2 0 -3 -2 -1 0 1 2 3 02468 rnorm(1000) -1*rnorm(100)^2 rnorm(1000) rnorm(100)^2

1 : sample size ∑ ̅ ∑ ̅ : sample size : observed values ＝＝ : observed values ̅ : sample mean Skewness ̅ : sample mean : sample SD : sample SD defined as 0 or 3 for normal distribution 17 18

3 TWS on Risk Management of Contaminants (MAFF, Japan) 2019/1/24

Analytical results below LOD, LOQ Calculation of LB, MB and UB

Examples How to deal with results below LOD, LOQ?  Lower bound (LB) replacing all the results reported as below the LOD/LOQ by 0 Assume LOD = 0.03 mg/kg  Medium bound (MB) LOQ = 0.05 mg/kg i. replacing all the results reported as below the LOD/LOQ by half their respective LOD/LOQ Some results are below LOD or LOQ. ii. replacing all the results reported as below the LOD by half their respective LOD, and retain all the results reported between LOD and LOQ  Upper bound (UB) replacing all the results reported as below the LOD/LOQ to their respective LOD/LOQ. MAFF 19 MAFF 20

Aggregation of two datasets Statistical test to compare two datasets

Can we combine two datasets (data1 and Parametric (normal distribution) or non data2) for further analysis? parametric distribution?

For example, two datasets obtained by a. Statistical normality test a. completely different sampling plan for different purpose For contaminants, datasets by surveillance usually b. slight different sampling plan for the same purpose have non parametric distribution c. same sampling plan but different target d. multi-years surveillance b. Statistical test to compare median e. same sampling plan but different basic statistics c. Statistical test to compare two distributions

MAFF 21 MAFF 22

Exercise two major non parametric Mann-Whitney U Test (1) statistical tests Assumptions of Mann-Whitney U test a. Mann-Whitney U Test  sometimes called the Mann Whitney Wilcoxon 1. All the observations from both datasets are Test or the Wilcoxon Rank Sum Test independent of each other,  test whether medians of two independent 2. The responses are ordinal (i.e., one can at least say, datasets come from the same population of any two observations, which is the greater), (Kruskal–Wallis H test is used to compare medians 3. Under the null hypothesis H , the distributions of for more than three independent datasets.) 0 both populations are equal. b.Two-sample Kolmogorov–Smirnov test 4. The alternative hypothesis H is that the distributions  test whether the two independent datasets come 1 are not equal. from the same distribution

MAFF 23 MAFF 24

4 TWS on Risk Management of Contaminants (MAFF, Japan) 2019/1/24

Mann-Whitney U Test (2) 1. Assign numeric ranks to all the observations (put the observations from both datasets to one set), beginning with 1 for the smallest where n1, n2 is the sample size for dataset 1 and dataset 2 value. Where there are groups of tied values, assign a rank equal respectively, and R2 is the sum of the ranks in dataset 2. to the midpoint of unadjusted rankings.

2. Add up the ranks for the observations which came from dataset 1. 5. The smaller value of U1 and U2 is the one used when consulting 3. Add up the ranks for the observations which came from dataset 2. significance test. 4. Statistic U is then given by:

where n1, n2 is the sample size for dataset 1 and dataset 2 respectively, and R1 is the sum of the ranks in dataset 1. 26 MAFF 25

Two-sample Kolmogorov–Smirnov test (1) Two-sample Kolmogorov–Smirnov test (2)

Assumptions of two-sample Kolmogorov–Smirnov test 1. First dataset has size m with an observed cumulative distribution function of F(x), and the second dataset has 1. All the observations from both detaset are size n with an observed cumulative distribution function independent of each other, of G(x).

2. Under the null hypothesis H0, both samples come 2. Calculate from a population with the same distribution , 3. The alternative hypothesis H1 is that both samples do not come from a population with the same 3. The null hypothesis is rejected at level α if distribution ,,

MAFF 27 MAFF 28

Two-sample Kolmogorov–Smirnov test (2) Data aggregation

The value of is given in the table below for the most common levels of . Exercise: Let’s try to test using data1 and data2 if α 0.10 0.05 0.025 0.01 0.005 0.001 they can combine for further analysis. c ( α ) 1.073 1.224 1.358 1.517 1.628 1.858  Mann Whitney U test In general  Two-sample Kolmogorov–Smirnov test

MAFF 29 MAFF 30

5 TWS on Risk Management of Contaminants (MAFF, Japan) 2019/1/24

Graphical expression

 Histogram  drawing histograms with various bin width  kernel density estimation Exe 3.2 Creating a frequency table,  P-P plot (Probability-Probability Plot) histogram  QQ plot (Quantile-Quantile Plot)

 Box plot (box and whisker plot)

32 MAFF 31

Creating a frequency table, and histogram Making histogram 1. Purpose to graphically summarize the distribution of a data set Exercise: 2. Steps to make histogram Let’s try to make frequency table and histogram using new dataset, combining  Making frequency table  Frequency table dataset 1 and dataset 2. • decide class interval or bin size, usually ten or more • need to consider border value to include lower or upper class  Calculating relative frequency, cumulative frequency  Making bar plot with no gap width between each bar 34 MAFF 33

Effect of bin size using same dataset Examples to selecting bin size

Try to make histograms with various bin size max min Too large bin width Good bin width Too small bin width Histogram of data1 Histogram of data1 Histogram of data1 Square-root choice

Sturges' formula log 1

Scott's choice 3.5

Frequency Frequency Frequency /

50 100 150 200 250 Freedman–Diaconis' choice

0 200 400 600 800 0 020406080100 2

0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 35 / 36 data1 data1 data1

6 TWS on Risk Management of Contaminants (MAFF, Japan) 2019/1/24

Shape of histogram and distribution Histogram and Density histogram Y： Frequency density Unimodal and asymmetric Y： Frequency Unimodal (oneHistogram peak) of x and symmetric density＝relative frequency / bin width ヒストグラム-- 密度表示ヒストグラム-- 頻度表示 relative frequency ＝ frequency / sample size right-skewed Left-skewed Density Density Density 密度 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 頻度

0.0 0.5 1.0 1.5 2.0 0510150.0 0.2 0.4 0.6 0.8 1.0 Density

x xl Density Curves x

Multimodal (two or more peaks) Frequency

Containing outlier analyze dataset after classifying into 0.0 0.5 1.0 1.5 2.0 2.5 3.0 unimodal sub- 0 1020304050 0.0 0.2 0.4 0.6 0.8 1.0 Density datasets 0.0 0.2 0.4 0.6 0.8 1.0 Density x x Density histogram is comparable to theoretical probability 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.01 0.02 0.03 0.04 37 density distributions. 38 0 5 10 15 20 25 30 40 50 60 70 80 90 100

Kernel density estimation (1) Kernel density estimation (2) ：Kernel density  Estimation of distribution not depend on bin number 1 estimator or class interval of histogram : kernel function : h> 0, bandwidth  The KDE smoothes each data point Xi into a small Example of kernel functions density bumps and then sum all these small bumps Gaussian kernel together to obtain the final density estimate. Epanechnikov kernel 1 / 5 5） below is called the kernel function that is generally a smooth, symmetric function. 0 5）：Kernel density estimator Rectangular kernel 1） 1 : kernel function 1） : h> 0, smoothing bandwidth 0 that controls the amount of Band width (h) plays key role same as bin width of histogram. smoothing 39 40

Kernel density estimation (3) Examples of kernel density estimation

Similar to bin size of histogram, KSD need to decide Gaussian kernel Rectangular kernel bandwidth to control amount of smoothing.

band width =0.03903 band width =0.03903  When h is too small, there are many wiggly structures on band width =0.0195 band width =0.0195 our density curve.

 When h is too large, some important structures are

obscured by the huge amount of smoothing. Density Density

Try to change h value to choose appropriate bandwidth. The following is one example formula for selecting h. 01234 01234

0.9 σ: standard deviation (sd) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 x x / or IQR instead of sd 41 42

7 TWS on Risk Management of Contaminants (MAFF, Japan) 2019/1/24

Probability plot Q-Q (Quantile-Quantile) Plot (1)  comparing two probability distributions by plotting Comparing two distribution F(x) and G(x) in their quantiles against each other graphical expression Q-Q plot follows the 45°line Q-Q plot ⇒ Usually compare an empirical distribution with a y = x if the two distributions theoretical distribution. agree.  Q-Q plot, P-P plot, CDF plot (Normal Q-Q plot and normal P-P plot is used to compare whether empirical distribution follow a normal distribution. The general QQ plot or PP plot is used to compare the distributions of any two datasets.) Empirical quantiles Q-Q plot and P-P plot follows the 45°line if the two distributions agree. 0 2 4 6 8 101214 gamma 02468101214

43 Theoretical quantiles 44

Q-Q (Quantile-Quantile) Plot (2) P-P plot, CDF plot

Example of calculating quantiles of items PP (probability–probability) plot  Order items from minimum (1) to maximum(n)  Plots the two cumulative distribution functions  Calculate quantiles using the following formula (CDF) against each other 0.5 or (1~) 1 example Observed fi Observed fi CDF plot value value 0.21 0.05 0.90 0.55 The cumulative distribution function (cdf) is the 0.35 0.15 1.00 0.65 probability that the variable takes a value less 0.50 0.25 1.01 0.75 than or equal to x. 0.64 0.35 1.12 0.85 0.79 0.45 5.56 0.95 45 46

P-P (probability–probability) Plot P-P plot

 plots the two cumulative distribution functions Example of calculating CDF (CDF) against each other example Data1 Rank Cumulative P-P plot Probability P-P plot follows the 45°line y 0.21 1 0.11 = x if the two distributions 0.35 2 0.22 agree. 0.50 3 0.33 0.64 4 0.44 0.79 5 0.56

Empirical probabilities Empirical 0.90 6 0.67 1.00 7 0.78 gamma 0.0 0.2 0.4 0.6 0.8 1.0 1.01 8 0.89 0.0 0.2 0.4 0.6 0.8 1.0 1.12 9 1.00 Theoretical probabilities 47 48

8 TWS on Risk Management of Contaminants (MAFF, Japan) 2019/1/24

Cumulative distribution function (CDF) Graphical Distribution, Box-plot

CDF: The area under the probability distribution Outlier ( outside 1.5 IQR) function from −∞ to x +1.5 IQR CDF of normal distribution Example: normal distribution 3rd Quadrant • 68% in ±1 σ Median st • 95% in ±2 σ 1 Quadrant -1.5 IQR

68% -4 -2 0 2 4 Outlier ( outside 1.5 IQR) 95% 0, 1) pnorm(k,  Inter-Quartile Range (IQR) mean -4 -3 σ -2 σ +2 +3

+ st 0.0 0.2 0.4 0.6 0.8 1.0 - σ σ 3rd quartile (75%th quantile) - 1 quartile (25%th quantile) σ σ σ -3 -2 -1 0 1 2 3  Normalized IQR: 0.7413×IQR 49 If data is normally distributed, converge to standard deviation

Example of box plot

Exe 3.3 Calculation of high percentile 0.0 0.2 0.4 0.6 0.8 Data1123Data2 Data1 + Data2

51 MAFF 52

Exercise 3.3 Using @risk for curve fitting

Let’s try to calucurate basic statistics using Not only high percentiles of sample data, we data1 in excel. want to estimate high percentiles of population.  Maximum value  Minimum value  Range  Average  Mean (arithmetic mean) High percentiles of population are estimated from  Median model distribution from real dataset using  Variance statistical computing software.  Sample standard deviation

MAFF 53 MAFF 54

9 TWS on Risk Management of Contaminants (MAFF, Japan) 2019/1/24

Select data set for fit distribution Moving slider to calculate high percentile in theoretical distribution

Let’s try to calucurate basic statistics using data1 in excel.

 Maximum value  Minimum value  Range  Average  Mean (arithmetic mean)  Median  Variance  Sample standard deviation

MAFF 55 MAFF 56

Example estimate 95 percentile Check P-P plot for assessing how closely two data sets agree

58 MAFF 57

Exercise 3.3

Let’s calculate high percentiles fitting with Well done ! inverse gaussian distribution.

 97.5 percentile  99 percentile

This Photo by Unknown Author is licensed under CC BY-SA MAFF 59 MAFF 60

10 TWS on Risk Management of Contaminants (MAFF, Japan) 2019/1/24

Summary– important points

 Data analysis is critical for risk management.

 Exercise by yourself for better understanding.

 In actual situation, collaboration with government scientists, laboratory analytical chemists, and statisticians is needed.