TWS on Risk Management of Contaminants (MAFF, Japan) 2019/1/24
January 24, 2019 Training Workshop for Building Capacities Exercise 3 “Risk management of contaminants in foods” Tokyo, Japan 3.1 Data aggregation calculation of basic statistics Exercise 3 Analysis of Occurrence data maximum, minimum, mean, median 3.2 Creating a frequency table, histogram Takanori UKENA, Ph.D. 3.3 Calculation of high percentile [email protected]
Ministry of Agriculture Forestry and Fisheries Food Safety and Consumer Affairs Bureau MAFF 1 MAFF 2
Data analysis using occurrence data
Purpose: To estimate population (e.g. nationwide situation) form sample data Exe 3.1 Data aggregation For further consideration
• setting maximum level • evaluating effectiveness of risk management measures • time-course analysis
4 MAFF 3
Analysis of surveillance results Analysis of surveillance results 1. laboratory conditions 2. Dataset • Sampling plan • Basic statistics mean, median, maximum and minimum value • Internal quality control LOD, LOQ and their definitions • Results below LOD or LOQ Calibration curve Replace 1 TWS on Risk Management of Contaminants (MAFF, Japan) 2019/1/24 Analysis of surveillance results Basic statistics 5. Statistical analysis Estimation of population from sample data. • distribution model • estimation of high percentiles Population size N • exposure assessment Maximum value mean μ Minimum value standard deviation σ Range Average sampling inference Mean (arithmetic mean) Median Sample Variance size n mean m Sample standard deviation standard deviation s 7 MAFF 8 Median, Mode Percentile (%ile) • Median The values are ranked in ascending order, i.e. from – the middle value in a set of values arranged in smallest to largest. order of size: Percentile is a number where a certain percentage of – the average of the two middle values if there is no one middle value. observations fall below that number. For example, the 20th percentile is the value below which 20% of the – a robust measure of central tendency observations may be found. Comparing to mean, median is robust to outlier value. 0 percentile (0%ile): minimum value • Mode 25 percentile (25%ile): first quartile (Q1) – a set of data values in a dataset that appears 50 percentile (50%ile): median or second quartile (Q2) most often (most-frequently occurring value) 75 percentile (75%ile) : third quartile (Q3). 100 percentile (100%ile) : maximum value 9 10 Exercise parameter • Let’s calculate basic statistics of Data1 Mean ・population mean Maximum value ・sample mean ̅ Minimum value Variance Range ・population 2 Average ・sample 2 Mean (arithmetic mean) Median Standard deviation Variance ・population Sample standard deviation ・sample 11 12 2 TWS on Risk Management of Contaminants (MAFF, Japan) 2019/1/24 Mean Deviation •Deviation: difference between the observed value population mean of a variable and mean 1 = ̅ sample mean • Squared deviation from the mean 1 needed to calculate sample variance ̅ = n 2 i xx )( i1 13 14 Unbiased estimation of variance Standard deviation (SD) a measure to quantify the amount of variation or •use of n − 1 for sample variance formula dispersion of a set of data values instead of sample size n ∑ 1 Population standard = 2 σ Population variance: = deviation Sample standard ∑ ̅ 1 = Sample variance: = ̅ 2 deviation 1 1 N : number of population : number of observations in the sample : observed values of the items : population mean ̅ : sample mean 15 16 Skewness Kurtosis degree of distortion from symmetrical curve High kurtosis is an indication of an outlier (or outliers) Sk < 0 : left-skewed Sk = 0: no skew Sk > 0: right-skewed Histogram of rnorm(1000) Histgram of -1*rnorm(100)^2 Histogram of rnorm(1000) Histogram of rnorm(100)^2 Frequency Frequency Frequency Frequency 0 50 150 050 150 0204060 0 204060 -3 -2 -1 0 1 2 3 -8 -6 -4 -2 0 -3 -2 -1 0 1 2 3 02468 rnorm(1000) -1*rnorm(100)^2 rnorm(1000) rnorm(100)^2 1 : sample size ∑ ̅ ∑ ̅ : sample size : observed values = = : observed values ̅ : sample mean Skewness ̅ : sample mean : sample SD : sample SD defined as 0 or 3 for normal distribution 17 18 3 TWS on Risk Management of Contaminants (MAFF, Japan) 2019/1/24 Analytical results below LOD, LOQ Calculation of LB, MB and UB Examples How to deal with results below LOD, LOQ? Lower bound (LB) replacing all the results reported as below the LOD/LOQ by 0 Assume LOD = 0.03 mg/kg Medium bound (MB) LOQ = 0.05 mg/kg i. replacing all the results reported as below the LOD/LOQ by half their respective LOD/LOQ Some results are below LOD or LOQ. ii. replacing all the results reported as below the LOD by half their respective LOD, and retain all the results reported between LOD and LOQ Upper bound (UB) replacing all the results reported as below the LOD/LOQ to their respective LOD/LOQ. MAFF 19 MAFF 20 Aggregation of two datasets Statistical test to compare two datasets Can we combine two datasets (data1 and Parametric (normal distribution) or non data2) for further analysis? parametric distribution? For example, two datasets obtained by a. Statistical normality test a. completely different sampling plan for different purpose For contaminants, datasets by surveillance usually b. slight different sampling plan for the same purpose have non parametric distribution c. same sampling plan but different target d. multi-years surveillance b. Statistical test to compare median e. same sampling plan but different basic statistics c. Statistical test to compare two distributions MAFF 21 MAFF 22 Exercise two major non parametric Mann-Whitney U Test (1) statistical tests Assumptions of Mann-Whitney U test a. Mann-Whitney U Test sometimes called the Mann Whitney Wilcoxon 1. All the observations from both datasets are Test or the Wilcoxon Rank Sum Test independent of each other, test whether medians of two independent 2. The responses are ordinal (i.e., one can at least say, datasets come from the same population of any two observations, which is the greater), (Kruskal–Wallis H test is used to compare medians 3. Under the null hypothesis H , the distributions of for more than three independent datasets.) 0 both populations are equal. b.Two-sample Kolmogorov–Smirnov test 4. The alternative hypothesis H is that the distributions test whether the two independent datasets come 1 are not equal. from the same distribution MAFF 23 MAFF 24 4 TWS on Risk Management of Contaminants (MAFF, Japan) 2019/1/24 Mann-Whitney U Test (2) 1. Assign numeric ranks to all the observations (put the observations from both datasets to one set), beginning with 1 for the smallest where n1, n2 is the sample size for dataset 1 and dataset 2 value. Where there are groups of tied values, assign a rank equal respectively, and R2 is the sum of the ranks in dataset 2. to the midpoint of unadjusted rankings. 2. Add up the ranks for the observations which came from dataset 1. 5. The smaller value of U1 and U2 is the one used when consulting 3. Add up the ranks for the observations which came from dataset 2. significance test. 4. Statistic U is then given by: where n1, n2 is the sample size for dataset 1 and dataset 2 respectively, and R1 is the sum of the ranks in dataset 1. 26 MAFF 25 Two-sample Kolmogorov–Smirnov test (1) Two-sample Kolmogorov–Smirnov test (2) Assumptions of two-sample Kolmogorov–Smirnov test 1. First dataset has size m with an observed cumulative distribution function of F(x), and the second dataset has 1. All the observations from both detaset are size n with an observed cumulative distribution function independent of each other, of G(x). 2. Under the null hypothesis H0, both samples come 2. Calculate from a population with the same distribution , 3. The alternative hypothesis H1 is that both samples do not come from a population with the same 3. The null hypothesis is rejected at level α if distribution , , MAFF 27 MAFF 28 Two-sample Kolmogorov–Smirnov test (2) Data aggregation The value of is given in the table below for the most common levels of . Exercise: Let’s try to test using data1 and data2 if α 0.10 0.05 0.025 0.01 0.005 0.001 they can combine for further analysis. c ( α ) 1.073 1.224 1.358 1.517 1.628 1.858 Mann Whitney U test In general Two-sample Kolmogorov–Smirnov test