Multivariate Data Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Multivariate Data Analysis Multivariate Data Analysis Introduction to Multivariate Data Analysis Principal Component Analysis (PCA) Multivariate Linear Regression (MLR, PCR and PLSR) Laboratory exercises: Introduction to MATLAB Examples of PCA (cluster analysis of samples, identification and geographical distribution of contamination sources/patterns…) Examples of Multivariate Regression (prediction of concentration of chemicals from spectral analysis, investigation of correlation patterns and of the relative importance of variables,… Romà Tauler (IDAEA, CSIC, Barcelona) [email protected] Univariate and Multivariate Data Univariate Multivariate • Only one measurement per • Multiple measurements per sample (pH, Absorbance, sample peak height or area) – Instrumental measuremnts • The property is defined by one (spectra, cromatograms,...) one measurement. – Multiple measurements • Total Selectivity is needed (constituent conc., sensorial • Interferences should be variables,....) eliminated de before • Total selectivity is not needed. measurements (separation) • Interferences can be present • Numerical treatment is easy • Different complexity levels (vectors, matrices, tensors,...). Univariate Statistics n ∑ x i Mean μ →=X i1= n n (x− X)2 ∑ i One variable Variance σ 22→=s i1= Xx n1− summary n (x− X)2 Standard ∑ i σ →=s i1= Deviation Xx n1− n ∑ (xii−− X)(y Y) i1= Covariance σ x,y→=s x,y n1− Relation s r → x,y between 2 Correlation x,y ss xy variables Univariate Statistics Description of one variable Zn 4.10 • Mean: 4.00 4.04 – Mean value of the variable. 3.82 – Size of the scale of the variable. 3.96 • Standard Deviation: 0.18. 4.07 • Variance: 0.032. 4.23 – Dispersion around the mean. 3.73 – Spread (dispersion) of the scale of the 3.80 variable. 4.23 Univariate Statistics Description of the relation between 2 variables Sn Ni 0.20 0.0022 • Covariance: 0.00043. 0.20 0.0020 – High values Indicate a linear relationship between x and y. 0.15 0.0015 – Sign: relation direct (+) or inverse (-). 0.61 0.0062 – Depends on the scale sizes of x i y. 0.57 0.0056 • Correlation: 0.996. 0.58 0.0055 – |1| total linearity, 0 absence of linear 0.30 0.0033 relationship. 0.60 0.0056 – Sign: relation direct (+), inverse (-). 0.10 0.0014 – Independent from the scale sizes of x and y. High Correlation, redundant information. Low Correlation, complementary Information. Multivariate Data Example Sn Zn Fe Ni Variables 0.20 4.10 0.06 0.0022 0.20 4.04 0.04 0.0020 0.15 3.82 0.08 0.0015 Objects 0.61 3.96 0.09 0.0062 Samples 0.57 4.07 0.08 0.0056 0.58 4.23 0.07 0.0055 0.30 3.73 0.02 0.0033 0.60 3.80 0.07 0.0056 0.10 4.23 0.05 0.0014 X (9,4) (n,m) Multivariate Statistics Description of variables • Vector of means: – Mean value of each variable Sn Zn Fe Ni – Sample diferences due to the (0.37 4.00 0.06 0.0037) differences in the size scales of the variables • Variables with higher means have n a higher influence in dta description. ∑ xij x = i=1 • If the scale sizes are very different, j n −1 a data pretreatment will probably be needed. Multivariate Statisitcs Description of variables • Vector of standard deviations: s = (s1, s2, s3, ..., sn) – Dispersion of the different variables. Sn Zn Fe Ni – Shows the differences in the scale (0.22 0.18 0.02 0.0020) ranges (dispersion, spread) among variables. • Variables with higher dispersions will have higher influence in the n data description. (x − x )2 ∑ ij j • If the scale ranges of the different i=1 s j = variables are very different, a data n −1 pretreatment will probably be needed. Multivariate Data Robust Parameters Sn Zn Fe Ni 0,100000 3,730000 0,020000 0,0014 0,150000 3,800000 0,040000 0,0015 0,200000 3,820000 0,050000 0,002 0,200000 3,960000 0,060000 0,0022 Spread Median 0,300000 4,040000 0,070000 0,0033 interquartile 0,570000 4,070000 0,070000 0,0055 0,580000 4,100000 0,080000 0,0056 0,600000 4,230000 0,080000 0,0056 0,610000 4,230000 0,090000 0,0062 They are less sensitive to the presence of outliers The range of the interquartile is not symmetric respect the median Show the data structure Introduction to Multivariate Data Analysis Descriptive Statistics (Excel) ppDDD opDDT ppDDT Total DDX Total POCs PCB#28 Media 6.096511644 2.16580938 7.086504433 66.19020173 78.70933531 7.600267864 Mediana 1.928828322 0.035 0.099439743 18.10346088 23.83508898 2.1 Moda 0.01 0.035 0.02 0.145 0.3215 0.9 Desviación estándar 20.01177892 12.61189239 59.35342594 199.7332731 229.5487184 23.05609066 Varianza de la muestra 400.4712955 159.0598297 3522.829171 39893.38037 52692.61412 531.5833166 Curtosis 45.08257127 52.62597243 100.808786 37.034318 41.9234197 37.79168548 Coeficiente de asimetría 6.553464204 7.205359571 10.01488076 5.826281824 6.155842663 6.004360386 Rango 158.99 103.565 598.939 1536.855 1856.1785 165.535 Mínimo 0.01 0.035 0.02 0.145 0.3215 0.065 Máximo 159 103.6 598.959 1537 1856.5 165.6 Suma 621.8441877 220.9125568 722.8234522 6751.400577 8028.352202 752.4265185 Cuenta 102 102 102 102 102 99 Probability distributions: Box plots A box plot summarizes the information on the data distribution primarily in terms of the median, the upper quartile, and lower quartile. The “box” by definition extends from the upper to lower quartile. Within the box is a dot or line marking the median. The width of the box, or the distance between the upper and lower quartiles, is equal to the interquartile range, and is a measure of spread. The median is a measure of location, and the relative distances of the median from the upper and lower quartiles is a measure of symmetry “in the middle” of the distribution. is defined by the upper and lower quartiles. A line or dot in the box marks the median. For example, the median is approximately in the middle of the box for a symmetric distribution, and is positioned toward the lower part of the box for a positively skewed distribution. Box plots, Tucson Precipitation 7 6 5 4 upper quartile P (in) 3 2 interquartile range median 1 irq 0 Jan July lower quartile Month Probability distributions: Box plots “Whiskers” are drawn outside the box at what are called the the “adjacent values.” The upper adjacent value is the largest observation that does not exceed the upper quartile plus 1.5 iqr , where iqr is the interquartile range. The lower adjacent value is the smallest observation than is not less than the lower quartile minus1.5 iqr . If no data fall outside this 1.5 iqr buffer around the box, the whiskers mark the data extremes. The whiskers also give information about symmetry in the tails of the distribution. Box plots, Tucson Precipitation For example, if the distance from the 7 top of the box to the upper whisker exceeds the distance from the bottom 6 of the box to the lower whisker, the 5 distribution is positively skewed in Whiskers the tails. Skewness in the tails may be 4 different from skewness in the middle P (in) 3 of the distribution. For example, a 2 interquartile range distribution can be positively skewed in the middle and negatively skewed 1 irq in the tails. 0 Jan July Month Probability distributions: Box plots Any points lying outside the 1.5 iqr around the box are marked by individual symbols as “outliers”. These points are outliers in comparison to what is expected from a normal distribution with the same mean and variance as the data sample. For a standard normal distribution, the median and mean are both zero, and: q at 0.25 = −0.67449, q at 0.75 =0.67449, iqr = q 0.75 − q 0.25 =1.349, where q 0.25and q. 075are the first and third quartiles, and iqr is the interquartile range. We see that the whiskers for a standard normal distribution are at data values: Upper whisker = 2.698 , Lower whisker = -2.698 Box plots, Tucson Precipitation 7 6 Outliers 5 4 P (in) 3 2 1 0 Jan July Month Probability distributions: Box plots From the cdf of the standard normal distribution, we see that the probability of a lower value than x=−2.698 is 0.00035. This result shows that for a normal distribution, roughly 0.35 percent of the data is expected to fall below the lower whisker. By symmetry, 0.35 percent of the data are expected above the upper whisker. These data values are classified as outliers. Exactly how many outliers might be expected in a sample of normally distributed data depends on the sample size. For example, with a sample size of 100, we expect no outliers, as 0.35 percent of 100 is much less than 1. With a sample size of 10,000, however, we would expect 35 positive outliers and 35 negative outliers for a normal distribution. For a normal distribution >> varnorm=randn(10000,3); >> boxplot(varnorm) 0.35% of 10000 are approx. 35 outliers at each whisker side Parametric vs Robust Statistics Standard Dev. Mean Parametric Box Plot Màximum Robust Median Interquartil Range (IQR) Mínimum Parametric vs Robust Statistics Sn Zn Fe Ni Sn Zn Fe Ni Mean and standard Median and IQRs plots deviation plots (Box plots) They help to see the size and range scale differences They suggest the use of appropriate data pretreatments to handle these differences.