Statistical Analysis of Metabolomics Data

Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline • Introduction • Data pre-treatment 1. Normalization 2. Centering, scaling, transformation • Univariate analysis 1. Student’s t-tes 2. Volcano plot • Multivariate analysis 1. PCA 2. PLS-DA • Machine learning • Software packages 2 Results from data processing Next, analysis of the quantitative metabolite information … 3 Metabolomics data analysis • Goals – biomarker discovery by identifying significant features associated with certain conditions – Disease diagnosis via classification • Challenges – Limited sample size – Many metabolites / variables • Workflow univariate multivariate machine pre-treatment analysis analysis learning 4 Data Pre-treatment 5 Normalization (I) • Goals – to reduce systematic variation – to separate biological variation from variations introduced in the experimental process – to improve the performance of downstream statistical analysis • Sources of experimental variation – sample inhomogeneity – differences in sample preparation – ion suppression 6 Normalization (II) • Approaches – Sample-wise normalization: to make samples comparable to each other – Feature/variable-wise normalization: to make features more comparable in magnitude to each other. • Sample-wise normalization – normalize to a constant sum – normalize to a reference sample – normalize to a reference feature (an internal standard) – sample-specific normalization (dry weight or tissue volume) • Feature-wise normalization (i.e., centering, scaling, and transformation) 7 Centering, scaling, transformation 8 Centering • Converts all the concentrations to fluctuations around zero instead of around the mean of the metabolite concentrations • Focuses on the fluctuating part of the data • Is applied in combination with data scaling and transformation 9 Scaling • Divide each variable by a factor • Different variables have a different scaling factor • Aim to adjust for the differences in fold differences between the different metabolites. • Results in the inflation of small values • Two subclasses – Uses a measure of the data dispersion – Uses a size measure 10 Scaling: subclass 1 • Use data dispersion as a scaling factor – auto: use the standard deviation as the scaling factor. All the metabolites have a standard deviation of one and therefore the data is analyzed on the basis of correlations instead of covariance. – pareto: use the square root of the standard deviation as the scaling factor. Large fold changes are decreased more than small fold changes and thus large fold changes are less dominant compared to clean data. – vast: use standard deviation and the coefficient of variation as scaling factors. This results in a higher importance for metabolites with a small relative sd. – range: use (max-min) as scaling factors. Sensitive to outliers. 11 Scaling: subclass 2 • Use average as scaling factors – The resulting values are changes in percentages compared to the mean concentration. – The median can be used as a more robust alternative. 12 Transformation • Log and power transformation • Both reduce large values relatively more than the small values. • Log transformation – pros: removal of heteroscedasticity – cons: unable to deal with the value zero. • Power transformation – pros: similar to log transformation – cons: not able to make multiplicative effects additive 13 Centering, scaling, transformation A. original data B. centering C. auto D. pareto E. range F. vast G. level H. log I. power 14 Log transformation, again • Hard to do useful statistical tests with a skewed distribution. • A skewed distribution or exponentially decaying distribution can be transformed into a Gaussian distribution by applying a log transformation. 15 http://www.medcalc.org/manual/transforming_data_to_normality.php Univariate vs. multivariate analysis • Univariate analysis examines each variable separately. – t-tests – volcano plot • Multivariate analysis considers two or more variables simultaneously and takes into account relationships between variables. – PCA: Principle Component Analysis – PLS-DA: Partial Least Squares-Discriminant Analysis • Univariate analyses are often first used to obtain an overview or rough ranking of potentially important features before applying more sophisticated multivariate analyses. 16 Univariate Statistics t-test volcano plot 17 Univariate statistics • A basic way of presenting univariate data is to create a frequency distribution of the individual cases. Due to the Central Limit Theorem, many of these frequency distributions can be modeled as a normal/Gaussian distribution. 18 Gaussian distribution • The total area underneath each density curve is equal to 1. 1.0 = 0, 2= 0.2, = 0, 2= 1.0, 0.8 = 0, 2= = 2= ) 0.6 x ( 2 0.4 2 " x−µ % −$ ' 0.2 1 # σ & ϕ(x) = e 0.0 0 1 2 4 σ 2π x mean = µ variance = σ 2 0.4 0.3 standard deviation = σ 0.2 34.1% 34.1% 0.1 2.1% 2.1% 0.1% 13.6% 13.6% 0.1% 0.0 µ 19 https://en.wikipedia.org/wiki/Normal_distribution Sample statistics 1 n Sample mean: X= ∑Xi n i=1 n 2 1 2 Sample variance: S = ∑(Xi − X) n −1 i=1 Sample standard deviation: S = S2 20 https://en.wikipedia.org/wiki/Normal_distribution t-test (I) • One-sample t-test: is the sample drawn from a known population? Null hypothesis H0 : µ = µ0 Alternative hypethesis H1: µ<µ0 x − µ Test statistic: t = 0 s n n 1 2 Sample standard deviation: s = ∑(xi − x) n −1 i=1 The test statistic t follows a student’s t distribution. The distribution has n-1 degrees of freedom. 21 t-test (II): p-value When the null hypothesis is rejected, the result is said to be statistically significant. 22 t-test (III) • Two-sample t-test: are the two populations different? Null hypothesis H0 : µ1 − µ2 = 0 Alternative hypethesis H1: µ1 − µ2 ≠ 0 x − x − µ −µ Test statistic: t = ( 1 2 ) ( 1 2 ) s2 s2 1 + 2 n1 n2 • The two samples should be independent. 23 t-test (IV) Equivalent statements: • The p-value is small. • The difference between the two populations is unlikely to have occurred by chance, i.e. is statistically significant. 24 t-test (V) • The p-value is big. • The difference between the two populations are said NOT to be statistically significant. 25 t-test (VI) • Paired t-test: what is the effect of a treatment? • Measurements made on the same individuals before and after the treatment. Example: Subjects participated in a study on the effectiveness of a certain diet on serum cholesterol levels. Subject Before After Difference 1 201 200 -1 H0 : µd = 0 2 231 236 +5 Ha : µd ≠ 0 3 221 216 -5 d − µ 5 260 243 -17 Test statistic: t = d 6 228 224 -4 sd n 7 245 235 -10 26 Volcano plot (I) • Plot fold change vs. significance • y-axis: negative log of the p-value • x-axis: log of the fold change so that changes in both directions (up and down) appear equidistant from the center • Two regions of interest: those points that are found towards the top of the plot that are far to either the left- or the right- hand side. 27 Volcano plot (II) 28 Multivariate statistics PCA PLS-DA 29 PCA (I) • PCA is a statistical procedure to transform a set of correlated variables into a set of linearly uncorrelated variables. 30 PCA (II) • The uncorrelated variables are ordered in such a way that the first one accounts for as much of the variability in the data as possible and each succeeding one has the highest variance possible in the remaining variables. • These ordered uncorrelated variables are called principle components. • By discarding low-variance variables, PCA helps us reduce data dimension and visualize the data. 31 PCA (III) • The transformation matrix • The transformation st nd th • p 1 , p 2 , , p n are called the 1 , 2 , and n principle components, respectively. 32 PCA (IV) • Each original sample is represented by an n-dimensional vector: Sbefore transformation = [x1, x2,, xn ] • After the transformation – If all of the principle components are kept, then each sample is still represented by an n-dimensional vector: Safter transformation = [y1, y2,, yn ] – If only m < n principle components are kept, then each sample will be represented by an m-dimensional vector: Safter transformation = [y1, y2,, ym ] 33 PCA (V) • y are called scores. • For visualization purpose, m is usually chosen to be 2 or 3. • As a result, each sample will be represented by a 2- or 3- dimenational point in the score plot. Principal Component Analysis (Scores) 30 wt15 ● 20 wt16 ● wt22 10 ● wt18 ● PC2 wt19 wt21 ● ● ko15 ● 0 ko22 ko21 ● ● 10 − ko19 ● ko16 ● ko18 ● 34 −30 −20 −10 0 10 PC1 PCA (VI) • Loadings • [ p 1 1 , p 1 2 ] , [ p 2 1 , p 2 2 ] , , [ p n 1 , p n 2 ] are denoted as points in the loadings plot 35 Loadings plot for all variables PCA (VII) Principal Component Analysis (Loadings) 548.1/3425 324/3268 472/2954 507.2/3030316.2/3089414.2/3062 533.2/2870485.2/3041317.2/3092547.3/3019560.3/3089 0.06 594.2/3376324.1/2789567.2/3016549.3/3428341.2/3556458.2/3046596.2/3008472.2/3121567.1/3030 248.2/3240 443/2684523.1/3037595.2/3004299.2/4126551.2/3013348.2/3499420.2/2722244.9/3130361.1/2693591.3/3005344.2/3350225/3232 577.2/2861506.3/3310548.4/4232344.2/3152419.2/3807592.2/2998502.2/3032590.3/3005215.2/3478463.2/3046459.2/3047504.2/3031 363.1/3481317.2/3492444.2/2897395.2/3570560.2/3096 507.1/3017512.2/3358 466.1/3203402.1/2688554.2/3490591.2/2982219.1/2523547.2/3021573.2/3004 337.2/3438 556.2/3444 576.4/3863338.1/2789419.1/3062302.1/2818 591.5/4105310.1/2798 548.3/3015532.2/2870402.1/2672488.2/2882546.2/3018 557.1/2574562.1/3514 286.2/2855 417.1/2683502.1/4192382/2677389.2/3349573.3/3391

Statistical Analysis of Metabolomics Data

UNIT 1 INTRODUCTION to STATISTICS Introduction to Statistics

The Landscape of R Packages for Automated Exploratory Data Analysis by Mateusz Staniak and Przemysław Biecek

Univariate Analysis and Normality Test Using SAS, STATA, and SPSS

Chapter Four: Univariate Statistics SPSS V11 Chapter Four: Univariate Statistics

Bivariate Analysis: the Statistical Analysis of the Relationship Between Two Variables

Instructions for Univariate Analysis

Univariate Analyses Can Be Used for Which of the Following

UNIT 2 DESCRIPTIVE STATISTICS Introduction to Statistics

Normality Testing Sas Spss.Pdf

Measures of Central Tendency

A Comparative Study of Univariate and Multivariate Methodological Approaches to Educational Research Robert M

Skewness to Test Normality for Mean Comparison