Median Polish Algorithm

Introduction to Computational Chemical Biology Dr. Raimo Franke [email protected] Lecture Series Chemical Biology http://www.raimofranke.de/ Leibniz-Universität Hannover Where can you find me? • Dr. Raimo Franke Department of Chemical Biology (C-Building) Helmholtz Centre for Infection Research Inhoffenstr. 7, 38124 Braunschweig Tel.: 0531-6181-3415 Email: [email protected] Download of lecture slides: www.raimofranke.de • Research Topics - Metabolomics - Biostatistical Data Analysis of omics experiments (NGS (genome, RNAseq), Arrays, Metabolomics, Profiling) - Phenotypic Profiling of bioactive compounds with impedance measurements (xCelligence) (Peptide Synthesis, ABPP) I am looking for BSc, MSc and PhD students, feel free to contact me! Slide 2 | My journey… Slide 3 | Outline Primer on Statistical Methods Multi-parameter phenotypic profiling: using cellular effects to characterize bioactive compounds Network Pharmacology, Modeling of Signal Transduction Networks Paper Presentations: Proc Natl Acad Sci U S A. 2013 Feb 5;110(6):2336-41. doi: 10.1073/pnas.1218524110. Epub 2013 Jan 22. Antimicrobial drug resistance affects broad changes in metabolomic phenotype in addition to secondary metabolism. Derewacz DK, Goodwin CR, McNees CR, McLean JA, Bachmann BO. Slide 4 | Chemical Biology Arsenal of Methods In silico Target interactions Correlation signals Chemical Genetics & predictions Chemoproteomics Pharmacophore Biochemical assays Expression profiling Protein microarrays –based target SPR, ITC Proteomics Yeast-3-Hybrid prediction X-Ray, NMR Metabonomics Phage display Image analysis ABPP Impedance Affinity pulldown NCI60 panel Chemical probes Computational Methods in Chemical Biology • In silico –based target prediction: use molecular descriptors for in silico screening, molecular docking etc. (typical applications of Cheminformatics) • Biochemical Assays: SPR: curve fitting to determine kass und kdiss, Xray: model fitting, NMR: chemical shift prediction etc. • Phenotypic Profiling / HCS: mining of omics data (Proteomics, Metabolomics , automated microscopy etc.): multivariate statistics • Chemical Systems Biology: Network Pharmacology (graph theory, ODEs) All approaches have in common: application of computational/statistical methods is a key step for data analysis -> Importance of statistics and statistical data mining Slide 6 | Short Primer on Statistical Methods • Intro to statistics slides partly © Professor David Wishart, University of Alberta, Canada • Free to download, share and remix: http://bioinformatics.ca/ Slide 7 | Statistics • “There are three kinds of lies: lies, damned lies, and statistics” Benjamin Disraeli (British Prime Minister in 1870ies) • Statistics is a formalized way of describing impressions and provides a framework for decision making • I.e. formalizes the subjective impression that population A is on average taller than population B by calculating a quantitative parameter: e.g. mean or median • In Chemical Biology Research statistics becomes more and more important: from simple t-tests to evaluate results of Western Blots to complex multivariate statistics to analyze results from omics experiments • -> It is vital to have a solid understanding of statistics Review of some aspects of Statistics • Descriptive vs. inferential statistics • Normal Distribution • T-Test and p-value • Correlation • Clustering • Multivariate Statistics: PCA Slide 9 | Statistical methods used in data analysis • Descriptive Statistics : summarizes data from a sample • Inferential Statistics : draws conclusions from data that are subject to random variation (e.g. sampling variation). It tries to make inferences about a population using information from samples (drawn form the population). • Very important distinction: sample <-> population • If the population is too big to directly characterize it with descriptive statistics: draw samples -> from the samples inferences can be made to characterize the underlying population • Usually: sample mean ≠ population mean Slide 10 | Distributions • Underlies all of statistics • Distribution of a single variable over a population: e.g. height of population: Univariate Statistics Slide 11 | Univariate Statistics • Univariate means a single variable • If you measure a population using some single measure such as height, weight, test score, IQ, you are measuring a single variable • If you plot that single variable over the whole population, measuring the frequency that a given value is reached you will get the following: A Bell Curve Carl Friedrich Gauß (*1777 in Braunschweig) # of of of each each # # Height Also called a Gaussian or Normal Distribution Features of a Normal Distribution Two key parameters characterize any kind of distribution: central tendency and spread µ = mean • Symmetric Distribution • Has an average or mean value ( µ) at the centre • Has a characteristic width called the standard deviation ( σ) • Most common type of distribution known Normal Distribution • Almost any set of biological or physical measurements will display some variation and these will almost always follow a Normal distribution • The larger the set of measurements, the more “normal” the curve • Minimum set of measurements to get a normal distribution is 30- 40 • -> For any statistical test relying on the normal distribution (e.g. T- test) that means that we would at least need 30 biological replicates (often not feasible) Gaussian Distribution P(x) - probability density distribution − µ 2 − ( x ) 1 2 P ( x) = e 2σ 2πσ −x2 ~ e µ-3 σ µ-2 σ µ-σ µ µ+σ µ+2 σ µ+3 σ Some Equations µ Σ Mean = xi N σ2 Σ µ 2 Variance = (x i - ) N σ Σ µ 2 Standard Deviation = (x i - ) N uncorrected sample standard deviation (sample considered as the entire population) corrected standard deviation, σ = Σ(x - µ)2 when used as estimator for population i standard deviation N-1 The uncorrected sample sd underestimates the population variance, to correct for this bias: 1/N-1 Mean, Median & Mode • Gaussian Normal Distribution: symmetrical, many other distributions are not • Other distributions: Binomial, Poisson • Skewed distribution are often the case for real life measurement Mode Median Mean Mean, Median, Mode • In a Normal Distribution the mean, mode and median are all equal • In skewed distributions they are unequal • Mean - average value, affected by extreme values in the distribution • Median - the “middlemost ” value, usually half way between the mode and the mean • Mode - most common value Probability Distributions with Standard Deviations (z-values, z-scores) P not within AUC boarders µ ± 1.0 S.D. 0.683 > µ + 1.0 S.D. 0.158 P=32% µ ± 2.0 S.D. 0.954 > µ + 2.0 S.D. 0.023 P=5% µ ± 3.0 S.D. 0.9972 > µ + 3.0 S.D. 0.0014 P=0.3% µ ± 4.0 S.D. 0.99994 > µ + 4.0 S.D. 0.00003 µ ± 5.0 S.D. 0.999998 > µ + 5.0 S.D. 0.000001 − µ 2 P − ( x ) 1 2 P ( x) = e 2σ 2πσ µ-3 σ µ-2 σ µ-σ µ µ+σ µ+2 σ µ+3 σ z=-3 z=-2 z=-1 z=0 z=1 z=2 z=3 Significance • Based on the Normal Distribution, the probability that something is >1 SD away (larger or smaller) from the mean is 32% • Based on the Normal Distribution, the probability that something is >2 SD away (larger or smaller) from the mean is 5% • Based on the Normal Distribution, the probability that something is >3 SD away (larger or smaller) from the mean is 0.3% The P-value • The p-value is the probability of obtaining a test statistic (a score, a set of events, a height) at least as extreme as the one that was actually observed (probability of an observed event purely arising by chance) • One "rejects the null hypothesis" when the p-value is less than the significance level α which is often 0.05 or 0.01 • When the null hypothesis is rejected, the result is said to be statistically significant Box-plot Descriptive statistics can be 1.5 x IQR whisker intuitively summarized in a 75% quantile Box-plot. IQR Median 25% quantile 1.5 x IQR whisker Everything above and below 1.5 x IQR is considered an "outlier". IQR = Inter Quantile Range = 75% quantile – 25% quantile Boxplots and Normal Distribution particularly useful for comparing distributions between several groups or sets of data Slide 24 | Application Example: Edge Effect and outlier detection Label-freeSlide 25 | phenotypic profiling Application: Regulation of a metabolite in Pseudomonas aeruginosa (∆pqsE vs. wt) featureidx fold log2fold tstat pvalue 52 90,49660842 -6,49979182 -26,68745376 0,00138943 KO wt Slide 26 | Fixing a Skewed Distribution • A skewed distribution or exponentially decaying distribution can be transformed into a “normal” or Gaussian distribution by applying a log transformation • This brings the outliers a little closer to the mean because it rescales the x-variable, it also makes the distribution much more Gaussian Log Transformation Skewed distribution Normal distribution linear scale log 0.5 transformed 0.4 0.3 0.2 0.1 0.0 0 8000 16000 24000 32000 40000 48000 56000 640 8.0 8.8 9.6 10.4 11.2 12.0 12.8 13.6 14.4 15.2 16.0 V5 V4 Distinguishing 2 Populations Population 1 Population 2 The Result # of of of each each # # Height Are they different? What about these 2 Populations? Population 1 Population 2 The Result # of of of each each # # Height Are they different? Student’s t-Test • Also called the t-Test • Used to determine if 2 populations are different • Formally allows you to calculate the probability that 2 sample means are the same ≠ • Hypothesis testing: H0: µ 1 = µ 2; H1: µ 1 µ 2 • If the t-Test statistic gives you a p=0.4, and the α is 0.05, H0 is found to be true -> the 2 populations are the same

Median Polish Algorithm

Birth Cohort Effects Among US-Born Adults Born in the 1980S: Foreshadowing Future Trends in US Obesity Prevalence

Median Polish with Covariate on Before and After Data

BIOINFORMATICS ORIGINAL PAPER Doi:10.1093/Bioinformatics/Btm145

Exploratory Data Analysis

Expression Summarization Interrogate for Each Gene, Called a Probe Set

Introductory Methods for Area Data Analysis

Generalised Median Polish Based on Additive Generators

A New Statistical Model for Analyzing Rating Scale Data Pertaining to Word Meaning

Statistical Analysis of Some Gas Chromatography Measurements

Using Approximations to Scale Exploratory Data Analysis In

Universiti Putra Malaysia Median Polish Techniques

Using Weighted Regression Model for Estimating Cohort Effect in Age- Period Contingency Table Data