<<

Introduction to Computational Chemical Biology

Dr. Raimo Franke [email protected] Lecture Series Chemical Biology http://www.raimofranke.de/ Leibniz-Universität Hannover Where can you find me?

• Dr. Raimo Franke Department of Chemical Biology (C-Building) Helmholtz Centre for Infection Research Inhoffenstr. 7, 38124 Braunschweig Tel.: 0531-6181-3415 Email: [email protected] Download of lecture slides: www.raimofranke.de • Research Topics - Metabolomics - Biostatistical Data Analysis of omics (NGS (genome, RNAseq), Arrays, Metabolomics, Profiling) - Phenotypic Profiling of bioactive compounds with impedance measurements (xCelligence) (Peptide Synthesis, ABPP)

I am looking for BSc, MSc and PhD students, feel free to contact me!

Slide 2 | My journey…

Slide 3 | Outline

¢ Primer on Statistical Methods ¢ Multi-parameter phenotypic profiling: using cellular effects to characterize bioactive compounds

¢ Network Pharmacology, Modeling of Signal Transduction Networks

Paper Presentations:

Proc Natl Acad Sci U S A. 2013 Feb 5;110(6):2336-41. doi: 10.1073/pnas.1218524110. Epub 2013 Jan 22.

Antimicrobial drug resistance affects broad changes in metabolomic phenotype in addition to secondary metabolism.

Derewacz DK, Goodwin CR, McNees CR, McLean JA, Bachmann BO.

Slide 4 | Chemical Biology Arsenal of Methods

In silico Target interactions Correlation signals Chemical Genetics & predictions Chemoproteomics

Pharmacophore Biochemical assays Expression profiling Protein microarrays –based target SPR, ITC Proteomics Yeast-3-Hybrid prediction X-Ray, NMR Metabonomics Phage display Image analysis ABPP Impedance Affinity pulldown NCI60 panel Chemical probes Computational Methods in Chemical Biology

• In silico –based target prediction: use molecular descriptors for in silico screening, molecular docking etc. (typical applications of Cheminformatics)

• Biochemical Assays: SPR: curve fitting to determine kass und kdiss, Xray: model fitting, NMR: chemical shift prediction etc. • Phenotypic Profiling / HCS: mining of omics data (Proteomics, Metabolomics , automated microscopy etc.): multivariate

• Chemical Systems Biology: Network Pharmacology (graph theory, ODEs)

All approaches have in common: application of computational/statistical methods is a key step for data analysis

-> Importance of statistics and statistical data mining

Slide 6 | Short Primer on Statistical Methods

• Intro to statistics slides partly © Professor David Wishart, University of Alberta, Canada • Free to download, share and remix: http://bioinformatics.ca/

Slide 7 | Statistics

• “There are three kinds of lies: lies, damned lies, and statistics” Benjamin Disraeli (British Prime Minister in 1870ies)

• Statistics is a formalized way of describing impressions and provides a framework for decision making • I.e. formalizes the subjective impression that population A is on average taller than population B by calculating a quantitative parameter: e.g. or

• In Chemical Biology Research statistics becomes more and more important: from simple t-tests to evaluate results of Western Blots to complex to analyze results from omics experiments • -> It is vital to have a solid understanding of statistics Review of some aspects of Statistics

• Descriptive vs. inferential statistics • Normal Distribution • T-Test and p-value • Correlation • Clustering • Multivariate Statistics: PCA

Slide 9 | Statistical methods used in data analysis

: summarizes data from a sample

• Inferential Statistics : draws conclusions from data that are subject to random variation (e.g. variation). It tries to make inferences about a population using information from samples (drawn form the population).

• Very important distinction: sample <-> population • If the population is too big to directly characterize it with descriptive statistics: draw samples -> from the samples inferences can be made to characterize the underlying population • Usually: sample mean ≠ population mean

Slide 10 | Distributions

• Underlies all of statistics • Distribution of a single variable over a population: e.g. height of population: Univariate Statistics

Slide 11 | Univariate Statistics

• Univariate a single variable • If you measure a population using some single measure such as height, weight, test score, IQ, you are measuring a single variable • If you plot that single variable over the whole population, measuring the frequency that a given value is reached you will get the following: A Bell Curve

Carl Friedrich Gauß (*1777 in Braunschweig) # of of of each each # #

Height

Also called a Gaussian or Normal Distribution Features of a Normal Distribution

Two key parameters characterize any kind of distribution: and spread

µ = mean • Symmetric Distribution • Has an average or mean value ( µ) at the centre • Has a characteristic width called the ( σ) • Most common type of distribution known Normal Distribution

• Almost any set of biological or physical measurements will display some variation and these will almost always follow a Normal distribution • The larger the set of measurements, the more “normal” the curve • Minimum set of measurements to get a normal distribution is 30- 40 • -> For any statistical test relying on the normal distribution (e.g. T- test) that means that we would at least need 30 biological replicates (often not feasible) Gaussian Distribution

P(x) - probability density distribution

− µ 2 − ( x ) 1 2 P ( x) = e 2σ 2πσ

−x2 ~ e

µ-3 σ µ-2 σ µ-σ µ µ +σ µ+2 σ µ+3 σ Some Equations

µ Σ Mean = xi N

σ2 Σ µ 2 = (x i - ) N σ Σ µ 2 Standard Deviation = (x i - ) N

uncorrected sample standard deviation (sample considered as the entire population)

corrected standard deviation, σ = Σ(x - µ)2 when used as estimator for population i standard deviation N-1 The uncorrected sample sd underestimates the population variance, to correct for this bias: 1/N-1 Mean, Median &

• Gaussian Normal Distribution: symmetrical, many other distributions are not • Other distributions: Binomial, Poisson • Skewed distribution are often the case for real life measurement

Mode Median

Mean Mean, Median, Mode

• In a Normal Distribution the mean, mode and median are all equal • In skewed distributions they are unequal • Mean - average value, affected by extreme values in the distribution • Median - the “middlemost ” value, usually half way between the mode and the mean • Mode - most common value Probability Distributions with Standard Deviations (z-values, z-scores) P not within AUC boarders µ ± 1.0 S.D. 0.683 > µ + 1.0 S.D. 0.158 P=32%

µ ± 2.0 S.D. 0.954 > µ + 2.0 S.D. 0.023 P=5%

µ ± 3.0 S.D. 0.9972 > µ + 3.0 S.D. 0.0014 P=0.3% µ ± 4.0 S.D. 0.99994 > µ + 4.0 S.D. 0.00003 µ ± 5.0 S.D. 0.999998 > µ + 5.0 S.D. 0.000001

− µ 2 P − ( x ) 1 2 P ( x) = e 2σ 2πσ

µ -3 σ µ -2 σ µ - σ µ µ + σ µ +2 σ µ +3 σ z=-3 z=-2 z=-1 z=0 z=1 z=2 z=3 Significance

• Based on the Normal Distribution, the probability that something is >1 SD away (larger or smaller) from the mean is 32% • Based on the Normal Distribution, the probability that something is >2 SD away (larger or smaller) from the mean is 5% • Based on the Normal Distribution, the probability that something is >3 SD away (larger or smaller) from the mean is 0.3% The P-value

• The p-value is the probability of obtaining a test (a score, a set of events, a height) at least as extreme as the one that was actually observed (probability of an observed event purely arising by chance) • One "rejects the null hypothesis" when the p-value is less than the significance level α which is often 0.05 or 0.01 • When the null hypothesis is rejected, the result is said to be statistically significant Box-plot Descriptive statistics can be 1.5 x IQR whisker intuitively summarized in a 75% quantile Box-plot. IQR Median 25% quantile

1.5 x IQR whisker

Everything above and below 1.5 x IQR is considered an "outlier". IQR = Inter Quantile = 75% quantile – 25% quantile Boxplots and Normal Distribution

particularly useful for comparing distributions between several groups or sets of data

Slide 24 | Application Example: Edge Effect and outlier detection

Label-freeSlide 25 | phenotypic profiling Application: Regulation of a metabolite in Pseudomonas aeruginosa (∆pqsE vs. wt)

featureidx fold log2fold tstat pvalue 52 90,49660842 -6,49979182 -26,68745376 0,00138943

KO wt

Slide 26 | Fixing a Skewed Distribution

• A skewed distribution or exponentially decaying distribution can be transformed into a “normal” or Gaussian distribution by applying a log transformation • This brings the outliers a little closer to the mean because it rescales the x-variable, it also makes the distribution much more Gaussian Log Transformation

Skewed distribution Normal distribution

linear scale log 0.5 transformed

0.4

0.3

0.2

0.1

0.0 0 8000 16000 24000 32000 40000 48000 56000 640 8.0 8.8 9.6 10.4 11.2 12.0 12.8 13.6 14.4 15.2 16.0 V5 V4 Distinguishing 2 Populations

Population 1

Population 2 The Result # of of of each each # #

Height

Are they different? What about these 2 Populations?

Population 1

Population 2 The Result # of of of each each # #

Height

Are they different? Student’s t-Test

• Also called the t-Test • Used to determine if 2 populations are different • Formally allows you to calculate the probability that 2 sample means are the same ≠ • Hypothesis testing: H0: µ 1 = µ 2; H1: µ 1 µ 2 • If the t-Test statistic gives you a p=0.4, and the α is 0.05, H0 is found to be true -> the 2 populations are the same • If the t-Test statistic gives you a p=0.04, and the α is 0.05, H0 is rejected -> the 2 populations are different • Paired and unpaired t-Tests are available, paired if used for “before & after” expts. while unpaired is for 2 randomly chosen samples

H0: Null Hypothesis H1: Alternative Hypothesis One- and two-tailed T-tests

two-tailed test: deviations of the estimated parameter in either direction from some benchmark value are considered theoretically possible one-tailed test: used if only deviations in one direction are considered possible

Slide 34 | How likely are findings to be true positives?

• Beware: p-value fixation • The p=0.05 threshold is chosen by convention and somewhat arbitrary • Beware: P value is neither as reliable nor as objective as most scientists assume • -> Very good paper on this: Nuzzo , R, Scientific method: statistical errors. Nature. 2014 Feb 13;506(7487):150-2. doi: 10.1038/506150a.

‰ It is wise to also consider (e.g. fold change)

Slide 35 | What if the Distributions are not Normal? Mann-Whitney U-Test

• Also called the Wilcoxon Rank Sum Test • Used to determine if 2 non-normally distributed populations are different • More powerful and robust than the t-test • Formally allows you to calculate the probability that 2 sample are the same • If the U-Test statistic gives you a p=0.4, and the α is 0.05, then the 2 populations are the same • If the U-Test statistic gives you a p=0.04, and the α is 0.05, then the 2 populations are different Data Comparisons

• In many kinds of experiments we want to know what happened to a population “before” and “after” some treatment or intervention • In other situations we want to measure the dependency of one variable against another • In still others we want to assess how the observed property matches the predicted property • In all cases we will measure multiple samples or work with a population of subjects • The best way to view this kind of data is through a A Scatter Plot Scatter Plots

• If there is some dependency between the two variables or if there is a relationship between the predicted and observed variable or if the “before” and “after” treatments led to some effect, then it is possible to see some clear patterns in the scatter plot • This pattern or relationship is called correlation Correlation

“+” correlation Uncorrelated “-” correlation Correlation

High Low Perfect correlation correlation correlation Correlation Coefficient

Σ µ µ (x i - x)(y i - y) r = Σ µ 2 µ 2 (x i - x) (y i - y)

r = 0.85 r = 0.4 r = 1.0

Alternative: Spearman Rank Coefficient (less sensitive towards outliers) Correlation Coefficient

• Sometimes called coefficient of linear correlation or Pearson product- correlation coefficient • A quantitative way of determining what model (or equation or type of line) best fits a set of data • Commonly used to assess most kinds of predictions, simulations, comparisons or dependencies Correlation Coefficient vs. Coefficient of Determination

• R (correlation coefficient) vs. R 2 (coefficient of determination) • R and R 2 are very different • Do not confuse R with R 2 • Do not call R 2 a correlation coefficient – THIS IS WRONG • Avoid using R 2 in discussions or comparisons in scientific papers

10 y = 1,1196x - 0,4115 R² = 0,8733 9 8 2 7 Excel gives you R not R! 6 Datenreihen1 Example: R=0,934 5 4 Linear (Datenreihen1) 3 2 1 0 0 2 4 6 8 10 Significance of Correlation

r = 0.85 r = 0.99 Is this significant? Is this significant? Significance & Correlation

Add 2 more points to the plot

r = 0.99 r = 0.05 important to collect best at least 30 data points (to reach normal distribution) Tricks to Getting Good (but meaningless) Correlation Coefficients (data dredging – manipulation)

Use only data at extreme Use only a small number ends of the curve or line of “good” data points

r = 0.95 r = 0.95 Is this significant? Is this significant? Correlation and Outliers

Experimental error or something important?

A single “bad” point can destroy a good correlation Outliers

• Can be both “good” and “bad” • When modeling data -- you don’t like to see outliers (suggests the model is bad) • Often a good indicator of experimental or measurement errors -- only you can know! • When plotting metabolite concentration data you do like to see outliers • A good indicator of something significant Detecting Clusters: plot of height over weight of a human population Height

Weight Is it Right to Calculate a Correlation Coefficient? Height

r = 0.73

Weight Or is There More to This? Height

Weight

Any idea what the two clusters might represent? Or is There More to This?

male Height

female

Weight Clustering Applications in

• Cheminformatics • Transcriptomics, Proteomics, Metabolomics • Microarray or GeneChip Analysis • 2D Gel or ProteinChip Analysis • Protein Analysis • Phylogenetic and Evolutionary Analysis • Structural Classification of Proteins • Protein Sequence Families •… Clustering

• Definition - a process by which objects that are logically similar in characteristics are grouped together. • Clustering is different than Classification • In classification the objects are assigned to pre-defined classes, in clustering the classes are yet to be defined • Clustering helps in classification Clustering Requires...

• A method to measure similarity (a similarity measure, i.e. Pearson correlation coefficient) or dissimilarity (a dissimilarity coefficient, i.e. Euclidean distance) between objects (can be stored in a similarity or dissimilarity matrix) • A threshold value with which to decide whether an object belongs to a cluster • A way of measuring the “distance” between two clusters (i.e. linkage criterion for hierarchical clustering) • A cluster seed (an object to begin the clustering process) Clustering Algorithms

• K-means or Partitioning Methods - divides a set of N objects into M clusters -- with or without overlap • Hierarchical Methods - produces a set of nested clusters in which each pair of objects is progressively nested into a larger cluster until only one cluster remains • Self-Organizing Feature Maps - produces a cluster set through iterative “training” Hierarchical Clustering

• Find the two closest objects and merge them into a cluster • Find and merge the next two closest objects (or an object and a cluster, or two clusters) using some similarity measure and a predefined threshold • If more than one cluster remains return to step 2 until finished

Initial cluster pairwise select select compare closest next closest

λ λ Rule: T = obs +- 50 nm Hierarchical Clustering

A

A A B

B C

C D B

E

Find 2 most Find the next F similar metabolite closest pair Iterate expression levels of levels or or curves curves Heat map Heatmaps Descriptors, not clustered

Apicidin_6 Aphidicolin_4 Amanitin_17 RhizopodinA_1 CytochalasinD_1 ChondramidC_1 Oligomycin_5 CruentarenA_5 MyxothiazolA_16 Samples (here: Dexamethasone_20 CCCP_16 Neopeltolide_16 compounds) Doxorubicin_17 GephyronicAcidA_15 H89_8 after hierarchical Cycloheximide_15 Anisomycin_15 Myriaporone_15 clustering Soraphen_21 Camptothecin_17 Mean of each column ArgyrinA_14 OkadaicAcid_12 Indirubin_3_monoxime_7 is set to 0 by subtraction of ArchazolidB_18 SB203580_10 SB202190_10 mean from each single PD169316_10 Oxamflatin_6 value in column Nocodazole_13 Vinblastine_13 TubulysinB_13 Griseofulvin_13 Taxol_13 ActinomycinD_17 Rapamycin_11 Vioprolide_22 Alsterpaullone_7 SaframycinMx1_4 BrefeldinA_19 MG132_14 Velcade_14 Colchicine_13 Podophyllotoxin_13 Etoposide_17 LY294002_9 Wortmannin_9 EpothiloneB_13 Emetine_15 Methotrexate_4 RatjadonC_20 PMA_8 Trichostatin_6 A23187_2 PurvalanolA_7 Scriptaid_6 CyclosporinA_2 Puromycin_15 Apicularen_18 Staurosporine_8 Mevastatin_3 Simvastatin_3 Tunicamycin_19 ChivosazolA_1 Chelerythrine_8

Seite 61 | Titel der Präsentation 2-dimensional case: 2 descriptors

Descriptor 2

Descriptor 1 3 samples A,B,C, each characterized by only two descriptors -> usually we have many more (transcript levels, metabolite levels) With 6000 transcript levels we would need to work in 6000-dimensional space, which is hard to imagine, but mathematical treatment is the same Label-freeSlide 62 | phenotypic profiling Agglomerative hierarchical clustering

• Average linkage algorithm, four compounds

• Algorithm starts with each compound in separate cluster • Distance matrix is searched for pair of compounds with smallest distance between them and merges them into one cluster: A-B • The distance matrix is recalculated with updated distances: A and B is replaced by the midpoint between them and the distance of compound C and D to this midpoint is recalculated: d(AB,C) =2.85, d(AB, D)=4.81, d(C,D)= 2.7 remains unchanged. • Algorithm repeats to find the smallest distance value in the distance matrix: compounds C,D are merged into one cluster and replaced by the midpoint C-D • Final distance between the two midpoints d(AB, CD)=3.83 –> gene clusters AB and CD are merged as the last step of the algorithm • Iterative process

Label-freeSlide 63 | phenotypic profiling Descriptor 2 Descriptor Descriptor 2 Descriptor

Descriptor 1 Descriptor 1 Descriptor 2 Descriptor

Descriptor 1

Label-freeSlide 64 | phenotypic profiling The true distance of samples is not always well represented by the dendrogram

dist(AB-CD)=3.83

dist(C-D)=2.7

The distance (or similarity) of B and C is hard to judge from the dendrogram dist(B-C)= 2.61, what you rather see is the distance between the clusters AB and CD. A dendrogram can not display all pairwise distances, it is always a compromise, the algorithm used has a huge impact on the resulting dendrogram.

Label-freeSlide 65 | phenotypic profiling Important Consideration regarding Clustering…

Founder of Bioconductor Project (together with Robert Gentleman, One of the leading Statisticians in omics Science says:

Label-freeSlide 66 | phenotypic profiling Clustering – how robust is this method?

• PART OF THE MISUNDERSTOOD! • Most often ignored. • Cluster structure is treated as reliable and precise • BUT! Usually the structure is rather unstable, at least at the bottom. • Can be VERY sensitive to noise and to outliers -> Yes!! • Homogeneity and Separation Final Thoughts • The most overused statistical method in gene expression analysis • Gives us pretty red-green picture with patterns • But, pretty picture tends to be pretty unstable. • Many different ways to perform hierarchical clustering • Tend to be sensitive to small changes in the data • Provided with clusters of every size: where to “cut” the dendrogram is user-determined What is “good clustering”?

• Internal criterion: A good clustering will produce high quality clusters in which: • the intra-class (that is, intra-cluster) similarity is high • the inter-class similarity is low • The measured quality of a clustering depends on both the linkage criterion and the similarity measure used • It is possible to perform a statistical test to check how well two clusters are separated, e.g. t-test when data points within the clusters are normally distributed, giving you a p-value Multivariate Statistics Multiple variables for one population

Biological data sets often contain several variables, they are multivariate . Scatter plots allow us to look at two variables at a time.

The boxplot function can be used to display several variables at a time. Omics Experiments and Multivariate Statistics

• Metabolomics experiments typically measure many metabolites at once, in other words the instruments are measuring multiple variables and so metabolomic data are inherently multivariate data • Metabolomics requires multivariate statistics • Metabolomics:hundrets to thousands of metabolites • Transcriptomics : thousands (Pseudomonas: 6000, human: 20.000 and more…splice isoforms, small RNAs etc.) • Multivariate statistics requires more complex, multidimensional analyses or dimensional reduction methods Multivariate Statistics – The Trick

• The key trick in multivariate statistics is to find a way that effectively reduces the multivariate data into univariate data • Once done, then you can apply the same univariate concepts such as p- values, t-Tests and ANOVA tests to the data • The trick is dimensional reduction Dimension Reduction & PCA

• PCA – Principal Componenent Analysis • Process that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components • Reduces 1000’s of variables to 2-3 key features

Scores plot Principal Component Analysis

Hundreds of peaks 2 components

25 PC2 Treatment 1 20

15

10 TreatmentANIT 1

5

Treatment 2 0

-5 -10 Control -15 Control PAP -20 Treatment 2 PC1 -25 -30 -20 -10 0 10 Scores plot

PCA captures what should be visually detectable If you can’t see it, PCA probably won’t help Visualizing PCA: Reduction from 3 to 2 dimensions

• PCA of a “bagel” • Two flash lights projections • One projection produces a weiner • Another projection produces an “O”

•The “O” projection captures most of the variation and has the largest eigenvector (PC1) •The weiner projection is PC2 and gives depth info

Which projection captures most of the information on the objects? PCA Plot Nomenclature

• PCA generates two kinds of plots, the scores plot and the loadings plot • Scores plot (on right) plots the data using the main principal components • Three groups, i.e.wt, KO1, KO2) or untreated/treatment1/treatment2 • How well are the groups separated? PCA Loadings Plot

• Loadings plot shows how much each of the variables (metabolites) contributed to the different principal components • Variables at the extreme corners contribute most to the scores plot separation (and are probably must interesting to investigate further) • Loadings plot can guide your analysis PCA Advice

• In some cases PCA will not succeed in identifying any clear clusters or obvious groupings no matter how many components are used. If this is the case, it is wise to accept the result and assume that the presumptive classes or groups cannot be distinguished • As a general rule, if a PCA analysis fails to achieve even a modest separation of classes, then it is probably not worthwhile using other statistical techniques to try to separate them PCA vs. PLS-DA

• Partial Least Squares Discriminant Analysis • PLS-DA is a supervised classification technique while PCA is an unsupervised clustering technique • PLS-DA uses “labeled” data while PCA uses no prior knowledge • PLS -DA enhances the separation between groups of observations by rotating PCA components such that a maximum separation among classes is obtained • PLS-DA results are essentially prediction models or class predictors • These models need to be validated and assessed to make sure they are not over-trained or over-fitted Validating PLS-DA (Permutation)

PCA Labelled data PLS-DA

Permuted data PLS-DA Data Analysis Progression

• Unsupervised Methods ¢ PCA or cluster to see if natural clusters form or if data separates well ¢ Data is “unlabeled” (no prior knowledge)

• Supervised Methods/Machine Learning ¢ Data is labeled (prior knowledge) ¢ Used to see if data can be classified ¢ Helps separate less obvious clusters or features

• Statistical Significance ¢ Supervised methods always generate clusters -- this can be very misleading ¢ Check if clusters are real by label permutation Note of Caution

• Supervised classification methods are powerful ¢ Learn from experience ¢ Generalize from previous examples ¢ Perform pattern recognition • Too many people skip the PCA or clustering steps and jump straight to supervised methods • Some get great separation and think the job is done - this is where the errors begin… • Too many don’t assess significance using permutation testing or n-fold cross validation • If separation isn’t partially obvious by eye-balling your data, you may be treading on thin ice R is the `lingua franca' of data analysis and statistical computing •R is a language and environment for statistical computing and graphics (~6000 packages) •Bioconductor: uses R and provides state of the art tools for analysis of high-throughput omics data (~1000 packages)

New York Times, 2009

UsingSlide 82 | Bioinformati cs and Slide 83 | Part II: Applications

¢ Multi-parameter phenotypic profiling: using cellular effects to characterize bioactive compounds ¢ Network Pharmacology, Modeling of Signal Transduction Networks

Slide 84 | Target Identification is a Bottleneck in Drug Discovery

Screening of extracts Discovery of new chemical for new natural products entities

Drugs Probes

Assay development (single Mode of action target and high content) and Target identification screening

Slide 85 | HZI Natural Products Collection

151 natural products: ~30% mode of action is known ~30% some information on bioactivity, but no information on mode of action ~40% no information about bioactivity / mode of action

Slide 86 | Profiling of natural products: Phenotypic Screening (High Content Screening, HCS) High Content Screening (=phenotypic, cell-based screening):

• Multiparametric interrogation of cellular processes • Set of output features (high information content) describe a complex cellular phenotype, not only the effect on one protein target • Often: automated imaging approach to understand compound activities in cellular assays (carried out in MTPs)

Automatic Microscope: Cellular Impedance (Roche xCelligence) Antibody staining TCRP (time-dependent cellular Automatic Image Analysis response profiles)

Slide 87 | Cell-Based Morphological Screening: Roche xCelligence Control Unit

E-plate 96

Analyzer Station, in incubator

• label-free system 96 well E-plate: • real-time kinetics (over days) tissue culture plate with interdigitated • high information content gold microelectrodes at the bootom of each well

Slide 88 | Electrical Impedance Sensors

• cells are cultured as a layer on microelectrodes and alternating voltage is supplied .

• Impedance (Z) = ratio of voltage(V) to current intensity (I)

Impedance is extemely sensitive to changes in rearrangements of the ionic • Proliferation microenvironment • Morphology (cytoskeleton) ->method is suceptible to wide • Volume range of cellular alterations: • Adhesion • Migration • Invasion • Death (apoptosis)

Slide 89 | TCRPs 1.0 1.5 NCI NCI

Archazolid B (V-ATPase) 0.0 0.5

0 20 40 60 80 100 120

t

UsingSlide 90 | Bioinformati cs and TCRPs

SB202190 (p38 inh.) 1.0 1.5 NCI NCI

Archazolid (V-ATPase) 0.0 0.5

0 20 40 60 80 100 120

t

UsingSlide 91 | Bioinformati cs and TCRPs

SB202190 (p38 inh.) 1.0 1.5 NCI NCI

Archazolid (V-ATPase) 0.0 0.5 Apicularen (V-ATPase)

0 20 40 60 80 100 120

t

UsingSlide 92 | Bioinformati cs and TCRPs

SB203580 (p38 inh.)

SB202190 (p38 inh.) 1.0 1.5

NCI NCI Premise: Compounds with the same mode of action produce similar TCRPs

Archazolid (V-ATPase) 0.0 0.5 Apicularen (V-ATPase)

0 20 40 60 80 100 120

t

UsingSlide 93 | Bioinformati cs and Reference Compounds: 63 in 21 MoA categories OkadaicAcid 12 Phosphatase Compound Class MoA Colchicine 13 Microtubule ChivosazolA 1Actin EpothiloneB 13 Microtubule ChondramidC 1Actin Griseofulvin 13Microtubule CytochalasinD 1 Actin Nocodazole 13 Microtubule RhizopodinA 1Actin Podophyllotoxin 13Microtubule 1 Actin A23187 2 Calcium regulation Taxol 13 Microtubule 2 Calcium regulation CyclosporinA 2 Calcium regulation TubulysinB 13 Microtubule 3 Cholesterol Mevastatin 3 Cholesterol Vinblastine 13 Microtubule 4 DNA Simvastatin 3Cholesterol ArgyrinA 14 Proteasome 5 F-ATPase Aphidicolin 4DNA replication MG132 14 Proteasome 6 HDAC Methotrexate 4 DNA synthesis Velcade 14Proteasome 7 CDK SaframycinMx1 4 DNA sythesis Anisomycin 15Protein synthesis 8 PKC_PKA CruentarenA 5 F-ATPase Cycloheximide 15Protein synthesis 9 PI3K Oligomycin 5F-ATPase Emetine 15 Protein synthesis 10 p38 Apicidin 6 HDAC GephyronicAcidA 15Protein synthesis 11 mTor Oxamflatin 6 HDAC Myriaporone 15Protein synthesis 12 Phosphatase Scriptaid 6 HDAC Puromycin 15Protein synthesis 13 Microtubule Trichostatin 6 HDAC CCCP 16 respiratory chain 14 Proteasome Alsterpaullone 7 CDK MyxothiazolA 16 respiratory chain 15 Protein synthesis Indirubin-3-monoxime 7CDK Neopeltolide 16 respiratory chain 16 respiratory chain PurvalanolA 7 CDK ActinomycinD 17 Transcription 17 Transcription Chelerythrine 8 PKC_PKA Amanitin 17 Transcription 18 V-ATPase H89 8 PKC_PKA Camptothecin 17 Transcription 19 Vesicle trafficking PMA 8 PKC_PKA Doxorubicin 17 Transcription 20 Nuclear receptor Staurosporine 8 PKC_PKA Etoposide 17 Transcription 21 Lipid synthesis LY294002 9 PI3K Apicularen 18 V-ATPase Wortmannin 9 PI3K ArchazolidB 18 V-ATPase PD169316 10 p38 BrefeldinA 19Vesicle trafficking SB202190 10 p38 Tunicamycin 19 Vesicle trafficking SB203580 10 p38 Rapamycin 11 mTor Label-freeSlide 94 | OkadaicAcidphenotypic 12 Phosphatase Colchicineprofiling 13 Microtubule Inference of mode of action from TCRPs

Goal: Infer mode of action from TCRP for bioactive compounds with unknown target by comparison with TCRP of compound with known target

Bioinformatics / Statistical data mining methods are needed!

UsingSlide 95 | Bioinformati cs and Experimental Design

• 2 key concepts: randomization and replication -> triplicats are distributed randomly over the microtiter plates (R: sampling without replacement) • Check for edge effects: boxplots, remove outliers

From: United States Patent Application 20090309618

DMSO in all 96 wells

UsingSlide 96 | Bioinformati cs and Issue One: Edge Effect and outliers

Outlier removal NCI[all]=1.261336 (criterion: NCI <1.1 OR NCI >1.4): NCI[A1:A12]= 1.314866 G7, H2, H5, H10 NCI[H1:H12]= 1.321426 NCI[col1]=1.245519 NCI[col12]=1.271254

Label-freeSlide 97 | phenotypic profiling Is straightforward curve clustering an option?

• Use each individual measurement of cellular impedance as descriptor • Measure of dissimilarity (or distance): Euclidean distance • All pairwise distances: stored in a distance matrix and then used to cluster the curves

Problems: • Each individual measurement would have to take place at the exactly same time point after compound addition: experimentally not feasible, we want to combine measurements from many different experiments • Curse of dimensionality (hundreds of individual measurements) • In the presence of measurement errors, direct clustering methods do not take advantage of the functional structure of the data

UsingSlide 98 | Bioinformati cs and Dimension Reduction

Main motivation: to avoid curse of dimensionelity (not computational costs)

• By considering more and more attributes (descriptors) in a dataset, we mathematically construct a high-dimensional space. • The more descriptors are used, the more similar objects appear, because of the inability of distance functions to separate points well in high dimensions

-> Distance functions lose their usefulness in high dimensionality. -> We need a clever way to reduce the dimensions without loosing information

Keep the functional structure of the data by fitting the curves by a non- linear parametric model and then use the model coefficients for partitioning

UsingSlide 99 | Bioinformati cs and Curve fitting with polynomials: Runge`s Phenomenon • TCRPs show many inflection points, so you would need higher polynoms • Polynomial interpolation with polynomials of high degree: problem of oscillation at the edges of an interval. • The error between the function and the interpolating polynomial gets worse for higher-order polynomials.

MG132 2.5 NCI

0.0 0.5 1.0 1.5 2.0 red: function to be interpolated 0 20 40 60 80 100 120 blue : 5th-order interpolating polynomial t green: 9th-order interpolating polynomial UsingSlide 100 | Bioinformati cs and Cubic Smoothing Splines

f(x)=a+bx+cx 2+dx 3

• Spline is a special function defined piecewise by polynomials

• The results are as good as with high polynomials, while avoiding Runge ´s phenomenon

• continuity prerequisites have to be fulfilled at the nodes, where two cubic polynomials are connected

UsingSlide 101 | Bioinformati cs and Splines fitting MG132 1.5 2.0 2.5 NCI NCI 0.0 0.5 1.0

0 20 40 60 80 100 120

t

UsingSlide 102 | Bioinformati cs and Splines fitting MG132 1.5 2.0 2.5 NCI NCI 0.0 0.5 1.0

0 20 40 60 80 100 120

t

UsingSlide 103 | Bioinformati cs and Construction of the distance matrix using basis spline coeficients

B spline basis function of order p with node vector tau

- The basis spline coefficients are used as describtors for the fitted TCRPs, on δ which a distance function is defined ( i,j := distance between i th and j th objects) These distances are the entries of the dissimilarity matrix.

UsingSlide 104 | Bioinformati cs and Computation and Extraction of the Basis Spline Coefficients

UsingSlide 105 | Bioinformati cs and Distance Matrix

UsingSlide 106 | Bioinformati cs and NEW: Tukey median polish to identify outliers

CytochalasinD_1.1 CytochalasinD_1.2 CytochalasinD_1.3 2.0 2.5 3.0 z[,z[, 1] 1] 0.5 1.0 1.5

0 10 20 30 40 50 60 rownames(z) blue, green,red

Label-freeSlide 107 | phenotypic profiling Median polish algorithm CytochalasinD_1.1 CytochalasinD_1.2 CytochalasinD_1.3 0 1,089540342 1,07404513 1,04484933 0.0837000000000003 0,716588433 0,840969034 1,08078887 0.1675 0,532813745 0,671681157 1,05516068 0.250900000000001 0,485253742 0,614535508 1,04695165 0.334199999999999 0,493933221 0,611214223 1,04044449 0.4175 0,521388717 0,637882192 1,03493843 Tukey median polish to identify outliers 0.500299999999999 0,555486671 0,676663085 1,01621784 0.5837 0,591887344 0,716030087 0,99909901 0.667000000000002 0,628376583 0,761941975 0,98668535 0.750299999999999 0,665220087 0,800722868 0,97857643 0.8337 0,697635285 0,837061639 0,97377115 0.917000000000002 0,727039235 0,874377259 0,97006707 10.003 0,762465681 0,911009085 0,97146862 12.503 0,853865911 1,004102764 0,96866553 15.003 0,923213179 1,060564619 0,97907698 17.503 0,978478434 1,098173293 0,98097908 20.003 1,027544062 1,128064863 0,96876564 22.487 1,057390842 1,149750904 0,95735309 24.987 1,08661766 1,164110579 0,95314846 27.487 1,106013639 1,180033213 0,94904395 •Ytj where j : jth replicate, t : timepoint t in • some small interval [a < t < b] • data = overall effect + row effect + column effect + residual. •Yj(t)= m(t) + sj + ej(t) • Residuals define how well a replicate fits the model -> we can use them for outlier detection

Label-freeSlide 108 | phenotypic profiling profiling phenotypic | Label-freeSlide 109

-0.5 0.0 0.5 1.0 outlier, CytochalasinD_1.3here automaticallydetectand remove better thanthreshold) toIQR Definearange threshold (worked yohlsn_. yohlsn_. CytochalasinD_1.3 CytochalasinD_1.2 CytochalasinD_1.1 medianpolish residuals Workflow

Label-freeSlide 110 | phenotypic profiling Apicidin_6 Aphidicolin_4 Amanitin_17 RhizopodinA_1 CytochalasinD_1 ChondramidC_1 Oligomycin_5 CruentarenA_5 Data combined from MyxothiazolA_16 8 different runs Dexamethasone_20 CCCP_16 DMSO normalized Neopeltolide_16 1 Actin Doxorubicin_17 GephyronicAcidA_15 2 Calcium regulation H89_8 Cycloheximide_15 3 Cholesterol Anisomycin_15 Myriaporone_15 4 DNA replication Soraphen_21 5 F-ATPase Camptothecin_17 ArgyrinA_14 6 HDAC OkadaicAcid_12 Indirubin_3_monoxime_7 7 CDK ArchazolidB_18 SB203580_10 8 PKC_PKA SB202190_10 PD169316_10 9 PI3K Oxamflatin_6 Nocodazole_13 10 p38 Vinblastine_13 11 mTor TubulysinB_13 Griseofulvin_13 12 Phosphatase Taxol_13 ActinomycinD_17 13 Microtubule Rapamycin_11 Vioprolide_22 14 Proteasome Alsterpaullone_7 SaframycinMx1_4 15 Protein synthesis BrefeldinA_19 MG132_14 16 respiratory chain Velcade_14 17 Transcription Colchicine_13 Podophyllotoxin_13 18 V-ATPase Etoposide_17 LY294002_9 19 Vesicle trafficking Wortmannin_9 EpothiloneB_13 20 Nuclear receptor Emetine_15 Methotrexate_4 21 Lipid synthesis RatjadonC_20 PMA_8 Trichostatin_6 A23187_2 PurvalanolA_7 • Only partitioning of the data, no Scriptaid_6 CyclosporinA_2 class prediction Puromycin_15 Apicularen_18 Staurosporine_8 • Parameters are difficult to Mevastatin_3 Simvastatin_3 optimize, no score for Tunicamycin_19 ChivosazolA_1 UsingSlide 111 | Chelerythrine_8 Bioinformati cs and A rank based score – simple but effective

For each compound distance matrix is sorted by ascending Euclidean distance Normalized score is needed to make scores comparable among classes with different number of class membersd (different number of replicates, or different number of MoA class members etc.): Divide achieved rank sum by ideal rank sum.

Example: Anisomycin Score=rank sum=1+2=3 Normalized score = rank sum / ideal rank sum, here: 3/3=1

Normalized Score can be used as threshold to remove outliers and compound classes that do not work or to decide where additional replicates are needed or quality issues of compounds have to be investigated

Normalized Score can also be used to quantitatively evaluate effect of data mining procedures (e.g. different scaling and normalization methods etc.) on analysis outcome ‰ Quantitative Performance Criterion

Label-freeSlide 112 | phenotypic profiling Case Study: Jerantinine E

Alkaloid natural product Jerantinine E: MoA was unknown

Destabilization of microtubules.

Slide 113 | Case Study: Jerantinin E

-> Profiling with xCelligence pointed into the right direction, validation with secondary assays always necessary!

R. Frei et al ., Angew Chem Int Ed 2013(50):13373-6

Slide 114 | Summary and Outlook

• We established a promising compound profiling method using to predict the mode of action of natural products from TCRP profiles • Advantage over other profiling methods: we can observe kinetics of compound action not just a snapshot (as in imaging) • Combination with descriptors from automatic microscope (dimension reduction as pre-requisite was done) • Cubic Smoothing Splines for dimension reduction: general idea also for time -dependent transcriptome data for instance

• Bioinformatics and intelligent data mining methods allowed us to infer the compound mode of action from at first sight simple growth curves

UsingSlide 115 | Bioinformati cs and Important Skills

• Three things that you have to know • Linux + a bit of shell scripting • A scripting language (e.g. Perl or Python) • R • Statistical Data Mining

PLoS Comput Biol. 2009 December; 5(12): e1000589. Join BIC, DSMZ: R Club Great online resources: Bioconductor.org Coursera.org

Slide 116 | Resources to delve in deeper

• http://simplystatistics.org/ - fantastic blog maintained by leading statisticians working in field – Rafael Irizarry (Harvard), Jeff Leek and Roger Peng (Johns Hopkins University) • MOOC (Massive Open Online Courses): edX, Coursera, Udacity • edX : Statistics and R for the Life Sciences (new 8-part course by Rafael Irizarry, has just started on the 19 th Jannuary ) • Coursera : Spezialization Series “Data Sciences” by Jeff Leek and Roger Peng • All courses are free, but you can pay a small fee to obtain an official certificate • Andy Field: Youtube videos, very entertaining, more focus on stats in Psychology: • https://www.youtube.com/user/ProfAndyField/

Slide 117 |