Statistical Genomics
Total Page:16
File Type:pdf, Size:1020Kb
12/4/2012 Statistical Genomics MSc CoMPLEX UCL Dr Andrew Teschendorff Statistical Cancer Genomics UCL Cancer Institute ([email protected]) 1-Dec-12 1 Outline 1. Motivation and biological/clinical background. 2. Statistical tests: parametric vs non-parametric testing, univariate and multivariate regressions, empirical nulls. 3. The multiple-testing problem in genomics: estimating the false discovery rate (FDR). 4. Power calculations in genomic studies. 5. Gene Set Enrichment Analysis (GSEA). 6. Dimensional Reduction: singular value decomposition. 4-Dec-12 2 1 12/4/2012 Statistical Genomics Definition: The development and application of statistical methodology to help analyze and interpret data from omic technologies. Goal: Ultimately, the development of statistical algorithms and software to improve the clinical management of complex genetic diseases. 4-Dec-12 3 Motivation (biological/clinical) • “Omic” data sets (e.g mRNA expression, SNPs, DNA methylation) have revolutionized the field of molecular genetics and medicine. • Example: an ongoing clinical trial (the MINDACT trial) is assessing a prognostic 70- gene expression signature, called MammaPrint, in deciding whether to give chemotherapy to breast cancer patients. 1-Dec-12 4 2 12/4/2012 Motivation (biological/clinical) • Personalized medicine: in cancer, knowing the repertoire of aberrations (genomic & epigenomic) in any given tumour, can we predict which treatments will work on that tumour? • Improved understanding of systems biology principles underlying complex genetic diseases. 1-Dec-12 5 Statistical Genomics Flowchart Biological/clinical question Experimental Design Experiment (microarrays / sequencing) Preprocessing (e.g image analysis) Normalisation Downstream Analysis (feature selection, classification, clustering…etc) Biological verification and interpretation 1-Dec-12 6 3 12/4/2012 Typical tasks in Statistical Genomics 1. Experimental design: (i) large experiments in genomics that profile many samples over a large number of arrays require careful design so as to avoid confounding by technical factors (e.g chip/batch effects), (ii) power calculations to determine minimum sample size. 2. Normalisation of raw data: raw data needs to be carefully calibrated and normalised (need for both intra and inter-array/sample normalisation). 3. Identification of genomic features correlating with a phenotype of interest: the purpose of the experiment is usually to identify genomic features (e.g mRNA expression levels) that are different between two conditions (e.g normal versus cancer). 4. Constructing classifiers for prediction: often we want to know whether we can derive a predictor based on genomic features (e.g. can we predict the prognosis of breast cancer patients based on epigenetic DNA methylation profiles measured at the time of diagnosis?) 1-Dec-12 7 Types of omic data 1. Transcriptomics: genome-wide quantification mRNA & miRNA expression (continuous valued data). 2. Proteomics: large-scale quantification of protein expression (continuous valued) 3. Epigenomics: genome-wide quantification of epigenetic marks (e.g. DNA methylation- covalent modification of cytosines by methyl group). Although binary in single cells, becomes continuous when measured over many cells due to (stochastic) variation. 4. Metabolomics: large-scale quantification of metabolite levels (continuous). 5. Genomics: genome wide quantification of allele-specific copy-number state (continuous & discrete) & SNP profiling (discrete valued data). 1-Dec-12 8 4 12/4/2012 Functional genomics: measuring gene expression • We can “easily” measure the mRNA levels of most known transcripts and individual exons, over ~100-1000 samples with microarray-based technologies (cDNA-,oligo-,exon-arrays), ~£100-200 per sample. • The Microarray consists of a solid surface onto which known DNA molecules have been chemically bonded at special locations. – Each array location is typically known as a probe and contains many replicates of the same molecule. – The molecules in each array location are carefully chosen so as to hybridise only with mRNA molecules corresponding to a single gene. “Omic” data matrices Raw data Intermediate data Final data: Matrix Array scans Images n Samples Spots p Features/Genes p Abundance Spot/Image levels quantitations p >> n 5 12/4/2012 Choosing a statistical test: binary phenotype (0,1) • Suppose we would like to establish if two sample distributions are different (sample distributions assumed to be representative of each phenotype). • The main characteristic of a distribution is the mean-the first statistical moment of a distribution (higher order moments include variance, skewness,…etc). So, typically we want a test to determine if the mean is different. • Parametric testing: it implicitly assumes a model for how the data is distributed in each sample group, i.e model is given by a statistical distribution, the parameters of which specify the model. Testing relies on parameter estimation (e.g Student’s t-test). • Non-parametric testing: no implicit model, testing does not involve parameter estimation (e.g Wilcoxon rank sum test). 1-Dec-12 11 Student’s (unpaired) t-test • Suppose data is normally distributed in each phenotype (small deviations from normality will not affect the testing): xx () T 1 2 1 2 t-statistic ss22 12 nn12 Null hypothesis: 12 Tt~ (0, ) t(0, ) is a t-distribution of mean 0 and degrees of freedom 2 2 2 s1 / n1 s2 / n2 2 2 2 2 (s1 / n1) /(n1 1) (s2 / n2 ) /(n2 1) A comparison of the t distribution with 4 df (in blue) and the standard normal distribution (in red) (same mean and variance). 1-Dec-12 12 6 12/4/2012 Wilcoxon rank sum test (unpaired) • If normality assumption is grossly violated, better to use the non- parametric Wilcoxon rank sum test (Mann-Whitney U-test). However, the test is less powerful than a t-test. • Null hypothesis (for continuous data) is: P(Red>Black)=P(Black>Red) 1. Arrange all values (n n12 n values) in increasing order without 20 1 regard to phenotype. Assign ranks. 18.5 2 2. Sum ranks of all values within one phenotype => R1 15.2 3 3. Then, following must hold: R R n( n 1) / 2 10.1 4 21 8.6 5 4. Statistic: W1 R 1 n 1( n 1 1) / 2 ( W 1 W 2 n 1 n 2 ) 6.9 6 WW 4.2 7 5. The statistic, max12 , , is directly related to the AUC. n n n n 1 2 1 2 AUC=Area under the ROC curve. • Note: statistic is derived from actual ranks and not values. In above example, R(red)=1+3+4=8 => W(red)=8-6=2 => W(black)=12-2=10 => AUC=10/12=0.83. 1-Dec-12 13 Wilcoxon rank sum test (unpaired) • Exercise 1: given data for two phenotypes (black & red) (-1, 2.5, 3.5, 7.5, 4, 9, 10, 10.5, 11, 12, 11.5, 13, 15) find AUC and P-value for rejecting null hypothesis that P(Black>Red)=P(Red>Black). For the P-value calculation you might want to use the R-function wilcox.test. • Exercise 2: Now consider data (1.1, 1.0, 1.2, 4.1, 4.0, 4.2). Compute P- values according to Wilcoxon test and t-test separately. What is the AUC in the case of the Wilcoxon test? What does this tell you about using non- parametric tests in the case where sample sizes are small? 1-Dec-12 14 7 12/4/2012 Parametric or non-parametric? Drawbacks of non-parametric testing: • given the sample size, there is a minimum achievable P-value (this constitutes a problem when correcting for multiple testing). • features of low variance may be given highly significant P-values: A) x x xxxxxx oo x x oo o o o oo o ooooo B) x x xxxxxx oo x x o o o oo o ooooo • Wilcoxon-test would assign same P-value to features A) and B), i.e. it is blind to the effect sizes of the features. 1-Dec-12 15 Testing with continuous phenotypes • Suppose we want to determine if a genomic variable (e.g gene expression) is correlated with a continuous phenotype (e.g age). • For this, can use a regression framework: e.g linear model 푦 = 훼 + 훽푥 + 휀 푦′ = 훽푥 + 휀 (푦′ = 푦 − 훼) Least Squares Estimate: 푇 푦′ 푥 푛 훽 = 푥푇푥 data points 훽 ⇒ 푡 = 푥 ~ 푡 0, 푛 − 2 |훽=0 푦 • Compare with Pearson correlation & Fisher Z-transform: T yx 1 1 + 푥푦 1 xy -1 xy 1 푍 = 푙표푔 ~ N(0, )| 1-Dec-12 2 푛−3 =0 xy 1 − 푥푦 8 12/4/2012 Some notes • The statistical significance of correlation values depends on the sample size n (e.g. a correlation of “only” 0.1 can be significant if n>300). Exercise: check this. • If there are outliers, using a t-test to evaluate significance is not a good idea, because the residuals won’t be normally distributed (the assumption underlying the t-test). In this case, we can obtain the “null” distribution of the t-statistic by randomly reassigning the phenotype labels to expression values (need to do this many times, > 1000 , to generate a reasonable estimate of the null distribution). • Null distribution = distribution of the statistic when the null hypothesis is true. By permuting a large number of times you effectively destroy any potential association between predictor and phenotype. By definition this constitutes the null hypothesis. • Often, the null distribution can’t be derived analytically, in which case permutation is the only approach to derive it and hence estimate significance. In this case, we talk about an empirical null. 1-Dec-12 17 Deriving an empirical null • Given an observed statistic S: to derive the null distribution of the statistic: i) randomly permute phenotype labels. ii) recompute statistic with permuted labels SP iii) repeat a large number, nP, of times (>1000) (SP1,SP2,…) iv) an empirical P-value can be calculated as: P-value=(#SP > S)/nP • in this case, noise was modelled from a Gaussian distribution, so not surprisingly, analytical and empirical estimates for the P-value are in close agreement.