12/4/2012

Statistical Genomics

MSc CoMPLEX UCL Dr Andrew Teschendorff Statistical Cancer Genomics UCL Cancer Institute ([email protected])

1-Dec-12 1

Outline

1. Motivation and biological/clinical background. 2. Statistical tests: parametric vs non-parametric testing, univariate and multivariate regressions, empirical nulls. 3. The multiple-testing problem in genomics: estimating the false discovery rate (FDR). 4. Power calculations in genomic studies. 5. Set Enrichment Analysis (GSEA). 6. Dimensional Reduction: singular value decomposition.

4-Dec-12 2

1 12/4/2012

Statistical Genomics

Definition: The development and application of statistical methodology to help analyze and interpret data from omic technologies.

Goal: Ultimately, the development of statistical algorithms and software to improve the clinical management of complex genetic diseases.

4-Dec-12 3

Motivation (biological/clinical)

• “Omic” data sets (e.g mRNA expression, SNPs, DNA methylation) have revolutionized the field of molecular genetics and medicine.

• Example: an ongoing clinical trial (the MINDACT trial) is assessing a prognostic 70- gene expression signature, called MammaPrint, in deciding whether to give chemotherapy to breast cancer patients.

1-Dec-12 4

2 12/4/2012

Motivation (biological/clinical)

• Personalized medicine: in cancer, knowing the repertoire of aberrations (genomic & epigenomic) in any given tumour, can we predict which treatments will work on that tumour?

• Improved understanding of systems biology principles underlying complex genetic diseases.

1-Dec-12 5

Statistical Genomics Flowchart

Biological/clinical question

Experimental Design

Experiment (microarrays / sequencing)

Preprocessing (e.g image analysis)

Normalisation

Downstream Analysis (feature selection, classification, clustering…etc)

Biological verification and interpretation 1-Dec-12 6

3 12/4/2012

Typical tasks in Statistical Genomics

1. Experimental design: (i) large experiments in genomics that profile many samples over a large number of arrays require careful design so as to avoid confounding by technical factors (e.g chip/batch effects), (ii) power calculations to determine minimum sample size. 2. Normalisation of raw data: raw data needs to be carefully calibrated and normalised (need for both intra and inter-array/sample normalisation). 3. Identification of genomic features correlating with a phenotype of interest: the purpose of the experiment is usually to identify genomic features (e.g mRNA expression levels) that are different between two conditions (e.g normal versus cancer). 4. Constructing classifiers for prediction: often we want to know whether we can derive a predictor based on genomic features (e.g. can we predict the prognosis of breast cancer patients based on epigenetic DNA methylation profiles measured at the time of diagnosis?)

1-Dec-12 7

Types of omic data

1. Transcriptomics: genome-wide quantification mRNA & miRNA expression (continuous valued data). 2. Proteomics: large-scale quantification of expression (continuous valued) 3. Epigenomics: genome-wide quantification of epigenetic marks (e.g. DNA methylation- covalent modification of cytosines by methyl group). Although binary in single cells, becomes continuous when measured over many cells due to (stochastic) variation. 4. Metabolomics: large-scale quantification of metabolite levels (continuous). 5. Genomics: genome wide quantification of allele-specific copy-number state (continuous & discrete) & SNP profiling (discrete valued data).

1-Dec-12 8

4 12/4/2012

Functional genomics: measuring gene expression

• We can “easily” measure the mRNA levels of most known transcripts and individual exons, over ~100-1000 samples with microarray-based technologies (cDNA-,oligo-,exon-arrays), ~£100-200 per sample.

• The Microarray consists of a solid surface onto which known DNA molecules have been chemically bonded at special locations. – Each array location is typically known as a probe and contains many replicates of the same molecule. – The molecules in each array location are carefully chosen so as to hybridise only with mRNA molecules corresponding to a single gene.

“Omic” data matrices

Raw data Intermediate data Final data: Matrix

Array scans Images n Samples

Spots p Features/ p Abundance Spot/Image levels quantitations p >> n

5 12/4/2012

Choosing a statistical test: binary phenotype (0,1) • Suppose we would like to establish if two sample distributions are different (sample distributions assumed to be representative of each phenotype). • The main characteristic of a distribution is the mean-the first statistical moment of a distribution (higher order moments include variance, skewness,…etc). So, typically we want a test to determine if the mean is different. • Parametric testing: it implicitly assumes a model for how the data is distributed in each sample group, i.e model is given by a statistical distribution, the parameters of which specify the model. Testing relies on parameter estimation (e.g Student’s t-test). • Non-parametric testing: no implicit model, testing does not involve parameter estimation (e.g Wilcoxon rank sum test).

1-Dec-12 11

Student’s (unpaired) t-test

• Suppose data is normally distributed in each phenotype (small deviations from normality will not affect the testing):

xx ()  T  1 2 1 2 t-statistic ss22 12 nn12

Null hypothesis: 12  Tt ~ (0,  ) t(0, ) is a t-distribution of mean 0 and  degrees of freedom 2 2 2 s1 / n1  s2 / n2    2 2 2 2 (s1 / n1) /(n1 1)  (s2 / n2 ) /(n2 1)

A comparison of the t distribution with 4 df (in blue) and the standard normal distribution (in red) (same mean and variance).

1-Dec-12 12

6 12/4/2012

Wilcoxon rank sum test (unpaired) • If normality assumption is grossly violated, better to use the non- parametric Wilcoxon rank sum test (Mann-Whitney U-test). However, the test is less powerful than a t-test. • Null hypothesis (for continuous data) is: P(Red>Black)=P(Black>Red)

1. Arrange all values (n n12 n values) in increasing order without 20 1 regard to phenotype. Assign ranks. 18.5 2

2. Sum ranks of all values within one phenotype => R1 15.2 3 3. Then, following must hold: R R  n ( n  1) / 2 10.1 4 21 8.6 5 4. Statistic: W1 R 1  n 1 ( n 1  1) / 2 ( W 1  W 2  n 1 n 2 ) 6.9 6 WW 4.2 7 5. The statistic, 12 , , is directly related to the AUC. n n n n 1 2 1 2 AUC=Area under the ROC curve. • Note: statistic is derived from actual ranks and not values. In above example, R(red)=1+3+4=8 => W(red)=8-6=2 => W(black)=12-2=10 => AUC=10/12=0.83.

1-Dec-12 13

Wilcoxon rank sum test (unpaired)

• Exercise 1: given data for two phenotypes (black & red) (-1, 2.5, 3.5, 7.5, 4, 9, 10, 10.5, 11, 12, 11.5, 13, 15) find AUC and P-value for rejecting null hypothesis that P(Black>Red)=P(Red>Black). For the P-value calculation you might want to use the R-function wilcox.test.

• Exercise 2: Now consider data (1.1, 1.0, 1.2, 4.1, 4.0, 4.2). Compute P- values according to Wilcoxon test and t-test separately. What is the AUC in the case of the Wilcoxon test? What does this tell you about using non- parametric tests in the case where sample sizes are small?

1-Dec-12 14

7 12/4/2012

Parametric or non-parametric?

Drawbacks of non-parametric testing:

• given the sample size, there is a minimum achievable P-value (this constitutes a problem when correcting for multiple testing). • features of low variance may be given highly significant P-values:

A) x x xxxxxx oo x x oo o o o oo o ooooo B) x x xxxxxx oo x x o o o oo o ooooo

• Wilcoxon-test would assign same P-value to features A) and B), i.e. it is blind to the effect sizes of the features.

1-Dec-12 15

Testing with continuous phenotypes

• Suppose we want to determine if a genomic variable (e.g gene expression) is correlated with a continuous phenotype (e.g age). • For this, can use a regression framework: e.g linear model

푦 = 훼 + 훽푥 + 휀

푦′ = 훽푥 + 휀 (푦′ = 푦 − 훼)

Least Squares Estimate: 푇 푦′ 푥 푛 훽 = 푥푇푥 data points 훽 ⇒ 푡 = 𝜎푥 ~ 푡 0, 푛 − 2 |훽=0 𝜎 푦

• Compare with Pearson correlation & Fisher Z-transform:

T yx 1 1 + 𝜌푥푦 1 xy -1  xy  1 푍 = 푙표푔 ~ N(0, )| 1-Dec-12 2 푛−3 𝜌=0 xy 1 − 𝜌푥푦

8 12/4/2012

Some notes

• The statistical significance of correlation values depends on the sample size n (e.g. a correlation of “only” 0.1 can be significant if n>300). Exercise: check this. • If there are outliers, using a t-test to evaluate significance is not a good idea, because the residuals won’t be normally distributed (the assumption underlying the t-test). In this case, we can obtain the “null” distribution of the t-statistic by randomly reassigning the phenotype labels to expression values (need to do this many times, > 1000 , to generate a reasonable estimate of the null distribution). • Null distribution = distribution of the statistic when the null hypothesis is true. By permuting a large number of times you effectively destroy any potential association between predictor and phenotype. By definition this constitutes the null hypothesis. • Often, the null distribution can’t be derived analytically, in which case permutation is the only approach to derive it and hence estimate significance. In this case, we talk about an empirical null.

1-Dec-12 17

Deriving an empirical null

• Given an observed statistic S: to derive the null distribution of the statistic: i) randomly permute phenotype labels. ii) recompute statistic with permuted labels  SP iii) repeat a large number, nP, of times (>1000)  (SP1,SP2,…) iv) an empirical P-value can be calculated as: P-value=(#SP > S)/nP

• in this case, noise was modelled from a Gaussian distribution, so not surprisingly, analytical and empirical estimates for the P-value are in close agreement. • Exercise: (i) run through a similar example yourself (ii) run the case where 1-Dec-12 errors are not normally distributed. 18

9 12/4/2012

Multivariate regression

• Often we need to determine if a particular genomic covariate/predictor correlates with a phenotype of interest, independently of other covariates (predictors). • For example, one may want to know if expression of a particular gene correlates with clinical outcome independently of other prognostic factors (e.g tumor size). If it doesn’t, or if other factors are stronger predictors, then value of using the gene expression predictor is diminished. • Another application is for evaluating conditional independence (inference of regulatory interactions). • Multivariate regression: assume linear model yX yn ( 1) : phenotype vector  (p 1) : regression coefficient vector X ( n p ) : predictor matrix (e.g DNAm data matrix)

ˆ Estimate: ˆ = (XTT X )1 X y  z j  j  1... p j T 1 ˆ y()XX jj 1-Dec-12 19 Under the null  j  0 : zj ~ t (0, df n  p  1)

Geometrical interpretation of multivariate linear regression

y Error (to be minimised)

x2 yˆ

x1

Simple example with 푦 a 3-dimensional vector of observations and two predictors 푥1& 푥2

10 12/4/2012

Multivariate regression model

• Let y denote the phenotype label, let j denote the predictor of interest (e.g expression or methylation of a gene), and let K denote the set of all other covariates. • Want to determine if j is associated with y independently of all other covariates K. • To do this we can use following strategy:

Data[y] = Model[K] + Residual[y|K] Data[j] = Model[K] + Residual[j|K]

Data[Residual[y|K]]= Model[Residual[j|K]] + Error

• The last regression is univariate and the regression coefficient measures the correlation of y and j, which remains after the variation due to all other covariates K is removed, i.e exactly what we want. In the case of linear models, this correlation is called the partial correlation between y and j.

1-Dec-12 21

Multivariate regression coefficients

• In the univariate case (p=1):

n T yx 2 ˆ yx i1 ii yx y T n 22  yx xx xx xx i1 ii

 yx  yx  <= Pearson correlation coefficient. xy • In the multivariate case (p>1):

K{:} xi i j y x  r & x  x  r k k y j k k x j k K k K Multiple regression coefficient is related to partial correlation,  2 which measures the residual   cor(,) r r    y   y correlation between two yxj yxK j| yx j j x j yx j  2 x j variables after correction for all Partial correlation. other covariates in model.

For proof: see book by Hastie & Tibshirani, and Cox 1-Dec-12 & Wermuth “Linear dependencies represented by 22 chain graphs” Statistical Science 1993, 8:204-218.

11 12/4/2012

Application: inference of regulatory Subnetwork consisting of 96 genes networks centered around the ESR2 gene.

Significant correlation but not partial correlation!

Significant partial Schäfer J , Strimmer K Bioinformatics 2005;21:754-764 correlations

The multiple-testing problem

• In genomics we often are correlating large numbers of genomic features to a phenotype of interest. • By random chance, we will get spurious associations: e.g in tossing a fair coin, the probability of getting 10 heads in a row is only ~0.001, but if you do 10,000 trials, on average in 10 trials you will get 10 in a row => “false positives”. • Need to control the number of false positives. • In principle, this is easy. If we perform m null tests and we use a significance threshold of 0.05, then on average you would expect ~ m*0.05 false positives. So, instead of 0.05, you would use a threshold of 0.05/m (Bonferroni correction), this then effectively guarantees no single false positive. • However, in many genomic applications this is far too strict (too few true positives to evaluate biological significance ). • We can afford a number of false positives provided we capture a larger number of true positives  need a confidence measure that each “gene” is a true positive. 1-Dec-12 24

12 12/4/2012

The false discovery rate (FDR)

• The false discovery rate of an experiment is the expected proportion of false positives among tests that are called positive (significant). As such, it is a function of the (arbitrary) significance threshold, t, used: FDR(t) = E[FP(t)/P(t)] • Given the actual experiment, the number of positives P(t) is simply the number of tests (features) passing the threshold t. • The challenge is to estimate the expected number of false positives E[FP(t)]. • There are different procedures to estimate this number, some more conservative than others, some more biased than others. • One approach is based on “q-values” (see Storey et.al 2003).

1-Dec-12 25

Estimating the FDR

• If there are m tests, only a proportion w of these will be true “nulls” (i.e. non-differentially expressed). • Thus, E[FP(t)] ~ (wm)t , since wm is the number of true nulls and t is the probability that any one of them is called significant. • Thus, the problem is to estimate w. Before we look at how to estimate this quantity, consider a simple example: Suppose we test 10,000 independent genes (mRNA expression) using a t-test against a phenotype where we know there should be no association, for instance, a binary sample label that we generate at random.

How should the P-value distribution look like?

1-Dec-12 26

13 12/4/2012

Estimating the FDR

 The distribution should follow a uniform distribution.

Exercise: what do you think should the FDR be? 1-Dec-12 27

Estimating the FDR

• Scenario where we have an association: In this case we have a clear skew towards small p-values.

Can estimate w from fitting a uniform distribution to interval (L,1).

L • however, a histogram of this shape does not guarantee that you can reliably identify the true positives, as this also depends on the actual sample size n (this is hard to read from a histogram plot, which is why we need a quantitative estimate such as the FDR). 1-Dec-12 28

14 12/4/2012

Estimating the FDR Formally, {#p L : i 1,..., m } wLˆ() i mL(1 ) wˆ() L mt FDR(,) t L  {#pi  t : i 1,..., m } need to choose LL so that histogram of p-values in range ( ,1) is flat.

Steps in FDR estimation (data is assumed to be intra and inter- array normalised): 1. Choose statistical test. 2. Derive p-value for each genomic feature. 3. Plot histogram of p-values. 4. Choose an optimal parameter value L (usu. ~ 0.5 is fine, but q-value R-package finds this). 5. Estimate q-values (FDR) from p-value 1-Dec-12 distribution. 29

Example: mRNA expression of BRCA1 mutant vs wild-type breast cancers (Hedenfalk et al NEJM 2001)

A density histogram of the 3,170 p values derived from t-tests:

3,170 genes

Note: to estimate the proportion of true nulls requires large numbers of features ~ 1000.

Storey J D, Tibshirani R PNAS 2003;100:9440-9445

1-Dec-12 30

©2003 by National Academy of Sciences

15 12/4/2012

Example: mRNA expression of BRCA1 mutant vs wild-type breast cancers (Hedenfalk et al NEJM 2001)

q-value analysis using the R-package q-value

Exercise: try to reproduce this on the data available in that package

1-Dec-12 Storey J D, Tibshirani R PNAS 2003;100:9440-9445 31 ©2003 by National Academy of Sciences

Other performance measures

• It is not only the FDR that needs to be controlled, but often controlling the FNR (false negative rate) is equally important. If the FNR is large, we may miss important biological associations.

• Some definitions: Test PP N A TP FN Hypotheses H (null) FP TN

SE = p(P|A) = TP/A PPV = p(A|P) = TP/P FNR = p(N|A) = FN/A = 1 – SE FDR = p(H|P) = FP/P = 1 - PPV

SP = p(N|H) = TN/H NPV = p(H|N) = TN/N FPR = p(P|H) = FP/H = 1 - SP FNDR = p(A|N) = FN/N = 1 - NPV 1-Dec-12 32

16 12/4/2012

Typical example

P N Total A 300 200 500 H 200 9300 9500 Total 500 9500 10000

SE = TP/A = 300/500 = 0.6  FNR=0.4 SP = TN/H= 9300/9500= 0.98  FPR=0.02

FDR = FP/P = 200/500 = 0.4

This example shows that although the Sensitivity (SE) and Specificity (SP) measures are relatively high, we have a relatively large FDR and FNR. This is because we have a large number of tests (10000) and only a small proportion are truly differentially altered (5%).

1-Dec-12 33

Power calculations in genomics

• Very often, we need to estimate the minimum sample size of an experiment in order to achieve a certain FDR and FNR.

• The FDR/FNR characteristics of an experiment depend on many factors (many of which are unknown), including:

1) the proportion of truly differentially altered features.

2) the distribution of effect sizes (signal to noise ratios).

3) heterogeneity of phenotypes.

4) sample size (this is the factor we can control).

1-Dec-12 34

17 12/4/2012

Power calculations in genomics

• For experiments where features are not highly correlated (e.g mRNA expression arrays, or DNA methylation arrays with well separated CpGs), we may take the features as independent tests  allows analytical estimation. • Suppose we have computed for each feature (total of m features) a statistic T, taking a value t, which tells us how well it discriminates two phenotypes. Then:

p( T t )  wpHA ( T  t )  (1  w ) p ( T  t ) F( t ) p ( T  t )  1  F ( t )  p ( T  t )

F( t )  wFHA ( t )  (1  w ) F ( t ) wm2(1 F ( t )) w (1 F ( t )) FDR() t HH m2(1 F ( t )) 1 F ( t ) (1w ) m 2(1 F ( t )) FNR( t ) 1  SE ( t )  1 A  1  2(1  F ( t )) (1 wm ) A

• In the special case, where we declare the top (1-w)m features to be differentially altered, i.e if we pick the threshold t such that 2(1-F(t))=(1-w), then FDR(t)=FNR(t) (Exercise). 1-Dec-12 35

Example • Consider following assumptions: equal sample size and variance and normally distributed data (i.e may need to log-transform data) :

2 2 2 n1 n 2  n s 1  s 2  

x1 x 2nn x 1  x 2 x 1  x 2 D T  22  e e   ss22    12 nn12

p( T t )  wpHA ( T  t )  (1  w ) p ( T  t )

pH ( T t ) ~ t (0, df  2 n  2)

n pA ( T t ) ~ t ( 2 e , df  2 n  2)

• Have 3 parameters: n, e, w 1-Dec-12 36

18 12/4/2012

Example ctd • Consider following assumptions: equal sample size and variance and normally distributed data. • Assume n=50 (a total of 100), e=1, w=0.95:

At a P-value threshold of 0.0001, power is ~80%, and FDR is still reasonably small.

1-Dec-12 37

Example ctd

• Still assuming e=1, w=0.95:

Declaring top 5% of discriminatory features to be differentially altered, at a sample size of n=50 per group, FDR=FNR<0.1.

However, we can see that with n=20 per group FDR=FNR>0.3.

1-Dec-12 38

19 12/4/2012

Example ctd

• What if the average effect size is log2(1.5) ~ 0.6 (more realistic)? (still assuming w=0.95)

Declaring top 5% of discriminatory features to be differentially altered, we would now need >100 samples per group to achieve FDR=FNR<0.1.

1-Dec-12 39

R-script used for examples

## DoPower.R

library(OCplus); ### frequency of alteration in cases MAF <- 1; ### samples n1 <- 50; ## samples per group n2 <- MAF*n1; ### effsize (quantitative data type) eff <- log2(2); ### proportion of non-differentially altered features pi0 <- 0.95;

### perform power calculation toc.o <- TOC(n1=n1,n2=n2,sigma=1,D=eff,p0=pi0,legend.show=TRUE,plot=FALSE)

jpeg("Example1-OCplus.jpeg",width=400,height=400); par(mfrow=c(2,1)); plot(y=toc.o$se,x=log10(toc.o$alpha),ylim=c(0,1),xlab="signlevel",ylab="Se"); plot(y=toc.o$FDR,x=log10(toc.o$alpha),ylim=c(0,1),xlab="signlevel",ylab="FDR"); dev.off();

jpeg("Example2-OCplus.jpeg",width=400,height=400); samplesize(n=c(10,25,50,100),p0=pi0,D=eff,crit.style="top",crit=1-pi0,sigma=1); dev.off();

eff <- log2(1.5); jpeg("Example3-OCplus.jpeg",width=400,height=400); samplesize(n=c(10,25,50,100,150),p0=pi0,D=eff,crit.style="top",crit=1-pi0,sigma=1); dev.off();

1-Dec-12 40

20 12/4/2012

Gene set enrichment analysis (i)

• Once we have derived a list of differentially altered features associated with some phenotype of interest, the next step is to investigate if there is any enrichment of biological terms among the selected features. • If the list of differentially altered features is small (e.g < 50, perhaps because of a stringent FDR threshold), we will lack the power for a GSEA. One may therefore relax the threshold to FDR~0.3. Typically, we will require ~200-500 features (1-5%) in our list of interest. • Relaxing thresholds is justifiable because if there is no biological association with a phenotype, then there is no reason to expect that features with a certain biological function will be ranked higher than a random set of features. • Often the list of differentially altered features may contain redundancies (e.g CpGs mapping to same CpG island/gene, or probes mapping to same gene). It is important to summarise the features (i.e remove redundancies) before GSEA. • Need a database of biological “terms”: this may include Gene Ontologies, molecular sigalling pathways, genes on same cytoband, …etc. An example is MSigDB (http://www.broadinstitute.org/gsea/msigdb/)

Gene set enrichment analysis (i) Features annotated to biological term Features in experiment that were tested

Overlap Features in list of interest (i.e associated with phenotype) • null hypothesis (no-enrichment) distribution : hypergeometric

Number of features of biological term “p” among npp  N n    tested features. t  d t  P( T t | d ) Number of tested features N Random variable giving number of features  in overlap. d Number of features in list of interest

Number of genes selected in source i

21 12/4/2012

Gene set enrichment analysis (i)

N N !1 • Binomial coefficient:  N( N  1)...( N  m  1) m m!( N m )! m ! = number of ways of selecting m objects out of a total of N , without replacement and where the precise order is irrelevant.

N n N n • Binomial distribution: B( n |{ N , })  (1  ) n

Under the null: TBn ~ ({pp , }), TBNn ~ ({ ,  })  TTBN  ~ ({ ,  }) p(,) T t T  d  t B( T t |{ n , }) B ( T  d  t ,|{ N  n , }) p( T t | T  T  d )   pp p( T T  d ) B ( T  T  d |{ N , })

npp t nt  N n  dt N n  d  t  (1 ) p  (1 ) p t d t      N d N d (1 ) d n t p npp  N n     t d t     d-t N Nn  p d

Gene set enrichment analysis (i)

• P-value for overenrichment:

t PTtd( |)1(   PTtd  |)1   pTkd (  |) k0

• This is a one-tailed test. For 2x2 contingency tables it is equivalent to a one-tailed Fisher’s exact test:

t d t d M   npp t N  n () d  t Nd

np Nn p N

• In Fisher’s exact test, we ask if the relative proportions in the 2x2 matrix M are significantly different from random, under the constraint that the rows and columns must sum to the same marginals.

22 12/4/2012

Example: GSEA on genome-wide DNA methylation data in blood (healthy & women with ovarian cancer (CA))

Age CpGs Age-CA CpGs CA CpGs (391 gene loci) (190 gene loci) (2516 gene loci)

No associations at 36 terms 154 terms adj P < 0.05 adj P < 0.05 adj P < 0.05

178 hyperM 213 hypoM 44 hyperM 146 hypoM 1154 hyperM 1362 hypoM

adjP< 0.05 adjP< 0.3 adjP< 0.05 adjP< 0.05

Immune system, Cell-adhesion Developmental genes Hematopoiesis T-cell Hematopoiesis/ (FOXC1, MYOD1, activation lymphoid-myeloid T-cell & leukocyte (MMP9,MMP11, GATA4,SOX8,…) (LY9, differentiation activation (INS, MMP13,…) LY86, IL2, BCL2, TNFRSF7, (TNF, MPO, ELA2, HOXA9 regulatory REST targets (CKDN2B, ….) IRF8, CTSG….) programs LHX3, VGF,…) SPOCK2,….) NK-mediated (SERPINF1, G-protein Myeloid cytotoxicity TNFAIP8,…) signalling (EDN2, GPR25, (BRAF, MAPK1, NPY5R,…) skewing HRAS, IFNG,….)

Testing significance of overlap

• Note that the same test can be used to evaluate if two lists of features (A & B) overlap in a statistically significant way:

In B Not in B a In A t a t  Not in A b t N  b () a  t Na b Nb N

• For example, one may want to compare if two lists of features correlating with the same phenotype but derived from different papers/experiments share a significant overlap.

23 12/4/2012

Gene set enrichment analysis (ii) A GSEA overview illustrating the method

Subramanian A et al. PNAS 2005;102:15545-15550

A GSEA overview illustrating the method. (A) An expression data set sorted by correlation with phenotype, the corresponding heat map, and the “gene tags,” i.e., location of genes from a set S within the sorted list. (B) Plot of the running sum for S in the data set, including the location of the maximum enrichment score (ES) and the leading-edge subset.

Gene set enrichment analysis (ii)

1. Rank N features according to correlation with phenotype  statistic T for each feature. 2. Decide on a value for the parameter p. 3. Compute enrichment score (ES) as below. ||T p ES( i )g j N | T |p hit S g j j iNS gj S gSj  1 ESmiss () i   Number of genes in set S ji Nn S gSj 

ES max{| EShit ( i )  ES miss ( i ) |: i  1,..., N } 1 If we choose p 0  NS = n S  ES hit ( i )   ji nS gSj and test is the Kolmogorov-Smirnov test.

24 12/4/2012

Further notes on GSEA

• Because the hypergeometric method requires the definition of a feature list of interest, it is threshold dependent  Robustness of result must be evaluated using a range of thresholds. • Choice of threshold may bias associations towards biological terms of a certain size (e.g more stringent thresholds may favour terms with fewer features). • The KS or modified KS-test method is independent of any threshold but is more sensitive to the precise ranking of features. • Small but consistent changes in a molecular pathway are more likely to be captured using the KS-tests (unless thresholds are relaxed). • In both cases need to correct for multiple testing of biological terms. Because of inherent dependencies between biological terms, best approach is to generate null ranked gene lists by permutation of phenotype labels and subsequent computation of enrichment values.

Dimensional reduction and unsupervised analysis • Often we are interested in performing unsupervised analysis on a high-dimensional genomic data set. The goal being to identify the salient patterns of variation in the data, or to do clustering.

• When we have >10,000 features, performing such unsupervised analysis is challenging. For example, two-way hierarchical clustering over 10,000 features and over 100 samples is computationally expensive and results could be highly unstable.

• Despite the high-dimensionality of genomic data sets, there are good reasons to believe that the relevant patterns of variation can be captured by a lower-dimensional subspace spanned by a handful of “components”.

• The process of representing a high-dimensional data set by a low- dimensional subspace is known as dimensional reduction.

25 12/4/2012

A trivial example

• these 5 images are each 100 x 100 pixels big, so this data could be represented by a 10,000 x 5 data matrix. • however, we can easily see that each image can be obtained from any other through a translation (2 parameters) and a rotation (1 angle). • thus, all the interesting variation in this high dimensional data set could be represented in a 3-dimensional subspace.

Dimensional reduction methods

• One of the simplest methods is to filter the features by variance:

1. Compute variance of each feature across all samples. 2. Rank features according to variance. 3. Select a specified number of top features (e.g top 25%). 4. Perform unsupervised analysis (e.g hierarchical clustering).

• Singular value decomposition (SVD) or Principal Components Analysis (PCA):

1. Data matrix is linearly decomposed into a series of components of variation, ranked by the amount of variance they account for in the data. 2. Each component has two projections: one across samples and another across features. The projections are weight vectors, the weights give the relative importance of samples and genes for that component.

26 12/4/2012

Singular value decomposition

XUDV  T (p s ) ( p  k ) ( k  k ) ( k  s ) k min( p , s ) is the maximum number of independent components

D diag( d1 , d 2 ,..., dkk ) d 1  d 2  ...  d  0 2 di Vari  k is the fraction of variation explained by component i 2 d j j1

• the columns of U are orthogonal to each other and give the weights of the components across features (p labels features).

• the columns of V orthogonal to each other and give the weights of the components across samples (s labels samples).

Interpretation of SVD

• the dominant principal component is the line at right angles to the light-blue line: however, projection of the data along the top PC does not discriminate blue from red. Projection along the 2nd PC (light-blue line) does. Extrapolated to higher dimensions, this example illustrates the danger of only focusing on the top principal component(s).

27 12/4/2012

Supervised SVD • the previous example shows how SVD could be used in a supervised fashion to find features correlating with a phenotype of interest. Procedure is as follows:

1. Perform SVD on data matrix X (this step is unsupervised).

2. Correlate the columns of V to phenotype of interest using a statistical test (e.g t-test or Pearson correlation). Rigorously, we would need to estimate the number of significant components and only consider those.

3. The component with the best (i.e most significant) correlation to phenotype is selected: label it by j. (Could also select all components that correlate with phenotype of interest).

4. Select features with the largest absolute weights in the j’th column of matrix U. A typical criterion is to e.g select the top 5% of features or e.g the features with absolute weights 2 standard deviations from the mean.

5. Finally, one can check that the selected features discriminate the phenotype (e.g can use t-test for each selected feature or cluster over all selected features).

SVD example: application to Infinium DNA methylation data Data Matrix: ~27,000 CpGs x 200 samples Significance analysis of principal components  approximately 9-10 significant components (red=observed, black=null)

Correlation of components 5 & 6 with Age.

28 12/4/2012

Advantages of SVD • SVD has many advantages over other dimensional reduction methods such as the one based on variance-filtering:

1. Typically, SVD reduces the dimensionality to ~10 components, much smaller than the typical ~5000 features left after variance filtering.

2. Thus SVD avoids the multiple-testing problem!

3. SVD removes potential redundancy of features: many features will be highly correlated and these are likely to feature in the same component.

4. More robust feature selection: because components represent patterns of variation of several features, if a component correlates to a phenotype, the corresponding features are less likely to be false positives (this can be proved empirically on real data).

5. A given feature can have large absolute weights in multiple components: while this can complicate the interpretation, it also allows features to be part of multiple distinct patterns of variation (e.g distinct molecular pathways if dealing with mRNA expression) in line with biological reality.

Some books + papers

1. Parmigiani, Garrett, Irizarry, Zeger. The Analysis of Gene Expression Data. Springer. Chapters 1-5,11-12. Somewhat outdated, but still worth reading for the conceptual frameworks. 2. Hastie, Tibshirani, Friedman. Elements of Statistical Learning: Data mining, inference and prediction. 2nd Edition. Springer. More mathematical, but with excellent introductory chapters. Pdf freely available from Hastie’s website. 3. Bishop CM. Pattern Recognition and Machine Learning. Springer. 4. Pounds SB. Estimation and control of multiple testing error rates for microarray studies. Brief in Bioinformatics 2005 Vol.7,No.1,25-36. A review of FDR estimation procedures. 5. Storey JD, Tibshirani R. Statistical significance for genomewide studies. PNAS 2003, 100 (16). A seminal paper on FDR estimation. A “must-read” because of importance and clarity. Source paper for qvalue R-package. 6. Pawitan Y,…et.al. False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics 2005, Vol.21, no.13. An important paper showing how to estimate sample size to achieve a desired FDR/FNR in genomic studies in the limit of weak correlations. Source paper for OCplus BioC package.

1-Dec-12 58

29