UNIVERSITY OF CINCINNATI

Date:______

I, ______, hereby submit this work as part of the requirements for the degree of: in:

It is entitled:

This work and its defense approved by:

Chair: ______

Testing for Differentially Expressed and Key Biological Categories in DNA Microarray Analysis

A dissertation submitted to the Graduate School of the University of Cincinnati In partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY in the Department of Environmental Health of the College of Medicine 2007

By

Maureen A. Sartor Masters in Biomathematics, North Carolina State University, August 2000 B.S., Xavier University, Cincinnati, Ohio, 1998

Committee Chair: Dr. Mario Medvedovic

ABSTRACT

DNA microarrays are a revolutionary technology able to measure the expression levels of thousands of genes simultaneously, providing a snapshot in time of a tissue or cell culture‟s transcriptome. Although microarrays have been in existence for several years now, research is yet ongoing for how to best analyze the data, at least partly due to the combination of small sample sizes (few replicates) with large numbers of genes. Several challenges remain in maximizing the amount of biological information attainable from a microarray experiment. The key components of microarray analysis where these challenges lie are experimental design, preprocessing, statistical inference, identifying expression patterns, and understanding biological relevance.

In this dissertation we aim to improve the analysis and interpretation of microarray data by concentrating on two key steps in microarray analysis: obtaining accurate estimates of significance when testing for differentially expressed genes, and identifying key biological functions and cellular pathways affected by the experimental conditions. We identify opportunities to enhance analytical techniques, and demonstrate that these enhancements significantly improve the functional interpretation of microarray results. We develop three related Bayesian statistical models to improve the estimates of significance by exploiting the information available from all genes, and functionally relating the variances to their expression levels. These novel methodologies are compared to previously proposed methods both in simulations and with real-world experimental data performed on multiple microarray platforms. In addition, we introduce a logistic regression method for identifying key biological categories and molecular pathways and compared this method with the commonly used Fisher‟s ii exact test and other relevant previously developed methods. We make our statistical methods available to the biomedical research community through the use of statistical software widely used for microarray analysis.

iii

iv

ACKNOWLEDGMENTS

I would like to express my gratitude to Dr. Mario Medvedovic for his guidance and support in my research and professional development. I would also like to thank the other members of my committee, Dr. Paul Succop, Dr. Siva Sivaganesan, Dr. Alvaro Puga, and Dr.

Michael Wagner for their time and helpful advice and suggestions.

Of course, I cannot forget my husband, George Schmiesing, for his continual encouragement, my children Brixon and Elliot, and my parents for providing me with a strong educational foundation and an appreciation for knowledge.

v

TABLE OF CONTENTS

CHAPTER 1: Introduction, specific aims, and background ...... 1

1.1. INTRODUCTION ...... 1

1.2. BACKGROUND ...... 4

1.2.1. Microarray technology ...... 4

1.2.2. T-statistics and Bayesian models ...... 6

1.2.3. Microarray data statistics: testing for significance of differential expression ...... 9

1.2.4. Testing for key biological categories ...... 16

1.2.4.1. Consortium and Kyoto Encyclopedia of Genes and Genomes ...... 16

1.2.4.2. Biological gene set enrichment analysis methods ...... 18

1.2.5. R statistical software and Bioconductor ...... 22

CHAPTER 2: Intensity-based hierarchical Bayes method improves testing for differentially expressed genes in microarrays ...... 23

2.1. OVERVIEW ...... 23

2.2. RESULTS AND DISCUSSION ...... 24

2.2.1. Intensity-based Bayesian model ...... 24

2.2.2. Estimation of hyperparameters ...... 28

2.2.3. Simulation study ...... 31

2.2.4. Results from controlled spike-in datasets ...... 38

2.2.5. Case Studies: Analysis and interpretation of two microarray datasets ...... 41

2.2.5.1. Results from the MEF Ahr-/- dataset ...... 41

2.2.5.2. Results from nickel exposure dataset ...... 43

2.3. CONCLUSIONS ...... 47

2.4. METHODS ...... 50

2.4.1. Mice and exposure protocol ...... 50

2.4.2. Microarray hybridizations ...... 51

vi

2.4.3. Data normalization and analysis ...... 52

CHAPTER 3: Systematic comparisons reveal that logistic regression provides a simple yet powerful approach to identifying enriched biological groups in gene expression data ...... 53

3.1. OVERVIEW ...... 53

3.2. RESULTS AND DISCUSSION ...... 54

3.2.1. LRpath model ...... 54

3.2.2. Simulation study ...... 57

3.2.3. Comparisons with real-world breast cancer datasets ...... 62

3.2.3.1. Subset analyses ...... 62

3.2.3.2. Concordance analysis ...... 64

3.2.4. Application: Results from a human IPF dataset ...... 66

3.3. CONCLUSIONS ...... 69

3.4. MATERIALS AND METHODS ...... 73

3.4.1. Simulation steps ...... 73

3.4.2. Construction of the gold-standard set of GO terms for the breast cancer dataset ...... 74

CHAPTER 4: Full hierarchical and empirical Bayesian spline-based models for the analysis of multiple types of microarray data ...... 75

4.1. OVERVIEW ...... 75

4.2. METHODS ...... 79

4.2.1. Empirical Bayes model ...... 79

4.2.1.1. Estimating the hyperparameters ...... 83

4.2.1.2. Defining the moderated T-statistic...... 83

4.2.2. Full Bayesian model ...... 84

4.2.2.1. Calculation of posterior predictive probabilities ...... 87

4.3. RESULTS ...... 88

4.3.1. Smoothing spline versus local regression ...... 88

4.3.2. Simulation results ...... 90

vii

4.3.3. Breast cancer datasets ...... 93

4.4. DISCUSSION ...... 96

CHAPTER 5: Discussion and conclusions ...... 97

5.1. COMBINED BENEFITS OF IBMT AND LRPATH ...... 97

5.1.1. Simulations ...... 97

5.1.2. Breast cancer results ...... 100

5.2. INTEGRATED DISCUSSION AND CONCLUSIONS ...... 102

5.3. FUTURE DIRECTIONS ...... 103

Bibliography ...... 105

Appendices ...... 115

A.1. APPENDIX FOR CHAPTER 2 ...... 115

A.2. APPENDIX FOR CHAPTER 3 ...... 124

A.2.1. Supplemental figures and tables for Chapter 3 ...... 124

A.2.2. Supplemental applications for Chapter 3 ...... 125

A.2.2.1. Results from AML/ALL dataset ...... 125

A.2.2.2. Results from cardiovascular TCDD dataset ...... 127

A.3. APPENDIX FOR CHAPTER 4 ...... 128

A.3.1. Empirical Bayes smoothing spline estimation ...... 128

A.3.2. Full Bayesian joint and full conditional distributions ...... 129

A.4. RESEARCHER CONTRIBUTIONS ...... 130

viii

LIST OF TABLES

Table 1.1 Recently proposed statistical methods for assessing differential expression of gene transcripts in

DNA microarray experiments ...... 10

Table 1.2 Fisher‟s exact 2x2 contingency table ...... 19

Table 1.3 Methods included in comparisons of statistical approaches to assess enrichment of biological

categories ...... 20

Table 2.1 Simulated estimation of prior degrees of freedom for SMT and IBMT ...... 37

Table 2.2 Top significant Gene Ontology categories for the MEF Ahr-/- dataset ...... 42

Table 2.3 Number of significant Gene Ontology terms and assigned genes among methods for Nickel

exposure dataset ...... 44

Table 3.1 Significant KEGG pathways and their properties from Application: Human IPF dataset ...... 67

Table 4.1 Performance of methods in simulations under a range of parameter sets ...... 92

Table 4.2 Intra-experiment Pearson correlation coefficients for variance estimates ...... 94

Table 4.3 Correlations between two independent Breast Cancer datasets ...... 95

Table A.1 Control of false positive rate in Affymetrix “spike-in” dataset ...... 119

Table A.2 Full list of significant Gene Ontology categories for MEF Ahr-/- dataset ...... 121

Table A.3 Top ranked genes in IBMT, but not SMT, and vice versa for Nickel data ...... 122

LIST OF FIGURES

Figure 1.1 Central dogma of molecular biology ...... 1

Figure 1.2 Flowchart of a typical dual-channel microarray experiment ...... 5

Figure 1.3 Two extreme methods of identifying differentially expressed genes ...... 8

Figure 1.4 A portion of the cellular component ontology of GO ...... 17

Figure 2.1 Dependence of gene variance on average log-intensities ...... 25 ix

Figure 2.2 Values used in IBMT simulations...... 32

Figure 2.3 The t-test, SMT, and IBMT correctly estimate the proportion of false positives ...... 34

Figure 2.4 Example false positive curves for IBMT and other relevant methods ...... 35

Figure 2.5 Areas under false positive curves for all three strength of dependency of variance on average

spot intensity, and for additional simulations...... 36

Figure 2.6 Results from the Choe, et al. spike-in experiment...... 39

Figure 2.7 Results from HG-U133 latin-square spike-in experiment...... 40

Figure 3.1 LRpath simulation results: Ability to rank enriched GO terms ...... 58

Figure 3.2 LRpath simulation results: Ability to detect enriched GO categories as significant ...... 60

Figure 3.3 LRpath simulation results: Comparison of bias in p-values ...... 61

Figure 3.4 ROC-like curves for Breast Cancer dataset ...... 64

Figure 3.5 Concordance of methods between two independent Breast Cancer datasets ...... 66

Figure 3.6 LRpath assessment of Gene Ontology (GO) terms over-represented in human idiopathic

pulmonary fibrosis (IPF) ...... 70

Figure 4.1 Typical relationships observed between gene log-variances and average log-expression level 77

Figure 4.2 Directed acyclic graph of the empirical Bayesian model...... 80

Figure 4.3 Directed acyclic graph for full hierarchical Bayesian model ...... 85

Figure 4.4 Comparison of smoothing spline and local regression to estimate function ...... 89

Figure 4.5 Convergence properties of the full Bayesian model ...... 91

Figure 4.6 Simulation results for empirical and full spline-based methods ...... 92

Figure 4.7 Comparison of methods through the use of gene set enrichment testing ...... 95

Figure 5.1 Combined benefit of LRpath and IBMT in terms of ranking enriched gene sets ...... 98

Figure 5.2 Combined benefit of LRpath and IBMT in terms of detecting enriched gene sets as significant

...... 99

Figure 5.3 Combined benefit of LRpath and IBMT: ROC-like curves for Breast Cancer dataset ...... 101

Figure A.1 Dependency of log-variance on log-expression intensity in simulation study ...... 116

x

Figure A.2 Improved relative performance of t-test with higher sample degrees of freedom ...... 117

Figure A.3 Control of false positive rate in simulation studies for additional parameter sets ...... 118

Figure A.4 Variance-Intensity relationship for Affymetrix latin-square experiment ...... 120

Figure A.5 Full ROC curves from Figure 3.4 ...... 124

Figure A.6 Histograms of p-values testing Gene Ontology for AML-ALL dataset ...... 126

LIST OF SYMBOLS

AhR: Aryl hydrocarbon receptor, ChIP: chromatin immunoprecipitation, DAG: directed acyclic graph, DEGs: differentially expressed genes, ER: Estrogen receptor, FDR: False discovery rate, FE: Fisher‟s exact test, GEO: Gene Expression Omnibus, GO: Gene Ontology,

GSEA: Gene set enrichment analysis, IBMT: Intensity-based moderated t-statistic, KEGG:

Kyoto Encyclopedia of Genes and Genomes, IPF: idiopathic pulmonary fibrosis, LR: logistic regression, MEF: mouse embryonic fibroblasts, mRNA: messenger ribonucleic acid, RMA:

Robust multichip analysis, ROC: Receiver-operator characteristic, SMT: Smyth‟s moderated t- statistic

xi

CHAPTER 1: Introduction, specific aims, and background

1.1. Introduction

The central dogma of molecular biology, which explains how information is transferred from DNA to RNA to , was developed several decades ago and is shown here in Figure

1.1. Although the general process is now known to be much more complex than this, it is commonly accepted that the change in steady state levels of messenger RNA (mRNA) in the cytoplasm of cells may serve as an indicator of change in protein levels, the main machinery for accomplishing cellular tasks.

Hence changes in the relative amounts of mRNA may be an important indicator of how cells respond to changes in their environment.

Microarrays have proven to be a powerful high-throughput Figure 1.1 Central dogma of molecular biology technology that allows one to Source: http://faculty.uca.edu/~johnc/rnaprot1440.htm measure the steady state transcriptional expression levels of thousands of genes in parallel, and are an important technology in a wide-array of biomedical research, including the fields of toxicology and environmental health. In these particular fields of research, microarrays are widely used to study the molecular and cellular pathways involved in the response to an

1 environmental exposure or toxin. DNA microarrays may be used to investigate gene expression under different treatments or doses of a treatment, different types or stages of a disease, different genotypes, multiple time points after an exposure or treatment, or any combination of the above.

For the fields of toxicology and environmental health, this understanding is an important step in developing treatments for harmful environmental exposures, defining minimum acceptable exposure levels, understanding the interactions between genotype and environmental responses, and enabling researchers to hypothesize the toxicological effects of similar but untested toxins.

Due to the complexity of microarray data, mediocre or sub-standard analysis can easily lead to faulty conclusions, and possibly many months spent researching futile genes or molecular pathways. Conversely, the use of high-quality methods for analyzing groups of genes involved in common molecular and cellular pathways helps investigators identify potential effects of environmental toxicants, novel medications, gene-knockouts, or other experimental conditions not obvious otherwise.

In more recent years, the breadth of use for microarrays has expanded to include measuring differential expression of micro RNAs (miRNAs), differential binding of promoter elements through the use of chromatin immunoprecipitation (ChIP-on-chip experiments), and even genome-wide binding sites (full tiling arrays). Although the applications in this dissertation concentrate on the genome-wide expression microarrays, the methods described are applicable to the above additional technologies as well.

This dissertation covers two important aspects of microarray analysis: testing individual genes for significant differential expression, and identifying functionally-related gene groups affected in a microarray experiment (also called gene set enrichment analysis). The specific aims of the dissertation are to develop improved methods for the above two steps in the analysis

2 of microarray data, to apply the novel methods to simulated and real data, to assess their individual and combined benefits, and to implement the methods as functions in R statistical software for distribution to the biomedical research community. In the background section of this dissertation, we will introduce the reader to microarray technology and experimental design, simple statistical methods used to test for differential expression of genes, as well as more complex methods for testing differential expression including several hierarchical Bayesian methods. For background relating to gene set enrichment analysis, we will provide background on the databases used to group genes by function or pathway, fisher‟s exact test and logistic regression, and describe other relevant previously proposed methods.

In Chapters 2, 3, and 4, we present the three manuscripts developed from this dissertation research. In Chapter 2, we develop a novel Bayesian adjustment to the T-statistics for testing for differentially expressed genes. Our proposed adjusted T-statistics make use of improved estimates of variance, which, as we show, results in more biologically relevant sets of genes determined to be differentially expressed. This methodology (IBMT, Intensity-Based Moderated

T-statistic) uses local regression to predict variance levels across all measured genes to improve upon the variance estimate of each individual gene. The prior variance of each gene is predicted based on its expression level as measured by the spot fluorescence levels on the microarrays.

Chapter 3 is devoted to developing an improved method for gene set enrichment analysis.

We present a new method (LRpath) to test for key biological categories enriched with differentially expressed transcripts in microarray experiments. We functionally relate the odds of gene set membership with the significance of differential expression, and calculate adjusted p- values as a measure of statistical significance. The new approach is compared to Fisher‟s exact

(FE) and other relevant methods in a simulation study and in the analysis of real microarray data.

3

The simulation study preserves the membership structure of the Gene Ontology hierarchy

(described below), and tests the effect of several parameters on method performance. Additional assessments, including consistency of the methods across independent datasets, are made using two breast cancer datasets. An R function for implementing LRpath can be downloaded from http://eh3.uc.edu/lrpath.

In Chapter 4, we present full and empirical hierarchical Bayesian models, both of which make use of splines to functionally relate gene variance to expression level. These models provide validation for the IBMT method of Chapter 2, which was not a true empirical Bayes model due to its use of non-parametric local regression. Although the spline-based empirical model (SSIBMT) performed nearly identically to IBMT in simulations and with real-world data, we provide evidence for a slight advantage of SSIBMT over IBMT. The full Bayes model, which accounts for the uncertainty in the estimated spline function and the uncertainty in the variance of the gene variances and has other additional differences compared with the empirical- based models, does not demonstrate an advantage over the empirical models.

In Chapter 5, we assess the combined benefits of IBMT and LRpath through objective assessments with simulated and real-world data. Further overall conclusions from the methods of

Chapters 2-4 are provided, as well as possible future directions.

1.2. Background

1.2.1. Microarray technology

Microarrays are a revolutionary technology enabling one to measure the relative transcription levels of thousands of genes simultaneously. Two main types of microarrays are currently in use, dual-channel and single-channel. For single-channel microarrays only one

4

biological sample is used per array,

whereas for dual-channel arrays, mRNA is

extracted from two different tissue

samples or cell cultures, fluorescently

labeled, and hybridized together on one

array (Figure 1.2). In both types the

steady state mRNA transcript amounts are

measured by the spot fluorescence

intensity levels. For each comparison of

interest in a microarray experiment,

biological and/or technical replicates are

often performed, and several comparisons

of interest may be involved in one

experiment. Several approaches have

been described for the experimental design

(Kerr & Churchill 2001; Churchill 2002;

Figure 1.2 Flowchart of a typical dual-channel microarray Yang & Speed 2002), pre-processing and experiment Source: (Leung & Cavalieri 2003) normalization (Kooperberg et al. 2002;

Yang et al. 2002; Bolstad et al. 2003; Irizarry et al. 2003a; Irizarry et al. 2003b), and basics of statistical testing of differential expression (Kerr et al. 2000; Wolfinger et al. 2001; Cui &

Churchill 2003; Leung & Cavalieri 2003) in microarray data. In what follows, for brevity we assume that the initial steps in microarray analysis (Figure 1.2) have been completed. Olson

5

(Olson 2006) and Allison et al. (Allison et al. 2006) provide general overviews of the complete process of microarray experiments.

Upon completion of the wet-lab portion of the experiment, microarrays are imaged, background adjusted, and often normalized either per array for dual-channel arrays, or between arrays for Affymetrix GeneChips (Figure 1.2). Once the data has been properly normalized, the simplest way to determine a list of differentially expressed genes is to calculate estimates of fold change, and then consider all genes with a magnitude of fold change higher than a pre- determined cutoff level to be significant. A commonly used estimate is the average log-fold change:

ˆ  g  meanlogTg Cg , (1.1)

where Tg and Cg are the expression intensities of the treated and control samples, respectively, for a specific gene g.

1.2.2. T-statistics and Bayesian models

One basic tool for testing differential gene expression is the T-statistic. For a specific gene g, the T-statistic can be expressed as:

ˆ  g t g  , (1.2) SEg

ˆ where  g is an estimate of log fold change for gene g and SEg is the standard error for gene g.

The standard error is most simply calculated as:

s 2 SE  g , (1.3) g n

6

2 where s g is the sample variance and n is the number of replicates. We stress that for the T- statistic the standard error calculations are based only on measurements of each gene separately.

The T-statistic and its degrees of freedom (df), which are calculated based on the sample size, are used to determine a p-value for gene g being differentially expressed. Once all the p-values are calculated, a cutoff level may be chosen to create a list of differentially expressed genes. The T- statistic (Eq. 1.2) is problematic for many microarray experiments because with small sample sizes it is often unclear whether the data satisfies either the normality assumption or the Central

Limit Theorem. Also, underestimating the true standard error of a gene artificially inflates the T- statistic by decreasing its denominator. When testing several thousand genes, it is likely that the standard error of one or two hundred genes (less than 1% of the total number) will be greatly underestimated. Such genes are likely to result in being false positives. Likewise, if the variability of a truly differentially expressed gene is overestimated, then that gene is likely to be a false negative.

When testing for differential expression in each gene separately, the estimates of variance are often poor due to small sample sizes. However, additional information may be gained by combining variance estimates across all genes, and methods that exploit this information have been shown to improve results (Lonnstedt & Speed 2002; Smyth 2004). Several of these are methods for adjusting the sample variances in the calculation of contrast T-statistics, tests for differential expression for each comparison of interest. Several previous adjustments to T- statistics, as well as other methods that exploit information across genes, have been proposed

(Long et al. 2001; Newton et al. 2001; Tusher et al. 2001; Efron et al. 2001; Lonnstedt & Speed

2002; Smyth 2004; Eckel et al. 2004; Newton et al. 2004), including three that take into account the larger variance often seen in genes with lower transcript expression levels (Baldi & Long

7

2001; Jain et al. 2003; Fox & Dimmic 2006). The idea for these variance adjustments is that if the genes‟ variances are somehow similar, although not necessarily equal, then an average of several thousand similar variance estimates may be used to improve individual poor estimates of the variance of each gene. The moderated variance serves as a trade-off between two extremes: applying a simple fold change cutoff (Eq. 1.1) and a regular T-statistic (Eq. 1.2) for individual genes (Figure 1.3). Using a simple fold change cutoff for ranking purposes implicitly assumes equal variance among all genes, whereas the T-statistic assumes no similarity or trend among variances. A fold change cutoff also does not provide levels of significance as the t-test does.

Many of the above methods rely on Bayesian statistics, a framework for combining prior information with current data to form posterior conclusions. A simple Bayesian model consists only of the likelihood of the data and fully specified prior distributions for the parameters. Often it is convenient to assume a prior distribution that results in a posterior distribution of the same

Figure 1.3 Two extreme methods of identifying differentially expressed genes The t-test, which measures the variance of log-fold change for each gene separately, is at the opposite extreme of estimating a single, identical variance level for all genes, and ranking genes by absolute fold change. Several methods, many of them Bayesian, use a combination of these two extremes. form; for each distribution function of the data, such a prior is referred to as the conjugate prior.

Empirical Bayesian methods use the data itself to define the prior information. Rather than assuming known values for the prior distributions, these parameters are estimated from the observed data, adding an additional level of complexity to the model. The parameters for these

8 priors are called hyperparameters, and this set-up results in what is referred to as a hierarchical model (Gill 2002). The advantage of empirical Bayes methods is that defining prior distributions from independent data or expert opinion is not required. In full Bayesian hierarchical models, instead of exactly specifying the prior distributions, we define hyperpriors for the hyperparameters (distributions for the hyperparameters), allowing them to vary as additional variables. One may specify the hyperpriors based on expert opinion or use dispersed, uninformative prior distributions so that a majority of the information for the posterior is taken from the data.

1.2.3. Microarray data statistics: testing for significance of differential

expression

Many diverse Bayesian methods for detecting differentially expressed genes in microarrays have been proposed. These methods can be categorized as Empirical Bayes or Full

Bayes, as parametric (usually assuming a log-normal distribution for the expression levels), semi- or non-parametric, as mixture-based or simple hypothesis based, and based on whether normalization is performed separately or within the model (Table 1.1). Below we describe examples of each type of method.

A common type of proposed method adjusts the significance levels of genes through the use of a T-statistic (Eq. 1.2) with a moderated denominator (Newton et al. 2001; Tusher et al.

2001; Lonnstedt & Speed 2002; Efron & Tibshirani 2002; Eckel et al. 2004). Lonnstedt and

Speed (Lonnstedt & Speed 2002) proposed a normal empirical Bayes adjustment for replicated, dual-channel array data implemented in the SMA package of R. They developed a statistic, B, the Bayes log posterior odds of differential expression, to determine significance level. Tusher et al. (Tusher et al. 2001) developed SAM, a method that adjusts the denominator of the regular

9

T-statistic by adding a constant, v0 to define a new relative difference estimate:

ˆ SAM  g d g  . (1.4) SEg  v0

Table 1.1 Recently proposed statistical methods for assessing differential expression of gene transcripts in DNA microarray experiments

Method Reference Empirical or full Parametric? Mixture or Normalization Bayesian, or hypothesis within or frequentist? based? separate?

(Baldi & Long full Bayes parametric hypothesis separate 2001) (Blangiardo et al. full & empirical parametric hypothesis within 2006) Bayes (Broet et al. 2002) full Bayes parametric mixture separate (Durbin et al. 2002) frequentist parametric hypothesis within (Eckel et al. 2004) empirical Bayes non-parametric mixture separate (Efron et al. 2001; empirical Bayes non-parametric mixture separate Efron & Tibshirani 2002) (Fox & Dimmic full Bayes parametric hypothesis separate 2006) (Lewin et al. 2006) full Bayes parametric hypothesis within (Lonnstedt & empirical Bayes parametric hypothesis separate Speed 2002) (Newton et al. 2001; empirical Bayes semi-parametric mixture separate Newton et al. 2004) (Smyth 2004) empirical Bayes parametric hypothesis separate (Tusher et al. 2001) frequentist parametric hypothesis separate

This method serves as a trade-off between the simple fold change cutoff (Eq. 1.1) and the unadjusted T-statistic (Eq. 1.2), since when v0 is zero it is equivalent to the T-statistic, and for large v0 it approximates the fold change ranking. Although SAM is not strictly a Bayesian model,

10 it still serves the same purpose of “shrinking” the estimated variance levels toward a predicted level based on all measured genes.

Efron et al. (Efron et al. 2001; Efron & Tibshirani 2002) introduced a non-parametric empirical Bayes method that is closely related to a multiple-testing adjustment for p-values

(Benjamini & Hochberg 1995). In this method, the distribution of z-scores for all genes is modeled as a mixture of affected and unaffected genes:

f (z)  p0 f0 z p1 f1 z

where p1 is the probability that a gene is affected, and p0 = 1- p1 , and f0(z) and f1(z) are the densities of z-scores for unaffected and affected genes, respectively. The null density, f0(z), is estimated from the data using differences not due to treatment effect, and then the posterior probability for each gene is calculated based on the Bayes rule: p0(z) = p0 f0(z) / f (z). Eckel et al. (Eckel et al. 2004) discusses an extension of Efron‟s 2001 method for dual-channel microarray experiments involving a continuous variable of interest, such as time-series or dose- response experiments. All of the above methods assume the log fold changes in expression are approximately normal. Newton et al. (Newton et al. 2001) proposed an empirical Bayesian model assuming gamma distributed intensities, rather than log-normally distributed fold changes as in equations (1), (2), and (4). The same authors (Newton et al. 2004) later developed a semi- parametric hierarchical mixture model using the EM-algorithm for simple, one-comparison experiments.

In 2001, Baldi and Long (Baldi & Long 2001) proposed a normal Bayesian method developed for single-channel arrays to test for differential expression. In their method, called

Cyber-T, the sample variance is replaced by:

11

2 2 2 d0 s0g  n 1sg sB  (1.5) d0  n  2

2 2 where s0g and s g are the prior and sample variances for each gene, respectively, and d0 and (n-

1) are the prior and sample degrees of freedom, respectively. To our knowledge, their method was the first to recognize a dependence of gene variance on spot intensity level and use this information to improve the resulting statistics. Rather than assigning distributions to their hyperparameters or estimating the hyperparameter values from the marginal distributions of the data as would be done in an empirical Bayes setting, they simply chose values for the prior parameters in an ad-hoc manner. Specifically, their method for calculating the prior predicted variance, is to calculate the average standard deviation of genes with a “similar” expression

intensity level in a moving window with size defined by the user, and choosing d0 simply requires user input. More recently, Fox and Dimmic (Fox & Dimmic 2006) proposed an extension of Cyber-T for a two-sample single-channel comparison. Like Cyber-T, their method assumes a hierarchical normal Bayesian model and uses a moving window average to calculate the prior variances. However, they calculate the prior degrees of freedom based on the moving window size, rather than basing it on the relative spread in variances compared to what would be expected if the variances were from the same distribution. Put more simply, they assume genes with similar expression levels have identical variance. This is an important contrast with both

Smyth's method described below and our methods, as well as our method being applicable to any experimental design. Importantly, they also do not provide an objective way to estimate the moving window size. Durbin et al. (Durbin et al. 2002) also noticed the non-constant relationship between variance and intensity. Rather than using a Bayesian model, they developed a model-based variance-stabilizing transformation of the raw intensities for single-

12 channel arrays that approximates the log-transformation for high intensities and is approximately linear for the low end of the intensity range. They model the intensities as y =  + e + , where y is the raw intensity,  is the mean background,  is the true intensity, and  and  are

2 2 error terms with respective variances  and  . The proposed transformation,

gy  ln y   y 2  2 S 2  , (1.6)     stabilizes the variance across the intensity range assuming the data fits this model.

In 2004, Smyth (Smyth 2004) published an empirical Bayes moderated-T adjustment, similar to (Lonnstedt & Speed 2002), that could be generalized to any experimental design and may be used for either single or dual-channel arrays. Although his method does not account for the dependency of variance on expression level, it has solid theoretical underpinnings and estimates the prior degrees of freedom based on how “similar” the variances are. The sample variance and degrees of freedom for the T-statistic, respectively, are replaced by the following formulas:

d s 2  d s 2 ~ 2 0 0 g g ~ sg  ;d g  d0  d g (1.7) d0  d g

where s0 and sg are the prior and sample standard deviations, respectively, and d0 and dg are the prior and sample degrees of freedom, respectively. Note that s0 is a constant for all genes,

2 whereas the prior variance in Cyber-T (Eq. 1.5), s0g , is gene specific.

More complex hierarchical, full Bayesian models have also recently been proposed, such as that by Lewin et al. (Lewin et al. 2006). They developed a model that combines the

13 normalization step and testing for differential expression. Although their model is parametric and they assume a log-normal distribution of gene expression levels, they do not propose an adjusted T-statistic for determining significance levels. Assuming a two-sample comparison on single-channel arrays, their basic model set-up is:

 1 2  yg1r ~ N g   g   g1r , g1 ,  2  (1.8)  1 2  yg 2r ~ N g   g   g 2r , g 2   2 

where yg1r and yg2r are the log-expression levels for gene g and replicate r of condition 1 and 2 respectively. g is the gene effect, g is the treatment effect, gsr is the array effect for

2 normalizing the arrays, and  gs is the gene-specific variance. Important aspects of their model are as follows: they assume an interchangeable prior for variance (i.e., no dependence of variance on intensity level), they use parametric quadratic splines with free-knot locations to determine the array effects, and they use a full Bayesian model given disperse hyperprior distributions.

Blangiardo et al. (Blangiardo et al. 2006) introduced two, related, hierarchical full and empirical Bayesian models for dual-channel arrays that also incorporate the normalization step into the model. Similar to Lewin et al., they assume exchangeable gene variances and assume they are log-normally distributed. Blangiardo et al. performed self-vs-self calibration experiments along with the comparative experiment of interest, and used the calibration arrays to calculate estimates of certain parameters in the model. Their full Bayesian hierarchical set-up is as follows:

14

xigc ~ Nigc , xg 

igc   ig  g   c   g ;  xg ~ log N ,   (1.9)

 g ~ N ,  ;  ~ N ,  ;   ~ Inv  Gamma ,   g g g g

where i denotes arrays, g denotes genes, and c denotes dyes. The xigc are the unnormalized expression levels, ig, c, and g are normalization effects estimated from the self-self experiment, g is the treatment effect,  and  are given uninformative priors, and

and are given informative priors whose parameters are also estimated from the self- self experiment. For their empirical Bayesian set-up, they used local regression to normalize the data prior to the model, and calculated the hyperparameters for the prior variance using the self- self experiment.

In 2002, Broet et al. (Broet et al. 2002) proposed a full hierarchical, Bayesian mixture model for testing differential expression of genes. They assumed each log-difference in gene expression was normally distributed, and they modeled an unknown, but finite, mixture of normal densities. They used a flat prior for the log-differences and assumed “exchangeable” variances with a conjugate inverse-gamma prior. Their algorithm employed the reversible-jump

Metropolis-Hastings algorithm, which allows one to use Markov chain Monte-Carlo (MCMC) sampling methods to sample from posterior distributions with a changing number of parameters.

In Chapters 2 and 4, we develop methods that account for the dependency of gene variance on expression intensity level, improving the estimates of variance of log-differences in gene expression. Our methods may be used with any experimental design for either single or dual-channel arrays, and all parameter values are estimated in a completely data-dependent manner. Our models can be classified as parametric, hierarchical Bayesian models that assume

15 normally distributed log-differences in gene expression. Two of them are empirical Bayesian in nature, and one is a full Bayesian model for obtaining more accurate estimates of gene variance levels. The full Bayesian model is used to calculate posterior distributions of variance, which are then used to calculate non-Bayesian (frequentist) significance levels. Contrary to the exchangeable variance models described above, in which the posterior variance estimates for each gene are a weighted average of the prior mean variance and the sample variances resulting in “shrinkage” of the variances towards an overall mean, in our methods the variances are instead pulled towards their predicted values based on a regression, i.e. “shrinkage” towards a smoothed function. In the first method introduce, we use local regression to predict prior variance levels, and in the second and third methods we use splines for this purpose.

1.2.4. Testing for key biological categories

1.2.4.1. Gene Ontology Consortium and Kyoto Encyclopedia of Genes and

Genomes

The Gene Ontology (GO) database (Ashburner et al. 2000; Harris et al. 2004) and Kyoto

Encyclopedia of Genes and Genomes (KEGG) pathway database (Kanehisa 2002; Kanehisa et al. 2006) are two classification schemes for genes that are used in this dissertation. They are standardized vocabularies for assigning genes to one or more functionally related groups. KEGG pathway (www.genome.jp/kegg/pathway.html) is a database collection of pathway maps representing molecular networks of various cellular processes, such as those involved in metabolic processes and signal transduction. GO consists of three structured ontologies

(biological process, molecular function, and cellular component) to which gene products of various species are assigned. The ontologies are hierarchical in nature, and can be represented

16 by directed acyclic graphs; each GO term may have multiple parent terms as well as multiple child terms, where child terms are subclasses of their parents (see Figure 1.4 for example). The group of GO terms to which a gene product belongs is termed its GO annotations and may consist of terms from multiple ontologies and from several levels within each ontology. If a gene product is assigned to GO term x, then it is also assigned to all ancestors of x.

Consequently, genes for which little is known tend to be assigned to large broad GO terms, while the smaller, most specific GO terms are populated by well-known genes. Thus, the relationship among all GO terms involves complex correlations and structure.

Figure 1.4 A portion of the cellular component ontology of GO The relationships that exist in the structure of the database is demonstrated. Cellular_component is the highest level, and arrows point to child terms. Genes assigned to a particular term also belong to all ancestor terms. Image is from MathWorks software Demo.

17

1.2.4.2. Biological gene set enrichment analysis methods

In Chapter 3, we introduce a new method for identifying biologically related groups of genes that show differential expression. That is, we are taking the interpretation of the microarray results a step further than in Chapter 2 by testing for biological categories that are significantly enriched with differentially expressed genes, rather than testing individual genes.

The Gene Ontology (GO) database, described in Section 1.2.4.1 above, is commonly used in testing for categories enriched with differentially expressed genes (Beissbarth 2006; Osborne et al. 2007). Various web-based or downloadable software programs have been developed for easy implementation of Gene Ontology and pathway testing. Onto-Express (Draghici et al. 2003;

Khatri et al. 2005), David/EASE (Hosack et al. 2003; Dennis, Jr. et al. 2003), the Gostats package of Bioconductor (Gentleman 2007), GOMiner (Zeeberg et al. 2003; Zeeberg et al.

2005), FuncAssociate (Berriz et al. 2003), FatiGO (Al Shahrour et al. 2004) and GoSurfer

(Zhong et al. 2004) are some such tools for identifying biological categories enriched with differentially expressed genes. Onto-Express uses the three branches of GO along with cellular role and location (Draghici et al. 2003), and has recently also incorporated a weighting scheme in the testing which gives heavier weight to with higher downstream pathway impact and higher normalized fold change (Khatri et al. 2005; Draghici et al. 2007).

David/EASE may be used to test a list of differentially expressed genes against the three branches of GO, KEGG pathways, chromosomal regions, and several other classification schemes. All of the above tools assume the data follow a hypergeometric distribution, first recognized by (Tavazoie et al. 1999), and implement either the Fisher‟s exact or chi-square test.

Previously, Khatri and Draghici (Khatri & Draghici 2005) provided a thorough comparison of several tests that all use counts of significant genes and use either the Fisher‟s

18 exact or chi-square test to identify over-represented categories. These tests all have limitations, most markedly the requirement to choose a specific significance cutoff level to distinguish between genes that are changed versus those that are not. Moreover, different threshold choices may lead to dramatically different significantly enriched categories, and thus different biological conclusions (Pan et al. 2005). Furthermore, categories with a large number of genes exhibiting slight differential expression will likely be missed because only the top differentially expressed genes are considered significant.

Fisher‟s exact test compares the proportion of genes differentially expressed in each category to the proportion expected to be differentially expressed by chance, based on the total number of changed genes. This can be displayed in a 2x2 contingency table (Table 1.2), in which the proportion A:B is compared to C:D.

Table 1.2 Fisher’s exact 2x2 contingency table # # not significant significant

# in category A B

# not in category C D

Fisher‟s exact probability is then expressed as:

A  B!C  D!A  C!B  D! p  , (1.10) A!B!C!D!A  B  C  D! where A is the number of differentially expressed genes in the category, B is the number of non- differentially expressed genes in the category, C is the number of differentially expressed genes not belonging to the category, and D is the number of non-differentially expressed genes not in

19 the category. Fisher‟s exact test computes the sum of the observed probability and all more extreme (less likely) probabilities to obtain a final p-value. An odds ratio, expressing the odds of group membership for significant versus non-significant genes, can also be calculated as a measure of the strength of enrichment with differential expression. This test requires choosing a significance cutoff level to distinguish between genes that are changed versus those that are not.

Furthermore, testing all categories with at least one differentially expressed gene adds a bias to the test, because whichever small, specific category a significant gene is in will automatically be significant. This may be true even after adjustment of probabilities for multiple testing.

Several methods have recently been proposed to overcome the limitations of such procedures (Table 1.3). Hosack (Hosack et al. 2003) developed an adjustment to Fisher‟s exact test in EASE to avoid a dependence of significance level on category size, but this test is less

Table 1.3 Methods included in comparisons of statistical approaches to assess enrichment of biological categories Method Main statistical test Input data User choice for DEG test? Fisher‟s exact Fisher‟s exact test Counts of significant and non- Yes significant genes

GSEA Weighted Kolmogorov- Normalized intensities for all genes No (Subramanian et Smirnov and all arrays al. 2005) globaltest(Goeman Logistic regression Normalized intensities of genes in No et al. 2004) category of interest

BayGO (Vencio et Bayesian 3 X 2 Counts of significant and non- Yes al. 2006) contingency table significant genes

SigPathway (Tian t-test (2 hypotheses Normalized intensities for all genes No et al. 2005) and all arrays tested)

powerful and still has the other disadvantages of using the hypergeometric distribution. BayGO still uses significance counts, but employs a Bayesian framework to 2-by-3 contingency tables.

20

Rather than classifying each gene as assigned/not assigned to a category, BayGO classifies each gene as either not in the category, exclusively in the category, or not exclusive to the category

(Vencio et al. 2006). Gene Set Enrichment Analysis (GSEA) was one of the earliest proposed methods using continuous information from all genes, and utilizes a weighted Kolmogorov-

Smirnov test. Permutations are performed to estimate the null distribution for significance testing by permuting samples (arrays) if the sample size is adequate, or by permuting genes otherwise

(Subramanian et al. 2005). Although in theory GSEA could allow for any testing procedure for differentially expressed genes, its implementation is restricted to a simple signal-to-noise ratio.

Other procedures include those proposed by (Tian et al. 2005) and (Goeman et al. 2004), implemented in the R packages sigPathway (Tian et al. 2005) and globaltest (Goeman et al.

2004). sigPathway allows for more than two experimental conditions, and determines significance by comparing (1) the sum of association measures (default is the t-test) between each gene and the phenotype of interest to (2) the distribution of sums under the null. The authors of sigPathway stress that the combination of two separate hypothesis tests is required, where the difference between the hypotheses is whether the genes or sample phenotypes are permuted to calculate the null distribution. Notably, globaltest allows the analysis of more complex models involving multiple covariates, using linear or logistic regression but in a different context from that proposed in this dissertation. For the logistic regression model of globaltest, the binary response variable is treatment or control and the explanatory variable is the gene expression levels of genes assigned to the category of interest. For each gene group tested, only expression intensities of the genes belonging to the group are used in the analysis; in other words, for each group the method does not incorporate information from the remaining genes.

21

Only Fisher‟s exact, BayGO, and our newly proposed method allow the investigator to choose different methods for detection of differentially expressed genes.

We will introduce a more powerful test achieved through the use of logistic regression.

Logistic regression is a natural extension of the chi-square test, allowing the explanatory variable to be measured on a continuous scale. In general, logistic regression is used to model the probability of a binary variable, y, as a function of an explanatory variable, x. The functional form used in logistic regression is:

e x y  P(success | x)  , (1.11) 1 e x where  and  are parameters estimated from the data and success is an indicator for membership in a biological category. In Chapter 3, we model the probability, y, that a gene belongs to a biological category as a function of its significance level, x, for being differentially expressed.

1.2.5. R statistical software and Bioconductor

We developed source code for use in Bioconductor (Gentleman et al. 2004), a

Bioinformatics project within the R statistical platform to distribute our statistical methods. R is a versatile free statistical software with computing and graphics capabilities that runs on all commonly used operating systems (Ihaka & Gentlemen 1996). Bioconductor is a powerful and flexible, open-source computing environment within R for computational biology and bioinformatics analysis that facilitates the integration of various biological data. Its popularity for the analysis of microarray data has grown quickly in the past few years, and it is now a widely used and respected microarray analysis software. Among Bioconductor‟s many libraries

22 are two that allow direct access to genes‟ mappings to the KEGG and GO databases, limma, which handles preprocessing of microarray data, linear models for testing for differentially expressed genes and Smyth‟s empirical Bayes method (Smyth 2004), and GEOquery, which enables direct downloads from the NCBI Gene Expression Omnibus (GEO) public repository of microarray data into R.

CHAPTER 2: Intensity-based hierarchical Bayes method

improves testing for differentially expressed genes in

microarrays

2.1. Overview

In this chapter, we describe and evaluate a novel Bayes moderated-T statistic that we refer to as IBMT (Intensity-Based Moderated T-statistic). IBMT is an extension of Smyth‟s

Moderated T-statistic (SMT) (Smyth 2004) and accounts for the dependence of variance on gene signal intensity. Like SMT, IBMT can be used with any experimental design, including but not limited to experiments with multiple treatments and/or both technical and biological replicates, experiments with a continuous covariate, and dual-channel experiments with dye-effects. It can also be used with any array platform, for example Affymetrix, dual-channel, tiling arrays, etc.

Similar to Smyth, we use empirical Bayes (EB) theory to estimate all parameters of the hierarchical Bayesian model. We use non-parametric local regression to functionally relate variance and absolute gene expression measurements. This possibility had been previously proposed but to the best of our knowledge has not been further explored (Baldi & Long 2001).

23

In this chapter, we describe the hierarchical model for gene expression data, detail the procedure for estimating all parameters in the model, and describe the testing procedure for identifying differentially expressed genes. In simulations carefully designed to mimic real microarray data (Karyala et al. 2004; Guo et al. 2004; Sartor et al. 2004), we determine that overall our method outperforms all other tested methods, including the simple T-statistic, fold change cut-off, SMT, and Fox (Fox & Dimmic 2006). We demonstrate that IBMT performs as well as, or better than any other tested method when using simulated data and “spike-in”

Affymetrix experiments (Choe et al. 2005). We also apply our method to two experimental microarray datasets (Wesselkamper et al. 2005a) that due to their experimental designs, cannot be correctly analysed with previously proposed methods that account for the variance-intensity relationship (CyberT and Fox). We find that our method generally resulted in higher significance of Gene Ontology (GO) (Harris et al. 2004) groups when testing for an enrichment of differentially expressed genes. We also provide examples of how our method results in biological conclusions that may not have been attained using an alternative method.

2.2. Results and discussion

2.2.1. Intensity-based Bayesian model

Figure 2.1 displays an example of the dependence of gene variance on expression level, taken from the MEF Ahr-/- dataset (see Section 2.4: Methods), similar to the observed dependency published previously (Baldi & Long 2001). The fact that such a dependency exists is intuitive, in view of how the data are measured from the microarray images. Spots with low fluorescence level will likely have fewer pixels measured, and the resulting estimate of expression is an average or median of fewer or lower numbers. Furthermore, transcripts that are lowly expressed are changed by a greater proportion by the addition of a few labeled transcripts,

24 and thus may actually vary more in biological tissue samples. This relationship between variance and expression level can be modeled as

2 s0g (g) = f(g) + g (2.1)

where the average log-expression level of gene g is denoted byg, f(g) is some function of g

2 defined on the range of g, and s0g is the estimated prior variance. As explained below, we

2 chose to model the function s0g (g) using local regression. The use of local regression differs

Figure 2.1 Dependence of gene variance on average log-intensities Typical example of the form of dependency of log-variance on average log-spot intensity for dual-channel microarrays. Red line was determined using local regression. Data were from mouse embryo fibroblast Ahr-/- dataset.

25 from the window method of Cyber-T in that the window method pools the standard deviation estimates of all genes in the window, whereas local regression uses a weighted average of the log-variances, where the weight for each gene j depends on the difference between the intensity of gene j and the intensity of the gene g, of interest. This relationship on its own can significantly reduce the uncertainty in the true variance of gene expression variances. For example, the relationship shown in Figure 2.1 explains approximately 34% of variability in individual gene expression variances.

For our intensity-based method, we follow a hierarchical Bayesian set-up similar to SMT

(Smyth 2004). Individual gene variances for genes with similar overall expression levels are assumed to have been generated by a single probability distribution. The parameters for the

2 distribution of the variances, d0 and s0g , are termed the hyperparameters, and are estimated from the data using EB theory. In terms of the precision of the gene expression levels, which is

2 defined as the reciprocal of the variance, 1/s0g is the mean, and the hyperparameter d0 is the

2 prior degrees of freedom and determines the spread of the distribution for a given s0g . Larger d0 values result in smaller spread of the distribution for the precision and variance of gene expression levels. Similar to previous methods (Baldi & Long 2001; Fox & Dimmic 2006), by assuming a single hyperparameter for the prior degrees of freedom, we make the assumption that the spread of variance estimates about the background variance level is similar across the entire range of fluorescence levels.

ˆ Suppose that  g is the estimate of the contrast of interest obtained after fitting the appropriate linear model for gene expression data for gene g. In the simplest case when comparing expression levels between two samples, is just the difference in average log-

26

ˆ expression levels for gene g under the two experimental conditions. We assume the  g measurements of log-fold change for each gene follow a normal distribution centered at g, the actual log-fold change:

ˆ 2 g ~ N(g ,vg g )

2 where g is the residual variance in the linear model for gene g and vgi is the coefficient of the variance required to calculate the standard error. For a two-sample t-test, vgi is 1/n1 + 1/n2 where n1 and n2 are the number of observations for each experimental condition. Given the variance

2 g , the sample variance for each gene is assumed to follow a scaled chi-square distribution with dg degrees of freedom:

2 2 2 2  g sg |  g ~  . d g d g

2 We adopt the conjugate prior distribution for g

1 1 2 2 ~ 2  d0  g d0 s0g

2 where d0 and s0g are the hyperparameters for the degrees of freedom and variance, respectively.

With this model, the closed-form solutions for the posterior mean of the variance and degrees of freedom given the hyperparameters are:

df  d0  dg 2 2 ~2 d0s0g  dg sg sg  d0  dg

27

~ 2 where df is the posterior degrees of freedom, dg is likelihood degrees of freedom, and s g is the posterior mean of the variance. Our goal is to calculate point estimates of hyperparameters so

2 that we can calculate expected values for the posterior parameters, g and df.

We can now use the moderated t-statistic:

ˆ t  gi gi ~ sg vgi

ˆ to test the hypothesis H0: g = 0 vs. HA: g ≠ 0 with df degrees of freedom, where  gj is the

~ estimate of log-fold change for gene g and contrast i, and sg is the posterior standard deviation.

As demonstrated by Smyth (Smyth 2004), under the null-hypothesis, the resulting moderated T-statistic in IBMT is distributed as Student‟s-t with df degrees of freedom. Thus, differentially expressed genes can be identified by calculating p-values and making appropriate multiple comparisons adjustments. However, if the data grossly deviate from the distributional assumptions, the moderated t-statistics can be used as a heuristic score for ranking genes based on the likelihood that they are differentially expressed, or an alternative empirical-based multiple comparison adjustment can be made, as in (Efron 2006a).

2.2.2. Estimation of hyperparameters

The formulas for posterior mean of the variance and degrees of freedom assume known hyperparameters d0 and s0g. We follow the empirical Bayes approach and estimate hyperparameters from the data. Gene-specific prior variances are estimated from f(g) as given in (Eq. 2.1), where f(·) is a fitted local regression model of adjusted individual genes‟ log- variances (see equation 2.4) on the average log-expression levels. In this way, we avoid having

28 to pre-specify a functional form for this dependency, and obtain predicted variances for each gene given their spot intensities.

To estimate the prior variance and prior degrees of freedom, we use the common empirical Bayesian method of equating the empirical to expected values for the first and second moments of log-variance. According to the hierarchical model, the sampling variance for each gene, marginally, has the following scaled-F distribution (Smyth 2004).

2 2 sg ~ s0g Fdg,d 0

Consequently, the log-sample variance is distributed as the sum of a constant and Fisher‟s

Z distribution and has the following expected value and variance:

2 2 Elog sg  log s0g  d g 2 d0 2 logd0 d g  (2.2)

2 ' ' varlog sg  d g 2 d0 2 (2.3) where ψ( ) is the digamma function and ψ´( ) is the trigamma function (Johnson & Kotz 1970;

Smyth 2004). We denote with eg the non-constant part of (Eq. 2.2) for each gene after solving

2 for log(s0g )

2 eg  log sg  d g 2 logd g 2, (2.4) with

2 Eeg  log s0g  d0 2 logd0 2. (2.5)

Next, we determine the predicted values for eg, pred(eg), as a function of average log-

2 intensities by local regression. We define the prior variance for each gene, s0g , to be the

29 exponential of pred(eg) + ψ(d0/2) – log(d0/2), by substituting pred(eg) for E(eg) in (Eq. 2.5) and

2 solving for log(s0g ). To calculate the prior degrees of freedom we equate the empirical variance of the log-sample variances with the marginal variance in (Eq. 2.3) and solve for d0. As indicated

2 before, we assume a priori that g varies with g, but its variance is constant for all g. Thus, if dg‟s were all the same and ψ´(dg/2) = c, say, then the marginal variance as given in (Eq. 2.3) would be a constant, with a consistent estimator given by

1 meane  prede 2  e  prede 2 . g g n  g g

This would yield an estimator for ψ´(d0/2), given by

2 meaneg  predeg   c . (2.6)

When dg‟s are different, the marginal variances in (Eq. 2.3) differ for different g, but by known values ψ´(dg/2). Thus if we assume that dg does not vary drastically, in the sense that mean 'd g 2 1 n 'd g 2 approaches a constant c as n gets large, then (Eq. 2.6) is a consistent estimate of ψ´(d0/2). Typically, dg does not vary substantially with good quality data, and with Affymetrix data dg is usually constant because intensity values for all genes on all arrays are measured and used in the analysis. Thus d0 can be estimated consistently by solving

2  'd0 2  meaneg  predeg   mean 'd g 2

2 for d0. Note that if dg is constant for all genes, then using log sg in placement of eg results in the same solution for d0.

30

2.2.3. Simulation study

Simulations were designed to imitate a six slide, single-channel microarray experiment with three treatments and three controls. The simulations were performed to compare the performance of five methods (t-test, fold change, SMT, IBMT, and Fox) with respect to: a) the strength of relationship between variance and signal intensity, b) estimation of the correct prior degrees of freedom, and c) unbiased estimation of the true false positive rate. Average expression intensities were generated assuming a log-normal distribution with a scale parameter of 1.1, shape parameter equal to 0.34, and threshold parameter 5.1 (Figure 2.2a). The parameters for this distribution were chosen to closely fit the actual distribution of average expression intensities seen from real experiments. Simulations were run assuming prior degrees of freedom d0  [1, 4, 16, 100]. For each prior degrees of freedom, actual and sample standard deviations were simulated for three different strengths of dependency on average log-intensities (Figure

2.2b), referred to as low, medium, and high. The specific functional form used for this was

0.8x5 gx  p1e  p2 with the following values used for p1 and p2: low: p1 = p2 = 0.875, medium: p1 = 1.25 and p2 =

0.5, and high: p1 = 1.5, p2 =0.25. To determine differences among the methods due to sample size, additional simulations were run for a 4-slide experiment (two treatment, two control) and a

10-slide experiment (five treatment, five control), with the high strength dependency, and an additional simulation was also run for the 6-slide experiment with no dependency of variance on average intensities. In the case of no dependence, IBMT performed nearly identical to SMT. All simulations were performed with 15000 “genes”, 300 (2%) of which were designed to be

31

A B A B

Figure 2.2 Values used in IBMT simulations. (A) Distribution of average log-expression levels (B) Three strengths of dependency of gene standard deviation on expression intensity used in simulations. “differentially expressed”. Log-ratios for all genes were simulated as described in (Smyth

2004). Actual mean log-ratios for the 300 differentially expressed genes were simulated from

2 the normal distribution N(0, 3•g ), and simulated measured mean log-ratios for all genes were

2 assumed to follow the normal distribution N(, g /3), where =0 if the gene is not differentially expressed, and the simulated log-ratio for the 300 (2%) differentially expressed genes.

The simulation process is summarized here:

For all 15000 genes:

1) Simulate g as random draws from a log-normal distribution,

2) Define function, f(g), for dependence of variance on g,

2 3) Simulate g as random draws from d0*f(g)/ (chi-square with d0 degrees of

freedom),

32

2 2 4) Simulate sg as random draws from g /dg*chi-square with dg=4 degrees of

freedom,

5) W.L.O.G., assume the first 300 genes are differentially expressed; simulate their

2 mean log-ratios g as random draws from N(0,2g ),

6) For the remaining 14700 non-differentially expressed genes set g=0,

2 7) Simulate estimated log-ratios as random draws from N(g, g /3).

Results from the simulations indicate that the added complexity of the model is outweighed by the additional gain in information. Four methods were compared in their ability to correctly estimate the false positive rate, using estimated False Discovery Rates (FDR)

(Benjamini & Hochberg 1995): the simple T-statistic (T), Smyth‟s moderated T-statistic (SMT), our intensity-based moderated-T (IBMT) method, and Fox‟s method (Fox). All methods except

Fox accurately estimate the percent of false positives, as demonstrated by Figure 2.3. When the prior degrees of freedom is low, Fox‟s method underestimates the percent of false positives

(Figure 2.3a and b), suggesting the possibility of a real risk of Fox‟s method to give overly- optimistic results with real data. Control of the true false positive rate under additional parameter sets gave the same results, and may be viewed as Figure A.2 in the Appendices.

We compared the ability of the methods to identify differentially expressed genes by creating false positive rate curves for each parameter set. These were created by ranking the

33 genes by significance level, and then calculating the number of accumulated false positives with

A B

C D

Figure 2.3 The t-test, SMT, and IBMT correctly estimate the proportion of false positives All tested methods except Fox (t-test, SMT, and IBMT) correctly control for the true false positive rate. Data shown is the average of 100 simulations and the mid-strength dependence of variance on expression level with (A) dg = 4, d0 = 1, (B) dg = 4, d0 = 4, (C) dg = 4, d0 = 16, and (D) dg = 4, d0 = 100. rank less than or equal to x. Example false positive rate curves for the five methods are shown in

Figure 2.4. Figure 2.5 summarizes the results for all parameter sets by presenting normalized

34 areas under the false positive curves described above. All results shown are the average of 100 simulation runs. All methods performed poorly when the data was simulated with only one prior

A B

Figure 2.4 Example false positive curves for IBMT and other relevant methods Number of falsely implicated differentially expressed genes with rank x for the simple t-test, fold change cut-off, SMT, Fox, and IBMT methods. Figure shows the accumulation of false positives by gene rank. Data shown is the average of 100 simulations using (A) the high-strength dependence of variance on expression level and d0=100, and (B) the mid-strength dependence and d0 = 1.

degree of freedom. As the number of prior degrees of freedom increased, the performance of all methods except the simple t-test improved with IBMT overall outperforming the other methods.

Fox‟s method closely followed the performance of the fold change method, with a substantial advantage over fold with high dependence of variance on signal intensity. However, it had poor performance when the genes‟ variances were approximately independent (small prior degrees of freedom). Both these results are probably due to this method‟s assumption that genes with similar intensities have identical variances. For the simulation with no dependence of variance

35

A Low Strength Dependency (dg=4) B Mid Strength Dependency (dg=4) 0.5 0.5 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0.25 0.2 t-test 0.2 t-test 0.15 fold 0.15 fold SMT SMT

0.1 0.1 % area under FDR curve %area IBMT under FDR curve %area IBMT 0.05 Fox 0.05 Fox 0 0 d0=1 d0=4 d0=16 d0=100 d0=1 d0=4 d0=16 d0=100 Prior degrees freedom Prior degrees freedom C High Strength Dependency (dg=4) D High Strength Dependency (dg=2) 0.5 0.5 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0.25 0.2 t-test 0.2 t-test 0.15 fold 0.15 fold SMT SMT

0.1 0.1 % area under FDR curve %area IBMT under FDR curve %area IBMT 0.05 Fox 0.05 Fox 0 0 d0=1 d0=4 d0=16 d0=100 d0=1 d0=4 d0=16 d0=100 Prior degrees freedom Prior degrees freedom

E High Strength Dependency (dg=8) 0.5 0.45 t-test 0.4 fold SMT 0.35 IBMT 0.3 Fox 0.25 0.2 0.15

0.1 % area under FDR curve %area 0.05 0 d0=1 d0=4 d0=16 d0=100 Prior degrees freedom

Figure 2.5 Areas under false positive curves for all three strength of dependency of variance on average spot intensity, and for additional simulations. Areas are normalized so that the highest (worst) possible area is 0.50, the lowest (best) being 0. (A) Low strength dependency- the fold change method performed poorest for low d0, while the simple t-test is poorest with high prior degrees of freedom. IBMT performs minimally better than SMT in this case. Fox performs similarly to fold change (B) Medium strength dependency- Similar to above, but with the advantage of IBMT larger for high d0, (C) High strength dependency- IBMT performs better than all other methods, especially for mid to high d0. (D) 4-slide simulation- Similar to (C), but with overall poorer performance for the t-test, and slightly more advantage by IBMT. (E) 10-slide simulation- Fox now performs significantly better than fold change, but both have very poor performance for low d0. IBMT still performs best.

36 on expression level, the areas under the false positive curves were the same for both SMT and

IBMT. The poor performance of the simple T-statistic in these simulations is most likely related to the low number of experimental replicates. We used four sample degrees of freedom, which was insufficient to accurately measure the variance of each gene separately. In additional simulations performed with higher sample degrees of freedom (8, 12, and 16), the simple t-test showed marked improvement over results based on fewer degrees of freedom, while the other methods did not show as much improvement as the degrees of freedom increased (supplemental

Figure S3).

Finally we compared the ability of IBMT to SMT to accurately estimate the prior degrees of freedom (Table 2.1). Since Fox‟s prior degrees of freedom is dependent only on the free parameter and sample size rather than estimated from the data (default d0=16 for all 4-slide simulations), Fox was not included in this comparison. As expected, the empirical Bayes method

Table 2.1 Simulated estimation of prior degrees of freedom for SMT and IBMT Values listed are for the function d0/(d0+dg) and are the mean of 100 simulations. Perfect values for each prior degrees of freedom are: d0=1: 0.20, d0=4: 0.50, d0=16: 0.80, and d0=100: 0.962 Dependency strength of variance on intensity Method d0=1 d0=4 d0=16 d0=100 SMT 0.200 0.494 0.774 0.923

Low IBMT 0.200 0.500 0.800 0.963

SMT 0.198 0.472 0.703 0.813

Middle IBMT 0.200 0.501 0.800 0.961

SMT 0.194 0.422 0.571 0.630

High IBMT 0.200 0.500 0.801 0.962

37 that does not account for the relationship between the variance and the magnitude of expression measurements tends to underestimate the prior degrees of freedom, especially for larger d0 values. As the dependency of variance on average intensities increases, this bias grows stronger.

For the simulation with no dependence of variance on intensity level, using d0=16, both methods accurately estimated the prior degrees of freedom, with estimates of d0/(d0+dg) equal to 0.802 and 0.803 for SMT and IBMT respectively.

2.2.4. Results from controlled spike-in datasets

Two publicly-available, and completely controlled, “spike-in” Affymetrix datasets were used to compare the performance of the same methods, plus Cyber-T, on real-world microarray data. The analysis of these experiments is a natural extension of the simulation studies as the

“correct” results are known. The first experiment consisted of three technical replicates each of control RNA samples and samples with known amounts of spiked-in RNA, and consisting of

3,860 individual cRNAs. We used the average of the top 10 expression datasets, as reported by

Choe et al. (Choe et al. 2005) and available for download at http://www.elwood9.net/spike. The description of all pre-processing steps used for these expression datasets, as well as further detail of the experimental methods are given in the original publication (Choe et al. 2005). In the original publication, Cyber-T was determined to be the preferred method for identifying differentially expressed genes, with SAM (Tusher et al. 2001) and the simple t-test being the other methods tested. For all six methods (t-test, fold, SMT, IBMT, Cyber-T, and Fox), we ranked the genes by significance level, and then the number of false positives was calculated as a function of the number of genes deemed to be significant. The order of performance in accumulating the least number of false positives, from best to worst, is IBMT, Fox, Cyber-T,

SMT, the simple t-test, and finally fold change (Figure 2.6a).

38

The ability of the different methods to correctly establish the statistical significance of differential expression was assessed by comparing estimated and empirically established False

Discovery Rates (FDR) (Benjamini & Hochberg 1995). The simple t-test performed best in correctly estimating the FDR (Figure 2.6b). Of the four other methods, IBMT and SMT resulted in estimated False Discovery rates closest to their true proportion of false positive rates (Figure

6b). All five methods underestimate the number of false positives, which under normal

(A) (B)

Figure 2.6 Results from the Choe, et al. spike-in experiment. (A) IBMT results in the fewest false positives overall. The other methods, from best to worst, are Fox, Cyber-T, SMT, t-test, and fold change. (B) Comparison of how accurately each method estimates the true proportion of false positives. The simple t-test performs best in correctly estimating its false positive rate, although all methods underestimate the true number of false positives, as noted in . Fox‟s method and especially Cyber-T results in the greatest underestimation of false positives. circumstances may result in an unacceptable amount of over-confidence in the significance of results. However, we stress that in this experiment even the simple t-test underestimated the true number of false positives, as has been previously noted (Dabney & Storey 2006). The prior degrees of freedom estimated for this study ranged from 4.0 - 5.4 for IBMT and 1.6 – 1.9 for

SMT, and using the defaults for the other methods, Cyber-T used 10 and Fox used 16.

39

The second spike-in dataset used was the Affymetrix HG-U133 latin-square data set available at http://www.affymetrix.com/support/technical/sample_data/datasets.affx, and consisting of 22,300 probe sets. This dataset consists of 14 sets of 3 chips, each having 42 probe sets (0.19%) spiked-in. After preprocessing with RMA, each consecutive pair of triplicates was analyzed separately, to identify the 2-fold changes in expression. In addition, IBMT was used to analyze each set of three consecutive triplicates. Figure 2.7a and 2.7b compare the average accumulation of false positives by gene rank and estimation of the true proportion of false positives respectively. Note the slight improvement in using three sets at a time compared to

(A) (B)

Figure 2.7 Results from HG-U133 latin-square spike-in experiment. (A) Methods that account for the dependency of variance on signal intensity (IBMT, Cyber-T, and Fox) accumulate the fewest false positives (B) The simple t-test performs best in estimating the true proportion of false positives, and the others from best to worst, are SMT, IBMT, Cyber-T, and Fox. pairs. Possibly due to the low number of spiked-in genes for this experiment, the ability of

IBMT, Cyber-T, and Fox to rank the differentially expressed genes on top could not be differentiated, as the curves for these three experiments cross. However, these methods did outperform SMT, fold change, and the t-test, again indicating the importance of accounting for the dependence of variance on gene signal intensity. Similar to the previous spike-in experiment, 40

Figure 2.7b shows that the t-test performed best in estimating the true proportion of false positives, and Cyber-T and Fox resulted in the greatest underestimation of false positives. Prior degrees of freedom for this data set ranged from 7.6 – 19.3 for IBMT and 5.2 – 8.2 for SMT, while Cyber-T and Fox used the same defaults as the previous data set. The relationship between variance and intensity for this study can be seen in Figure A.4 in Appendix A.1.

2.2.5. Case Studies: Analysis and interpretation of two microarray datasets

2.2.5.1. Results from the MEF Ahr-/- dataset

Although simulations and spike-in datasets point to the potential advantage of IBMT and allow a determination of its general behavior, only with the analysis of experimental data can the practical advantages or disadvantages of the method be observed. We compared the t-test based on the simple linear model, fold change cut-off, SMT, and IBMT on two experimental datasets.

Cyber-T and Fox‟s method were not included because they could not be properly used with the experimental designs of these datasets. The first is a comparison of relative RNA levels of wildtype mouse embryo fibroblast (MEF) cells to aryl-hydrocarbon receptor gene (Ahr) knockout

MEF cells, involving both technical and biological replicate arrays. The aryl-hydrocarbon receptor protein (AHR) is a critical mediator of the molecular defense of exposures to environmental toxicants by serving as the receptor in a toxicant-activated signaling pathway

(Nebert 1989). The top 300 (2.2%) ranked genes from each of the four methods were used to test for Gene Ontology categories significantly enriched with differentially expressed genes to compare the ability of each method to reveal pathways or cellular processes involved in AHR function. We used a fixed number of genes to test Gene Ontology to keep the comparison of methods unbiased. Testing was performed using Expression Analysis Systematic Explorer

41

(EASE), and linking to the three branches of the Gene Ontology database. Fisher‟s exact probability was calculated for each gene category, and a Bonferroni-adjusted p-value < 0.1 was used as the significance cut-off level (Hosack et al. 2003).

Table 2.2 shows the top 10 significant Gene Ontology categories for each method. IBMT had the highest number (17) of significant categories as well as the highest number of unique genes (144) involved in those categories. As opposed to the previous analyses, it is difficult to conclude the relative performance of methods based of this analysis. All four methods identified extracellular genes and genes involved in the extracellular space as important categories altered when the Ahr gene is

Table 2.2 Top significant Gene Ontology categories for the MEF Ahr-/- dataset Top ten categories for each of the four compared methods: magnitude of fold change, simple t-test, SMT, and IBMT. The IBMT method resulted in both the highest number of significant categories using a 0.10 Bonferroni- adjusted p-value cut-off as well as the highest number of genes in a significant category.

Top 10 t-test FOLD SMT IBMT GO 1 Extracellular space Extracellular (91) Extracellular (90) Extracellular (92) (77) 2 Extracellular (84) Extracellular space (82) Extracellular space Response to biotic (81) stimulus (39) 3 Integrin binding (5) Signal transducer Receptor binding (27) Extracellular space activity (67) (80) 4 Spermine/Spermidine Organogenesis (38) Chemoattractant Response to external biosynthesis (3) activity (8) stimulus (46) 5 Carboxy peptidase Chemoattractant activity Signal transducer Defense response (34) activity (6) (7) activity (68) 6 Spermidine Receptor binding (24) Response to biotic Signal transducer metabolism (3) stimulus (33) activity (68) 7 Polyamine Histogenesis and Chemokine receptor Chemoattractant biosynthesis (3) organogenesis (9) binding/activity (7) activity (8) 8 Receptor binding (22) Morphogenesis (39) Integrin binding (5) Immune response (27) 9 Adenosylmethionine Serine-type G-protein-coupled Response to pest/ decarboxylase activity endopeptidase inhibitor receptor binding (7) pathogen/ parasite (2) activity (9) (19) 10 Spermine metabolism glycosaminoglycan Spermine/Spermidine Chemokine receptor (3) binding (7) biosynthesis (3) binding/activity (7) # 6 8 13 17 Bonf<0.1 # genes 92 142 135 144 ↑

42 knocked-out. This is consistent with what has previously been observed in vascular SMCs (Guo et al. 2004). IBMT further recognized “response to external stimulus” (as well as several of its progeny: response to biotic stimulus, defense response, and immune response) as being significantly affected. Once the AHR is activated by the binding of an exogenous toxicant, the

AHR induces the transcriptional activity of a battery of xenobiotic metabolizing genes as part of a host defensive response (Swanson & Bradfield 1993) and interacts with other signaling pathways to either stimulate or depress signal transduction (Puga et al. 2005). In addition, the interaction of the AHR and TGF- signalling pathways is known to greatly affect those genes that encode extracelluar matix (ECM) and ECM remodeling proteins (Guo et al. 2004). The full list of significant categories and top ranked genes from each method are available as supplemental information at http://eh3.uc.edu/ibmt .

2.2.5.2. Results from nickel exposure dataset

The second experimental dataset that we analysed using IBMT is a time series response to nickel inhalation in female 129S1/SvImJ strain mouse lung (Wesselkamper et al. 2005a).

Data were collected at five times (3, 8, 24, 48, and 72 hours), each being compared to control samples in triplicate. For each time, samples for one array were labelled with opposite dyes.

Data was normalized and analysed for differentially expressed genes as described in the methods. As in the previous section, the analysis of this experiment, which must account for both dye-effect and multiple treatment conditions, is an example that cannot be analyzed correctly by either Cyber-T or Fox‟s method.

We tested for significant GO categories as described above for the top ranked 200 (1.5%) genes in each comparison, and three different p-value cut-off values were used for significance rather than the stricter Bonferroni adjustment due to overall lower p-values from Fisher‟s exact

43 test in this dataset. Two hundred rather than 300 genes were used in this experiment because only approximately 200 genes were significantly differentially expressed at the earliest time- point based on previous analysis. Table 2.3 displays a summary of the results from testing for significant Gene Ontology categories. IBMT found the highest number of unique genes (666) involved in the significantly identified categories across time. The FOLD method results in the highest number of significant categories overall, and IBMT found the most significant categories using the two smaller p-values of 0.0001 and 0.001.

Table 2.3 Number of significant Gene Ontology terms and assigned genes among methods for Nickel exposure dataset The number of significant categories, as well as the number of genes assigned to the significant categories are shown for the five time points for each of three p-value cut-offs.

Number of unique genes Number of significant categories p-value Time pt T FOLD SMT IBMT T FOLD SMT IBMT 0.0001 03hr 0 0 0 0 0 0 0 0 0.001 03hr 0 0 0 0 0 0 0 0 0.005 03hr 4 6 6 8 2 3 3 4 0.0001 08hr 0 16 0 46 0 1 0 2 0.001 08hr 0 49 12 54 0 5 9 2 0.005 08hr 14 71 53 54 6 28 21 0 0.0001 24hr 25 22 26 26 11 15 15 5 1 0.001 24hr 52 32 62 56 15 19 21 9 3 0.005 24hr 65 66 72 69 25 34 35 5 0.0001 48hr 0 0 42 46 0 0 1 2 0.001 48hr 2 9 44 52 1 3 4 6 0.005 48hr 49 34 49 60 8 26 15 25 0.0001 72hr 0 59 57 58 0 9 5 7 0.001 72hr 45 61 63 66 3 17 12 11 0.005 72hr 51 68 77 71 7 42 20 7 1 Total 307 493 563 666 78 202 161 75

# Zeroes 6 3 3 2 # Best 0 2 4 8 0 6 3 7

Given the nature of this experiment, one would expect that some functional categories would be affected at two or more time points. Therefore, an additional measure of performance

44 is the level of overlap across time points in which categories were found to be significant. To accomplish this aim, we calculated the average number of time points each significant category was determined to be significant using the three same p-value cut-offs as above. The results are, for p-values of 0.0001, 0.001, and 0.005 respectively, FOLD: 1.04, 1.16, and 1.39; T: 1.00, 1.12, and 1.26; SMT: 1.17, 1.44, and 1.45; and IBMT: 1.30, 1.60, and 1.58. Thus, according to the results, the IBMT method gave the most consistent results through time. The list of significant

GO categories is available at http://eh3.uc.edu/ibmt with our supporting material.

Acute lung injury is a severe clinical syndrome that results from multiple causes including pneumonia, sepsis, trauma, and inhaled irritants (Ware & Matthay 2000). Pathological conditions associated with the development of acute lung injury include alveolar damage, inflammatory cell influx and activation, pulmonary edema and hemorrhage, alteration of surfactant production, and insufficient gas exchange (Lewis & Jobe 1993; Ware & Matthay

2000; Chollet-Martin 2000). Prior studies have assessed aspects of the molecular mechanisms involved in the pathogenesis of acute lung injury in mice using inhaled nickel (McDowell et al.

2000; Wesselkamper et al. 2000; Prows & Leikauf 2001; McDowell et al. 2002; Hardie et al.

2002; McDowell et al. 2003; Wesselkamper et al. 2005a; Wesselkamper et al. 2005b).

IBMT identified several transcripts that could play significant roles in the development of nickel-induced acute lung injury that were not recognized using the SMT method. For example, following 24 h of nickel exposure, transcripts for three heat shock proteins (HSPs) were found to be induced using the IBMT method as compared to the SMT method, including heat shock 70kD protein 5 (HSPA5, 2.3-fold), heat shock protein 1B (HSPA1B, 2.4-fold), and heat shock protein

9A (HSPA9A, 2.3-fold). HSPs are a group of genes that are transcriptionally regulated in response to cellular stress. In the lung, induction of HSPs protects against acute lung injury in in

45 vivo (Villar et al. 1993; Villar et al. 1994) and in vitro models (Wong et al. 1996; Wang et al.

1996; Wong et al. 1997). Thus, HSP induction in response to nickel may be involved in an early cytoprotective mechanism in the development of acute lung injury.

Another transcript that was determined to be significantly changed using the IBMT method as compared to the SMT method was from a group of genes known as aquaporins, which facilitate water movement through the air space-capillary barrier in the lung (King & Agre

1996). Expression of aquaporin 5 (Aqp5), the major water channel gene expressed in alveolar, and bronchial epithelium, decreased an estimated 2.3-fold after 48 h of nickel exposure. In previous studies, decreased expression of Aqp5 has been associated with acute lung injury caused by adenoviral infection (Towne et al. 2000) and bleomycin treatment (Gabazza et al. 2004) in mice. These data are consistent with the modulation of Aqp5 expression in regulating fluid homeostasis and abnormal fluid fluxes in the development of pulmonary inflammation and edema associated with acute lung injury.

Finally, another significantly altered transcript that was identified by IBMT and not SMT was fibroblast growth factor 2 (FGF2, a.k.a. basic fibroblast growth factor). Mouse lung FGF2 transcript levels were estimated to be induced 5.6-fold after 72 h of nickel exposure. In the lung,

Fgf2 is expressed in alveolar type II cells (Sannes et al. 1996), and may have multiple biological activities in vitro and in vivo, including angiogenesis, mitogenesis, and cellular differentiation

(Basilico & Moscatelli 1992). Additionally, induction of Fgf2 expression can influence cell proliferation and biosynthetic events that are important to the proper resolution of tissue injury in the lung (Li et al. 2000; Carreras et al. 2001). Thus, increased Fgf2 expression may be an important molecular event in the pathogenesis of nickel-induced acute lung injury.

46

Taken together, the IBMT method successfully identified several transcripts that were significantly changed at various times throughout the development of nickel-induced acute lung injury in mice that were not identified by the SMT method. These transcripts have been previously investigated in the development of lung injury, and may have biological relevance in our mouse model. The lists of top-ranked genes by IBMT but not SMT, and vice versa, are available as supplemental information.

2.3. Conclusions

IBMT has the advantage of balancing two important factors in the analysis of microarray data: the degree of independence of variances relative to the degree of identity (i.e. t-test vs. equal variance assumption), and the relationship between variance and signal intensity. We demonstrated that incorporating information about the dependence of the variance of genes on expression intensity level can improve the efficiency of the Empirical Bayes moderated t- statistics, and that properly estimating the prior degrees of freedom is important in estimating the true proportion of false positives. If a non-intensity-based moderated-T is used, and the variance of low expressed genes is higher than average, then an over-representation of low expressed genes will occur in the top ranked differentially expressed transcripts because their variance estimates will be “shrunk” towards the lower overall variability. This in turn results in a higher rate of falsely implicated genes and makes the interpretation of the results more difficult. Indeed, this trend could be seen in the comparison of genes found to be significant in SMT but not

IBMT, or vice versa, in the nickel exposure experiment. SMT identified a large number of relatively low expressed genes (49% < 100 signal level; median expression level = 99), often with unknown function, as being significantly changed compared to IBMT (0% < 100 signal level; median expression level = 357). To our knowledge, IBMT is the first to account for the

47 dependence of gene variance on intensity levels in a completely data-dependent manner, without a need for specification of free parameters by the user, within the empirical Bayes analysis framework. Furthermore, as opposed to Cyber-T (Baldi & Long 2001) and Fox (Fox & Dimmic

2006), IBMT can properly analyze data from any experimental design setup and array platform, including multiple treatments or time series, Affymetrix chips or two-dye arrays, and experiments with both technical and biological replicates. The prior variance levels are estimated using local regression and the prior degrees of freedom are estimated using a consistent estimator based on the Empirical Bayes approach.

The IBMT method outperformed or performed as well as the simple t-statistic, fold change, SMT, and Fox in simulation studies intended to mimic real microarray data and on real microarray data itself. The improved performance of IBMT on spike-in experiments suggests that the pooling of information across genes, as well as accounting for the relationship between the variances and overall intensities of gene expression measurements, is warranted. The “spike- in” Affymetrix datasets also revealed the need to correctly estimate the prior degrees of freedom for correctly estimating the proportion of false positives. By simply accepting user input for this parameter (as in Cyber-T, and indirectly in Fox), one is at risk of either greatly overestimating or underestimating the true accumulation of false positives. For the “spike-in” experiments, this may explain the poorest estimation of the true false positive rate by Cyber-T and Fox. As our results show, all methods underestimated the proportion of false positives in these Affymetrix spike-in datasets. This may partially be due to the design of these experiments, creating correlations that would not be seen in experimental data, or even unintended real changes.

However, correlations among genes and microarrays have been observed in experimental data also, and in this case, the significance statistics may be more accurately calculated using a local

48 fdr procedure with an empirical null distribution, as proposed by Efron (Efron 2006a; Efron

2006b), rather than the Benjamini FDR (Benjamini & Hochberg 1995) as applied in this paper.

Even if no correlations are expected, Efron‟s local fdr procedure with the theoretical Normal null may improve accuracy in estimating signficance levels for any chosen analysis method.

Our method was also applied to two experimental dual-channel datasets, a simple knockout versus wildtype comparison and a time-series experiment. Analysis of these data indicated that IBMT generated the greatest number of genes involved in GO categories significantly enriched with genes determined to be differentially expressed. Although the biological pathways affected in each experiment can be ascribed with limited certainty, in the time series experiment we examined self-consistency among sampling times. Although affected pathways may change across time, it is reasonable to expect that some should be consistent for at least two or more times. Our analysis showed that IBMT had the highest self-consistency. In addition to the comparison of methods using Gene Ontology, interpretation of the results hinted that biological categories found in the MEF Ahr-/- experiment using IBMT were more consistent with functions previously ascribed to this receptor. IBMT also provided a greater percent of genes directly relevant to what is currently known of the response to Nickel exposure in mice.

We have implemented IBMT as an R function (Ihaka & Gentlemen 1996) which can be downloaded along with all other supplemental material from our supporting website http://eh3.uc.edu/ibmt. The function is most easily implemented using the functionality of the limma package (Smyth 2004), but can also be used in conjunction with other linear model or mixed model analyses.

49

2.4. Methods

2.4.1. Mice and exposure protocol

Two dual-channel microarray experiments were performed. The first was a comparison of wildtype mouse embryo fibroblast (MEF) cells to aryl-hydrocarbon receptor (Ahr) knockout

MEF cells. Four biological replicate cell cultures each of wildtype and knockout cells were compared, each with dye labelling switched for the second technical replicate of each biological pair.

The second dataset has been published (Wesselkamper et al. 2005a) and the methods are summarized here. 129S1/SvImJ strain mice (females, age 7-10 weeks) were purchased from The

Jackson Laboratory (Bar Harbor, ME). All mice were housed in our animal facilities  1 week prior to exposure. Nickel aerosol was generated from 50 mM NiSO46H20 (Sigma, St. Louis,

MO) and monitored as described previously (Wesselkamper et al. 2000). Mice were exposed to

150  15 g Ni2+/m3 in a 0.32-m3 stainless steel inhalation chamber. All experimental protocols were reviewed and approved by the Institutional Animal Care and Use Committee at the

University of Cincinnati Medical Center.

Mice were exposed to aerosolized nickel for 3, 8, 24, 48, and 72 h. Following exposure, mice were killed with pentobarbital (followed by exsanguination), and the lungs were removed, placed in liquid nitrogen, and stored at -80C. Total cellular RNA was isolated from frozen lung tissue with TRIzol (Invitrogen), and quantity was assessed by A260/A280 spectrophotometric absorbance (SmartSpec 3000, Bio-Rad, Hercules, CA). RNA quality was assessed by separation with a denaturing formaldehyde/agarose/ethidium bromide gel, and quantified by analysis with an Agilent Bioanalyzer (Quantum Analytics, Foster City, CA) (Wesselkamper et al. 2005a).

50

2.4.2. Microarray hybridizations

The two real datasets were performed using Qiagen-Operon‟s mus musculus version 1.1

70-mer oligonucleotide library, representing 13,664 annotated transcripts. The first dataset is a simple comparison of wildtype mouse embryo fibroblast (MEF) cells to Ahr-/- MEF cells. A similar microarray comparison performed with mouse smooth muscle cells has previously been published (Karyala et al. 2004; Guo et al. 2004). The second dataset has been published

(Wesselkamper et al. 2005a), but we summarize the methods below. RNA quality for both experiments was assessed by separation with a denaturing formaldehyde/agarose/ethidium bromide gel, and quantified by analysis with an Agilent Bioanalyzer (Quantum Analytics, Inc.,

Foster City, CA). To examine differential gene expression, a 70-mer oligonucleotide library, representing 13,443 mouse genes (Operon Biotechnologies, Inc., Huntsville, AL), was used by the Genomic and Microarray Laboratory, Center for Environmental Genetics, University of

Cincinnati, (http://microarray.uc.edu/) was used to fabricate microarrays. The microarray hybridisations were carried out as described (Guo et al. 2004; Sartor et al. 2004). For the AHR experiment, each biological replicate consisted of one mouse cell culture, and for the Ni- treatment experiment, each exposure group consisted of nine mice. RNA from three mice was pooled for each microarray, and three separate microarrays per exposure group were compared to non-exposed controls. Both experiments were performed using 20 g total RNA per array. Each sample of mRNA was reverse transcribed and tagged with either fluorescent Cyanine 3 (Cy3) or

Cyanine 5 (Cy5) (e.g., Cy3 for control and Cy5 for 72-h exposure). Cy3 and Cy5 samples were co-hybridized with the printed 70-mers. Following hybridization, slides were washed and scanned at 635 (Cy5) and 532 (Cy3) nm (GenePix 4000B, Axon Instruments, Inc., Union City,

CA).

51

2.4.3. Data normalization and analysis

Microarray protocols and analyses were performed as described in (Karyala et al. 2004;

Guo et al. 2004; Puga et al. 2004; Sartor et al. 2004). Briefly, microarray hybridization data representing raw spot intensities generated by the GenePix® Pro v5.0 software and data normalization was performed for each microarray separately. First, channel specific local background intensities were subtracted from the median intensity of each channel (Cy3 and

Cy5). Second, background adjusted intensities were log-transformed and the differences (R) and averages (A) of log-transformed values were calculated as R = log2(X1) - log2(X2) and A =

[log2(X1) + log2(X2)]/2, where X1 and X2 denote the Cy5 and Cy3 intensities after subtracting local backgrounds, respectively. Third, data centering was performed by fitting the array- specific local regression model of R as a function of A (Dudoit et al. 2002). Normalized log- intensities for the two channels were then calculated, and statistical analysis was performed for each gene separately by fitting a mixed effects linear model (Wolfinger et al. 2001). For the

-/- MEF Ahr experiment the model used was: Yijkl =  + Ai + Sj + M(S)kj + Cl+ ijkl, where Yijkl corresponds to the normalized log-intensity on the ith array (i = 1, …, 8), with the jth treatment (j

= 1, 2), for the kth mouse, and labeled with the lth dye (l = 1 for Cy5, and 2 for Cy3).  is the

th th overall mean log-intensity, Ai is the effect of the i array, Sj is the effect of the j treatment,

th th M(S)kj is the effect of the k mouse with treatment j, and Cl is the effect of the l dye.

Assumptions about the model parameters were the same as described elsewhere (Wolfinger et al.

2001), with array and mouse effects assumed to be random, and treatment, and dye effects assumed to be fixed. The model for the second dataset was as described above, with the exception of no mouse-within-treatment effect, and a higher number of arrays (5·3 = 15) and treatment conditions (6) (Wesselkamper et al. 2005a). Ordinary T-statistics and estimates of fold

52 change were calculated for each gene using this model. The SMT (Smyth 2004) and IBMT significance levels were then calculated as described above.

CHAPTER 3: Systematic comparisons reveal that logistic

regression provides a simple yet powerful approach to

identifying enriched biological groups in gene expression data

3.1. Overview

In this chapter we introduce and test a new logistic regression-based method, LRpath, that uses a significance measure of differential expression on the x-axis and a binary variable indicating category membership on the y-axis (Table 1.1). LRpath functionally relates gene set membership to the statistical significance of genes‟ differential expression. Logistic regression is a natural extension of the chi-square test, allowing the significance values to remain on a continuous scale. In this report, we explain how the logistic regression approach is used to model enriched biological categories and discuss the interpretation of the statistical output. We then compare the sensitivity and specificity of LRpath and several other relevant methods in identifying enriched Gene Ontologies using simulated and real-world microarray data.

Comparisons of such methods are relatively lacking in the field, with one exception being (Curtis et al. 2005), where a descriptive comparison of methods is provided using two real microarray datasets. Our simulation study and real-world data analyses were structured so that the true hierarchical GO structure is preserved, thus retaining the natural correlations among categories and current (thus imperfect) gene assignments. The results from real-world data reinforce the simulation findings by additionally preserving the natural correlations among gene expression profiles and imperfect data distributions. Using two independent datasets examining the same

53 biological phenomenon, we show our method is capable of providing reproducible results, and more consistently than any other method tested. The use of our method is further demonstrated by analyzing a previously-published microarray experiment comparing healthy subjects to those with idiopathic pulmonary fibrosis (IPF) (Pardo et al. 2005). Results from two additional datasets are available in Appendix A.2.

3.2. Results and discussion

3.2.1. LRpath model

Our aim was to develop a method that 1) does not require the choice of a significance cutoff and 2) has the flexibility of allowing the investigator to choose different methods for detecting differentially expressed genes. The method we developed, LRpath, uses logistic regression to model the relationship between gene set membership and differential expression; it does so in terms of odds ratios, a concept already familiar to many investigators in the field. The basic question asked by LRpath is, “Do the odds of a gene belonging to a predefined gene set increase as the significance of differential expression increases?” Similarly to the familiar

Fisher‟s exact and chi-square tests, separately for each gene set LRpath tests whether the odds ratio of gene set membership for significant vs. non-significant gene trasncripts is greater than 1.

However, unlike Fisher‟s exact and chi-square test, the logistic regression model allows the data resulting from tests of differential expression to remain on a continuous scale. This removes the need to choose a significance cutoff, and has the advantage of taking into account the distribution of significance levels for genes not belonging to, as well as belonging to, the gene set of interest.

As an example, if oxidative stress is affected in the experiment, we would expect to see that genes with significant p-values are more likely to be involved in oxidative stress than genes with

54 less significant p-values, although we may not know how or want to predefine what is “more” and “less” significant.

To describe our model, we begin with some notation: let x be significance measures of differential expression for all genes and let y be a binary variable indicating whether each gene belongs to a gene set of interest, both of which are assumed known. If p is the proportion of genes belonging to the gene set (y=1) at a specified x value, then p/(1-p) is the corresponding odds, and log(p/(1-p)), the logit function, is the log odds. Logistic regression is used to model the proportion of genes assigned to the predefined category as a smooth function of the significance level. Specifically, we model:

 p  log     x , 1 p 

where  is the intercept,  is the slope, and both  and  are estimated in the model. The logit model has the attractive property that the slope parameter, , corresponds to the log odds ratio for each unit change in x. The ratio of odds (odds ratio) for a significant vs. non-significant level of differential expression can easily be calculated along with whether the odds ratio is significantly greater than one.

Our logistic regression method proceeds as follows. Initially, the user chooses minimum and maximum category sizes to test, since very small categories have little power to be detected as enriched, and the largest categories are often too vague to be of biological interest. Gene sets to which fewer than 10 analyzed genes belong should be interpreted with caution, due to the reduced reliability of logistic regression based results for small sample sizes. For each category c, y is defined as 1 for genes in c, and 0 for all other genes. The significance statistics, x, are

55 defined as –log(p-values). This common transformation of the p-values has the desirable property of “stretching out” the significant range of p-values and condensing the non-significant range. Logistic regression using the logit link function is then performed, and the p-value for a non-zero slope (odds ratio greater than one) is calculated. When the slope =0, the odds ratio for a significant vs. non-significant level is one, and the proportion of genes belonging to the category of interest does not depend on the significance level x of the gene. When  > 0, the odds ratio is greater than one, and we conclude that the category of interest is “enriched” with differentially expressed genes. The p-values for enrichment are then adjusted for multiple testing using FDR (false discovery rate) (Benjamini & Hochberg 1995), although Bonferroni or another alternative method may also be used. Biological categories can be sorted by either adjusted p- value or odds ratios. Included in the output is the list of Gene IDs in each category that has a significance level meeting a chosen threshold. This enables the investigator to easily determine which genes were responsible for the significant enrichment of the category.

When multiple related comparisons are of interest (e.g. a time course or multiple treatments versus a control) then  may be modeled as a vector (1,…, n) where each element is the slope at one time point or one treatment. In this case, for each gene set one could test the null hypothesis: j – i = 0. That is, a combined logistic regression may be performed to identify if a category is significantly more affected by one treatment than another. Alternatively, one could test for each gene set the null hypothesis: 1 = 2 = … = n = 0 (odds ratios =1 for all time points or dose levels) to determine which categories are affected by any dose level or at any time in the experiment. This type of analysis is illustrated in Application 2 in Appendix A.2.

56

3.2.2. Simulation study

In our simulations we compare the performance of several methods by imitating six- and ten- slide, single-channel microarray experiments with three (and five) treated samples and three

(five) controls. The simulations were not designed specifically to fit the assumed properties of

LRpath (see Section 3.4.1: Materials and Methods for details), and were performed to compare the performance of six methods (Table 1.3 in Chapter 1): LRpath, Fisher‟s exact, GSEA,

BayGO, globaltest, and sigPathway. We performed Fisher‟s exact test using the following p- value cutoffs for DEGs: 0.001, 0.01, and 0.05, and for BayGO we used a 0.01 cutoff. For sigPathway (NTK and NEK hypotheses) (Tian et al. 2005), we use a ranking procedure that combines the two hypotheses, but separate significance scores, because combined significance scores are not available. All methods were performed using an R-package when available, or R- code downloaded from the original publication‟s authors otherwise. Because GSEA is the only method tested to treat increased and decreased transcript levels separately, we made a slight modification so that the absolute value of the signal-to-noise ratio is used. The simulated data were first analyzed for detection of differentially expressed genes (DEGs) using a simple t-test.

The following variables were assessed in the simulations:

1. Number of differentially expressed genes: (500, 1000) 2. Distribution of true fold changes: Normal(0, 22) and Uniform(-2, 2) 3. Number of enriched categories: (2, 5) 4. Level of enrichment, L: (25%, 50%, 75%, 90%) 5. Number of array replicates: (3, 5), corresponding to 4 and 8 sample degrees of freedom.

Gene expression values of DEGs were assigned to human Entrez Gene IDs so that the desired enrichment level of chosen categories was obtained, with the remaining gene expression values assigned to randomly chosen, unique human Entrez Gene IDs. The Entrez Gene IDs were then mapped to all of their assigned GO terms, rather than simply to the most specific or any

57 single level of the ontologies. This allowed the simulations to preserve the actual correlations and gene distributions existing in the structure of the GO database (further details are in Section

3.4: Materials and Methods). For each variable set tested, we performed the simulation 20 times and then examined across simulation runs (1) the median p-value of non-enriched GO terms (to determine bias), (2) the calculated FDR values from enriched GO terms, and (3) the ranking of enriched GO terms (ranked by significance). We compared the ability of the methods to identify enriched GO terms by comparing the adjusted ranking of enriched categories (Figure 3.1) and the percent of enriched categories having a FDR < 0.10 (Figure 3.2) (Benjamini & Hochberg

1995).

Figure 3.1 LRpath simulation results: Ability to rank enriched GO terms Log10-rankings of enriched GO terms were calculated to compare the ability of methods to correctly rank these categories at the top of the list. Thus, lower ranking scores are better. Methods are LRpath (LR), Fisher‟s exact (FE) with the following 3 criteria for detecting DEGS (p<0.001, p<0.01, and p<0.05), globaltest, BayGO and sigPathway (sigPath). Initial 4 parameter sets used 2 enriched categories, 90, 75, 50, and 25% enrichment with DEGs respectively, 500 total DEGs, normally distributed fold changes, and 3 replicates for treated and control groups. Subsequent groups of parameter sets were varied as indicated on the graph. Data shown are averages from 20 simulation runs for each parameter set.

58

Most obviously, the performance of all methods depends on the level of enrichment, varied between 25% and 90%. Overall, LRpath performed best in ranking the actual enriched

GO terms as most significant (Figure 3.1). The performance of Fisher‟s exact test varied greatly depending on which p-value cutoff was used to identify differentially expressed genes, as previously seen (Pan et al. 2005). For most parameter sets, using the more relaxed cutoff of p<0.05 performed best, while the stricter cutoff of p<0.001 performed much worse. This indicates that for Fisher‟s exact test, a more relaxed cutoff may be more appropriate for discovering key biological categories than is normally considered justified.

For experiments with more statistical power, as illustrated by our simulations of a 10- slide experiment, using a stricter p-value cutoff for DEG detection may offer better performance.

Indeed, using p<0.01 performed better than p<0.05 in ranking GO terms for two of the four parameter sets in the 10-slide experiment, and the performance of the p<0.001 cutoff increased substantially compared to its performance in the smaller simulated experiment. As expected, simulating higher actual fold changes for DEGs resulted in overall better performance of all methods.

Simulating twice as many DEGs resulted in overall poorer performance presumably because there were more DEGs not belonging to an enriched category, thus adding noise.

Increasing the number of enriched categories from 2 to 5 had little effect on the methods‟ overall performance. Of the other methods tested, BayGO, GSEA, and Fisher‟s exact with the p<0.05 cutoff offered the next best performance, depending on parameter values. In finding the enriched categories to be significant, LRpath consistently outperformed the other methods (Figure 3.2).

59

The next best performance varied among simulation sets: FE was 2nd best

Figure 3.2 LRpath simulation results: Ability to detect enriched GO categories as significant Percent of enriched GO terms having an FDR (False Discovery Rate) less than 0.10. Methods are LRpath (LR), Fisher‟s exact (FE) with the following 3 criteria for detecting DEGs (p<0.001, p<0.01, and p<0.05), globaltest, BayGO and sigPathway„s Tk and Ek hypotheses (sigT and sigE). Initial 4 parameter sets used 2 enriched categories, 90, 75, 50, and 25% enrichment with DEGs respectively, 500 total DEGs, normally distributed fold changes, and 3 replicates for treated and control groups. Subsequent groups of parameter sets were varied as indicated on the graph. Data shown are averages from 20 simulation runs for each parameter set.

10 times, BayGO was 2nd best 6 times, and GSEA 4 times. Whereas using a p-value cutoff of

0.05 for DEG detection resulted in the optimal Fisher‟s exact ranking of categories (for all 6- slide experiment parameter sets), using a stricter p-value cutoff often performs best in terms of the number found significant. Thus, these findings add even more complexity in determining an optimal p-value cutoff for DEGs when implementing Fisher‟s exact test. Figure 3.2 also demonstrates the large range in the power of the tests in relation to the level of enrichment. For

25% enrichment of a category, there were never more than 35% of enriched categories identified as significant. For the 90% enrichment level the range of power varied more across methods, but

LRpath detected at least 70% of the enriched GO terms on average for every parameter set. The 60 results for globaltest indicate that this method may benefit from the development of an alternative p-value adjustment method, as suggested by its relative performance in ranking the enriched GO terms being dramatically better than its performance in assessing the statistical significance of the enrichment.

To identify biases in the methods, the median p-value for non-enriched GO terms was determined, which should equal 0.50 when no bias exists. As seen in Figure 3.3, LRpath, GSEA,

Figure 3.3 LRpath simulation results: Comparison of bias in p-values Median p-values of non-enriched GO terms were calculated as a measure of bias in the p-value distributions for each method. When no bias exists, the median p-value is 0.50. Error bars indicate +/- the standard deviations. LRpath, GSEA, and the Tk hypothesis of sigPathway (sigT) do not show bias in this respect, while the p-values for Fisher‟s exact (FE), BayGO, globaltest (global), and the Ek hypothesis of sigPathway (sigE) are biased towards smaller values. and the NTK test of sigPathway were free of bias overall. The bias in p-values of Fisher‟s exact test greatly depended on the p-value cutoff used for differentially expressed genes, with p-value

61

< 0.001 exhibiting the greatest bias. It also depended to a lesser extent on the variables used in simulations. The methods globaltest, BayGO, and the NEK test of sigPathway also were biased toward lower p-values by similar amounts.

3.2.3. Comparisons with real-world breast cancer datasets

Although simulations point to the potential advantage of LRpath and provide a detailed comparison of the methods‟ behavior, only with experimental data, which have the natural correlations among genes and real-world distributions of data points, can these findings be validated. This first experimental dataset was used to compare the performance of the methods

(LRpath, Fisher‟s exact, GSEA, globaltest, sigPathway and BayGO) through subsetting. The study (Sotiriou et al. 2006) consists of human breast carcinoma samples identified as histologic tumor grades 1, 2, or 3 using the Elston-Ellis grading system, and as estrogen receptor (ER) positive or negative. For this analysis, we used samples from non-treated patients (i.e. from the first batch of arrays performed) with histologic grades 1 or 3. Methods were compared by examining the specificity and sensitivity using various sized subsets of the samples.

Preprocessed data was downloaded from the NCBI Gene Expression Omnibus (GEO) public repository (GEO accession GSE2990), and in order to have a direct comparison with the other methods, a simple t-test was used to test for DEGs between grade 1 and 3 tumors for input into

LRpath, BayGO, and Fisher‟s exact.

3.2.3.1. Subset analyses

To assess method performance, we tested for enriched GO terms between the 29 grade 1 samples and 12 grade 3 samples, all non-treated with positive ER status. We first created a gold- standard set of enriched GO terms, as well as a set of non-enriched GO terms (see Section 3.4.3:

62

Materials and Methods for details). The gold-standard set included ten GO terms identified by at least four of the six methods, namely, cell division, M phase, mitosis, M phase of mitotic cell cycle, spindle organization and biogenesis, regulation of mitosis, condensed chromosome, mitotic checkpoint, cell cycle checkpoint, and regulation of progression through cell cycle. We then randomly chose subsets of the samples consisting of 3, 4, or 6 replicates, tested for enriched

GO terms, and repeated this procedure 20 times to construct reliable Receiver-Operator

Characteristic (ROC) curves for the methods. We considered only situations in which the true positives outnumbered the false positives to be relevant. The pertinent part of the ROC-like curves, where the false positives do not far outnumber the true positives, are shown in Figure

3.4a,b, and c for 3, 4, and 6 biological replicates respectively. The full ROC curves are available as Figure 7.4 in Appendix A.2.

LRpath proved to be the statistically most powerful method in all cases, with BayGO showing the next best overall performance. This was determined by comparing the height of the

ROC curve in the portion of the curve containing less than 20 false positives. As expected, results were better for all methods with higher numbers of samples used. The overall performance of globaltest and GSEA was similar according to the ROC-like curves.

Interestingly, the overall order of performance level is similar to that found in the simulation study (compare with Figure 1), with the best to worst being LRpath, BayGO, GSEA and globaltest, FE with a p<0.01 cutoff, FE with a p<0.001 cutoff, and sigPathway at the point of 20 false positives in Figure 3.4.

63

A B

C

Figure 3.4 ROC-like curves for Breast Cancer dataset Samples for grade 3 vs. grade 1 tumors, all with positive ER status, were randomly selected to create 60 subsets (20 subsets each containing 3, 4, or 6 biological replicates of grade 3 and grade 1). Subsets were then analyzed for DEGs and tested for enriched GO terms using the following methods: LRpath, Fisher‟s exact (FE) with cutoffs of 0.01 and 0.001 for DEGs, BayGO with 0.01 cutoff, GSEA, globaltest, and sigPathway (sigPath). GO terms were then ranked and the true and false positives determined by comparing to the gold standard sets. Gray line indicates point at which there are an equal number of true and false positives. (A) 3 replicates, (B) 4 replicates, and (C) 6 replicates.

3.2.3.2. Concordance analysis

A frequently recurring concern with microarray data is its reproducibility and generalizability. A high-quality gene set enrichment method will necessarily demonstrate high

64 consistency among independent experiments studying the same biological phenomenon, despite differences in technical factors, quality, patients, etc. On the other hand, a method that overfits the data may result in a large number of biological groups that appear to be significant, but these results will not be reproducible across independent studies. To this end, we examined the consistency of each method‟s results between two independent datasets: the breast cancer dataset above, and the independent samples with positive ER status from another primary breast tumor study, where each sample was also identified as histologic tumor grade 1, 2, or 3 using the

Elston-Ellis grading system (Miller et al. 2005). The analysis of this section has the benefits of being both inherently unbiased by design and containing real data distributions. We separately analyzed the independent histologic grade 1 and 3 samples from this dataset for differentially expressed genes, and separately tested for enriched GO terms between the 39 grade 1 samples and 28 grade 3 samples. For each method, concordance was measured in two ways: (1) the degree of correlation in significance of GO terms between the two datasets (Figure 3.5a), and (2) the number of overlapping GO terms between the two datasets among top ranked lists of increasing length (Figure 3.5b).

The results shown in Figure 3.5 indicate that LRpath has the greatest consistency across datasets. Consistent with the other analyses, Fisher‟s exact test using a p<0.01 criteria for DEG detection resulted in greater concordance between datasets than using p<0.001. sigPathway demonstrated a marked improvement relative to the other methods in this analysis, indicating that sigPathway‟s performance may strongly depend on having large sample sizes. Examining the number of overlapping GO terms in the top ranked lists of each method, we see LRpath performing best, Fisher‟s exact with the p<0.001 criteria for DEGs performing worst, and the other methods‟ performances indistinguishable from each other. All methods except Fisher‟s

65 exact with p<0.001 performed markedly better than would be expected by chance, indicating a true signal in answering the question as to what gene sets are enriched between histologic grade

3 vs. grade 1 primary breast cancer tumors. We list the overlapping GO terms among the top 50 ranked for LRpath in Supplemental Table 3 (Additional File 2). Notably, this list of 28 GO terms included all ten of the GO terms identified by at least four methods in the previous section, and consistent with the findings of (Sotiriou et al. 2006), were mainly related to cell cycle progression.

A B

Figure 3.5 Concordance of methods between two independent Breast Cancer datasets Reproducibility of the methods (LRpath, Fisher‟s exact (FE) with cutoffs of 0.01 and 0.001 for DEGs, BayGO, GSEA, globaltest, and sigPathway) was tested by measuring the consistency of gene set enrichment results across two datasets, both comparing grade 3 to grade 1 tumors. (A) Correlation between datasets for each method. As a measure of significance, the –log(p-values) of GO term enrichment were calculated for each method and for each dataset separately, and the Pearson correlation coefficients between datasets were calculated. (B) Overlapping enriched GO terms by rank. Ranked lists of GO terms were generated for each method and each dataset separately. The number of overlapping GO terms was then calculated between datasets for each method for increasing length of ranked lists.

3.2.4. Application: Results from a human IPF dataset

Here we further investigate LRpath‟s ability to provide informative biological results.

We summarize the results of testing the human idiopathic pulmonary fibrosis study (Pardo et al.

66

2005), performed with the Amersham Biosciences CodeLink Uniset Human I Bioarray, for enriched GO terms and KEGG pathways. Normalized data was downloaded from GEO (GEO accession GDS1252) and the 11 normal and 13 IPF lung tissue samples were analyzed for differentially expressed genes using a hierarchical Bayes moderated t-test, IBMT (see Chapter

2)(Sartor et al. 2006), and then tested for enriched gene sets using LRpath and Fisher‟s exact.

Table 3.1 lists the 6 KEGG pathways that were significant (FDR < 0.05) and illustrates the output of LRpath.

Table 3.1 Significant KEGG pathways and their properties from Application: Human IPF dataset KEGG pathways having FDR<0.05 testing with LRpath are shown for human IPF data. Odds ratios were calculated based on the difference between a p-value=0.50 and a p-value=0.001, and provide a measure of the strength of enrichment, # genes indicates how many analyzed genes belong to each enriched category, and the Entrez Gene ID lists provide the genes most responsible for the enrichment. Actual output from LRpath has additional columns for slope coefficient, p-value, and Ontology for GO terms.

KEGG Pathway name # genes Odds LRpath DEG Entrez IDs (p<0.05) ID Ratio FDR hsa04610 Complement and 54 3.34 0.0002 624, 629, 715, 716, 717, 720, 725, 730, 1361, coagulation cascades 1380, 2152, 2155, 2157, 3075, 3426, 5054, 5104, 5328, 5624, 7056 hsa04060 Cytokine-cytokine receptor 133 2.41 0.0004 268, 608, 650, 1440, 1441, 1896, 2920, 3082, interaction 3552, 3559, 3586, 3593, 3601, 3791, 3977, 4233, 5155, 6351, 6363, 6372, 6376, 6387, 7043, 7422, 7424, 8742, 8807, 9180, 9547, 10563, 23765, 51330, 53832, 56034, 58191 hsa04510 Focal adhesion 137 2.31 0.0008 87, 88, 857, 858, 859, 894, 896, 1311, 1499, 2534, 3082, 3479, 3675, 3678, 3679, 3694, 3791, 3909, 3918, 4233, 5155, 5228, 6696, 7058, 7422, 7424, 9855, 10319, 56034 hsa04110 Cell cycle 71 2.54 0.0082 890, 891, 894, 896, 983, 1031, 1032, 1111, 1387, 4085, 4087, 4173, 4175, 5111, 5591, 7043, 10912, 11200, 85417 hsa04350 TGF-beta signaling pathway 50 2.69 0.0168 94, 268, 650, 652, 1311, 1387, 4052, 4086, 4087, 4091, 7043, 7058, 57154 hsa01430 Cell Communication 53 2.50 0.0395 1311, 1825, 1829, 3853, 3909, 3918, 6696, 7058, 10319

Consistent with what has been reported in human IPF, the significant KEGG pathways included changes in “Complement and coagulation cascades” (Zorzetto et al. 2003; Kubo et al. 2005),

67

“Focal adhesion” (Pardo et al. 2005), “TGF-beta signaling pathway” (Kaminski et al. 2000;

Bonniaud et al. 2005) and “Cell communication” (Kaminski & Rosas 2006; Selman et al. 2006).

One previously unrecognized KEGG pathway was “Cell cycle” that was identified by LRpath.

Transcripts in this pathway overlapped TGF-beta signaling transcripts that included increased activator SMAD family members (SMAD1; SMAD2; SMAD4; SMAD5) and decreased inhibitor SMAD6. In addition to this expected finding, other increased transcripts included cyclin A2 (CCNA2) and cell division cycle 2, G1 to S and G2 to M (CDC2) which promote both cell cycle G1/S and G2/M transitions and have been associated with fibroblast proliferation (Qi et al. 2007) and endothelial cell apoptosis (Wang et al. 2003). Another decreased transcript was growth arrest and DNA-damage-inducible, gamma (GADD45G), which causes cell cycle arrest at G2/M transition. This decrease would complement the cell cycle progression (Sun et al. 2003).

Interestingly, the changes noted in many members of the “Cytokine-cytokine receptor interaction” [including decreased colony stimulation factor 3 (granulocyte) (CSF3), CSF3R, and chemokine (C-X-C motif) ligand 2 (CXCL2), along with increased interleukin 10 (IL10)] would be consistent with a decrease in granulocyte inflammation, which is contrary to expectations generated with animal models of pulmonary fibrosis (Shen et al. 1988; Chaudhary et al. 2006;

Kang et al. 2007), but consistent with the minimal effects of corticosteroids (Collard et al. 2004).

In addition, a total of 112 GO terms containing fewer than 500 analyzed genes

(Supplemental Table 4- Additional File 3) were identified as being significantly enriched (FDR <

0.05) with differentially expressed genes. LRpath used with the simple t-test resulted in slightly fewer (106) enriched GO terms by the same criteria, with 85 of them also identified by LRpath +

IBMT. Using LRpath as a discriminator, we analyzed the organizational relationship of a subset of biological processes that contained over-representation of transcripts that are altered in human

68

IPF and generated a Directed Acyclic Graph (DAG) of GO categories using GO Tree (Figure

3.6a). Enriched GO terms identified at the lowest level of the DAG include “Blood vessel morphogenesis”, “Bone remodeling” and “Regulation of ossification”. Because “Blood vessel morphogenesis” was a novel finding, we examined the members of this group that led to its identification by LRpath. Interestingly, of the 14 transcripts within this group, 10 were decreased. This included decreased vascular endothelial growth factors/receptor, and transcripts supportive of blood vessel growth. In addition, this was accompanied by an increased anti- angiogenic transcript: collagen, type XV, alpha 1 (Figure 3.6b). Overall, this is consistent with increased blood vessel remodeling and pulmonary hypertension (Nathan et al. 2007).

3.3. Conclusions

Testing for enriched biological categories, such as GO terms, has become a routine part of microarray analysis, and provides investigators with greater biological insight than significant gene lists alone. Here, we introduced a simple yet powerful new method using logistic regression and odds ratios, LRpath, for testing enrichment of biological categories. We also performed an in-depth comparison of our method with several relevant previously proposed methods using

69

Figure 3.6 LRpath assessment of Gene Ontology (GO) terms over-represented in human idiopathic pulmonary fibrosis (IPF) (A) Relational structure of biological processes enriched with transcripts altered in human IPF. Directed Acyclic Graph (DAG) of GO terms generated by GO Tree using transcripts in significant categories identified by LRpath. Green Box: Highly-significant (FDR < 0.0001) or Blue Box: significant (FDR<0.001) enriched categories arise from Open Box: non-enriched parent or related categories (FDR>0.05). Enriched GO terms identified lead to “blood vessel morphogenesis”, bone remodeling” and “regulation of ossification”. (B) Representative transcripts contained in blood vessel morphogenesis GO term and altered in human IPF. Decreased vascular endothelial growth factors/receptor (VEGFA, VEGFC, and KDR) and transcripts supportive of blood vessel growth (COL4A3, EPAS1, ROBO4, and NOTCH4) were accompanied by an increased anti-angiogenic transcript (COL15A1), which is consistent with blood vessel remodeling and pulmonary hypertension. Values are means + SEM. Abbreviations: VEFA: vascular endothelial growth factor A; VEGFC: vascular endothelial growth factor C; KDR: kinase insert domain receptor (a type III receptor tyrosine kinase) (a.k.a. vascular endothelial growth factor receptor 2); COL4A3: collagen, type IV, alpha 3 (Goodpasture antigen); EPAS1: endothelial PAS domain protein 1 (a.k.a. hypoxia-inducible factor (HIF)-2 alpha); ROBO4: roundabout homolog 4, magic roundabout (Drosophila); NOTCH4: Notch homolog 4 (Drosophila); COL15A1: collagen, type XV, alpha 1.

70 both simulated and experimental data. In the simulation study we know the exact truth about enriched GO terms, but the simulated microarray data lacks the correlation structure found in real-world data and may have unrealistic distributional properties. On the other hand, the use of subsets of breast cancer microarray data suffers from a potentially biased gold standard, but addresses the other issues appropriately. Finally, the comparison of independent real-world datasets is both inherently free of bias and contains real correlations and distributions, and addresses the question of which methods provide the most reproducible results. Our comparisons provide useful information to investigators seeking to choose the appropriate method for identifying affected gene sets in a microarray experiment. Overall results were concordant between the simulation study and the breast cancer data analyses. The observed concordance of the results in these three different analyses offers strong evidence that our conclusions are valid and reproducible. We acknowledge that the benefit of testing for enriched biological categories is not only dependent on the testing method, but also on the gene groupings employed. Thus, one may yield greater biological insight using GSEA with one of its gene set databases than with LRpath using KEGG or GO. However, comparing the usefulness of gene group databases was not the goal of this thesis.

In simulations, we compared the performance of six methods (LRpath, Fisher‟s exact test, GSEA, gloabltest, BayGO, and sigPathway). Methods were compared based on correct ranking of enriched categories, detecting enriched categories as statistically significant, and bias- free p-value distributions of non-enriched groups. Our results indicate that, as expected, the power to detect enriched GO terms depends greatly on the level of enrichment, and to a lesser extent on several other parameters tested. For Fisher‟s exact test, we conclude that both the test used to detect DEGs (see Section 5.1) and the significance cutoff used to define DEGs affect the

71 results of gene enrichment testing. We find that in general, under the assumption of independent microarray samples, a more relaxed cutoff such as p<0.05 or 0.01 yields better results than stricter cutoffs such as p<0.001. Although globaltest performed nearly as well as GSEA for ranking enriched GO terms, GSEA had far better performance in detecting GO terms as statistically significant and also lacked the bias in p-values of globaltest. However, globaltest has the advantage of handling more complex experimental designs than the other methods and thus may still be the preferred method in some situations. Overall, LRpath outperformed all other methods based both on ranking enriched GO terms and detecting them as statistically significant.

Using subsets of a human breast cancer dataset (Sotiriou et al. 2006) and concordance between the two independent larger sample breast cancer datasets, we showed that LRpath again resulted in the best performance. The results from these analyses were generally in agreement with results of the simulation study. In certain situations, the use of significance values may not be appropriate, for example when testing for categories enriched with genes belonging to a specific cluster, sharing a similar expression profile. In these cases where LRpath, GSEA, and globaltest are not appropriate, our results indicate that BayGO is likely to provide the best performance.

Using the breast cancer (Miller et al. 2005; Sotiriou et al. 2006) and IPF (Pardo et al.

2005) datasets, we also uncovered novel insights into the possible biological mechanisms of these diseases. In breast cancer, we demonstrate the use of LRpath and other methods to detect consistent GO terms distinguishing histologic grade 3 and grade 1 primary breast tumor samples from two independent datasets. In IPF, we demonstrate the use of LRpath to detect over- represented biological categories not presented in the original analysis (i.e. identifying additional

72 pathways including altered “Cell cycle”, decreased “Blood vessel development”, and a counter- intuitive decrease in “Cytokine-cytokine receptor interaction” with respect to animal models.)

We have implemented LRpath as an R function (Ihaka & Gentlemen 1996) (Additional file 6) which can be downloaded along with all other supplemental material from our supporting website http://eh3.uc.edu/lrpath. The function is designed to automatically test the categories of

Gene Ontology terms or KEGG Pathways, but can be modified to use with user-defined categories. This implementation accepts as input significance statistics of the investigator‟s choice and allows for duplicate and missing gene identifiers, thus resulting in a powerful, easily implemented, and flexible method. By providing a more accurate assessment of the biological categories enriched with differentially expressed genes, LRpath can empower investigators to gain valuable biological insight into the results of genome-wide expression data.

3.4. Materials and methods

3.4.1. Simulation steps

The simulations were performed as follows. All simulations were performed with 10000

“genes”, with 500 (5%) genes, or in one set 1000 (10%) genes, designed to be “differentially expressed”. The simulation process is detailed here:

For all 10,000 genes, g:

2  1. Simulate gene variances, g , as random draws from the  distribution

2. Without loss of generality, assume the first 500 (or 1000) genes are differentially

expressed

73

 a. Simulate actual mean log ratios, g, following N(0, 4g ) (or Uniform([-2.5,-

0.5]U[0.5,2.5]),

b. For the remaining 9500 (or 9000) genes, set the actual mean log ratios, g =0.

2 3. Simulate normalized estimated expression levels as random draws from N(gg ).

4. Randomly select 2 (or 5) GO terms to be “enriched”

5. Randomly assign L% of human Entrez Gene IDs from each enriched GO category to

differentially expressed genes.

6. Randomly assign unique Entrez Gene IDs to all other gene expression values,

including unassigned differentially expressed genes, as random draws from all human

Entrez Gene IDs represented in GO

7. For all GO terms with 10-200 genes, calculate significance statistics (p-values and

FDR) for each tested method.

For methods other than LRpath, we used the method‟s default values with the following exceptions. For GSEA, we permuted genes rather than samples for experiments with less than six replicates (as recommended for small experiments), and for globaltest we used the asymptotic method when the total number of possible permutations was too few to obtain significant p-values. For BayGO we increased the number of simulations from 100 to 1000 for higher accuracy.

3.4.2. Construction of the gold-standard set of GO terms for the breast

cancer dataset

Using Entrez Gene IDs, all GO terms containing between 10-200 analyzed genes were tested (1761 categories) with LRpath, Fisher‟s exact test, globaltest, GSEA, BayGO, and sigPathway. Categories with p-value<0.005 in at least two of the six methods were included in

74 the gold-standard set, resulting in 79 true positive GO terms (Supplemental Table 5a). The number of GO terms with p<0.005 in each of the methods was as follows: BayGO 43, Fisher‟s exact 34, globaltest 18, GSEA 218, LRpath 126, sigPathway 26. Of the total number of GO terms in the gold-standard set, this resulted in 39 being detected by BayGO, 23 by Fisher‟s exact,

16 by globaltest, 39 by GSEA, 76 by LRpath, and 12 by sigPathway. Categories that did not result in a p-value<0.05 for any method were included in the non-enriched GO term list, resulting in 873 negative categories (Supplemental Table 5b). The remaining categories were deemed to have unknown status and were not used in creating the ROC-like curves.

CHAPTER 4: Full hierarchical and empirical Bayesian spline-

based models for the analysis of multiple types of microarray

data

4.1. Overview

The purpose of this chapter is to introduce a universally applicable parametric, empirical

Bayes approach for the analysis of replicated microarray data, and to validate the robustness of the method with a full hierarchical Bayesian approach. By “universally applicable”, we mean applicable to general experimental designs analyzed with any linear model, in conjunction with any normalization method, and performed on any platform (dual-channel, Affymetrix, chromatin immunoprecipitation arrays, etc.). The units analyzed may be transcript elements, miRNAs, or promoter elements, but for simplicity we will refer to them throughout this chapter as “genes”.

The empirical and full Bayes models introduced here are in the same spirit as a published method, IBMT (see Chapter 2), which is not a truly empirical Bayesian model due to its use of non-parametric local regression. Our new methods validate the simplified statistical model used

75 in IBMT, and results of the empirical model suggest a slight improvement over IBMT. The newly introduced empirical model is an extension of the eBayes function in the limma package

(SMT) (Smyth 2004) of Bioconductor (Gentleman et al. 2004), and as such, it is extremely flexible in that it may be used with any microarray platform, any normalization method, and in conjunction with any linear model. Two important relationships that are empirically measured in these models are the degree of commonality among gene variances, and the relationship between gene variance and absolute expression level. The latter was not accounted for in (Smyth 2004), but has been in other methods for limited types of designs and array types (Baldi & Long 2001;

Jain et al. 2003; Fox & Dimmic 2006). Motivation for modeling this relationship can be seen in

Figure 4.1, where we provide typical examples for replicated (a) dual-channel, (b) RMA normalized Affymetrix (Irizarry et al. 2003a; Irizarry et al. 2003b), and (c) Affymetrix ChIP- chip data. To our knowledge, the methods proposed in this chapter are the first true empirical and full Bayes methods for gene variance estimation exhibiting the properties described above.

The two models illustrate two methods for incorporating spline estimation within the hierarchical Bayesian framework to functionally relate gene variance and absolute expression levels. Splines are piecewise-polynomials used to fit a smooth function to a set of data points.

The joining points of the piecewise-polynomials are called knots. Interpolating splines fit polynomials exactly through all data points, whereas regression/polynomial splines fit piecewise polynomial regression functions to the data points with a number of knots, m, << N, the total number of data points. Restrictions are placed on splines so that the overall function is smooth at

the knots; for example it is often assumed for a cubic spline that the two functional values meeting at a given knot are identical, as well as their first and second order derivatives (Boor

76

2001). Smoothing splines, in contrast, define a knot at each data point and include a roughness

A B

C D

Figure 4.1 Typical relationships observed between gene log-variances and average log-expression level Data shown is for replicated (A) dual-channel cDNA data, (B) RMA normalized Affymetrix data, (C) MAS normalized Affymetrix data, and (D) quantile normalized Affymetrix promoter tiling data. Red lines were determined using smoothing splines. penalty on the 2nd derivative to smooth the function. Although smoothing splines may be piecewise parametric, i.e., the regressions are parametric and assume the data is normally distributed about the predicted curve, the entire spline is semi-parametric. The use of splines is convenient for modeling a function whose form may vary and therefore should not be restricted to a specific parametric form. Splines, as opposed to other smoothers, are attractive when the

77 above situation occurs in the context of a parametric Bayesian model, because the semi- parametric nature of splines enables one to model a function flexibly while retaining the simplifying assumptions of the normal distribution.

Smoothing splines, in particular, have an empirical Bayesian interpretation that will be exploited in this chapter. The Bayesian interpretation of splines has a long, although somewhat sparse, history beginning with (Wahba 1978) where the smoothing spline solution was described as the posterior mean of a Bayesian model using a partially improper prior on the polynomials.

The partially improper prior is a random linear function plus an integrated Wiener process

(Brownian motion). This was generalized, and more fully developed in terms of reproducing kernel Hilbert spaces in (Wahba 1990).

We show that the empirical Bayes procedure, which has a much faster running time, and the advantage of having consistent, closed-form solutions, produces results nearly identical to the full Bayes model that uses slightly different underlying distributional assumptions and a different spline estimation method. This strong concordance between models points to the robustness of our empirical model. In this paper in Section 4.2 we outline the two novel methodologies and display them as directed acyclic graphical structures (DAGs). In Section 4.2.1 we detail the empirical Bayes procedure and the full Bayesian model in Section 4.2.2, and in Section 4.3 we provide results from simulated and experimental data. We have developed R-functions for implementing our methods: SSIBMT for the empirical model and fbSpline for the full Bayesian model. The R source code for both methods is freely available at http://eh3.uc.edu/spline_bayes.

78

4.2. Methods

4.2.1. Empirical Bayes model

Because we expect to gain information regarding the variances from other genes, but not the fold changes of differential expression, we adopt a hybrid frequentist/Bayesian method in that the estimated fold changes are determined from an appropriate frequentist linear or mixed model and the Bayesian model is used to provide more accurate estimates of variance. The goal is to obtain a closed-form solution for the posterior mean and variance, and posterior degrees of freedom, which can be used to formulate a moderated T-statistic. The hierarchical structure of the model can be represented visually by a directed acyclic graph (Figure 4.2), where each node in the graph, given its parents, is independent of its non-descendants in the model.

ˆ In the linear model, the estimated log-fold changes,  g , for any specified contrast are assumed to be approximately normal:

ˆ 2  g ~ N( g ,vg g ) (4.1)

2 where g is the residual variance in the linear model for gene g and vg is the coefficient of the variance required to calculate the standard error. For a two-sample t-test, vgi is 1/n1 + 1/n2 where n1 and n2 are the number of observations for each experimental condition. In our creation of a

ˆ moderated T-statistic, the  g are taken from the frequentist linear model, such that only estimates for the variance and degrees of freedom are required. Thus, at the lowest level of the

2 empirical Bayes model (Figure 4.2), the sample variances, sg are assumed to follow a scaled chi- square distribution:

79

 2 s 2  2 ~ g  2 g g dg (4.2) d g

  d0 g

 g dg

sg

Figure 4.2 Directed acyclic graph of the empirical Bayesian model. Each node is necessarily conditionally independent of its non-descendants given its parents, thus providing a simple method to formulate the model‟s joint distribution. Data nodes are in gray boxes, latent variables in white boxes. sg 2 are the sample standard deviations for log-fold changes, dg are the residual degrees of freedom, g are the true variances,  is a matrix of spline coefficients,  is the roughness penalty, d0 is the prior degrees of freedom, and g are the average log-intensities.

2 where dg is the contrast degrees of freedom from the test of interest in the linear model and g are the transcript/probe variances. At the next level, we assume the conjugate prior for the variances, that is they are assumed to follow a scaled inverse-chi-square:

1 1 2 2 ~ 2  . (3) d0  g d0s0g

2 where d0 is a parameter that is interpreted as the prior degrees of freedom, and s0g is the hyperprior/background variance level, estimated by a cubic smoothing spline function f(.) evaluated at g and defined on the range of [a,b] = [min g, max g].

The smoothing spline minimizes the following expression:

n b 1 2 2 2 2 f  g  log sg    f u du (4.4)  a n g1

80 where the smoothing parameter, , is a fixed constant . Because restrictions are placed on the spline function, we reduce the number of parameters by representing it in terms of a basis matrix,

B. We use the B-spline basis, which is computationally stable and efficient because the cubic B- spline basis functions, Bg, have the property of being nearly orthogonal. Written as a linear combination of the basis functions, the spline function is,

f     j B j   (4.5) j

where j are coefficients. Expression (4.4) can now be written in matrix notation as:

Y  B T Y  B   T  (4.6)

2 where Y is the vector of log sg values, and the matrix  is defined as

b   B ''  B ''  d . (4.7) ij  i   j   a

The solution to this minimization is the penalized least squares estimator,

BT B  ˆ  BT Y (4.8)

(Hastie & Tibshirani 1990).

The first term of expressions (4.4) and (4.6) is a least squares regression between each knot, which is valid whether or not the data are normal. The 2nd term penalizes large 2nd derivatives of the spline function, keeping it “smooth”. The use of the square for the penalty is the natural result of assuming the errors are normally distributed. The smoothing parameter,  is estimated from the data using either generalized cross-validation (GCV), which is known to

81 minimize the expected predictive MSE of , ER() (Wahba 1983), and the GCV method is fairly robust to non-homogeneous variance and non-normal error distributions (Wahba 1990), page 65.

The spline model assumes homogeneous variance around the spline function, eliminating the need for weighting.

2 To evaluate the empirical model, we calculate the marginal distribution of sg ; we

2 2 integrate the distribution for sg over g to get the distribution as a function of the hyperparameters only. The marginal distribution of the sample variance is:

~ 2 d0 f  g , , 2 2 ~ msg  ~    f  g , ,Fd g , d0  (9) dg d0 d g

~ where f  g , , is the exponential of the spline function. The log of the marginal can now be written as a function of the average log-intensities (with the spline function having coefficients matrix  and smoothing parameter ) and errors distributed as Fisher‟s Z, the log of the F- distribution (Johnson & Kotz 1970):

2 mlog sg  ~ f  g , , Fisher's Zd g ,d0  (10)

The Fisher‟s Z distribution is known to have expected value and variance:

1   d   1   1  E z  log 0   d  d       g   0  2  d g  2   2      (11) 1   1   1  Varz   1  d g   1  d0  4   2   2 

82 where  and 1are the digamma and trigamma functions respectively. Fisher‟s Z is known to closely approximate the normal distribution as dg and d0  large (Johnson & Kotz 1970). In addition, the skewness approaches 0 as dg approaches d0.

4.2.1.1. Estimating the hyperparameters

We use the method of moments to obtain estimates of the hyperparameters. First, we equate the empirical to theoretical expected value and variance of the marginal variance:

2 2 Elog sg   log s0g  d g 2 d0 2 logd0 d g  (4.12) 2 varlog sg   1 d g 2 1 d0 2

2 where s0g is the gene-specific prior variance level. First, we calculate E[log sg2] as predicted values from the spline solution, and then solve the following equation for d0:

1 2 log s 2  Elog s 2   mean ' d 2 ' d 2 (4.13) n  g g g 0

2 We can then define s0g as the exponential of:

2 2 log s0g  Elog sg  d g 2 d0 2  logd0 d g  (4.14)

4.2.1.2. Defining the moderated T-statistic

Assuming the hyperparameter estimates ˆ   and ˆ   given, the solutions for the posterior mean of the degrees of freedom and variance are:

df g  d0  d g 2 2 ~2 d0s0g  d g sg (4.15) sg  d0  d g

83

~ 2 2 2 where dfg is the posterior degrees of freedom and s g is the posterior mean of g given sg for gene g. Writing the posterior variance in this way provides an intuitive interpretation of the relationship between the prior and sample residual degrees of freedom. That is, d0 serves as a counter-weight to dg that determines the importance of the estimated background variance level relative to the individual gene‟s estimated variance. And d0 is empirically estimated by comparing the observed distribution of sample variances to the expected distribution of sample variances derived from the assumed approximate normality of log-fold changes and sample size

(i.e. dg). As in IBMT (see Chapter 2), the posterior variance is still distributed as a chi-square since we used the conjugate prior for the variance (inverse chi-square) in our model. Thus, the moderated T-statistic using the posterior degrees of freedom and variance still follows a T- distribution:

ˆ t  g ~ T g ~ dfg (4.16) sg vg

Parametric p-values are then calculated using the posterior t-statistic and posterior degrees of freedom, and the p-values are then adjusted using an appropriate method, such as a parametric or empirical False Discovery Rate (FDR) (Benjamini & Hochberg 1995; Efron 2006a).

4.2.2. Full Bayesian model

84

The full Bayes model is used to examine the robustness of the empirical model, and is offered as an alternative to the empirical procedure. We qualify here that the model is “full

Bayes” only so far as to estimate the variance as a function of expression intensity and calculate posterior distributions of variances. This model uses diffuse distributions on the hyperparameters and accounts for the uncertainty in (1) spline function estimation and (2) the relative weighting of prior versus sample variances. The directed acyclic graph for the full

Bayesian model is shown in Figure 4.3. The goal of this procedure is to obtain posterior distributions of variance that are used to obtain posterior predictive p-values, rather than to

40  quantiles V ∞) IG(.001,.001)

   g

 g dg

sg

Figure 4.3 Directed acyclic graph for full hierarchical Bayesian model Each node is necessarily conditionally independent of its non-descendants given its parents, thus providing a simple method to formulate the model‟s joint distribution. Data nodes are in gray boxes, latent variables in white boxes. Sg 2 are the sample standard deviations for log-fold changes, dg are the residual degrees of freedom, g are the true variances,  is a matrix of spline coefficients,  is the vector of knot locations,  is the standard deviation of variances around the spline function (corresponding to the prior degrees of freedom of the empirical model), and g are the average log-intensities.

obtain posterior mean variances used to calculate moderated T-statistics as was done in the empirical model or to obtain Bayesian posterior probabilities of differential expression as would

85 be the goal in a completely full Bayes approach. As with the empirical model, the sample

2 variances, sg , are assumed to follow a scaled chi-square distribution, now more conveniently expressed in terms of a Gamma distribution:

 d d  s 2 ~ Gamma g , g  (4.17) g  2   2 2 g 

where dg is again the sample degrees of freedom from the linear model. Rather than using the

2 conjugate prior for the variances, g , we instead use a log-normal prior for the gene variances:

 2  2  2 2  (4.18)  g ~ log Normallog s0g ,    d0 

2 where log s0g is a cubic spline function f(.) evaluated at g and defined on the range of

[a,b]=[min g, max g]. The function f(.) is defined by the matrix  (the coefficients of the polynomials between each knot), and  (the vector of knot locations). The log-normal prior is a convenient choice for two reasons: it approximates the conjugate prior distribution for mid to large d0 (empirically determined), and it allows us to estimate the spline function as a function of log-variances with normally distributed errors. As in the empirical model, we formulate the spline in terms of the B-spline basis. The knots are placed at 40 equally spaced quantiles of the data.

At the top level of the DAG, hyper-hyperprior distributions are defined as following:

 ~ MVN0, and  ~ Inverse Gamma0.001,0.001 (4.19)

This leads to the joint model distribution in Appendix A.3. Metropolis-Hastings within Gibbs sampling method was used to estimate the parameters (Appendix A.3.2). The polynomial spline

86

2 provides posterior distributions of variance, and assuming gaussian data (i.e. the logg are

2 normally distributed around log s0g ) and uninformative normal priors for spline coefficients, the posterior vector of spline coefficients is multivariate normal. Thus, we can sample from:

MVN45ˆ,Covˆ (4.20)

The log-normal distribution assumed for the prior variance approximates the scaled-inverse-chi- square, and this approximation has been exploited previously, as in (Jaffrezic et al. 2007), where they assumed that the resulting significance statistics still followed a Student T-distribution.

However, in simulations with small degrees of freedom we determined the difference to be large enough to cause a substantial deviation in the accuracy of parametric significance statistics (data not shown). This is because the posterior variance is no longer chi-square distributed, but rather a log-normal mixture of chi-square distributions, which leads to the denominator of the test statistic also not being chi-square distributed, and thus not following the T-distribution.

ˆ t *  g . Therefore, we calculate semi-parametric “posterior predictive probabilities”, ~ 2  g  g using an MCMC sampling method on the posterior distributions of variance, to substitute for the parametric p-values (Meng 1994).

4.2.2.1. Calculation of posterior predictive probabilities

ˆ From an appropriate frequentist linear model, we have estimates of log-fold change,  g , of which the difference from the true, unknown log-fold change, g, is normally distributed given

2 the variance, g :

ˆ 2 i  i ~ N0,v0 i  (4.21)

87

In (A1), 0 is the scale for the variance, defined as 1/n1 + 1/n2 where n1 and n2 are the number of observations for each experimental condition for a single comparison experiment.

2 Under the usual t-test, i is assumed to be distributed chi-square. Since this can no longer be

2 assumed, we approximate the distribution of i by sampling from the posterior variances and calculate posterior predictive probabilities (Meng 1994). First, we sample from the posterior variances to obtain sampling distributions for the log-fold changes. With unknown variance distribution, we can write:

p   ˆ  p   ˆ |  2 p  2 d 2 (4.22)  i i    i i i   i  i

ˆ 2 That is, pi  i  is a mixture of normals depending on the distribution of i . We approximate

ˆ 2 pi  i  by (1) sampling from N(0, 0i ) fifty times for each posterior variance sampled, and

2 2 then (2) combining samples for all i to obtain the mixture, effectively integrating over i .

Finally, we calculate the proportion of sampled values that are more extreme than ˆ , giving the posterior predictive probability.

4.3. Results

4.3.1. Smoothing spline versus local regression

Simulations were performed to compare the performance of smoothing splines versus local regression (SSIBMT vs. IBMT) in accurately estimating the functional relationship between log-variance and expression level for varying levels of sample and prior degrees of freedom. Average log expression intensities were generated assuming a log-normal distribution

88 with a scale parameter of 1.1, shape parameter equal to 0.34, and threshold parameter 5.1 (Figure

4.4a- left axis). The functional relationship was defined as a quadratic similar to that seen in

Affymetrix expression data normalized by RMA (Figure 1b): -0.08(x-8.5)2 - 0.04(x-8.5) – 0.5, where x is the vector of average log-expressions (Figure 4.4a- right axis). Log-sample variances were then randomly generated according to the marginal sample variance of the empirical model

2 (10) as instances from sg + F(dg, d0). We estimated the original functional form using smoothing splines and local regression (spans = 0.1, 0.3) for several combinations of dg (4, 8, 16, and 32) and d0 (2,4,8,16, and 100). Each parameter combination was replicated 10 times, and to measure performance, we calculated the sum of squared errors (SSEs) between the original and estimated function at all data points (Figure 4.4b). The results in Figure 4.4b suggest that SSIBMT has a slight edge in performance over IBMT, and that SSIBMT‟s advantage over IBMT is greatest for experiments with low sample or prior degrees of freedom.

A B

Figure 4.4 Comparison of smoothing spline and local regression to estimate function (A) left axis- histogram of log2-average expression levels used for simulations; right axis- functional relationship between log-variances and log2-average expression levels, as typically seen in Affymetrix expression data normalized with RMA. (B) Average SSEs for true versus estimated log-variance functions Results comparing the

89 performance of smoothing splines and local regression under varying sample and prior degrees of freedom. For the local regression, one iteration was used with spans of 0.1 and 0.3. 4.3.2. Simulation results

Simulations were performed assuming a total of 10,000 “genes” and 500 (5%) differentially expressed genes. Parameters varied in the simulations were the residual (sample) degrees of freedom (4 and 8), the assumed prior distribution of variances around spline function

(scaled-inverse chi-square and log-normal), and the prior degrees of freedom (equivalent to variance of log-normal distribution) (d0  [2, 4, 16, 100] and   [0.14, 0.4, 1.75, 4]). The  values 0.14 and 0.4 approximate, respectively, the d0 values of 100 and 16. The function relating log-variance to log-expression levels is the same as in Figure 4.4a. DEGs were randomly generated from the Uniform(|0.5, 2.5|) distribution. Each parameter combination was replicated

10 times in order to obtain accurate results. The methods tested were the regular t-test, IBMT,

SSIBMT, and fbSpline. IBMT was previously compared to other relevant methods (Baldi &

Long 2001; Smyth 2004; Fox & Dimmic 2006), including the use of a simple fold change cutoff for ranking genes.

The convergence properties of the full Bayes model were tested for a subset of simulation runs. The posterior variances were seen to converge relatively quickly as seen by trace plots, and autocorrelation plots revealed adequately low levels of correlation and satisfactory mixing

(Figure 4.5). For SSIBMT, we evaluated its ability to correctly estimate the prior degrees of freedom, and found it to be as good as IBMT (see Chapter 2 for IBMT‟s performance).

90

A B

Figure 4.5 Convergence properties of the full Bayesian model 2 (A) a typical autocorrelation plot of the variances, g , averaged over a random sample of 100 genes. Error bars are +/- . s.d. The plot shown is for dg=4, d0=2 and with the chi-square prior distribution. (B) trace plot for a randomly chosen gene for the same parameter set as for (A) (top), and for dg =8, =0.4, and log-normal prior distribution (bottom). Any reliable analysis method should exhibit the property of correctly balancing the estimated and true proportion of false positives. In other words, the method should not greatly over- or under-estimate the significance values. Thus, for SSIBMT and fbSpline, we generated plots of the true versus estimated proportion of false positives, shown in Figure 4.6b for dg=8 and d0=16 with both the chi-square and log-normal prior distributions; similar results were obtained from the other simulation sets.

Results of the simulations revealed SSIBMT to have a slight, but significant (p=0.002) advantage over fbSpline, and to have nearly identical performance to IBMT. For each simulation set the number of false positives were counted in the top ranked 1-500 “genes” for each method. False positive curves were generated as in Figure 4.6a, and areas under the curve were calculated as overall performance measures for each simulation set (Table 4.1).

91

A B

Figure 4.6 Simulation results for empirical and full spline-based methods (A) Both methods (SSIBMT and fbSpline) correctly estimate the true proportion of false positives. Seen here are results for dg=8 and d0=16 with the chi-square (top) and log-normal (bottom) prior distributions averaged over 10 simulation runs (B) Accumulation of false positives by increasing ranked lists for the same parameter values as in (A-bottom). As seen, IBMT, fbSpline, and SSIBMT all show nearly identical performance.

Table 4.1 Performance of methods in simulations under a range of parameter sets As a measure of overall performance we show the areas under the false positive curves, an example of which is shown in Figure 6b. “lnorm” indicates the log-normal prior distribution and “chi-sq” indicates the conjugate prior distribution. For simulations using the log-normal distribution we chose values of  that resulted in variance levels approximately corresponding to those from the chosen d0 values. The variance for d0 ≤ 4 is undefined.

dg prior d0 or  t-test IBMT SSIBMT fbSpline

2 0.298 0.275 0.275 0.286 4 0.290 0.239 0.239 0.245 chi-sq 16 0.283 0.171 0.171 0.171 4 100 0.282 0.147 0.147 0.148 4 0.171 0.160 0.160 0.169 lnorm 1.75 0.218 0.202 0.202 0.208 0.4 0.269 0.169 0.169 0.170 0.14 0.271 0.146 0.146 0.146 2 0.179 0.167 0.167 0.175 8 chi-sq 4 0.167 0.144 0.145 0.147 16 0.151 0.102 0.102 0.102 100 0.138 0.080 0.080 0.080 92

4 0.127 0.122 0.122 0.126 lnorm 1.75 0.136 0.126 0.126 0.131 0.4 0.131 0.092 0.092 0.092 0.14 0.141 0.080 0.080 0.080

4.3.3. Breast cancer datasets

A frequently recurring concern with microarray data is its reproducibility and generalizability. A good analysis method will display high consistency among independent experiments studying the same biological phenomenon, despite differences in technical factors, quality, etc. Thus, to address this concern, we validated our methods using two publicly available breast cancer datasets to examine the transcriptional differences between histologic grade 1 and grade 3 primary tumors. The first study (Sotiriou et al. 2006) consists of human breast carcinoma samples identified as histologic tumor grades 1, 2, or 3 using the Elston-Ellis grading system, and as estrogen receptor (ER) positive or negative. For this analysis, we used samples from non-treated patients (i.e. from the first batch of arrays performed) with histologic grades 1 or 3. Preprocessed data was downloaded from the NCBI Gene Expression Omnibus

(GEO) public repository (GEO accession GSE2990) and a total of 29 grade 1 and 12 grade 3 samples were analyzed for differentially expressed genes using the unadjusted t-test, IBMT,

SSIBMT, and fbSpline. The second breast cancer dataset consisted of the independent samples with positive ER status from another primary Breast tumor study, where each sample was also identified as histologic tumor grade 1, 2, or 3 using the Elston-Ellis grading system (Miller et al.

2005). This resulted in 39 grade 1 samples and 28 grade 3 samples that were used to test for differentially expressed genes with the same methods as above.

These two datasets displayed differing statistical properties. GSE3494 had a large dependence of variance on expression level, whereas GSE2990 revealed a much smaller

93 dependence. GSE2990 overall displayed higher levels of variance than GSE3494, and the prior degrees of freedom estimated for the two datasets also differed: for IBMT they were (GSE3494:

4.8, GSE2990: 201.6), and for SSIBMT were (GSE3494: 4.9, GSE2990: 203.5). Nonetheless, we were able to detect significant correlations in significance statistics between the datasets, and an even higher correlation between datasets for the results of gene set enrichment testing.

Results indicate that SSIBMT, fbSpline, and our published method result in higher correlations between datasets than does the unadjusted t-test both when testing individual genes and biological categories. The dataset GSE2990 had overall higher levels of variance and therefore fewer genes significantly differentially expressed applying False Discovery Rates (FDR)

(Benjamini & Hochberg 1995). Within each experiment, the posterior variances were more highly correlated with each other than with the variance estimates of the t-statistic (Table 4.2).

The same conclusions were reached when substituting p-values for variance estimates.

For gene set enrichment testing, we used Fisher‟s exact test with two different cutoffs for differentially expressed genes (p<0.01 and p<0.001) and BayGO with a p<0.01 cutoff. BayGO, similar to Fisher‟s exact test, uses significance counts, but employs a Bayesian framework with

Table 4.2 Intra-experiment Pearson correlation coefficients for variance estimates Above diagonal: GEO accession ID GSE3494; below diagonal: GSE2990. t-test IBMT SSIBMT fbSpline t-test 1.0 0.956 0.956 0.965 IBMT 0.808 1.0 0.999996 0.999 SSIBMT 0.804 0.997 1.0 0.999 fbSpline 0.920 0.963 0.965 1.0

2-by-3 contingency tables (Vencio et al. 2006). All three methods resulted in a higher correlation between datasets when used in conjunction with SSIBMT or fbSpline than with the unadjusted t-test (Figure 4.7). These results suggest that using a more powerful detection

94 procedure for DEGs can lead directly to improved reproducibility across independent microarray experiments.

Table 4.3 Correlations between two independent Breast Cancer datasets Inter-experiment Pearson correlation coefficients for –log(p-values) are shown for t-test, IBMT, SSIBMT, and fbSpline GSE3494 GSE2990 method t-test IBMT SSIBMT fbSpline t-test 0.1854 0.1886 0.1887 0.1649 IBMT 0.2144 0.2199 0.2201 0.1932 SSIBMT 0.2144 0.2199 0.2200 0.1932 fbSpline 0.2067 0.2116 0.2117 0.1868

Figure 4.7 Comparison of methods through the use of gene set enrichment testing Correlation coefficients for –log(p-values) from enrichment testing of Gene Ontology (GO) terms between independent sample sets from two breast cancer datasets (GSE2990 and GSE3494) are shown. Three tests (Fisher‟s exact with p<0.01 cutoff for DEGs, Fisher‟s exact with p<0.001 cutoff for DEGs, and BayGO with p<0.01 cutoff for DEGs) were used to compare the performance of the t-test, IBMT, SSIBMT, and fbSpline.

95

4.4. Discussion

In this manuscript we introduced a powerful and universally applicable empirical Bayes model for the detection of differentially expressed gene transcripts or genome binding sites, and we have demonstrated the robustness of this method through comparisons with a novel full

Bayesian method. These models empirically measure the relative importance of individual gene variance level versus the estimated background gene variance level. The background variance in turn is modeled with cubic splines as a function of mean expression level. The spline estimations are fit directly within the hierarchical Bayesian framework. Previously, it has been proposed to account for the dependence of variance on intensity level by placing genes in a certain number of intensity “bins” or windows, and estimating the background variance level for each bin or window separately (Baldi & Long 2001; Strand et al. 2002; Eaves et al. 2002; Fox &

Dimmic 2006). However, this approach is not as stable and leads to the issue of determining the optimal number of bins as a trade-off between bias and variance (Strand et al. 2002), which our spline-based methods circumvent. For high-density and tiling arrays, our empirical Bayes method provides a robust approach to finding individual probe elements with enriched binding.

Theoretically, for optimal performance, this would be combined with a feature selection procedure that considers the behavior of neighboring probes to identify larger bound promoter regions.

The full Bayesian model, fbSpline, which accounts for the uncertainty in the estimated spline function and in the variance of gene variances, does not demonstrate any increase in performance over the empirical model, SSIBMT. They do, however, produce very similar results as seen in simulations, especially for experiments with high prior degrees of freedom, or correspondingly, small . For low prior degrees of freedom or high , i.e. when the functional

96 relationship is more difficult to estimate due to a higher noise level, the empirical model actually demonstrated an advantage over the full Bayesian model.

CHAPTER 5: Discussion and conclusions

5.1. Combined benefits of IBMT and LRpath

5.1.1. Simulations

In this section we provide results showing the combined benefit of using IBMT and

LRpath as compared to the t-test and Fisher‟s exact or BayGO, the other two gene set enrichment testing methods that allow the investigator to choose the method for detecting differentially expressed genes. Simulations were performed as follows. Simulations were run assuming prior degrees of freedom d0 =4 and the medium strength of dependency on average log-intensities, as defined in Chapter 2. All simulations were performed with 10000 “genes”, with 500 (5%) genes, or in one set 1000 (10%) genes, designed to be “differentially expressed”. The rest of the simulation steps were identical to those in Chapter 3 and were performed to compare the performance of six methods (t-test + LRpath, IBMT + LRpath, t-test + Fisher‟s exact, IBMT +

Fisher‟s exact, t-test + BayGO, IBMT + BayGO). The data were first analyzed for detection of differentially expressed genes (DEGs) using both a simple t-test and IBMT. We performed

Fisher‟s exact test using p-value cutoffs of 0.001, 0.01, and 0.05, and for BayGO we used a 0.01 cutoff. For each variable set tested as described in Chapter 3, we examined (1) the calculated

FDR values from enriched GO terms, and (2) the ranking of enriched GO terms (ranked by significance). We compared the ability of the methods to identify enriched GO terms by comparing the adjusted ranking of enriched categories (Figure 5.1) and the percent of enriched categories having a FDR < 0.10 (Figure 5.2). 97

In Figure 5.1, we see that while IBMT and LRpath both provide an individual advantage over the displayed alternatives, used together they result in the best performance. The benefit

(1) (2) (3) (4) (5)

Figure 5.1 Combined benefit of LRpath and IBMT in terms of ranking enriched gene sets Log10-rankings of enriched GO terms were calculated to compare the ability of methods to correctly rank these categories at the top of the list. Thus, lower ranking scores are better. Methods are LRpath (LR), Fisher‟s exact (FE) with the following 3 criteria for detecting DEGs (p<0.001, p<0.01, and p<0.05), and BayGO, each with IBMT and a t-test. Initial 4 parameter sets (1) used 2 enriched categories, 90, 75, 50, and 25% enrichment with DEGs respectively, 500 total DEGs, normally distributed fold changes, and 3 replicates for treated and control groups. Subsequent groups of parameter sets are the same as in Figure 3.1, namely (2) using 1000, rather than 500, DEGs, (3) using DEGs with higher fold changes, (4) using 5, rather than 2, enriched GO terms, and (5) using 5 replicates instead of 3. Data shown are averages from 20 simulation runs for each parameter set. of LRpath with IBMT as compared to Fisher‟s exact with IBMT strongly depends on what cutoff is used for differentially expressed genes in Fisher‟s exact test (compare solid red line to solid blue, black, and brown lines). However, since the optimal cutoff is unknown for any specific experimental dataset, the benefit of using LRpath would also not be known. The use of IBMT with LRpath also provides an additional benefit compared to LRpath with a simple t-test. This advantage was relatively constant across the simulation parameter sets. Practically, Figure 5.1 shows that the difference between using a simple t-test with Fisher‟s exact versus IBMT with

98

LRpath could mean the difference between the truly enriched gene sets being ranked 30-50 in the list and being ranked in the top few.

(1) (2) (3) (4) (5)

Figure 5.2 Combined benefit of LRpath and IBMT in terms of detecting enriched gene sets as significant Percent of enriched GO terms having an FDR (False Discovery Rate) less than 0.10. Methods are LRpath (LR), Fisher‟s exact (FE) with the following 3 criteria for detecting DEGs (p<0.001, p<0.01, and p<0.05), and BayGO each with IBMT and a t-test. Initial 4 parameter sets used 2 enriched categories, 90, 75, 50, and 25% enrichment with DEGs respectively, 500 total DEGs, normally distributed fold changes, and 3 replicates for treated and control groups. Subsequent groups of parameter sets are the same as in Figure 3.2, namely (2) using 1000, rather than 500, DEGs, (3) using DEGs with higher fold changes, (4) using 5, rather than 2, enriched GO terms, and (5) using 5 replicates instead of 3. Data shown are averages from 20 simulation runs for each parameter set. Looking at the ability of the methods to detect the enriched gene sets as significant

(Figure 5.2), we see the same trend as in Figure 5.1. LRpath with IBMT performed best for all simulated parameter sets except for the set using a higher number of replicates, where the t-test and IBMT with LRpath performed approximately equal. This is not surprising, since IBMT‟s advantage over the t-test diminishes as the number of replicates in the experiment increases.

IBMT again displays an advantage over the simple t-test in detecting enriched gene sets regardless of the other method used (LRpath, Fisher‟s exact, or BayGO). This gives support to the results of the case studies in Chapter 2, where we found IBMT provided more significantly enriched gene sets than the other methods tested. Here, the overall range in performance

99 between IBMT with LRpath and a simple t-test with Fisher‟s exact is even more striking; it could mean the difference between detecting over 80% of the affected gene sets as being enriched and detecting less than 30% (compare solid red to dashed brown line)

5.1.2. Breast cancer results

Similar to Section 3.2.3.1 of Chapter 3, we analyzed subsets of a breast cancer study

(Sotiriou et al. 2006) examining the difference between histologic tumor grade 3 versus grade 1 samples. By constructing a gold standard set of true and false Gene Ontology terms (see Section

4.3 of Chapter 3), and analyzing random subsets of 3, 4, or 6 replicates, we calculated ROC curves the same as in Figure 3.4, except using different methods. The methods tested were

Fisher‟s exact (with cutoffs of 0.01 and 0.001 for DEGs), BayGO, and LRpath, each with IBMT and a simple t-test. LRpath with IBMT proved to be the statistically most powerful approach in all cases, with BayGO + IBMT having the next best performance (Figure 5.3). This was determined by comparing the height of the ROC curve in the portion of the curve containing less false positives than true positives. In every case, LRpath, Fisher‟s exact, and BayGO used in conjunction with IBMT outperformed the same method used with a simple t-test, often with a substantial performance increase. This difference between IBMT and the simple t-test within the same GO enrichment method illustrates the effect size that the test for DEGs can have on GO term enrichment analysis, and the importance of having the option to choose the most appropriate DEG testing method. As in the simulation results, the combined benefit of using

IBMT with LRpath as opposed to the simple t-test with Fisher‟s exact strongly depends on the cutoff chosen for DEGs. This difference was as drastic as detecting 28 GO terms with less than

100

10% false positives using LRpath and IBMT versus not detecting any significant GO terms with

Fisher‟s exact and a t-test (compare solid red to dashed brown or black line in Figure 5.3c).

A B

C

Figure 5.3 Combined benefit of LRpath and IBMT: ROC-like curves for Breast Cancer dataset Samples for grade 3 vs. grade 1 tumors, all with positive ER status, were randomly selected to create 60 subsets (20 subsets each containing 3, 4, or 6 biological replicates of grade 3 and grade 1). Subsets were then analyzed for DEGs and tested for enriched GO terms using the following methods: LRpath, Fisher‟s exact (FE) with cutoffs of 0.01 and 0.001 for DEGs, and BayGO with 0.01 cutoff, each with IBMT and a simple t-test. GO terms were then ranked and the true and false positives determined by comparing to the gold standard sets. Gray line indicates point at which there are an equal number of true and false positives. (A) 3 replicates, (B) 4 replicates, and (C) 6 replicates.

101

5.2. Integrated discussion and conclusions

As seen in Sections 1.2.3 and 1.2.4 of the background material, numerous methods were previously developed both for testing for differential expression of gene transcripts and for gene set enrichment analysis. Many of these methods were originally published without an attempt to systematically or objectively compare the novel method with previous relevant methods. In this dissertation we have made an effort to be diligent in comparing our novel methods with other relevant methods systematically and objectively. When simulations were performed, they were designed to imitate real microarray data as closely as possible, rather than to fit the assumed distributions of our models, and for LRpath, the simulations were structured to use actual gene assignments to predefined biological groups. For gene set enrichment analysis in particular, an obvious lack of method comparisons exists in the field, and we hope that not only did we help to fill that gap, but also that we provided the community with an integrative approach to objective comparisons with future proposed methods. Without an absolute gold-standard or known truth available for real-world datasets, one is often left to rely solely on biological interpretations. We have attempted to remain objective by using a technique that involves obtaining a consensus between well-designed simulations and comparisons with real-world datasets.

In this dissertation, four methods have been implemented as R functions: IBMT, SSIBMT, fbSpline, and LRpath. We chose the R programming environment as our preferred distribution platform due to its flexibility, its popularity among the microarray community, and because it is the vehicle through which to use Bioconductor (a software project dedicated to the analysis of genomic data), and it is open source and open distribution. This will allow researchers to easily use our proposed methods to analyze their microarray data regardless of the platform, species, or library used to generate the data, as well as to compare our methods to other algorithms.

102

The statistical methodologies in this dissertation were inspired by the analysis of real- world data and developed paying close attention to the properties of real experimental data; as such they are intended to be far more than simply mathematical exercises. Our methods are flexible in that they can be used with any microarray platform, experimental design, any linear model, and in addition, LRpath allows the investigator to choose the testing method for differentially expressed transcripts. LRpath displayed robust behavior and improved statistical power to discern enriched biological groupings when compared to tested alternatives. By providing a more accurate assessment of the individual changes in mRNA steady state levels as well as the biological categories enriched with differential expression, the methodologies developed in this dissertation can empower investigators to gain valuable biological insight into the results of genome-wide expression data. We hope to have made a valuable contribution to the field.

5.3. Future directions

The methodologies in this dissertation have certain limitations that provide further opportunities for improvement. The Bayesian methods for identifying differentially expressed genes (IBMT, SSIBMT, and fbSpline), assume that the variance for any gene g does not change for different treatment conditions (or treatment pairs in the case of dual-channel arrays). In addition, these methods assume no measurement error in the average log-intensities. If the actual measurement error is substantial, it may result in an underestimation in the prior degrees of freedom, therefore relying less on the background variance level than would be optimal.

For identifying enriched gene sets, LRpath currently is applied to only Gene Ontology and KEGG. This could easily be expanded to include testing for enriched or cytobands, as well as other functional databases. Alternatively, we could define embedded

103 networks by extending the current classically-defined KEGG pathways to include the members‟ nearest neighbors in well-defined protein-protein interaction networks. One could then test under what conditions, if any, the embedded networks provide an optimal increase in information over the classical pathways.

104

Bibliography

Al Shahrour, F., R. Diaz-Uriarte and J. Dopazo. 2004. FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics., 20: 578-580.

Allison, D. B., X. Cui, G. P. Page and M. Sabripour. 2006. Microarray data analysis: from disarray to consolidation and consensus. Nat. Rev. Genet., 7: 55-65.

Ashburner, M., C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin and G. Sherlock. 2000. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 25: 25-29.

Baldi, P. and A. D. Long. 2001. A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes. Bioinformatics., 17: 509-519.

Basilico, C. and D. Moscatelli. 1992. The FGF family of growth factors and oncogenes. Adv. Cancer Res., 59: 115-165.

Beissbarth, T. 2006. Interpreting experimental results using gene ontologies. Methods Enzymol., 411: 340-352.

Benjamini, Y. and Y. Hochberg. 1995. Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society B, 57: 289-300.

Berriz, G. F., O. D. King, B. Bryant, C. Sander and F. P. Roth. 2003. Characterizing gene sets with FuncAssociate. Bioinformatics., 19: 2502-2504.

Blangiardo, M., S. Toti, B. Giusti, R. Abbate, A. Magi, F. Poggi, L. Rossi, F. Torricelli and A. Biggeri. 2006. Using a calibration experiment to assess gene-specific information: full Bayesian and empirical Bayesian models for two-channel microarray data. Bioinformatics., 22: 50-57.

Bolstad, B. M., R. A. Irizarry, M. Astrand and T. P. Speed. 2003. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics., 19: 185-193.

Bonniaud, P., P. J. Margetts, K. Ask, K. Flanders, J. Gauldie and M. Kolb. 2005. TGF-beta and Smad3 signaling link inflammation to chronic fibrogenesis. J Immunol., 175: 5390- 5395.

105

Boor, C. d. 2001. A Practical Guide to Splines. Springer Verlag, New York.

Broet, P., S. Richardson and F. Radvanyi. 2002. Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J Comput. Biol, 9: 671-683.

Carreras, I., C. B. Rich, J. A. Jaworski, S. J. Dicamillo, M. P. Panchenko, R. Goldstein and J. A. Foster. 2001. Functional components of basic fibroblast growth factor signaling that inhibit lung elastin gene expression. Am. J. Physiol Lung Cell Mol. Physiol, 281: L766-L775.

Chambers, J. M. and T. J. Hastie. 1992. Statistical Models in S. Wadsworth and Brooks, Pacific Grove, CA.

Chaudhary, N. I., A. Schnapp and J. E. Park. 2006. Pharmacologic differentiation of inflammation and fibrosis in the rat bleomycin model. Am. J Respir. Crit Care Med., 173: 769-776.

Choe, S. E., M. Boutros, A. M. Michelson, G. M. Church and M. S. Halfon. 2005. Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biol., 6: R16.

Chollet-Martin, S. 2000. Polymorphonuclear neutrophil activation during the acute respiratory distress syndrome. Intensive Care Med., 26: 1575-1577.

Churchill, G. A. 2002. Fundamentals of experimental design for cDNA microarrays. Nat. Genet., 32 Suppl: 490-495.

Collard, H. R., J. H. Ryu, W. W. Douglas, M. I. Schwarz, D. Curran-Everett, T. E. King, Jr. and K. K. Brown. 2004. Combined corticosteroid and cyclophosphamide therapy does not alter survival in idiopathic pulmonary fibrosis. Chest, 125: 2169-2174.

Cui, X. and G. A. Churchill. 2003. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol., 4: 210.

Curtis, R. K., M. Oresic and A. Vidal-Puig. 2005. Pathways to the analysis of microarray data. Trends Biotechnol., 23: 429-435.

Dabney, A. R. and J. D. Storey. 2006. A reanalysis of a published Affymetrix GeneChip control dataset. Genome Biol, 7: 401.

Dennis, G., Jr., B. T. Sherman, D. A. Hosack, J. Yang, W. Gao, H. C. Lane and R. A. Lempicki. 2003. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol, 4: 3.

Draghici, S., P. Khatri, R. P. Martins, G. C. Ostermeier and S. A. Krawetz. 2003. Global functional profiling of gene expression. Genomics, 81: 98-104.

106

Draghici, S., P. Khatri, A. L. Tarca, K. Amin, A. Done, C. Voichita, C. Georgescu and R. Romero. 2007. A systems biology approach for pathway level analysis. Genome Res., 17: 1537-1545.

Dudoit, S., Y. Yang, M.J. and T. P. Speed. 2002. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica, 12: 111-139.

Durbin, B. P., J. S. Hardin, D. M. Hawkins and D. M. Rocke. 2002. A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics., 18 Suppl 1: S105- S110.

Eaves, I. A., L. S. Wicker, G. Ghandour, P. A. Lyons, L. B. Peterson, J. A. Todd and R. J. Glynne. 2002. Combining mouse congenic strains and microarray gene expression analyses to study a complex trait: the NOD model of type 1 diabetes. Genome Res., 12: 232-243.

Eckel, J. E., C. Gennings, V. M. Chinchilli, L. D. Burgoon and T. R. Zacharewski. 2004. Empirical bayes gene screening tool for time-course or dose-response microarray data. J. Biopharm. Stat, 14: 647-670.

Efron, B. Correlation and large-scale simultaneous significance testing. Department of Statistics, Stanford University technical report. 2006b. Ref Type: Report

Efron, B. Local false discovery rates. Department of Statistics, Stanford University technical report. 2006a. Ref Type: Report

Efron, B., R. Tibshirani, Storey JD and V. Tusher. 2001. Empirical bayes analysis of a microarray experiment. J Amer Stat Assoc, 96: 1151-1160.

Efron, B. and R. Tibshirani. 2002. Empirical bayes methods and false discovery rates for microarrays. Genet. Epidemiol., 23: 70-86.

Fox, R. J. and M. W. Dimmic. 2006. A two-sample Bayesian t-test for microarray data. BMC. Bioinformatics., 7: 126.

Gabazza, E. C., M. Kasper, K. Ohta, M. Keane, C. essandro-Gabazza, H. Fujimoto, Y. Nishii, H. Nakahara, T. Takagi, A. G. Menon, Y. Adachi, K. Suzuki and O. Taguchi. 2004. Decreased expression of aquaporin-5 in bleomycin-induced lung fibrosis in the mouse. Pathol. Int., 54: 774-780.

Gentleman, R. C. Bioconductor package, GOstats vignette: http://bioconductor.org/packages/2.1/bioc/vignettes/GOstats/inst/doc/GOstatsHyperG.pdf. 10-3-2007.

Gentleman, R. C., V. J. Carey, D. M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F.

107

Leisch, C. Li, M. Maechler, A. J. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Y. Yang and J. Zhang. 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol., 5: R80.

Gill, J. 2002. Bayesian methods: a social and behavioral sciences approach. Chapman & Hall/CRC, Boca Raton.

Goeman, J. J., S. A. van de Geer, F. de Kort and H. C. van Houwelingen. 2004. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics., 20: 93- 99.

Golub, T. R., D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield and E. S. Lander. 1999. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286: 531-537.

Guo, J., M. Sartor, S. Karyala, M. Medvedovic, S. Kann, A. Puga, P. Ryan and C. R. Tomlinson. 2004. Expression of genes in the TGF-beta signaling pathway is significantly deregulated in smooth muscle cells from aorta of aryl hydrocarbon receptor knockout mice. Toxicol. Appl. Pharmacol., 194: 79-89.

Hardie, W. D., D. R. Prows, A. Piljan-Gentle, M. R. Dunlavy, S. C. Wesselkamper, G. D. Leikauf and T. R. Korfhagen. 2002. Dose-related protection from nickel-induced lung injury in transgenic mice expressing human transforming growth factor-alpha. Am. J. Respir. Cell Mol. Biol., 26: 430-437.

Harris, M. A., J. Clark, A. Ireland, J. Lomax, M. Ashburner, R. Foulger, K. Eilbeck, S. Lewis, B. Marshall, C. Mungall, J. Richter, G. M. Rubin, J. A. Blake, C. Bult, M. Dolan, H. Drabkin, J. T. Eppig, D. P. Hill, L. Ni, M. Ringwald, R. Balakrishnan, J. M. Cherry, K. R. Christie, M. C. Costanzo, S. S. Dwight, S. Engel, D. G. Fisk, J. E. Hirschman, E. L. Hong, R. S. Nash, A. Sethuraman, C. L. Theesfeld, D. Botstein, K. Dolinski, B. Feierbach, T. Berardini, S. Mundodi, S. Y. Rhee, R. Apweiler, D. Barrell, E. Camon, E. Dimmer, V. Lee, R. Chisholm, P. Gaudet, W. Kibbe, R. Kishore, E. M. Schwarz, P. Sternberg, M. Gwinn, L. Hannick, J. Wortman, M. Berriman, V. Wood, C. N. de la, P. Tonellato, P. Jaiswal, T. Seigfried and R. White. 2004. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res., 32: D258-D261.

Hastie, T. J. and R. J. Tibshirani. 1990. Generalized Additive Models. Chapman and Hall, New York.

Hosack, D. A., G. Dennis, Jr., B. T. Sherman, H. C. Lane and R. A. Lempicki. 2003. Identifying biological themes within lists of genes with EASE. Genome Biol., 4: R70.

Ihaka, R. and R. Gentlemen. 1996. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5: 299-314.

Irizarry, R. A., B. M. Bolstad, F. Collin, L. M. Cope, B. Hobbs and T. P. Speed. 2003a. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res., 31: e15.

108

Irizarry, R. A., B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis, U. Scherf and T. P. Speed. 2003b. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics., 4: 249-264.

Jaffrezic, F., G. Marot, S. Degrelle, I. Hue and J. L. Foulley. 2007. A structural mixed model for variances in differential gene expression studies. Genet. Res., 89: 19-25.

Jain, N., J. Thatte, T. Braciale, K. Ley, M. O'Connell and J. K. Lee. 2003. Local-pooled- error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics., 19: 1945-1951.

Johnson, N. L. and S. Kotz. 1970. Distributions in Statistics: Continuous Univariate Distributions - 2. Wiley, New York.

Kaminski, N., J. D. Allard, J. F. Pittet, F. Zuo, M. J. Griffiths, D. Morris, X. Huang, D. Sheppard and R. A. Heller. 2000. Global analysis of gene expression in pulmonary fibrosis reveals distinct programs regulating lung inflammation and fibrosis. Proc. Natl. Acad. Sci. U. S. A, 97: 1778-1783.

Kaminski, N. and I. O. Rosas. 2006. Gene expression profiling as a window into idiopathic pulmonary fibrosis pathogenesis: can we identify the right target genes? Proc. Am. Thorac. Soc., 3: 339-344.

Kanehisa, M. 2002. The KEGG database. Novartis. Found. Symp., 247: 91-101.

Kanehisa, M., S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T. Katayama, M. Araki and M. Hirakawa. 2006. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res., 34: D354-D357.

Kang, H. R., S. J. Cho, C. G. Lee, R. J. Homer and J. A. Elias. 2007. Transforming Growth Factor (TGF)-beta1 Stimulates Pulmonary Fibrosis and Inflammation via a Bax- dependent, Bid-activated Pathway That Involves Matrix Metalloproteinase-12. J Biol Chem., 282: 7723-7732.

Karyala, S., J. Guo, M. Sartor, M. Medvedovic, S. Kann, A. Puga, P. Ryan and C. R. Tomlinson. 2004. Different global gene expression profiles in benzo[a]pyrene- and dioxin- treated vascular smooth muscle cells of AHR-knockout and wild-type mice. Cardiovasc. Toxicol., 4: 47-73.

Kerr, K. M. and G. A. Churchill. 2001. Experimental design for gene expression microarrays. Biostatistics, : 183-201.

Kerr, K. M., M. Martin and G. A. Churchill. 2000. Analysis of variance for gene expression microarray data. Journal of Computational Biology, : 819-837.

Khatri, P., S. Sellamuthu, P. Malhotra, K. Amin, A. Done and S. Draghici. 2005. Recent additions and improvements to the Onto-Tools. Nucleic Acids Res., 33: W762-W765.

109

Khatri, P. and S. Draghici. 2005. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics., 21: 3587-3595.

King, L. S. and P. Agre. 1996. Pathophysiology of the aquaporin water channels. Annu. Rev. Physiol, 58: 619-648.

Kooperberg, C., T. G. Fazzio, J. J. Delrow and T. Tsukiyama. 2002. Improved background correction for spotted DNA microarrays. J. Comput. Biol., 9: 55-66.

Kubo, H., K. Nakayama, M. Yanai, T. Suzuki, M. Yamaya, M. Watanabe and H. Sasaki. 2005. Anticoagulant therapy for idiopathic pulmonary fibrosis. Chest, 128: 1475-1482.

Leung, Y. F. and D. Cavalieri. 2003. Fundamentals of cDNA microarray data analysis. Trends Genet., 19: 649-659.

Lewin, A., S. Richardson, C. Marshall, A. Glazier and T. Aitman. 2006. Bayesian modeling of differential gene expression. Biometrics, 62: 10-18.

Lewis, J. F. and A. H. Jobe. 1993. Surfactant and the adult respiratory distress syndrome. Am. Rev. Respir. Dis., 147: 218-233.

Li, C. M., J. Khosla, I. Pagan, P. Hoyle and P. L. Sannes. 2000. TGF-beta1 and fibroblast growth factor-1 modify fibroblast growth factor-2 production in type II cells. Am. J. Physiol Lung Cell Mol. Physiol, 279: L1038-L1046.

Li, Y. and D. Ruppert. 2007. On the asymptotics of penalized splines. Biometrika.

Long, A. D., H. J. Mangalam, B. Y. Chan, L. Tolleri, G. W. Hatfield and P. Baldi. 2001. Improved statistical inference from dna microarray data using analysis of variance and a Bayesian statistical framework. Analysis of global gene expression in Escherichia coli K12. J. Biol. Chem., 276: 19937-19944.

Lonnstedt, I. and T. P. Speed. 2002. Replicated microarray data. Statistica Sinica, 12: 31- 46.

McDowell, S. A., K. Gammon, C. J. Bachurski, J. S. Wiest, J. E. Leikauf, D. R. Prows and G. D. Leikauf. 2000. Differential gene expression in the initiation and progression of nickel- induced acute lung injury. Am. J. Respir. Cell Mol. Biol., 23: 466-474.

McDowell, S. A., K. Gammon, B. Zingarelli, C. J. Bachurski, B. J. Aronow, D. R. Prows and G. D. Leikauf. 2003. Inhibition of nitric oxide restores surfactant gene expression following nickel-induced acute lung injury. Am. J. Respir. Cell Mol. Biol., 28: 188-198.

McDowell, S. A., A. Mallakin, C. J. Bachurski, K. Toney-Earley, D. R. Prows, T. Bruno, K. H. Kaestner, D. P. Witte, H. Melin-Aldana, S. J. Degen, G. D. Leikauf and S. E. Waltz. 2002. The role of the receptor tyrosine kinase Ron in nickel-induced acute lung injury. Am. J. Respir. Cell Mol. Biol., 26: 99-104.

110

Meng, XL. 1994. Posterior predictive p-values. Ann. Stat., 22: 1142-1160.

Miller, L. D., J. Smeds, J. George, V. B. Vega, L. Vergara, A. Ploner, Y. Pawitan, P. Hall, S. Klaar, E. T. Liu and J. Bergh. 2005. An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc. Natl. Acad. Sci. U. S. A, 102: 13550-13555.

Nathan, S. D., P. W. Noble and R. M. Tuder. 2007. Idiopathic Pulmonary Fibrosis and Pulmonary Hypertension: Connecting the Dots. Am. J Respir. Crit Care Med..

Nebert, D. W. 1989. The Ah locus: genetic differences in toxicity, cancer, mutation, and birth defects. Crit Rev Toxicol, 20: 153-174.

Newton, M. A., C. M. Kendziorski, C. S. Richmond, F. R. Blattner and K. W. Tsui. 2001. On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data. J. Comput. Biol., 8: 37-52.

Newton, M. A., A. Noueiry, D. Sarkar and P. Ahlquist. 2004. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics., 5: 155-176.

Olson, N. E. 2006. The microarray data analysis process: from raw data to biological significance. NeuroRx., 3: 373-383.

Osborne, J. D., L. J. Zhu, S. M. Lin and W. A. Kibbe. 2007. Interpreting microarray results with gene ontology and MeSH. Methods Mol Biol, 377: 223-242.

Pan, K. H., C. J. Lih and S. N. Cohen. 2005. Effects of threshold choice on biological conclusions reached during analysis of gene expression by DNA microarrays. Proc. Natl. Acad. Sci. U. S. A, 102: 8961-8965.

Pardo, A., K. Gibson, J. Cisneros, T. J. Richards, Y. Yang, C. Becerril, S. Yousem, I. Herrera, V. Ruiz, M. Selman and N. Kaminski. 2005. Up-regulation and profibrotic role of osteopontin in human idiopathic pulmonary fibrosis. PLoS. Med., 2: e251.

Prows, D. R. and G. D. Leikauf. 2001. Quantitative trait analysis of nickel-induced acute lung injury in mice. Am. J. Respir. Cell Mol. Biol., 24: 740-746.

Puga, A., M. A. Sartor, M. Y. Huang, J. K. Kerzee, Y. D. Wei, C. R. Tomlinson, C. S. Baxter and M. Medvedovic. 2004. Gene expression profiles of mouse aorta and cultured vascular smooth muscle cells differ widely, yet show common responses to dioxin exposure. Cardiovasc. Toxicol., 4: 385-404.

Puga, A., C. R. Tomlinson and Y. Xia. 2005. Ah receptor signals cross-talk with multiple developmental pathways. Biochem Pharmacol, 69: 199-207.

Qi, Y., Y. Tu, D. Yang, Q. Chen, J. Xiao, Y. Chen, J. Fu, X. Xiao and Z. Zhou. 2007. Cyclin A but not cyclin D1 is essential for c-myc-modulated cell-cycle progression. J Cell Physiol, 210: 63-71.

111

Ruppert, D. 2002. Selecting the number of knots for penalized splines. J Comput Graph Stat, 11: 735-757.

Ruppert, D., M. P. Wand and R. J. Carroll. 2003. Scatterplot Smoothing Semiparametric Regression. Cambridge University Press, Cambridge.

Sannes, P. L., J. Khosla, S. Johnson, M. Goralska, C. McGahan and M. Menard. 1996. Basic fibroblast growth factor in fibrosing alveolitis induced by oxygen stress. Chest, 109: 44S-45S.

Sartor, M., J. Schwanekamp, D. Halbleib, I. Mohamed, S. Karyala, M. Medvedovic and C. R. Tomlinson. 2004. Microarray results improve significantly as hybridization approaches equilibrium. Biotechniques, 36: 790-796.

Sartor, M. A., C. R. Tomlinson, S. C. Wesselkamper, S. Sivaganesan, G. D. Leikauf and M. Medvedovic. 2006. Intensity-based hierarchical Bayes method improves testing for differentially expressed genes in microarray experiments. BMC. Bioinformatics., 7: 538.

Selman, M., A. Pardo, L. Barrera, A. Estrada, S. R. Watson, K. Wilson, N. Aziz, N. Kaminski and A. Zlotnik. 2006. Gene expression profiles distinguish idiopathic pulmonary fibrosis from hypersensitivity pneumonitis. Am. J Respir. Crit Care Med., 173: 188-198.

Shen, A. S., C. Haslett, D. C. Feldsien, P. M. Henson and R. M. Cherniack. 1988. The intensity of chronic lung inflammation and fibrosis after bleomycin is directly related to the severity of acute injury. Am. Rev Respir. Dis., 137: 564-571.

Smyth, G. K. 2004. Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments. Statistical Applications in Genetics and Molecular Biology, 3: Article 3.

Sotiriou, C., P. Wirapati, S. Loi, A. Harris, S. Fox, J. Smeds, H. Nordgren, P. Farmer, V. Praz, B. Haibe-Kains, C. Desmedt, D. Larsimont, F. Cardoso, H. Peterse, D. Nuyten, M. Buyse, M. J. Van de Vijver, J. Bergh, M. Piccart and M. Delorenzi. 2006. Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl. Cancer Inst., 98: 262-272.

Strand, A. D., J. M. Olson and C. Kooperberg. 2002. Estimating the statistical significance of gene expression changes observed with oligonucleotide arrays. Hum. Mol Genet., 11: 2207-2221.

Subramanian, A., P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander and J. P. Mesirov. 2005. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U. S. A, 102: 15545-15550.

Sun, L., R. Gong, B. Wan, X. Huang, C. Wu, X. Zhang, S. Zhao and L. Yu. 2003. GADD45gamma, down-regulated in 65% hepatocellular carcinoma (HCC) from 23 chinese

112 patients, inhibits cell growth and induces cell cycle G2/M arrest for hepatoma Hep-G2 cell lines. Mol Biol Rep., 30: 249-253.

Swanson, H. I. and C. A. Bradfield. 1993. The AH-receptor: genetics, structure and function. Pharmacogenetics, 3: 213-230.

Tavazoie, S., J. D. Hughes, M. J. Campbell, R. J. Cho and G. M. Church. 1999. Systematic determination of genetic network architecture. Nat. Genet., 22: 281-285.

Thackaberry, E. A., Z. Jiang, C. D. Johnson, K. S. Ramos and M. K. Walker. 2005. Toxicogenomic profile of 2,3,7,8-tetrachlorodibenzo-p-dioxin in the murine fetal heart: modulation of cell cycle and extracellular matrix genes. Toxicol Sci., 88: 231-241.

Tian, L., S. A. Greenberg, S. W. Kong, J. Altschuler, I. S. Kohane and P. J. Park. 2005. Discovering statistically significant pathways in expression profiling studies. Proc. Natl. Acad. Sci. U. S. A, 102: 13544-13549.

Towne, J. E., K. S. Harrod, C. M. Krane and A. G. Menon. 2000. Decreased expression of aquaporin (AQP)1 and AQP5 in mouse lung after acute viral infection. Am. J. Respir. Cell Mol. Biol., 22: 34-44.

Tusher, V. G., R. Tibshirani and G. Chu. 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U. S. A, 98: 5116-5121.

Vencio, R. Z., T. Koide, S. L. Gomes and C. A. Pereira. 2006. BayGO: Bayesian analysis of ontology term enrichment in microarray data. BMC. Bioinformatics., 7: 86.

Villar, J., J. D. Edelson, M. Post, J. B. Mullen and A. S. Slutsky. 1993. Induction of heat stress proteins is associated with decreased mortality in an animal model of acute lung injury. Am. Rev. Respir. Dis., 147: 177-181.

Villar, J., S. P. Ribeiro, J. B. Mullen, M. Kuliszewski, M. Post and A. S. Slutsky. 1994. Induction of the heat shock response reduces mortality rate and organ damage in a sepsis- induced acute lung injury model. Crit Care Med., 22: 914-921.

Wahba, G. 1978. Improper priors, spline smoothing, and the problem of guarding against model errors in regression. J Royal Stat Soc, Series B, 40: 364-372.

Wahba, G. 1983. Bayesian Confidence-Intervals for the Cross-Validated Smoothing Spline. Journal of the Royal Statistical Society Series B-Methodological, 45: 133-150.

Wahba, G. 1990. Spline Models for Observational Data. Society for Industrial and Applied Mathematics, Philadelphia.

Wang, S., M. G. Hasham, I. Isordia-Salas, A. Y. Tsygankov, R. W. Colman and Y. L. Guo. 2003. Upregulation of Cdc2 and cyclin A during apoptosis of endothelial cells induced by cleaved high-molecular-weight kininogen. Am. J Physiol Heart Circ. Physiol, 284: H1917- H1923.

113

Wang, Y. R., X. Z. Xiao, S. N. Huang, F. J. Luo, J. L. You, H. Luo and Z. Y. Luo. 1996. Heat shock pretreatment prevents hydrogen peroxide injury of pulmonary endothelial cells and macrophages in culture. Shock, 6: 134-141.

Ware, L. B. and M. A. Matthay. 2000. The acute respiratory distress syndrome. N. Engl. J. Med., 342: 1334-1349.

Wesselkamper, S. C., L. M. Case, L. N. Henning, M. T. Borchers, J. W. Tichelaar, J. M. Mason, N. Dragin, M. Medvedovic, M. A. Sartor, C. R. Tomlinson and G. D. Leikauf. 2005a. Gene Expression Changes During the Development of Acute Lung Injury: Role of TGF-{beta}. Am. J Respir. Crit Care Med., 172: 1399-1411.

Wesselkamper, S. C., S. A. McDowell, M. Medvedovic, T. P. Dalton, H. S. Deshmukh, M. A. Sartor, L. M. Case, L. N. Henning, M. T. Borchers, C. R. Tomlinson, D. R. Prows and G. D. Leikauf. 2005b. The Role of Metallothionein in the Pathogenesis of Acute Lung Injury. Am. J. Respir. Cell Mol. Biol..

Wesselkamper, S. C., D. R. Prows, P. Biswas, K. Willeke, E. Bingham and G. D. Leikauf. 2000. Genetic susceptibility to irritant-induced acute lung injury in mice. Am. J. Physiol Lung Cell Mol. Physiol, 279: L575-L582.

Wolfinger, R. D., G. Gibson, Wolfinger E.D., L. Bennett, H. Hamadeh, P. Bushel, C. Afshari and R. S. Paules. 2001. Assessing gene significance from cDNA microarray expression data via mixed models. Submitted.

Wong, H. R., R. J. Mannix, J. M. Rusnak, A. Boota, H. Zar, S. C. Watkins, J. S. Lazo and B. R. Pitt. 1996. The heat-shock response attenuates lipopolysaccharide-mediated apoptosis in cultured sheep pulmonary artery endothelial cells. Am. J. Respir. Cell Mol. Biol., 15: 745- 751.

Wong, H. R., M. Ryan, S. Gebb and J. R. Wispe. 1997. Selective and transient in vitro effects of heat shock on alveolar type II cell gene expression. Am. J. Physiol, 272: L132- L138.

Yang, Y. H., S. Dudoit, P. Luu, D. M. Lin, V. Peng, J. Ngai and T. P. Speed. 2002. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res., 30: e15.

Yang, Y. H. and T. Speed. 2002. Design issues for cDNA microarray experiments. Nat. Rev. Genet., 3: 579-588.

Zeeberg, B. R., W. Feng, G. Wang, M. D. Wang, A. T. Fojo, M. Sunshine, S. Narasimhan, D. W. Kane, W. C. Reinhold, S. Lababidi, K. J. Bussey, J. Riss, J. C. Barrett and J. N. Weinstein. 2003. GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol, 4: R28.

Zeeberg, B. R., H. Qin, S. Narasimhan, M. Sunshine, H. Cao, D. W. Kane, M. Reimers, R. M. Stephens, D. Bryant, S. K. Burt, E. Elnekave, D. M. Hari, T. A. Wynn, C. Cunningham-

114

Rundles, D. M. Stewart, D. Nelson and J. N. Weinstein. 2005. High-Throughput GoMiner, an 'industrial-strength' integrative gene ontology tool for interpretation of multiple- microarray experiments, with application to studies of Common Variable Immune Deficiency (CVID). BMC. Bioinformatics., 6: 168.

Zhong, S., K. F. Storch, O. Lipan, M. C. Kao, C. J. Weitz and W. H. Wong. 2004. GoSurfer: a graphical interactive tool for comparative analysis of large gene sets in Gene Ontology space. Appl. Bioinformatics., 3: 261-264.

Zorzetto, M., I. Ferrarotti, R. Trisolini, L. L. Agli, R. Scabini, M. Novo, A. De Silvestri, M. Patelli, M. Martinetti, M. Cuccia, V. Poletti, E. Pozzi and M. Luisetti. 2003. Complement receptor 1 gene polymorphisms are associated with idiopathic pulmonary fibrosis. Am. J Respir. Crit Care Med., 168: 330-334.

Appendices

A.1. Appendix for Chapter 2

Outline for Supplemental Material:

1) Dependency of log-variance on log-expression intensity in simulation study 2) Improved relative performance of t-test with higher sample degrees of freedom 3) Control of false positive rate in simulation study for additional parameter sets 4) Control of false positive rate in Affymetrix “spike-in” dataset 5) Variance-Intensity relationship for Affymetrix latin-square experiment 6) Full list of significant Gene Ontology categories for MEF Ahr-/- dataset 7) List of significant Gene Ontology categories for Nickel time course 8) Top ranked genes in IBMT, but not SMT, and vice versa for Nickel data

115

Figure A.1 Dependency of log-variance on log-expression intensity in simulation study Example of Local regression estimation of log-variance. Similar results were found using other parameter conditions.

116

A) B)

C) D)

Figure A.2 Improved relative performance of t-test with higher sample degrees of freedom Accumulation of false positives by increasing number of genes determined to be significant for the simple t-test (black), SMT (red), and IBMT (blue). Sample degrees of freedom varies from A) 4, B) 8, C) 12, and D) 16, which prior degrees of freedom remains constant at 16. The simple t-test improves as the sample degrees of freedom increases relative to SMT and IBMT.

117

Figure A.3 Control of false positive rate in simulation studies for additional parameter sets Actual vs. estimated false positive rates were plotted for a sample of parameter sets in the simulations described. 300 genes were simulated as differentially expressed, and results shown are the average of 100 simulations. Similar results, showing all 3 methods (t-test, SMT, and IBMT) correctly controlled the false positive rate, were obtained using different sample degrees of freedom.

118

Table A.1 Control of false positive rate in Affymetrix “spike-in” dataset Comparison of methods to control false positive rate at the 1% and 5% levels. The first column lists the cutoff value used for each row. The second column displays the number of genes that should be determined significant using the corresponding cutoff from the first column, if the false positive rate is correctly controlled. The 3rd column shows number of genes estimated to be significant using the Benjamini-Hochberg FDR corresponding cutoff. The number and percent of “extra” genes determined to be significant, beyond what would be found according to the true q- value, is indicated in the last column. For both cutoffs, IBMT find the lowest percent of additional genes.

x (q-value # genes with true # genes with # (%) genes cutoff) q-value

IBMT 0.05 926 1555 629 (40%)

Cyber-T 0.05 871 1624 753 (46%)

SMT 0.05 708 1517 809 (53%)

t-test 0.05 737 1309 572 (44%)

IBMT 0.01 659 1216 557 (46%)

Cyber-T 0.01 617 1363 746 (55%)

SMT 0.01 344 1179 835 (71%)

t-test 0.01 253 931 678 (73%)

119

Figure A.4 Variance-Intensity relationship for Affymetrix latin-square experiment The HG-U133 latin-square experiment illustrates a typical relationship between log-variance and average log- intensity after preprocessing the raw data with RMA.

120

Table A.2 Full list of significant Gene Ontology categories for MEF Ahr-/- dataset Gene Ontology categories that had a bonferroni-adjusted p-value<0.10 for each of the 4 tested methods, listing category (# significant genes in category), and Bonferroni-adjusted p-value.

GO term rank T-test FOLD SMT (eBayes) IBMT Extracellular space (77) 5.3E- 1 003 Extracellular (91): 2.6E-005 Extracellular (90) 5.9E-005 Extracellular (92) 1.8E-006 Extracellular space (82) 6.25E-Extracellular space (81) 1.4E- Response to biotic stimulus (39) 2 Extracellular (84) 9.1E-003 005 004 1.0E-005 Signal transducer activity (67) Extracellular space (80) 6.7E- 3 Integrin binding (5) 2.8E-002 1.7E-002) Receptor binding (27) 1.2E-003 005 Spermidine biosynthesis (3) Organogenesis (38) Chemoattractant activity (8) 3.6E-Response to external stimulus 4 4.1E-002 3.5E-002 003 (46) 2.7E-004 Spermine biosynthesis (3) Chemoattractant activity (7) Signal transducer activity (68) 5 4.1E-002 4.3E-002 4.0E-003 Defense response (34) 2.9E-004 Carboxy peptidase activity (6) Receptor binding (24) 4.9E- Response to biotic stimulus (33) Signal transducer activity (68) 6 6.8E-002 002 8.6E-003 2.0E-003 Histogenesis and Chemokine receptor binding (7) Chemoattractant activity (8) 3.1E- 7 organogenesis (9) 7.8E-002 1.9E-002 003

8 Morphogenesis (39) 8.6E-002 Chemokine activity (7) 1.9E-002 Immune response (27) 6.1E-003 Response to pest/pathogen/parasite (19) 9.8E- 9 Integrin binding (5) 1.9E-002 003 G-protein-coupled receptor 10 binding (7) 2.4E-002 Chemokine activity (7) 1.6E-002 Spermidine biosynthesis 3.2E- Chemokine receptor binding (7) 11 002 1.6E-002 Spermine biosynthesis (3) 3.2E- G-protein-coupled receptor 12 002 binding (7) 2.1E-002 Spermine biosynthesis (3) 3.0E- 13 Defense response (29) 6.1E-002 002 Spermidine biosynthesis (3) 3.0E- 14 002 Extracellular matrix (18) 3.0E- 15 002 16 Receptor binding (23) 7.4E-002

17 Response to stress (29) 8.4E-002

121

Table A.3 Top ranked genes in IBMT, but not SMT, and vice versa for Nickel data Genes in top-ranked list for IBMT, but not SMT, and vice versa. For each gene, the Entrez Gene ID: Gene Symbol (Average expression level) and fold change versus control are shown.

Time IBMT, not SMT SMT, not IBMT 03 hr 108014: Sfrs9 (415) -2.47 68646: 1110020G09Rik (264) -2.54 03 hr 19172: Psmb4 (810) 2.38 69697: 2310057J16Rik (155) -2.95 03 hr 69216: Ccdc23 (332) -2.81 24099: Tnfsf13b (141) -3.42 03 hr 13806: Eno1 (1174) 2.71 16671: Krt1-3 (101) -3.57 03 hr 22240: Dpysl3 (262) -3.26 AL021127: AL021127 (110) -3.54 03 hr 17120: Mad1l1 (268) -3.76 66253: Aig1 (108) -3.17 03 hr 75434: 1700001C02Rik (370) -4.23 15357: Hmgcr (128) -3.17 03 hr 234353: D430018P08 (258) -3.68 14962: H2-Bf (183) 3.53 03 hr 11857: Arhgdib (170) -4.42 16907: Lmnb2 (131) -4.01 03 hr 50724: Sap30l (235) -5.52 67235: Zfp99 (76) -4.05 03 hr 100952: Emilin1 (358) -3.62 70730: 6330409D20Rik (65) -6.45 03 hr 54614: Prpf40b (258) 3.22 72792: 2810459M11Rik (195) -3.19 03 hr 104458: Rars (291) -4.65 74356: 4931428F04Rik (99) -4.19 03 hr 19737: Rgs5 (222) -4.24 15978: Ifng (115) 3.7 03 hr 18821: Pln (142) -8.09 69698: 2310046K01Rik (91) -3.93 03 hr 59054: Mrps30 (234) -6.49 50501: Prok2 (58) -6.02 03 hr 13163: Daxx (276) -5.48 19113: Prlpe (98) -5.73 03 hr 66925: Sdhd (388) -5.12 545156: Kalrn (149) -4.17 03 hr 18113: Nnmt (107) -9.78 75799: 4930444P10Rik (45) -8.92 03 hr 72297: B3gnt3 (130) -13.21 12774: Ccr5 (58) 3.86 03 hr 78482: 1700123L14Rik (220) -15.35 16528: Kcnk4 (109) -5.51 03 hr 14767: Gpr66 (132) -13.77 71934: Car13 (67) -5.36 08 hr 72341: Tmem103 (3141) 2.22 17022: Lum (125) -2.09 08 hr 56312: Nupr1 (442) 2.15 18739: Pitpnm (165) -2.05 08 hr 68603: Pmvk (255) 2.42 70357: Kcnip1 (65) -2.13 08 hr 212706: C330016O10Rik (270) 2.25 75502: Cklfsf2b (94) -2.39 08 hr 17071: Ly6f (4703) 2.25 66275: 1810009K13Rik (189) 2.21 08 hr 15006: H2-Q1 (245) -2.23 12182: Bst1 (105) 2.29 08 hr 15433: Hoxd13 (875) 2.31 70567: 5730455O13Rik (97) -2.69 08 hr 56295: Higd1a (412) 2.74 30052: Pcsk1n (101) -2.91 08 hr 12315: Calm3 (395) -2.38 68184: Denr (138) 2.55 08 hr 18481: Pak3 (468) 2.76 80708: Pacsin3 (97) -2.56 08 hr 70235: Wdr51a (187) -2.9 73456: Izumo1 (91) 3.1 08 hr 18725: Pira2 (4812) 3.86 57249: Gabrq (110) 2.93 08 hr 16581: Kifc2 (120) -4.13 21784: Tff1 (66) -3.68 08 hr 18784: Pla2g5 (127) -5.85 18111: Nnat (94) -3.4 08 hr 12526: Cd8b (162) -4.88 74238: Mterfd3 (77) -2.95 24 hr 12409: Cbr2 (31685) -1.68 17768: Mthfd2 (65) 2.56 24 hr 30806: Adamts8 (278) 1.93 75870: Tcam1 (88) -2.14 24 hr 69202: 2610009E16Rik (6090) -1.96 14682: Gnaq (125) 2.14 24 hr 14828: Hspa5 (1442) 2.17 13835: Epha1 (143) -2.05 24 hr 14728: Gp49b (356) 2.01 74281: Spatc1 (119) -2.17 24 hr 15124: Hba-ps3 (26563) -2.26 236733: Usp11 (105) -2.17

122

24 hr 12051: Bcl3 (262) 2.04 74536: 9030409C19Rik (136) -2.65 24 hr 14601: Ghrh (228) 2.06 113862: V1rc5 (70) -2.78 24 hr 15526: Hspa9a (456) 2.2 70603: Mutyh (76) -3.51 24 hr 15511: Hspa1b (371) 2.38 18196: Nsg1 (66) -2.41 24 hr 27280: Phlda3 (241) 2.54 69382: 1700024P04Rik (96) -2.71 24 hr 83490: Pik3ap1 (210) -3.21 67981: Hormad1 (64) -3.72 24 hr 16740: L1Md-Tf9 (8818) 3.39 14029: Evx2 (51) -2.98 24 hr 19217: Ptger2 (177) 3.13 15061: H28 (51) -3.63 48 hr 67938: Mylc2b (4966) -2.03 56873: Lmbr1 (173) 2.4 48 hr 66734: Map1lc3a (1212) -2.18 20017: Rpo1-2 (131) 2.63 48 hr 17975: Ncl (1064) 2.23 14786: Grb7 (173) -2.62 48 hr 11830: Aqp5 (1509) -2.3 22608: Nsep1 (137) -3.03 48 hr 51938: Ccdc39 (321) -2.34 74561: Nkx6-3 (140) -2.85 48 hr 83553: Tktl1 (6955) -2.33 71355: Col24a1 (118) -3.24 48 hr 210992: Aytl2 (1476) -2.72 Y07611: Y07611 (115) -2.98 48 hr 67648: 4930542C12Rik (785) -2.63 72181: Nsun4 (65) -6.06 48 hr 110956: D17H6S56E-5 (265) 2.62 69315: 1700001L19Rik (119) -3.09 48 hr 16737: L1Md-Tf5 (14763) 2.34 80721: Slc19a3 (110) -3.9 48 hr 70223: Nars (330) 2.96 18793: Plaur (82) -2.51 48 hr 224824: Pex6 (369) -2.53 226016: 5730446C15Rik (173) -3.35 48 hr 56430: Rsn (240) 2.59 67430: 4921536K21Rik (90) -2.87 48 hr 14118: Fbn1 (502) 2.67 78548: 5430417C01Rik (84) -2.93 48 hr 12628: Cfh (993) 3.33 77080: 9230110F15Rik (112) -4.11 48 hr 13204: Dhx15 (269) 2.61 11828: Aqp3 (168) 2.99 48 hr 20763: Sprr2i (380) 2.86 14465: Gata6 (162) 2.69 48 hr 16010: Igfbp4 (772) 3.96 13003: Cspg2 (136) 3.38 48 hr 72240: 1600014C23Rik (535) -3.5 57738: Slc15a2 (88) -3.97 48 hr 19273: Ptprl (268) -3.05 100978: Nfxl1 (76) 5.06 48 hr 13587: Rnase2 (357) -2.84 74437: 4933402E13Rik (73) -4.83 48 hr 18730: Pira7 (3705) 4.63 13840: Epha6 (123) 2.8 48 hr 50926: Hnrpdl (463) 2.68 16822: Lcp2 (79) 3.57 48 hr 17912: Myo1b (219) -3.01 22222: Ubr1 (164) 3.21 48 hr 18597: Pdha1 (345) 3.13 74227: 1700016A09Rik (87) -5.79 48 hr 12331: Cap1 (755) 3.66 73456: Izumo1 (91) 7.7 48 hr 68311: Lypd2 (346) -4.22 16703: Krtap8-1 (96) -5.59 48 hr 22630: Ywhaq (600) 3.61 113862: V1rc5 (70) -3 48 hr 13711: Elf5 (168) -14.15 14126: Ms4a2 (73) -6.74 48 hr 12425: Cckar (142) -12.01 AK017085: AK017085 (89) 3.34 72 hr 69202: 2610009E16Rik (6090) -2.4 52118: Pvr (176) 2.46 72 hr 78185: 4930524L23Rik (2647) 2.57 27273: Pdk4 (137) 2.74 72 hr 15213: Hey1 (344) -2.64 14682: Gnaq (125) 2.59 72 hr 15124: Hba-ps3 (26563) -2.75 77669: 9130221D24Rik (81) 4.47 72 hr AJ400878: AJ400878 (480) -3.04 13983: Esr2 (83) -3.08 72 hr 72461: Prcp (497) 5.36 56353: Rybp (84) 3.16 72 hr 14173: Fgf2 (206) 5.6 29820: Tnfrsf19 (89) -2.78 72 hr 12322: Camk2a (187) 7.13 78767: 2610021K21Rik (59) 9.63

123

A.2. Appendix for Chapter 3

A.2.1. Supplemental figures and tables for Chapter 3

Figure A.5 Full ROC curves from Figure 3.4 Samples for grade 3 vs. grade 1 tumors, all with positive ER status, were randomly selected to create 60 subsets (20 subsets each containing 3, 4, or 6 biological replicates of grade 3 and grade 1). Subsets were then analyzed for DEGs and tested for enriched GO terms using the following methods: LRpath, Fisher‟s exact (FE) with cutoffs of 0.01 and 0.001 for DEGs, BayGO with 0.01 cutoff, GSEA, globaltest, and sigPathway (sigPath). GO terms were then ranked and the true and false positives determined by comparing to the gold standard sets. Gray line indicates point at which there are an equal number of true and false positives. (A) 3 replicates, (B) 4 replicates, and (C) 6 replicates.

124

A.2.2. Supplemental applications for Chapter 3

A.2.2.1. Results from AML/ALL dataset

This is the familiar dataset from (Golub et al. 1999), also analyzed in the globaltest method publication and consisting of 27 acute lymphoma leukemia (ALL) patient samples, and

11 acute myeloid leukemia (AML) patient samples. Table 1 lists the Gene Ontology (GO) terms

(Ashburner et al. 2000) found to be significantly enriched (FDR<0.10) (Benjamini & Hochberg

1995) using LRpath in conjunction with IBMT analysis (Sartor et al. 2006). Protease and peptidase related activity were most notably different between AML and ALL patient profiles.

AML patients‟ response to stress was also identified as being different from that of ALL patients, as well as chemokine and cytokine activities.

Enriched GO terms were also tested with GSEA (Subramanian et al. 2005), globaltest

(Goeman et al. 2004), and BayGO (Vencio et al. 2006). We summarize the concordance between methods by correlating the –log(p-values) using Pearson‟s correlation coefficient (Table 2). As the table shows, the results of LRpath using logistic regression are most concordant with those of

Fisher‟s Exact test, but with more power. Overall, Fisher‟s Exact has the most similarity with other methods. LRpath is also significantly correlated with all other methods. GSEA and globaltest were not significantly correlated with BayGO.

Figure 1 shows the distribution of p-values from all methods tested. Fisher‟s Exact test displays a fairly flat histogram of p-values, and indeed no GO terms were determined to be significant at the FDR 0.10 level for this dataset. BayGO resulted in a humped distribution

125

(interestingly with a maximum possible p-value of 0.6) and also did not identify any significant

Figure A.6 Histograms of p-values testing Gene Ontology for AML-ALL dataset Only LRpath shows an enrichment of low p-values with an otherwise flat distribution.

126

GO terms. For this dataset globaltest appears to be somewhat unstable, as it found virtually all

GO terms to be significant, seen in Figure 1 by the histogram highly skewed to the left. GSEA displays a fairly flat p-value distribution, although it did identify four enriched GO terms.

However, none of these (sodium ion transport, regulation of cell adhesion, sodium ion binding, and physiological interaction between organisms) overlapped with LRpath. Only LRpath resulted in an enrichment of low p-values with an otherwise flat distribution.

A.2.2.2. Results from cardiovascular TCDD dataset

This dataset was performed on Affymetrix GeneChip Murine Genome U74 Version 2 arrays and is a TCDD (2,3,7,8-tetrachlorodibenzo-p-dioxin) dose response series on gene expression during murine cardiovascular development (Thackaberry et al. 2005). Pregnant mice were dosed with 1.5, 3.0, or 6.0 mg TCDD/kg on gestational day (GD) 14.5, and microarray analysis was used to characterize global changes in fetal cardiac gene expression on GD 17.5.

Eight control mice received corn oil and were compared to four mice at each dosage level. This experiment was chosen to illustrate the use of our logistic regression method for determining biological categories enriched with differentially expressed genes overall in the experiment (any dose level). Data preprocessed by the Affymetrix MAS5.0 method was downloaded from NCBI

GEO (GEO accession GDS1302), and analyzed for differentially expressed genes using IBMT

(Sartor et al. 2006). We tested the following null hypothesis using our logistic regression method: 1 = 2 = 3 = 0, where 1, 2, and 3 are the slope coefficients for the 1.5, 3, and 6 ug/kg dose levels respectively, against both the GO database and KEGG pathways.

Results indicated that 11 GO terms and 3 KEGG pathways were significantly enriched with differentially expressed genes at the FDR<0.10 level (Table 3). Most notably, the KEGG pathway for “Metabolism of xenobiotics by cytochrome P450” was significant (FDR=0.008). 127

Twelve of the 46 analyzed genes in this pathway (including Cyp1a1, Cyp1b1, Adh1, Adh5,

Adh7, Gstm1 and several other P450s) were found to be differentially expressed (p<0.05) for at least one dose level. The other significant KEGG pathways and 10 of the 11GO terms were immune related, most notably related to complement activation. This highly significant result was not identified or discussed in the original publication.

A.3. Appendix for Chapter 4

A.3.1. Empirical Bayes smoothing spline estimation

For computational purposes, our SSIBMT R-function makes use of the R function smooth.spline, which uses P-splines that are a generalization of smoothing splines and approximate them under appropriate conditions, i.e. a large enough number of knots and a penalty on the integrated squared 2nd derivative (Ruppert et al. 2003). Smoothing splines are the instance of P-splines using the same penalty term as above and a knot at each data point.

Theoretical background for the convergence of P-splines to smoothing splines for large numbers of knots is found in (Li & Ruppert 2007)

(http://legacy.orie.cornell.edu/~davidr/papers/index.html).

For empirical asymptotic properties and the practice of assuming a smaller number of knots for large datasets, see (Chambers & Hastie 1992; Ruppert 2002). It has been observed that the number of knots used makes little difference as long as the number of knots is above a certain minimum; we adopt the suggested default given in (Ruppert 2002): min(n/4, 40). By using a reduced number of knots, we enable the implementation of SSIBMT to extremely large datasets, such as high density promoter or tiling microarrays that may involve several million probe elements.

128

A.3.2. Full Bayesian joint and full conditional distributions

The joint distribution for the full Bayesian model is derived from the DAG using the property the each node is independent of its non-descendants given its parents, and that each node is independent of all other nodes given its parents, children, and children‟s parents. The joint distribution is defined as P(sg, dg, g, g, , ) =

2 i=1..n [p(sgg, dg) * p(g | , , ,  g) * p(p(] =

n   d d  n   g g  2 2 Gamma ,  log Nf , , ,  IG 0.001,0.001 MVN0, I    2 2 2   g1   g  g1

n  dg 2 2  n    d g 2 dg 21   d g sg 1   1 2  s 2 exp  exp  log 2  f , , 2 2  d 2  g  2    g     2 g  2  2 g1 d 2   g  g1  2 g   g g     0.001 1000 2 0.0011  1000  1  1     exp 2  1 2 exp  0.001    2 N 2   2 

2 2 The full conditional for g , the gene variances, is P(log g=i | sg , log g<>i, ,,  g, ) =

2 dg 2  d g 2 1  1 2 2 2  g  exp sg 2  2 exp log g  f , ,  2 c  2  g  g   2 dg 21  d g 2 1 1 2 2    g  exp  sg  log g  f , ,   2  2 2 2   g 

This is an unknown distribution, and therefore requires the MCMC sampling technique of

 2 2 Metropolis-Hastings. The full conditional for  , the variance of g is P( | log g , , g,) =

n 0.001  1 2 1000 0.0011  1000  exp  log 2  f , , 2 2  2 exp  2   g        2  g1  g 0.001   

129

This is the usual variance with known mean, using the conjugate prior system. The posterior is known to be:

2  n log 2  f , ,   InverseGamma0.001 ,0.001  g   2 2   

Each node in the DAG, given its parents, children, and its children‟s parents, is independent of all other nodes. Thus, the full conditional for the spline coefficients can be written as P(log

2 g , , g, ). With uninformative normal priors for spline coefficients, the full conditional posterior for the coefficients of the spline function is multivariate normal. Using 40 knots results in 44 degrees of freedom (df = number of knots + degrees(3) + 1) and 44 spline coefficients plus the intercept.

MVN45ˆ,Covˆ

 2 Since the full conditionals for  and  are known distributions but the full conditional for g is not, we use the Metropolis-Hastings within Gibbs sampling scheme, and alternately sample between these three full conditionals at each iteration. The fbSpline function uses a default burn- in of 500 iterations, but can be adjusted by the user.

A.4. Researcher contributions

For Chapter 2, Mario Medvedovic participated in the conception of the methodology and provided guidance in the development, design, and drafting of the manuscript, and Siva

Sivaganesan contributed to the statistical details of the method. Craig R. Tomlinson, Scott C.

Wesselkamper, and George D. Leikauf provided interpretation of the biological results from the dual-channel datasets, and Craig R. Tomlinson additionally oversaw the microarray

130 hybridizations for the two dual-channel experiments. We further acknowledge Saikumar

Karyala for growing the MEF wildtype and Ahr-/- cells, and Danielle Halbleib for performing the microarray hybridizations. Funding was provided through the Center for Environmental

Genetics by the NIEHS grant no. P30 ES06096. For Chapter 3, Mario Medvedovic provided guidance in the development, design, and drafting of the manuscript and George D. Leikauf provided interpretation of the biological results for the breast cancer and IPF results. For

Chapter 4, Mario Medvedovic and Siva Sivaganesan contributed to the development and design of the methodologies. We thank Michael Wagner, Paul Succop, and Alvaro Puga for a thorough reading of the dissertation and many helpful suggestions.

131