UNIVERSITY OF CINCINNATI

Date: 22-Oct-2009

I, Johannes M Freudenberg , hereby submit this original work as part of the requirements for the degree of: Doctor of Philosophy in Biomedical Engineering It is entitled: Bayesian Infinite Mixture Models for Clustering and Simultaneous

Context Selection Using High-Throughput Data

Student Signature: Johannes M Freudenberg

This work and its defense approved by: Committee Chair: Mario Medvedovic, PhD Mario Medvedovic, PhD

Bruce Aronow, PhD Bruce Aronow, PhD

Michael Wagner, PhD Michael Wagner, PhD

Jaroslaw Meller, PhD Jaroslaw Meller, PhD

11/18/2009 248 Bayesian Infinite Mixture Models for Gene Clustering and

Simultaneous Context Selection Using High-Throughput

Gene Expression Data

A dissertation submitted to the

Graduate School

of the University of Cincinnati

in partial fulfillment of the

requirements for the degree of

Doctor of Philosophy

in the Department Biomedical Engineering

of the Colleges of Engineering and Medicine

by

Johannes M. Freudenberg

M.S., 2004, Leipzig University, Leipzig, Germany (German Diplom degree)

Committee Chair: Mario Medvedovic, Ph.D. Abstract

Applying clustering algorithms to identify groups of co-expressed is an important step in the analysis of high-throughput genomics data in order to elucidate affected biological pathways and transcriptional regulatory mechanisms. As these data are becoming ever more abundant the integration with both, existing biological knowledge and other experimental data becomes as crucial as the ability to perform such analysis in a meaningful but virtually unsupervised fashion.

Clustering analysis often relies on ad-hoc methods such as k-means or hierarchical clustering with Euclidean distance but model-based methods such as the Bayesian Infinite

Mixtures approach have been shown to produce better, more reproducible results. Further improvements have been accomplished by context-specific gene clustering algorithms designed to determine groups of co-expressed genes within a given subset of biological samples termed context. The complementary problem of finding differentially co-expressed genes given two or more contexts has been addressed but relies on the a priori definition of contexts and has not been used to facilitate the clustering of biological samples.

Here we describe a new computational method using Bayesian infinite mixture models to cluster genes simultaneously utilizing the concept of differential co-expression as a unique similarity measure to find groups of similar samples. We compute a novel per-gene differential co-expression score that is reproducible and biologically meaningful. To evaluate, annotate, and display clustering results we present the integrated software package CLEAN which contains functionality for performing Clustering Enrichment Analysis, a method to functionally annotate clustering results and to assign a novel gene-specific functional coherence score.

2

We apply our method to a number of simulated datasets comparing it to other commonly used clustering algorithms, and we re-analyze several breast cancer studies. We find that our unsupervised method determines patient groupings highly predictive of clinically relevant factors such as estrogen receptor status, tumor grade, and disease specific survival. Integrating these data with computationally and literature-derived information by applying CLEAN to the corresponding clusterings as well as the DCS signature substantiates these findings.

Our results demonstrate the range of applications our methodology provides, offering a comprehensive analysis tool to study gene co-expression and differential co-expression patterns specific to the biological conditions of interest while simultaneously determining subsets of such biological conditions using a unique similarity measure that is complementary to the currently existing methods. It allows us to further our understanding of highly complex diseases such as breast cancer, and it has the potential to greatly facilitate research in many other, not yet as intensively studied areas.

3

4

Preface

I would like to thank everyone who helped and supported me to reach this milestone on my Bioinformatics journey. I especially thank my advisor and mentor Dr. Mario Medvedovic for all of his support, insight, and encouragement. Discussions and conversations with him are always inspiring and enlightening and I am deeply grateful for the opportunity to work with him.

I also thank Dr. Bruce Aronow, Dr. Jarek Meller, and Dr. Michael Wagner for serving on my thesis committee, but more importantly for their expertise, advice, and support over the past years. Thanks and gratitude also to all my collaborators, especially Dr. Siva Sivaganesan whose expertise in Bayesian Statistics is invaluable for this dissertation, Monika Ray for trusting me and challenging my ways, and Xiangdong Liu, Vineet Joshi, and Zhen Hu. I thank my fellow travelers Jing Chen, Sivakumar Gowrisankar, Ranga Chandra Gudivada, Rachana Jain, and

Mukta Phatak for leading the way. I want to also thank my parents for their love and support

(and for sending care packages across the ocean) and my daughter Anouk for reminding me everyday what‟s really important in life.

Finally, I wish to thank my partner, my friend, my love Un Kyong Ho: This is for you.

5

Table of Contents

Abstract ...... 2 Preface ...... 5 List of Tables ...... 7 List of Figures ...... 8 List of Algorithms ...... 8 Chapter I – Background ...... 9 Introduction ...... 9 Clustering ...... 11 Gaussian finite and infinite mixture models ...... 12 Context-specific clustering ...... 13 Differential Co-expression ...... 14 Outline of this dissertation ...... 15 Chapter II – Semiparametric Bayesian model for unsupervised differential co-expression analysis identifies novel molecular subtypes ...... 17 Introduction ...... 17 Results ...... 19 Context-specific infinite mixture model ...... 19 Recovery of simulated contexts ...... 21 Identifying molecular subtypes in breast cancer gene expression data ...... 23 Differentially co-expressed genes ...... 27 Comparison to other outcome predictors ...... 28 Reproducibility of differential co-expression signature across independent datasets .. 29 Meta-analysis based on the differential co-expression signature...... 31 Discussion and Conclusions ...... 32 Methods ...... 34 Infinite mixture model ...... 34 Differential co-expression score ...... 43 Simulation study ...... 45 Breast cancer studies ...... 47 Chapter III: CLEAN – CLustering Enrichment ANalysis ...... 50 Background ...... 50 Results ...... 53 Comparing clustering results using the CLEAN score ...... 55 Reproducibility and the comparison with cluster-wide scores ...... 57 Unsupervised selection of informative genes ...... 62 Computational Infrastructure ...... 64 Discussion ...... 66 Conclusions ...... 69 Methods ...... 70

6

Data Preprocessing, Gene Selection and Clustering ...... 70 Clustering Enrichment Analysis ...... 71 Defining Functional Categories ...... 73 Obtaining All Possible Gene Clusters ...... 74 Determining significant functional enrichment...... 74 Procedure for Non-hierarchical Methods ...... 74 R package and FTreeView tool ...... 75 Chapter IV: Additional Applications ...... 76 Simulation studies ...... 76 A “synthetic” dataset ...... 79 Large-scale breast cancer study – revisited ...... 81 Normal tissue dataset...... 85 Bibliography89 Appendix 98 A. Gene Lists ...... 98 B. R packages ...... 108 B.1 Contributions to gimmR ...... 108 B.2 CLEAN ...... 108

List of Tables

Table 2.1: Patient characteristics of DCIM derived patient groups for the Schmidt et al. dataset...... 23 Table 2.2: Comparison of computationally derived patient groups for the Schmidt et al. dataset ...... 23 Table 2.3: Predicting survival outcome using different clinical, molecular, and computational methods...... 24 Table 2.4: Predicting survival outcome using the top 200 DCE genes with different clustering algorithms ...... 27 Table 2.5: Comparison of patient groupings derived from individual and joint analyses ...... 30 Table 2.6: Overlap of the 500 DCS signature with other well-known breast cancer signatures .. 34 Table 2.7: Overview of the breast cancer studies used ...... 47 Table 3.1. Contingency table of genes with significant and non-significant CLEAN score in human and mouse tissues...... 61 Table 3.2. Contingency table of genes with significant and non-significant cwCLEAN score in human and mouse tissues...... 61 Table A1. Top 200 DCS genes for the Schmidt et al. (2008) dataset...... 98 Table A2: Top 200 genes differentially expressed between two major contexts for the Schmidt et al. (2008) dataset...... 103

7

List of Figures

Figure 1.1 Number of samples published each month through the Gene Expression Omnibus. . 10 Figure 2.1. Contexts are defined by unique local gene co-expression patterns...... 20 Figure 2.2. Differences between contexts are defined by differential co-expression, not differential expression...... 23 Figure 2.3. DCIM derived contexts in breast cancer data are predictive of patient survival ...... 26 Figure 2.4. Reproducibility of the gene-specific DCS ...... 30 Figure 2.5. Heatmaps of top 500 DCS gene signature in three breast cancer studies ...... 31 Figure 2.6. Heatmaps of top 500 DCS gene signature in breast cancer meta-study. Expression patterns are remarkably consistent with individual datasets. DCIM sample clusterings are highly correlated with ER status, tumor grade, and patient survival. 32 Figure 2.7. Directed Acyclic graph describing the DCIM computational model ...... 37 Figure 3.1. Calculating functional coherence scores ...... 54 Figure 3.2. Comparison of clustering methods...... 56 Figure 3.3. Integrating cluster analysis and functional knowledge ...... 58 Figure 3.4. Reproducibility of CLEAN and cwCLEAN scores in a specific dataset ...... 59 Figure 3.5. Differences in the reproducibility of CLEAN and cwCLEAN scores ...... 60 Figure 3.6. Unsupervised selection of informative genes...... 64 Figure 3.7. Integrated software package ...... 65 Figure 3.8. Expression patterns of genes with statistically significant CLEAN scores in four independent breast cancer datasets ...... 68 Figure 4.1. Recovery of gene clusters in simple simulation scenario ...... 77 Figure 4.2. Recovery of gene clusters in two fold-changes scenario ...... 78 Figure 4.3. Recovery of gene and sample clusters non-informative samples scenario ...... 79 Figure 4.4. Heatmap of the expression dataset generated by the Microarray Quality Control (MAQC) Consortium ...... 81 Figure 4.5. ROC curves for sample clustering of the MAQC dataset ...... 81 Figure 4.6. Kaplan-Meier curves estimating patient survival in joint breast cancer data set ...... 82 Figure 4.7. DCIM clustering and survival analysis of the joint breast cancer dataset ...... 83 Figure 4.8. Clustering Enrichment Analysis (CLEAN) of the DCIM derived gene clustering .... 85 Figure 4.9. Screenshot of fTreeView displaying DCIM derived gene clustering and contexts for the normal tissue study ...... 86 Figure 4.10. Screenshot of fTreeView displaying DCIM and CLEAN results for the GSE3526 dataset ...... 87 Figure 4.11. Combining the display of CLEAN scores and the DCS for the GSE3526 dataset .. 87

List of Algorithms

Algorithm 2.1. Gene-specific DCS...... 44 Algorithm 3.1. Clustering Enrichment Analysis (CLEAN)...... 71 Algorithm 3.2. Averaging gene clusterings over n runs of a non-hierarchical clustering algorithm ...... 75

8

Chapter I – Background

Introduction

The expression of genetic information is a fundamental process found in virtually all living cells1. Parts of the DNA – the “genetic blue print” – are transcribed into a “working copy”, a shorter RNA molecule or transcript which in turn is translated into . This general concept is known as the Central Dogma of molecular biology [1] and is a highly complex process regulated by an intricate network of interactions between DNA, RNA, and . According to the Central Dogma, messenger RNA abundance in a cell should correlate to protein abundance. Indeed, such correlation appears to be reasonably strong [2, 3], and RNA expression measurements are now widely used to study the function of genes as the development of DNA microarrays during the mid 1990s [4] made it possible to monitor genome-wide messenger RNA abundance – a quantum leap from the precursor technique Southern blotting [5].

This invention subsequently spawned a new field contending with a whole host of challenges and new developments such as array types (e.g., oligonucleotide, bead, and spotted arrays), standardization and quality control (e.g., MIAME [6] and MAQC [7]), and new (and old) statistical problems (e.g., data normalization, reproducibility assessment, high dimensionality, data mining).

Most if not all of these areas are still being pursued – some with more intensity than others – while whole transcriptome re-sequencing techniques are now starting to emerge

1There are living cells such as mature mammalian erythrocytes which lack a nucleus and do not synthesize RNA. 9 promising a new set of opportunities as well as challenges. Yet so far, the use of high- throughput technologies to assess gene expression levels has become ubiquitous analyzing samples of almost any species, tissue type, disease state, and experimental condition imaginable generating an incredible wealth of whole genome expression data. The requirement by many scientific journals to publish such data along with the manuscript has lead to a steady increase of gene expression data stored in repositories accessible by the public. Furthermore, the growth rate of such repositories itself has been increasing ever since their inception (Figure 1.1). As of

September 2009, the Gene Expression Omnibus (GEO) [8] alone provides over 350,000 publicly available samples, where a sample represents a single gene expression experiment as a vector of relative expression levels of a platform-specific set of “genes” or DNA sequences.

Figure 1.1 Number of samples published each month through the Gene Expression Omnibus (GEO) [8], a repository for high-throughput gene expression data. While the monthly numbers vary by up to several thousand samples, the overall trend has been a steady increase since the sites inception. GEO now holds more than 350,000 publicly available samples. The graph is based GEO series summary data and was generated as of September 2009.

Despite or possibly because of this enormous amount of data, it remains a challenge to fully harness the wealth of information it represents. New methods are needed to gain novel

10 insight into the underlying biological processes and to infer new knowledge from these data which requires means to recognize patterns and to organize data in meaningful ways with little or no human intervention.

Clustering

Clustering is a commonly used technique to partition a data set into smaller, more manageable units or clusters. More specifically, an object (e.g., a gene) characterized by a set of features or attributes (e.g., biological conditions) is placed in a group of other objects having similar attribute values (e.g., expression levels). Ideally, members of such cluster are most similar to each other with respect to the considered attributes but are dissimilar to all other objects. Based on this general concept, many clustering algorithms have been proposed differing not only in the search strategy employed to determine clusters but also in the definition of object similarity [9]. For an excellent review of many clustering methods the reader is referred to Jain et al. [9].

Applying cluster analysis to any kind of high-throughput gene expression data reflecting relative transcript abundance is a natural choice, as genes function in concert, are part of intricate pathways, and are regulated by various elaborate regulatory mechanisms. It is reasonable to expect gene co-expression and co-regulation to be reflected in DNA expression data. In other words, functionally related genes can be expected to produce similar expression profiles.

Therefore, given a group of genes with similar expression patterns, it is reasonable to hypothesize that these genes share a functional relationship („guilt-by-association‟) [10].

However, this paradigm highly depends on adequate methods and experimental design: Using improper distance measures, making assumptions not suitable to the data, or applying

11 inappropriate algorithms may lead to irrelevant or even misleading clusterings. Yet even the most accurate clustering analysis cannot overcome the fact that co-expression does not necessarily imply co-regulation or functional relatedness. It can merely serve as a hypothesis generator requiring further independent analysis and testing. In order to generate the strongest hypotheses possible, it is therefore crucial to develop appropriate algorithms that provide researchers with useful, meaningful, and reproducible clusterings.

The first clustering method applied to DNA expression microarray data was hierarchical clustering using uncentered correlation as similarity measure and average linkage [11]. Other well-known pattern recognition algorithms such as k-means [12] and self-organizing maps

(SOMs) [13] followed. It should be noted that all of these methods are ad hoc. In other words, these methods do not take into account the variability of the data and therefore have no means to establish the statistical significance of the results or to estimate reproducibility. In contrast, model-based clustering algorithms are designed to fit a statistical model given the data. That is, model-based clustering methods explicitly account for uncertainties and measurement errors in the data and provide confidence estimates along with the actual clustering results.

Gaussian finite and infinite mixture models

Among the most commonly used model-based clustering methods are Gaussian mixture models. This Bayesian approach has proven extremely valuable in clustering expression data which are roughly normally distributed on the logarithmic scale. Here, similar expression profiles are assumed to be generated by a common underlying Gaussian pattern. Once such a model is postulated, the posterior probability distribution of the model parameters given the data can be estimated using computational methods such as Gibbs sampling [14]. Like many clustering methods (e.g., k-means, SOMs), finite Gaussian mixture models require the number of

12 clusters to be pre-specified. In contrast, infinite Gaussian mixtures allow the optimal number of clusters to be estimated as a model parameter. Medvedovic et al. [15, 16] show that taking the uncertainty of the true cluster number into account by using infinite rather than finite mixtures dramatically improves the reproducibility of the clustering results as well as their sensitivity and specificity compared to not only heuristic methods (hierarchical clustering, k-means) but also to model-based methods (finite mixture models, MCLUST [17]).

Clustering in the context of DNA microarray data usually refers to the discovery of groupings of genes. However, clustering of samples has been shown to be exceedingly valuable in its own right. In this case, biological samples (e.g., patients) are the objects to be grouped into clusters and genes are the attributes. In one application, this has lead to the discovery of novel breast cancer subtypes [18]. Based on these findings a number of gene signatures have been devised to predict clinical outcome [19, 20], tumor subtype [21], and histologic grade [22], some of which are currently studied in clinical trials to improve treatment of the disease [23].

Context-specific clustering

Traditionally, genes and conditions are clustered separately and independently. The underlying assumption in this case is that genes are tightly co-regulated under all conditions of the experiment. Likewise, all genes are assumed to be relevant to determine the similarity of samples. In contrast, so-called biclustering methods [24-29] are designed to determine subsets of genes co-regulated only in a subset of experimental conditions or samples. The motivation here is that a given set of genes may act in concert as part of a common pathway under one condition but may act independently under another. Hence their co-regulation is context-specific. Such biological contexts include different tissue types, disease states, developmental stages, environmental factors, etc. As a result, corresponding gene clusters are specific to a subset of

13 samples, and conversely, a grouping of samples is sensible only for a subset of genes. The search for gene clusters can inform the simultaneous search for sample groupings and vice versa

– an advantage over sequential two-way clustering methods where both searches are independent. Unfortunately, like the traditional two-way clustering methods, most bi-clustering methods too are ad-hoc making it difficult to determine the significance of a given bi-cluster.

Furthermore, bi-clustering results are often too complex to interpret as it is unclear how the different bi-clusters relate to each other.

Differential Co-expression

The complementary problem of searching for differential co-expression has only recently gained more attention [30-32]. The goal is to find genes co-expressed in one biological context but not in another due to, for example, dysregulation in a disease state such as breast cancer. A number of methods have been described in recent years to solve this problem with applications in prostate cancer [33], leukemia [34, 35], or muscle growth [36]. These studies show that differentially co-expressed genes, known to be disease related, could not be identified by differential expression analysis [33, 34, 36]. In a particular application, gene expression of developing muscle tissue in a bovine animal model (Wagyu cattle) was compared to a double- muscled model (Piedmontese cattle) expressing a version of the myostatin (MSTN) transcription factor known to carry the causal mutation for the observed phenotype. This gene was not found to be differentially expressed in multiple studies but was identified as differentially co-expressed with downstream transcriptional targets such as MYL2 [36].

A drawback for all these methods is that they operate under the assumption that the biological contexts are known and search for changes in correlation of expression levels between those contexts thus requiring prior specification of contexts. To the best of our knowledge, no

14 method has been proposed to explicitly search for contexts using differential gene co-expression to distinguish between biological samples.

Outline of this dissertation

The principal contribution of this dissertation is a new Bayesian method to simultaneously cluster both genes and biological samples using high-throughput gene expression data. We postulate a Gaussian infinite mixture model that incorporates the concept of differential gene co-expression to separate biological samples into contexts. We then design and implement computational methods to estimate the posterior distribution of the model parameters given data which include clustering of both genes and samples. Finally, we compute a novel score to determine differentially co-expressed genes.

The remainder of this dissertation is divided into three chapters. Chapter II comprises a manuscript submitted for publication describing the design and implementation of our

Differential Co-expression Infinite Mixtures (DCIM) model and a novel, gene specific differential co-expression score (DCS). We outline the statistical and computational framework to group of biological samples into contexts and genes into global and further into local clusters, and to compute the DCS. We then apply our methods to a set of simulation scenarios and to a number of large-scale breast cancer studies. Chapter III is adapted from a recently published manuscript [37] describing a novel method, Clustering Enrichment Analysis (CLEAN), to functionally annotate gene clusterings. The CLEAN software package also contains tools to generate and display gene and sample clusterings, expression intensities and functional annotations which can be interactively viewed by the user. In addition, CLEAN can be used to compare results of different gene clustering algorithms. In chapter IV, we show further results

15 such as the evaluation and comparison of gene clusterings and application to other tissue types combining the methods described in chapters II and III.

16

Chapter II – A Semiparametric Bayesian model for unsupervised differential co-expression analysis identifies novel molecular subtypes

Introduction

Examination of genome-wide patterns of gene expression levels is frequently used to characterize differences and similarities between biological samples at molecular level, and to elucidate underlying biological pathways and molecular networks. The analysis of gene expression profiles commonly focuses on either differential expression or co-expression [38]. In the former case, the goal is to identify genes whose expression level varies between two or more sample types or conditions. In contrast, co-expression (cluster) analysis is used to group together genes with similar expression patterns across different samples, and to group samples with similar global expression profiles.

In the case of breast cancer, such studies have lead to the discovery of distinct molecular subtypes differing in clinical, histological, and molecular characteristics, as well as treatment response and disease outcome. They point to a diverse etiology of the disease with distinct molecular signatures involving numerous biological processes. Some of these findings are currently used in clinical trials aiming to improve patient prognosis and treatment [23].

More recently, differential co-expression [30-32] has been used to characterize dysregulation of gene expression regulatory networks in prostate cancer [33], leukemia [34, 35],

17 or muscle growth [36]. In such analyses, a group of genes which are co-expressed within one biological context (e.g., normal prostate tissue samples) but not within another context (e.g., prostate tumor samples) are said to be differentially co-expressed. These studies demonstrated that some of the known disease related genes, which could not be identified by differential expression analysis, were actually differentially co-expressed [33, 34, 36]. All differential co- expression analysis methods to date require the prior definition of biological contexts within which the co-expression is to be compared.

Here we present a novel probabilistic approach to jointly uncover contexts (i.e., groups of samples) with specific co-expression patterns, and groups of genes with different co-expression patterns across such contexts. Our probabilistic differential co-expression infinite mixture

(DCIM) model implicitly defines a new similarity measure for grouping biological samples based on the similarity of the gene co-expression structure within each sample. Two samples will be similar according to this measure if the same groups of genes are co-clustered in both samples regardless of the overall similarity of the gene expression patterns in the classic sense such as high correlation and small Euclidean distance. To the best of our knowledge, this is the first time that changing co-expression patterns have been used to cluster samples based on gene expression data, and the first framework for completely unsupervised analyses of differential co- expression.

The DCIM model is based on the Bayesian semi-parametric Dirichlet process mixtures

[39], also referred to as the infinite mixture model [15, 40]. The context specificity of the gene co-expression patterns is specified as in the context-specific infinite mixture model [41]. To facilitate de-novo search for contexts, we impose additional Dirichlet process-like priors on the membership of samples in different contexts. The use of infinite mixtures allows us to avoid

18 specifying the number of global and local gene expression clusters as well as the number of contexts. Co-expression relationship and co-memberships in the same context are estimated by integrating over all possible values of these key parameters.

Using the new methodology we revisit the problem of identifying molecular subtypes of breast cancers. We find that the patient groupings induced by the differential co-expression strongly predict disease outcome. Differentially co-expressed genes as well as the patterns of differential co-expression are highly reproducible across independent expression datasets.

Finally, we find that our differential co-expression „signal‟ is complementary to other predictive parameters such as estrogen receptor (ER) status, lymph node (LN) status, and AURKA expression, as well as the „signals‟ contained in the clusters of samples created using traditional similarity measures.

Results

Context-specific infinite mixture model

The DCIM model is based on the assumption that global gene clusters, consisting of genes with similar expression patterns across all samples, are grouped further into local clusters within each context consisting of samples with identical co-expression structure. In Figure 2.1.A samples (i.e., columns) are organized into three contexts, and genes (i.e., rows) form four global clusters. Within context X, global clusters 1 and 3 are further grouped into a single local cluster and global clusters 2 and 4 are grouped into another local cluster. Consequently, within context

X all gene expression profiles form only two local clusters. Similarly, within context Y, gene clusters 1 and 4 form a local cluster and gene clusters 2 and 3 form a local cluster. Since the local clustering of genes is different between groups of samples X and Y, these two groups form

19 two different contexts. As a result, each context is characterized by a unique co-clustering structure of genes.

Figure 2.1. Contexts are defined by unique local gene co-expression patterns. A) Genes grouped into global gene clusters, marked 1-4, are assumed to each derive from common underlying gene expression patterns. Depending on the respective biological contexts, marked X, Y, and Z, global clusters are grouped further locally. Conversely, biological samples are in the same biological context if they have the same groupings of co-clustered genes. As a result, each context is characterized by a unique co-clustering structure of genes. Given two contexts and their respective gene co-expression measures, we compute a differential co-expression score (DCS, right side- panel). The data displayed in this heatmap was generated by the simulation algorithm (see Methods) at the σ=0.5 noise level. B) Average ROC curves were obtained for repeatedly simulated data with noise levels ranging from σ=0.4 .. 0.8, with σ=0.5 displayed here, by averaging the FPRs (incorrectly co-clustered pairs of samples) and TPRs (correctly co-clustered pairs) for each distinct tree cut level. The curve for the DCIM algorithm (blue solid line) is consistently above the curves for hierarchical clustering with Euclidean distance (red dashed line) and Pearson correlation (green dotted line). C) To summarize ROC curves over all simulations at a given noise level σ, we compute the area under the curve (AUC) for each simulation and plot the average AUC against σ.

The DCIM model is specified in terms of a Bayesian Network [42]. A directed acyclic graph (DAG) is used to specify dependences in the model in terms of directed Markov independence. The local probability distributions for the key variables specifying allocation of genes into global clusters (C), allocation of global clusters into local clusters within each context

(L) and the allocation of samples into different contexts (D) are given in terms of the priors derived from the respective Dirichlet processes. These distributions define the probabilities of each object (gene, global cluster, sample) being allocated in any of the already existing groups

(global clusters, local clusters, contexts) or into a new group. The ability to create new groups when needed allows us to avoid a priori specifying the exact number of groups of any kind. The

20

DAG and local probabilities define the unique joint probability distribution of data and all parameters in the model. Posterior distributions of all parameters, given the data are estimated using a Gibbs sampler. Marginal posterior distributions of the three key allocation variables (C,

D, L) are summarized in terms of the posterior pair-wise probabilities (PPPs) of global and local co-expression for each pair of genes and the PPPs of belonging to the same context for each pair of samples. It has previously been shown that such PPPs can be used as the direct estimates of statistical confidence in two genes belonging to the same cluster [41]. Technical details pertaining to postulating the model and performing statistical inference are provided in the

Methods section. Computational procedures for fitting the model are implemented in the R package gimmR which can be downloaded freely from http://ClusterAnalysis.org.

Using the local posterior PPPs of co-expression derived from the model, we use a heuristic algorithm to search for differences between the local gene clusterings and identifying genes that are differentially co-expressed between two contexts (see Methods for details). The higher the resulting differential co-expression score (DCS) is for a gene, the more likely this gene‟s co-clustering partners are different between the two contexts. In Figure 2.1.A, genes with high DCS between contexts X and Y+Z contexts are indicated by the thick blue bar on the right- hand side of the heatmap. Genes in cluster 1 distinguish context Y from contexts X and Z, genes in cluster 2 distinguish context X from contexts Y and Z. Taken together, they define all three contexts.

Recovery of simulated contexts

We first evaluate our method using a series of simulated datasets at different noise levels with the data structure shown in Figure 2.1.A. Receiver Operating Characteristics (ROC) curves were constructed by computing the number of correctly (true positives) or incorrectly (false

21 positives) co-clustered pairs of samples for each clustering and comparing it to two traditional hierarchical clustering methods (Figure 2.1.B). ROC curves were summarized by calculating the average area under the curve (AUC) for each scenario. Average AUCs were consistently higher for our DCIM algorithm when compared to traditional clustering algorithms (Figure 2.1.C) indicating a significantly higher level of precision in reconstructing sample groupings across the whole range of noise levels.

To further accentuate the conceptual difference between the sample groupings based on our context-building algorithm and traditional similarity measures, we slightly modified the simulation procedure (Figure 2.2.A), leaving the co-expression structure unchanged but modifying the relative expression levels. For example, all “samples” in the first context still have identical co-expression structure, but the mean expression profile of the first two “samples” is different from the mean expression profile of the last three “samples”. As expected, groupings based on traditional similarity measures no longer corresponded to the underlying context structure. In contrast, the DCIM algorithm continues to correctly identify the underlying contexts (Figures 2.1.B and 2.1.C). These results indicate that, in general, our DCIM method is expected to produce groupings of biological samples that will be different from the groupings constructed using the traditional similarity measures.

22

Figure 2.2. Differences between contexts are defined by differential co-expression, not differential expression. A) Heatmap of a simulated data set (σ=0.3) to illustrate the imposed gene clusters and contexts. The simulatin procedure was slightly modified leaving the co-expression structure unchanged but modifying the relative expression levels by adding an additional fold change level. B) Gene clustering ROC curve at σ=0.5. C) Average gene clustering AUC plotted against σ.

Table 2.1: Patient characteristics of DCIM derived patient groups for the Schmidt et al. dataset [43] as shown in Figure 2.3.A (blue and green sample cluster, respectively). Clinical Parameter Odds ratio One-sided Fisher p-value ER status (ER+, ER-) 9.33 7.0×10-9 Tumor size (≤ 2cm, >2cm) 1.97 1.8×10-2 Tumor grade (G1, G2, G3) (N/A) 5.0×10-9

Table 2.2: Comparison of computationally derived patient groups for the Schmidt et al. dataset [43] showing the ratio of the number of patients placed in poor/favorable survival groups by both algorithms divided by the number of all patients Pearson Euclidean k-Means DCIM correlation distance Pearson correlation 1 0.74 0.94 0.74 Euclidean distance 0.74 1 0.77 0.77 k-Means 0.94 0.77 1 0.80 DCIM 0.74 0.77 0.80 1

Identifying molecular subtypes in breast cancer gene expression data

We examined the biological importance of uncovering differential co-expression (DCE) structure by performing alternative molecular sub-typing of breast cancer samples in a recent breast cancer gene expression dataset [43]. Figure 2.3.A shows the resulting hierarchical

23 clustering of patient samples and the expression patterns of the 200 most differentially co- expressed genes which were selected by ranking their DCS. Two clear sample groups or contexts are noticeable. A closer examination of the samples in these two contexts revealed a high correlation with key clinical parameters such as estrogen receptor (ER) status, tumor grade, and tumor size (Table 2.1). The membership in the two contexts was also highly predictive of the disease outcome as indicated by Kaplan-Meier survival curves (Figure 2.3.C) (logrank p- value = 5.1×10-5) and statistically significant differences in 10 year survival rates (60.9% vs.

81.2%, p-value=3.4×10-3). Traditional similarity/distance measures induced considerably different sample groupings (Table 2.2) which had little or no correlation with the disease outcome (Table 2.3).

Table 2.3: Predicting survival outcome using different clinical, molecular, and computational methods (Schmidt et al. dataset [43]). Size of patient groups logrank p-

poor favorable value survival survival Tumor size (≤2cm, >2cm) 88 112 0.17 Clinical Tumor grade (G1, G2/G3) 165 35 3.7×10-3 ER status 38 162 0.12 Molecular AURKA <, > median 100 100 4.1×10-3 expression k-Means (k=2) 62 138 8.7×10-5 Hierarchical Pearson correlation 71 129 0.16 clustering Euclidean distance 18 182 0.037 Computational k-Means clustering 64 136 0.043 DCIM 65 135 5.1×10-5

24

25

Figure 2.3. DCIM derived contexts in breast cancer data are predictive of patient survival. DCIM was applied to a large-scale breast cancer dataset[43] identifying contexts and the differential co-expression score (DCIS). A) Resulting hierarchical clusterings of patients were split at the top level to define two contexts marked (1) and (2) in the top panel displaying a heatmap of the 200 genes with highest gene-specific DCS. The bottom panel shows the average expression profile for the three global gene clusters marked in the heatmap with corresponding color sidebars. Within context (1), all three gene clusters are co-expressed following the same pattern, while within context (2) the three clusters have distinguishable expression patterns. The right-hand panel shows significantly enriched functional categories for these genes as determined by CLEAN [37]. B) Empirical distribution of all pairwise gene-gene correlation coefficients (Pearson correlation) for the 154 genes marked by the left sidebar in panel A). Correlation is significantly higher in context (1) (solid line) than in context (2) (dashed line). As control, the top right plot shows correlations for 154 randomly selected genes and the same two contexts while the bottom right plot shows correlations for the same genes but with randomly reassigned context labels. C) To estimate patient survival we compute Kaplan-Meier curves for the two contexts. The estimated 10 year survival for the patient group corresponding to context (1) was 61% compared to 81% for context (2). The difference between the two curves was highly significant (logrank test p=5.1×10-5). D) We used available – dichotomized – clinical, molecular, and computational parameters and their pairwise combinations to fit one-parameter and two-parameter Cox proportional hazard models and assessed the model fit for each parameter using the Akaike Information Criterion (AIC). The model combining DCIM contexts and AURKA expression defined patient groups best predicts patient survival (red box).

26

Differentially co-expressed genes

The functional analysis of the 200 genes most differentially co-expressed between the two major contexts revealed enrichment for genes both positively and negatively associated with

ER status (Figure 2.3.A). Genes negatively associated with ER status were tightly co-regulated within the “poor-prognosis” samples in context 1, but showed no co-expression within context 2.

This cluster was also enriched for ESR1 regulatory targets as established in recent ChIP-chip experiments [44]. Genes positively associated with ER status are also tightly co-regulated forming a large cluster (blue, red and green clusters combined) within context 1, but are generally less (red bar) or not at all (blue and green bar) co-regulated within context 2. This cluster was also enriched for Cell Adhesion, Cell Communication, and Mammary Gland

Development genes. These differential co-expression patterns are also reflected in the distribution of pairwise correlations shown in Figure 2.3.B. Within context 1, the Pearson correlation coefficient between gene pairs is significantly higher than within context 2. Complete results of functional analysis for the 200 most differentially co-expressed genes are provided in

Table A1.

Table 2.4: Predicting survival outcome using the top 200 DCE genes with different clustering algorithms (Schmidt et al. dataset [43]). Gene List Euclidean Distance Pearson's Correlation k-Means logrank logrank logrank size size size p-value p-value p-value group 1 group 2 group 1 group 2 group 1 group 2 Top 200 DCE 181 19 1.01×10-1 115 85 9.05×10-1 179 21 1.52×10-2 genes

As in the second scenario of our simulation study (Figure 2.2), sample groupings based on the differential co-expression of genes with the highest DCS were considerably different than sample groupings generated by traditional similarity measures on these same genes.

Furthermore, the differences in disease outcomes were much smaller for the sample groupings

27 generated by the traditional hierarchical clustering methods and k-means algorithm (Table 2.4).

This indicates that our method not only identifies functionally important genes, but that the implicit similarity measure based on the differential co-expression is necessary to optimally utilize expression patterns of these genes in predicting the disease outcome.

Comparison to other outcome predictors

We compared the strength of association between disease outcome and the patient groupings induced by the DCIM algorithm to several alternative groupings based on important clinical and molecular parameters, and unsupervised clustering of patient samples based on the traditional similarity measures (Table 2.3). Among the parameters with statistically significant disease outcome predictive ability were tumor grade and aurora kinase A (AURKA) gene expression, a proliferation associated gene shown to be a powerful predictor of survival for breast cancer [45]. Tumor size and ER status did not yield patient groups significantly different with respect to the disease outcome for this dataset. Given the high level of enrichment of ER status related genes among differentially co-expressed genes, it is particularly interesting that in this dataset ER status on its own was not strongly associated with the disease outcome. Among the unsupervised computational methods we compared, k-Means algorithm and Euclidean distance created patient groups with marginally statistically significant differences in disease outcome. The unsupervised analysis based on differential co-expression yielded the highest statistical significance for differences in survival between sample groupings.

To evaluate the independent contribution of the differential co-expression signature to predictive models based on other variables, we systematically evaluated the benefit of combining two classification methods using Cox regression. We found that the model significantly improved

28 when including DCIM based classification as compared to using any other variable alone. In particular, the model combining DCIM and AURKA expression had the lowest overall Akaike

Information Criterion (AIC) (395.3) indicating the best model fit (Figure 2.3.D).

Reproducibility of differential co-expression signature across independent datasets

The reproducibility of DCE scores was assessed by repeating the analysis on two additional breast cancer datasets [46, 47]. The high correlations between DCS and the highly significant overlaps between lists of genes with highest DCS for different datasets indicate the reproducibility of differential gene co-expression (Figure 2.4). Using information from all three datasets, we constructed a “differential co-expression signature set” by selecting 500 genes with top-ranking DCSs in all three datasets. Using only these genes to re-analyze all three datasets with the DCIM algorithm yielded remarkably consistent patterns of differential co-expression in all three datasets (Figure 2.5). The analysis of disease outcomes for sample groupings induced by differential co-expression in these two additional datasets was consistent with the results for

Schmidt dataset. Despite the fact that the Miller et al. dataset [46] also contained samples from lymph node positive patients, the overall gene expression patterns in the two contexts were concordant to expression patterns in the other two datasets. The lymph node status was in this case the strongest single predictor of the disease outcome, but the co-expression signal together with the lymph node status provided for the best model fit in explaining the disease outcome among all 2-predictor combination.

29

Figure 2.4. Reproducibility of the gene-specific DCS. The DCIM algorithm was applied to three breast cancer studies (Schmidt et al. [43], GSE11121; Miller et al. [46], GSE3494; Desmedt et al. [47], GSE7390) and the per- gene DCS were computed. The top panel shows Pearson (lower triangle) and Spearman correlation (upper triangle) comparing the DCS for the three datasets. To further evaluate our approach of selecting top-ranked genes based on DCS we determine odds ratios (bottom left) and Fisher test p-values (bottom right) for cutoffs ranging from 1 to 1,000 in each of the three pairwise comparisons.

Table 2.5: Comparison of patient groupings derived from individual and joint (“meta”-) analyses suggesting that DCIM derived contexts are consistent. Individual analyses Favorable survival Poor survival Joint Favorable survival 494 (49.9%) 65 (6.6%) analysis Poor survival 139 (14.1%) 291 (29.4%)

30

Figure 2.5. Heatmaps of top 500 DCS gene signature in three breast cancer studies. Expression patterns are remarkably consistent across different datasets. DCIM sample clusterings are highly correlated with ER status and patient survival.

Meta-analysis based on the differential co-expression signature

The predictive potential of the differential co-expression signature was then tested in the analysis of a „super‟-set (N = 989) comprised of the three independent datasets described above and additional three studies [22, 48, 49]. Using the DCIM algorithm to cluster samples based on the 500 DC signature genes (Figure 2.6), we again observed a patient grouping with significantly different disease outcomes (logrank p = 3.8×10-3), highly significant correspondence to the

31 groupings found when analyzing the data sets individually (Table 2.5, odds ratio = 15.9, Fisher p-value = 1.6×10-77), and the correlation to ER status and tumor grade (Figure 2.6).

Figure 2.6. Heatmaps of top 500 DCS gene signature in breast cancer meta-study. Expression patterns are remarkably consistent with individual datasets. DCIM sample clusterings are highly correlated with ER status, tumor grade, and patient survival.

Discussion and Conclusions

We have developed a novel model-based probabilistic method, DCIM, which allows us to organize biological samples into contexts based on their similarity of groupings of co-clustered genes. This definition of similarity is different from commonly used approaches where the

32 similarity between sample gene expression profiles is a function of differences between expression levels for individual genes.

We demonstrated that breast cancer sample groupings based on differential co-expression are more strongly correlated with the disease outcome than the sample groupings produced by traditional clustering techniques. This echoes earlier findings showing that some disease-related genes are not differentially expressed but differentially co-expressed. In one particular application, gene expression of developing muscle tissue in a bovine animal model (Wagyu cattle) was compared to a double-muscled model (Piedmontese cattle) expressing a version of the myostatin (MSTN) transcription factor known to carry the causal mutation for the observed phenotype. This gene was not differentially expressed in multiple studies but was identified as differentially co-expressed with downstream transcriptional targets such as MYL2 [36].

Differentially co-expressed genes identified by our algorithm are functionally related to the etiology of breast cancer and are reproducible across independent breast cancer datasets.

Finally, the DC signature was complementary to other clinical, pathological and molecular predictors improving upon the precision of any other single predictor when used in the multivariate analysis.

The general features of expression profiles as well as lists of differentially co-expressed genes were highly reproducible across different datasets. When used to in the meta-analysis of a combined dataset consisting of 989 breast cancer samples based on the DC signature, DCIM algorithm produced sample groupings that were strikingly concordant with sample groupings obtained in separate analyses of individual datasets.

Gene expression profiling of breast cancer samples has been used to derive numerous distinct, but often overlapping gene lists that are predictive of the disease outcome [23]. On the

33 other hand, it has been shown that the gene expression profile of a single gene (AURKA) can serve as a surrogate for the predictive ability of most of such lists [45]. Indeed, AURKA, along with many other proliferative markers is among the top genes that are differentially expressed in the classical sense between the sample grouping generated by the DCIM algorithm (Table A.2).

More interestingly, we found that our 500 gene DC signature has significant overlap with the experimentally derived list of “intrinsic genes” [21] (Table 2.6). The “intrinsic genes” signature consisting of genes with high between-to-within-tumor ratio of variability in expression, has served as the gold standard for molecularly classifying breast tumors [21, 50, 51], and has also been shown to contain the predictive ability independent of the clinical parameters. The significant overlap between with “intrinsic genes” set indicates that our unsupervised algorithm is capable of identifying a subset of the experimentally identified informative gene set that could not be identified by any other currently available clustering algorithm.

Table 2.6: Overlap of the 500 DCS signature with other well-known breast cancer signatures. Gene List Reference OR Fisher p Tumor subtypes Hu et al. [21] 3.96 6.38E-30 (“Intrinsic breast cancer genes”) Histologic grade Sotiriou et al. [22] 2.49 2.23E-03 Cinical outcome van't Veer et al. [19] 3.75 8.45E-03 (“70-genes signature”) Metastasis Wang et al. [20] 2.49 7.96E-01 (“76-genes signature”)

Methods

Infinite mixture model

Motivation. In a typical application we have a set of gene expression signatures of biological samples which represent a range of different phenotypes. Examples include clinical

34 parameters such as age, gender, tumor stage and tumor grade or molecular parameters such as estrogen receptor (ER) and progesterone receptor (PR) status. The different phenotypes likely correspond to a host of genetic pathways and other regulatory programs regulating the many functions of a cell within the particular tissue samples. Some of the activated pathways are expected to be common to all of the samples under investigation while others are likely to be specific to a particular subset of samples. Liu et al. show that explicitly accounting for different sample subtypes (or contexts) significantly improves gene clustering but their approach requires a-priori definition of contexts [41]. However, the optimal definition contexts based on clinical parameters is not always known or obvious but may, for example, correspond to molecular breast cancer subtypes [18, 51]. We therefore here propose a method to infer contexts directly from the data.

Figure 2.1 illustrates the rationale for our Differential Co-expression Infinite Mixtures

(DCIM) model. As in Liu et al. [41] we call a group of samples of the same subtype context. A group of genes co-expressed in all contexts is a global cluster, that is, these genes have a common expression pattern across all samples in the experiment. The common expression is assumed to be derived from a common underlying pattern. Global clusters that are indistinguishable locally, that is, within a given context, are grouped further into a local cluster of co-expressed genes. Conversely, groups of genes that are co-expressed (or co-clustered) only within specific subsets of samples but not across all samples thereby define contexts. As a result, each context is characterized by a unique gene co-clustering structure. To formally describe global and local gene clusters as well as contexts and their relation to the measured intensities we postulate the above mentioned Bayesian infinite mixture model.

35

Model parameters. Suppose we have a gene expression dataset for N genes and M biological samples or experimental conditions, X is the N × M expression matrix where xij is the relative expression level of gene i in sample j. Accordingly, xi =( xi1, xi2, …, xiM) is the

T T expression profile for gene I and xj =( x1j, x2j, …, xNj) is the expression signature for sample j.

T xi is assumed to be generated by one out of Q underlying patterns. C = (c1, c2, …, cN) is the corresponding index variable; ci = q means that expression profile xi is generated by underlying pattern q, q = 1, …, Q, Q ≤ N. Patterns are represented by the M-variate normal distribution

ΝM(μq, Σq), that is, ci = q implies xi ~ ΝM(μq, Σq).

Likewise, samples within the same biological context are assumed to generate similar expression signatures. D = (d1, d2, …, dM) is an M-dimensional index variable dj = r means that

T expression signature xj is assumed to originate from context r = 1, …, R ≤ M and each context r comprises 1 ≤ Mr ≤ M samples or experimental conditions. The two extreme cases are R = 1 and

R = M. The former means that all expression signatures originate from the same context and is equivalent to a simpler clustering model, where contexts are not defined. The latter means that each sample defines its own context.

1 2 R Given D and a gene expression profile xi, R subprofiles xi , xi , …, xi are defined such

r that xi = (xijr(1), xijr(2), …, xijr(Mr)) and dj = r for expression signatures jr(1), jr(2), …, jr(Mr). In other words, the R contexts define R subprofiles of a given gene expression profile.

Each expression profile is grouped into one of Q clusters representing the underlying overall patterns. We term this the global gene clustering. Thus, each subprofile of an expression profile is implicitly grouped into one of Q global clusters. Locally however, that is, within a given context, groups of subprofiles may be indistinguishable and therefore form local groups of subprofiles termed local clusters. In other words, global gene clusters are grouped further into

36 local gene clusters depending on the biological context of the sample. The local gene clustering structure is represented by a Q × R matrix L where lqr = t means that, within context r, global cluster q is grouped into local cluster t.

Figure 2.7. Directed Acyclic graph describing the DCIM computational model. Nodes represent random variables and edges indicate dependencies between nodes such that each random variable is conditionally independent of its non-descendents given its parent nodes (local Markov property).

Model specification. To specify the distribution of the expression data as well as the parameters described above we postulate a Bayesian hierarchical model [52] displayed as a

Directed Acyclic Graph [42] in Figure 2.7. Here, nodes represent random variables and edges between the nodes indicate dependencies between nodes such that each random variable is conditionally independent of its non-descendents given its parent nodes (local Markov property).

The joint probability distribution of the random variables is given by p( X , C , D , L, M , M * , , ,  , , a, a,  , )  p( X | C , M , ) p(C |  ) p(M | L, D , M * ) p( |  , ) p( L | a) p( D | a ) p(M * |  , ) p( ) p(a) p( ) p( ) p(a) p( ) p( )

37 where M  { 1 ,...,  Q } and   { 1 ,...,  Q } are the mean vectors and variance-covariance

matrices defining the expression patterns xi , that is, p( xi | ci  q, M , )  f N ( xi |  q ,  q ) ,

f N ( |  , ) is the multivariate normal distribution function. Distributions for the random variables defining the global gene clustering C, local gene clustering L, and sample to context assignment D each are defined according to the infinite mixture paradigm that does not require a priori specification of the number of clusters and contexts, respectively [15, 41]. For example, the prior probability that a sample j will be placed in already existing context r is

n  j , r p (d j  r | D , a )  while the prior probability of j being placed in a new context is M  1  a

a p(d j  d j ' , j  j'| D , a )  where n  j , r is the number of samples currently in M  1  a context r without j. Probability distributions for the other parameters have been described previously [41] and are listed below.

Variables in the model: xi=(xi1, xi2,…, xiM) , i=1,…,N observed gene expression profiles for all N genes

q=(q1,…, qM), q=1,…,Q the mean profile for global cluster q

r xi =(xijr(1), xijr(2), …, xijr(Mr)) where dj = r, r=1,…,R is the expression profile for gene i within context r, i=1,…,Q

* μ tr , mean expression profile for the local cluster t within context r

M=(1,…,Q)

Σ=(1,…,Q), where each q is a diagonal matrix with context-specific cluster variances on the

2 2 2 2 2 2 diagonal, that is q=diag( (σ q1 ,...., σ q1 ), (σ q2 ,...., σ q2 ),..., (σ qR ,...., σ qR ) )

M*=[(μ * ,...., μ * ),..., (μ * ,...., μ * )] 1 1 Q 11 1 R Q R R

38

Hyperparameters λ, ,  and  are all assumed to be context-specific: =(1,…,R), =(1,…, R),

1 R r r r =(1,…, R); λ=(λ ,…,λ ), λ = (λ 1,…,λ Mr) where Mr is the number of samples within context r.

Conditional distributions given parent nodes:

p( x i | c i  q, M , )  f N ( x i | μ q , Σ q ) p( (μ , μ ,..., μ ) | L , M * , )  ( f (μ | μ * , Σ ), f (μ | μ * , Σ ),..., f (μ | μ * , Σ ) q1 q2 qR N q1 L q11 1 N q2 L q2 2 2 N qR L qR R R

n  i, q α p ( c i  q | C -i ,  )  , q=1,…,Q, p( c i  c i' , i'  i | C -i ,  )  , i=1,…,N, where n-i,q is N -1  α T - 1  α the number of profiles placed in global cluster q not counting the profile i

n -qrt a p(L qr  t | C, a )  , t=1,..,Q, p (L kf  L k'f , k'  k | C, a )  where n-qrt is the number of global Q -1  a Q -1  a clusters currently placed in local cluster t within context r without counting the qth global cluster

* * -1 - 2 - 2 β tr β r υ r p(μ | λ , τ )  f (μ | λ , τ I) p (σ tr | β r , υ r )  f G (σ tr | , ) , r=1,…,R tr r r N tr r r 2 2

-2 2 1 σ xr 1 1 p (υ r | σ xr )  f G (υ r | , ) p(β r )  f G (β r | , ) 2 2 2 2

2 2 1 σ xr 2 2 p (τ r | σ xr )  f G (τ r | , ) p(λ | μ , σ )  f (λ | μ , σ I) 2 2 r xr xr N r xr xr

-1 -1 1 1 p(α )  f G (α | , ) 2 2 where

N N x r (x r  μ )`( x r  μ )  i  i xr i xr i 1 2 i 1 μ xr  σ xr  N N n r -1

Posterior Conditional Distributions:

39

2 2 σ σ r  1 x  tr λ  1 tr x r tr * r r *  i L  t * 2 * n tr n tr cir * p ( μ tr | C, L, σ tr , X , λ r , r )  f N ( μ tr | , I) , where x tr  and n tr is σ 2 σ 2 n *  1  tr  1  tr tr r * r * n tr n tr the total number of expression profiles grouped in global clusters which are place in the local cluster t within the context r. Similarly, the variance for all global clusters place in the local cluster t within the context r is

r n *  β s 2  β υ p ( σ - 2 | X, M , β , υ )  f ( σ - 2 | f tr r , tr r r ) s 2  (x r  μ * )`(x r  μ * ) tr r r G tr , where tr  i tr i tr 2 2 L  t cir

K f μ *  tr 1 1 2 t 1 τ r 2 τ r σ xr  μ xr σ xr Q Q Q f ( λ | μ * ,..., μ * , τ )  f ( λ | r r , r I) r 1r Q r r r N r -1 1 2  2 τ r σ x  σ xr  Q Q r

Q r (μ * - λ )`( μ * - λ )  σ 2  tr r tr r xr 1 * * 1 n r Q r  1 t 1 f ( τ r | μ 1r ,..., μ Q r , λ r )  f G ( τ r | , ) r 2 2

Q r β σ - 2  σ - 2 r  tr xr - 2 - 2 Q r β r  1 t 1 f ( υ r | σ 1r ,..., σ Q r , β r )  f G (υ r | , ) r 2 2 (Q β -3) -1 Q β -2 β β r r  β  r  r  σ υ β )   f ( β | σ -2 ,..., σ -2 , υ )   ( r )( r ) 2 exp   r  (υ σ -2 ) 2 exp  - tr r r  r tr tr r    r tr    2 2  2  t 1   2  

3 Q - 1 α 2 exp(  ) (α) f ( α | Q, N )   2α  ( N  α)

n  i, q p ( c i  q | C -i , L , x i , M , Σ )  f N ( x i | μ q , Σ q ) N -1  α

2 α p( c i  c i' , i'  i | C -i , x i , μ x , σ x )   f N ( x i | μ q , Σ q ) p(μ q , Σ q | λ, ) dμ q dΣ q  N -1  α 

40

n 2 -qrt q * σ tr p (L qr  t | X , a )  f N ( x f | μ tr , I) Q - 1  a n q

a σ 2 σ 2 p (L  L , k'  k | X , a )  f ( x q | μ * , tr I) p ( μ * , tr ) d( μ * , σ 2 ) , where n is the number of kr k' r  N r tr tr tr tr -i,q Q -1  a n q n q profiles in global cluster q without counting profile i, and n-qrt is the number of global clusters

x r  i th q c i  q grouped into local cluster t within context r not counting q global cluster, and x r  . n q

Model fit. Given data, that is, a set of gene expression profiles, the goal is to estimate the posterior distribution of the model parameters p(C , D, L, M , M * , , ,  , , a, a,  , |X ) and in particular, the marginal distribution of parameters C, L, and D given the data, which describe the gene clusterings and the contexts.

We accomplish this goal by implementing a Gibbs sampler [53], a Markov chain Monte

Carlo algorithm that can be used when the conditional probability for each random variable in the model is known given all other variables. Our implementation is an extension of the previously described algorithm [41]. In particular, we use the following posterior conditional probability distribution to update contexts D, that is, to place sample j into existing context r:

Q n r p ( d  r | x , D , L, , λ , )  - jr f (x i | μ ,  ) p (μ ,  | λ , )dμ j j -j   N j jrq rq jrq rq jrq M  1  a  q 1 {i|c i t ,Ltr  q}

a Q p ( d  d , j  j' | x , D , C, , λ , )  f (x i | μ ,  ) p (μ ,  | λ , )dμ j j ' j -j   N j jq q jq q jq M  1  a  q 1 {i|c i  q}

Notice that we integrate out the parameter μ q in the likelihood function. As a result, the posterior probability function does not depend on the mean vectors but it accounts for the

41 specific local gene clusterings L, that is, the better a local clustering Lr fits sample j the higher the likelihood.

All other conditional posterior distribution functions are as described in [41] and reproduced above. We also adapt the “reverse annealing” procedure [16, 41] to counteract situations where the Gibbs samples do not “mix” well.

To summarize the Gibbs samples of clustering variable C and the context variable D we use the previously described approach of first averaging clusterings thus computing posterior pairwise probabilities (PPPs) and then generating a hierarchical clustering with the PPPs as the similarity measure and average linkage as the agglomeration strategy [15, 41].

We extend this approach to compute “local” or context-specific PPPs. Let S be a context,

S = {samplej}. After each Gibbs step g, we have recorded the parameters C(g), D(g), and L(g).

Given g, we can now generate a sample-specific clustering Cs(g) := (c(s)1(g), c(s)2(g), …,

T c(s)N(g)) , for all s in S, where

o Cs(g) is an index variable. c(s)i = q(s) means, that, after Gibbs step g, for sample s,

gene i was assigned to sample-specific cluster q(s)

o ds(g) = r (i.e., “sample s was in context r after Gibbs step g”),

o lqr(g) = t (i.e., “within context r, global cluster q was grouped into local cluster t

after Gibbs step g”)

o ci(g)= q (i.e., “gene i was assigned to global cluster q after Gibbs step g”)

o c(s)i(g) := t

In other words, the sample-specific local clustering after each Gibbs step is identical to the local clustering corresponding to the context the sample was a member of after that Gibbs step. Cs(g) is generated for all samples s in S and for all Gibbs steps after the “burn-in” phase.

42

Local PPPs specific to context S can now be computed for all gene pairs using the same procedure as for the global gene PPPs.

Availability. The above described algorithms are available as part of the R package gimmR which can be downloaed from our website (http://Clusteranalysis.org/). The software generates output files that can be viewed and analyzed both directly in R [54] and using the

FTreeView program [37]. A list of the R functions contributed to the gimmR package and short descriptions can be found in Appendix B.

Differential co-expression score

As described above, our model describes groups of samples or contexts and both global and local, that is context-specific, gene clusterings. Given this framework and given two contexts, we consider a pair of genes differentially co-expressed (DCE), if they are co-clustered in one context, but not in the other. Contexts are user-defined but this choice is typically guided by the posterior distribution of the D parameter in the model.

The definition of co-expression depends on pairs if not larger groups of genes, but it does not depend on groupings of samples. Gene co-expression groups (i.e., clusters) can be viewed by the user with tools such as TreeView [55] which also facilitate further analysis such as determining functional enrichment of gene clusters [37]. Differential expression refers to single genes but depends on at least two groups of samples. In this case, researchers are accustomed to assigning a score (fold change, p-value, etc.) to each gene allowing them to rank and prioritize genes. The challenge with the analysis of differential co-expression lies in the fact that it depends on groupings of both, genes and biological samples. Simultaneously viewing clusterings for two contexts would be too complex for all practical purposes as would be an

43 attempt to prioritize pairs of genes based on a score. Therefore, we here propose a gene-specific differential co-expression score (DCS) which assigns a single number to each gene while at the same time accounting for the context-specific clusterings of the genes. The intuition being that the majority of genes typically will remain co-clustered with the same group of genes regardless of context. A smaller number of genes, however, are expected to have changing co-expression patterns. This behavior is reflected in the differences of pairwise gene distances for the context- specific clusterings. The score will be computed by averaging these differences for each gene, not over all possible gene pairs but over only the pairs within a local gene cluster within either context.

Given two contexts c1 and c2, we compute the gene-specific DCS as follows:

Algorithm 2.1. Gene-specific DCS.

1) For each context c,

a. Compute the N×N posterior pairwise probability (PPP) matrix of any two genes

being co-clustered within c

b. Construct the hierarchical tree Tc by applying average linkage hierarchical

clustering with the local PPP matrix as similarity measure

2) Calculate the N×N matrix Diff =(d)N,N = abs(PPPc1-PPPc2) of absolute differences between

the two PPP matrices

3) For each context c,

44

a. Cut Tc at all possible levels to obtain a list of gene clusters Gc where cutting Tc at

level (1-p) induces a gene clustering such that the average PPP between each pair

of genes within a resulting cluster is greater than p.

b. For each gene cluster g in Gc

i. For each gene i, compute the score DCScluster(i, g, c)

DCScluster(i, g, c) = Σdij/(|g|-1), if genes i,j are in g, i≠j, and |g| is size of

cluster g.

DCScluster(i, g, c) = 0, if i is not in g.

4) For each gene i, compute the gene-specific score DCSgene(i) = max{g,c}(DCScluster(i, g, c))

Simulation study

We designed a simple data simulation procedure to study the ability of different algorithms to correctly identify gene clusters and sample clusters or contexts as previously described [41]. As in the example shown in Figure 2.1, each simulated N×M data matrix X comprises four gene clusters and three contexts. Clusters 1 and 2 each have 20 genes while clusters 3 and 4 each have 80 genes. Each of the three contexts has five samples. Thus, M=15 and N=200. Each gene expression profile xi is assumed to be generated by one of four

2 underlying patterns representing the four gene clusters such that xi ~ N(μc, σ ), μc = (μc1, …, μcM) and gene i is generated by pattern c. For clusters 3 and 4, μc is assumed to be identical for all samples, that is “low” (=0) and “high” (=1), respectively. In contrast, for cluster 1, μc is assumed to be “high” for samples 1-5 and low for samples 6-15 while for cluster 2, μc is “high” for samples 6-10. Thus, only gene clusters 1 and 2 allow distinguishing the three contexts. The

45 noise parameter σ is the same for all clusters and context ranging from 0.4 to 0.8. Each simulation is repeated 100 times. Figure 2.1 shows a heatmap of one of the simulated datasets at the σ=0.5 noise level. In addition, we slightly modify the procedure described above to generate a second set of simulations where the “high” values are set to 2 instead of 1 for samples 1-2, 6-8, and 11-12 thus leaving the co-expression patterns (and contexts) intact but changing the expression levels in some samples (Figure 2.2).

Given a hierarchical clustering of samples, we compute Receiver Operating

Characteristics (ROC) based on the number of correctly or incorrectly co-clustered pairs of samples after cutting the tree at each possible distinct level (1 .. M). Given hierarchical tree T and level p, cutting T horizontally at level p induces a sample clustering with Np clusters, 1 ≤ Np

≤ M. For a given level p, we can therefore compute the true positive rate (TPR) and false positive rate (FPR) by assessing for each pair of co-clustered samples whether or not they both are in the same „true‟ context. As a result, we have M FPR-TPR pairs, one for each cutting level, and can plot the ROC curve. We average ROC curves over multiple simulations by averaging the corresponding FPRs and TPRs at each distinct tree cutting level.

Since the ROC curves are piecewise linear it is straightforward to compute the area under the curve (AUC). Comparing the performance of two clustering algorithms we cannot assume independence between AUC measures as they also depend on the „difficulty‟ of the particular simulated dataset. We therefore use a paired t-test rather than the unpaired test to compare

AUCs of DCIM to other methods at a given noise level with DCIM as the reference. We compute error bars for the AUC plots using the 95% confidence interval of the paired t-test statistic.

46

Breast cancer studies

Data preprocessing and gene selection. Raw data files (Affymetrix HG-U133A, HG-

U133+2.0 CEL files) for six human breast cancer datasets (GEO expression series GSE11121

[43], GSE1456 [48], GSE2990 [22], GSE3494 [46], GSE7390 [47], and GSE9195 [49] were downloaded from the public repository GEO [56]. Each dataset was RMA-preprocessed [57] separately using the Gene-based custom CDF (version 10) from the Psychiatry/MBNI

Microarray Lab at the University of Michigan („Brainarray‟) [58]. We applied a mild variation filter using Cancer Outlier Profiler Analysis (COPA, 95th percentile) [59] to select the top 10,000 genes to be clustered in each of the human breast cancer datasets. In each dataset expression profiles were centered by setting the median expression value of each gene to zero (subtracting the gene-specific medians). Table 2.7 summarizes patient characteristics for the three independent datasets.

Table 2.7: Overview of the breast cancer studies used. Reference [43] [46] [47] GEO accession GSE11121 GSE3494 GSE7390 Number of patients 200 251 198 Disease-specific Endpoint Metastasis-free survival Metastasis-free survival survival Surgery, 110 (43%) Surgery, 125 (63%) also Treatment patients also received Surgery received irradiation adjuvant therapy ER status (+/-/NA) 162/38/- 213/34/4 134/34/- LN status (+/-/NA) -/200/- 84/158/9 -/198/- Tumor grade 29/136/35/- 67/128/54/2 30/83/83/2 (1/2/3/NA) Age (years): mean 60 (12) 62 (14) 46 (7) (standard deviation) Tumor size (cm): mean (standard 2.1 (1.0) 2.2 (1.3) 2.2 (0.8) deviation)

47

We created a joint expression data set from these six data sets as follows. First we identified unique patient samples based on the patient annotation made available on the GEO website. Next, we RMA-preprocessed and per-gene normalized all unique samples separately for each microarray platform (Affymetrix HG-U133A and HG-U133+2.0, respectively) again using the Entrez Gene-based custom CDF (version 10). Finally, we combined the resulting expression sets by matching all 11,961 Entrez Gene ID based probesets represented on both platforms.

Survival analysis and other statistical analyses. We computed Kaplan-Meier curves and Cox proportional hazard regression using the survival package in R [54]. Survival times and end points together with other annotation data were obtained directly from the GEO website using the GEOquery R package. Where multiple end points were available we chose disease- specific or metastasis-free survival rather than overall survival (Table 2.7). The included studies differed considerably by observation time while typical endpoints in clinical trials are 5 year overall or disease-specific survival. To make results more comparable across studies we therefore censored the observation time at 5 and/or 10 years. For the Cox proportional hazard model fit, we dichotomized parameters as follows. Tumor size: ≤/> 2cm; tumor grade: grade

1/grades 2 and 3; ER status: +/-; AURKA gene expression (median): ≤/> median after preprocessing; AURKA gene expression (k-Means): cluster 1/cluster 2; computational methods: cluster 1/cluster 2.

All statistical analyses were performed using the statistical programming environment R version 2.7.1 [54] and Bioconductor release 2.2 [60]. This includes other clustering algorithms, namely hierarchical clustering with Euclidean distance or 1-Pearson correlation as distance measure and k-Means clustering. We applied average linkage as the agglomeration strategy

48 when DCIM‟s 1-PPP or 1-Pearson correlation was the distance measure and complete linkage for Euclidean distance. Unless otherwise stated, k-Means implies k=2.

For functional annotation of gene lists and clusterings was done with the R package

CLEAN [37] with functional categories derived from GO [61], KEGG [62], L2L [63], among other computationally and literature derived categories.

An R package containing the DCIM algorithm and the scripts used for our analysis are available at the supplemental website (http://Clusteranalysis.org/).

49

Chapter III: CLEAN – CLustering Enrichment ANalysis2

Background

Identifying groups of co-expressed genes through cluster analysis has been successfully used to elucidate affected biological pathways and postulate transcriptional regulatory mechanisms [64, 65]. The integration of biological knowledge in such analyses has been most commonly facilitated by assessing the enrichment of clusters with genes from pre-defined functionally coherent gene lists (“functional categories”). The concept of “functionally related genes clustering together” has been established by ad-hoc visual examination of hierarchical clustering results and their enrichment by genes from the same functional category [11]. The first assessment of statistical significance of such enrichments was performed by analyzing results of k-means clustering [12] using the hypergeometric distribution [66]. Similar strategies have also been used in the analysis of lists of differentially expressed genes [67], gene lists constructed based on genome-wide Chromatin Immunoprecipitation (ChIP) [68, 69] and epigenomics experiments [70], as well as the general approach to integrate lists of genes derived by various experimental and knowledge-based procedures [71]. Introducing biological knowledge through such post-hoc analysis has been important for interpreting results and separating reproducible, biologically meaningful gene clusters from clusters that may have resulted from random fluctuations in the data. For both of these objectives, reproducibility of conclusions made is of utmost importance.

2 This chapter has previously been published in [37]. 50

The first two concept defining papers [11, 66] also highlight the dichotomy that exists in using hierarchical vs. partitioning clustering procedures to this days. Hierarchical procedures do not necessitate specifying the “right” number of clusters, a parameter generally unknown in advance whose estimation from the data leads to instability in clustering results [15]. On the other hand, selecting “meaningful” clusters in a hierarchical clustering that can be then correlated with functional categories using the hypergeometric distribution is still mostly performed by ad- hoc visual inspection of related heatmaps. Algorithms for systematic testing of all possible clusters have also been developed [72-74], but results of such analyses are difficult to summarize. Postulating the “right” number of clusters or choosing “good” clusters in an ad-hoc fashion before correlating them with functional categories can result in poor reproducibility since a slightly different number of clusters or slightly different “good” clusters can result in a different interpretation of the results. This problem is akin to choosing the “optimal” cut-off criteria for selecting differentially expressed genes before performing similar functional analyses. It has been shown that results of such analyses are highly sensitive to changes in the cut-off used with different cut-offs yielding different conclusions [75]. In the analysis of differentially expressed genes computational alternatives have been developed that do not require setting such thresholds [76-78], but they are generally not applicable in the knowledge- based assessment of clustering results.

A frequently encountered problem in analyzing genome-wide experimental data is to choose among results produced by different clustering algorithms. Criteria such as homogeneity and separation are relatively straightforward to compute but are mostly of theoretical interest. A more relevant criterion from a biological perspective is the overall functional coherence of resulting gene clusters. Most of the methods developed to date for this purpose require

51 specification of the number of clusters [79, 80]. Comparing different methods at a fixed number of clusters is problematic as some methods might create a better clustering structure when more clusters are allowed and others could create better clusterings when few clusters are allowed. To circumvent this problem ROC curves have been used to assess false and true positive rates of co- clustered gene pairs using the functional categories as a gold standard [41, 81]. However, this same strategy lacked discriminative power when a large number of large functional categories, such as (GO) terms, served as a gold standard and required again fixing the number of clusters [81, 82].

We developed an analytical framework and flexible computational infrastructure for integrating knowledge-based functional categories into the cluster analysis of gene expression data. The framework is based on the simple, conceptually appealing and biologically interpretable gene-specific functional coherence CLustering Enrichment ANalysis (CLEAN) score derived by correlating the clustering structure as a whole with functional categories of interest. The CLEAN score is gene-specific and it differentiates between the levels of functional coherence for genes within the same cluster. The statistical significance of coherence scores is established by comparing them to the empirical null-distribution obtained by randomly permuting gene identifiers. The corresponding computational infrastructure is based on an open- source R package for the data analysis and open-source Java viewer for visually integrating and analyzing expression data and associated knowledge-based functional categories.

We investigate the reproducibility of the findings based on the CLEAN scores, and demonstrate its utility in comparing the functional coherence of clusterings produced by different algorithms and in selecting genes with informative expression patterns. Being gene-specific, the

CLEAN score facilitates easy comparisons of functional coherence of different hierarchical

52 structures (e.g., generated by different clustering algorithms) and selection of genes based on functional coherence of their expression pattern without the need to fix the number of clusters.

On the other hand, we demonstrate that differentiating between the levels of functional coherence for genes within the same cluster leads to significant improvements in reproducibility of findings across independent microarray datasets when compared to traditional cluster-wide analyses. Furthermore, genes selected based on the CLEAN score produced more precise sample groupings than genes selected using the cluster-wide score.

Results

Given a hierarchical clustering of genes based on their expression profiles and a set of functional categories (e.g., Gene Ontologies), the CLustering Enrichment ANalysis (CLEAN) score for a gene is calculated as follows (Figure 3.1):

1. Fisher‟s exact test for enrichment is calculated for all functional categories

containing the gene and for all possible clusters containing this gene. (Figure 3.1).

2. The CLEAN score is then computed as the maximum –log10(q-value) of

enrichment tests across all pairs of clusters containing the gene and functional

categories containing the gene (see methods for details).

The clustering-specific null-distribution of the CLEAN score is established by randomly permuting gene identifiers. Statistically significant scores are then used to facilitate selection of genes or gene clusters, as well as the assessment of functional coherence and the comparison of clustering results produced by different algorithms. The integrated clustering viewer/browser,

53

Functional TreeView (FTreeView), is used for integrative browsing and visual display of gene clusters and associated functional categories.

Figure 3.1. Calculating functional coherence scores. Given a hierarchical clustering of genes based on their expression profiles and a set of functional categories (e.g., Gene Ontologies), the CLustering Enrichment ANalysis (CLEAN) score for a gene is calculated as the maximum of –log(Fisher‟s Exact Test q-value) of enrichment tests across all pairs of clusters containing the gene and functional categories containing the gene (see methods for details). The Cluster-wide CLEAN score (cwCLEAN) is calculated in a similar fashion except that the maximum is taken over all clusters that contain the gene and all functional categories regardless of whether they contain the gene or not.

When multiple category types are used, the joint CLEAN score is calculated as the maximum of CLEAN scores for each category type. Here we focus on three sets of functional categories: Gene Ontology (GO) categories [61], Kyoto Encyclopedia of Genes and Genomes

(KEGG) pathways [62, 83], and a custom set of Co-regulation Groups (CG) based on the computational analysis of gene promoters and regulatory motif definitions in the Transfac database, version 12.1 [84] (see methods).

All currently used algorithms assign statistical significance of functional enrichment to whole clusters instead of individual genes within the cluster. To compare the properties of the 54

CLEAN score to currently used methods we define a cluster-wide CLEAN (cwCLEAN) score to serve as a surrogate for this traditional type of analysis. The cwCLEAN score is defined as the maximum of –log10(q-value) for all clusters containing the gene regardless of whether the enriched functional categories contain the gene or not (Figure 3.1). We analyzed several public microarray datasets to demonstrate the statistical properties and utility of the CLEAN framework, and to compare its performance to traditionally used approaches.

Comparing clustering results using the CLEAN score

The CLEAN score provides a tool to compare the functional coherence of clustering results produced by different clustering algorithms on a gene-by-gene basis without requiring a pre-defined number of gene clusters. We used four independent large-scale breast cancer gene expression datasets [43, 46-48] to demonstrate utility of the CLEAN score in choosing the clustering structure with the highest functional coherence. For each dataset we compared the performance of three typical clustering algorithms: Context specific infinite mixture model

(CSIMM) [41], Euclidean distance based and Pearson‟s correlation based hierarchical clustering.

For all three algorithms, the hierarchical clustering was constructed using the average linkage principle and algorithms were applied to expression data with and without prior variance- rescaling (see methods). Clustering algorithms were used to cluster data from four independent breast cancer gene expression datasets with GEO accession numbers GSE1456 [48], GSE3494

[46], GSE7390 [47], and GSE11121 [43].

55

Figure 3.2. Comparison of clustering methods. We compared functional coherence of six clustering algorithms: Context specific infinite mixture model (CSIMM), Euclidean distance based and Pearson‟s correlation based hierarchical clustering with and without prior variance-rescaling of the data, across four independent human breast cancer datasets (GEO expression series GSE1456 [48], GSE3494 [46], GSE7390 [47], and GSE11121 [43]). For all six algorithms, the hierarchical clustering was constructed using the average linkage principle. The number of genes common in all four datasets after filtering was 6,150. CLEAN scores are plotted against the x-axis and the corresponding number of genes with the CLEAN greater than this are plotted against the y-axis. Higher areas under the curve imply the higher functional coherence.

For each clustering algorithm the total number of genes (y-axis) with the CLEAN score as high or higher than the given threshold was plotted against all possible threshold levels (x-axis).

There are two conclusions that can be immediately made based on results in Figure 3.2. First,

56 variance-based rescaling of the data significantly improved the functional coherence of resulting clustering. While the CSIMM model is capable of compensating this effect to some extent, the performance of both CSIMM and Euclidean distance based algorithms improved after data was re-scaled. Since the Pearson‟s correlation coefficient implicitly performs such re-scaling, there is little difference between its performance with and without re-scaling. After the data was re- scaled, all three algorithms perform almost identically indicating that the re-scaling is the key step in improving the functional coherence of the data. The second conclusion is that these non- trivial results are perfectly reproducible across four independent breast cancer datasets, which is an important indication about their applicability to other datasets of this type.

Reproducibility and the comparison with cluster-wide scores

We used the same four independent breast cancer gene expression datasets (GSE1456

[48], GSE3494 [46], GSE7390 [47], and GSE11121 [43]) and a study comparing tissue-specific gene expression patterns in mouse and human [85] to investigate reproducibility of the CLEAN scores. We first assessed the contribution of the functional data to the reproducibility of the clustering results by comparing the correlation between the CLEAN scores (Figure 3.3.A) to correlation of pair-wise distances used to construct the hierarchical clustering of genes (Figure

3.3.B) in the two datasets (GSE3494 and GSE7390). In this analysis pairwise distances are based on the Bayesian posterior pairwise probabilities (PPPs) produced by the CSIMM algorithm [41].

Significantly increased correlation for the CLEAN sore (0.82 vs. 0.52 PPPs) indicated a significant increase in reproducibility of results in terms of functional coherence of the gene expression patterns over the simple clustering that does not incorporate an assessment of functional coherence. The heatmap of expression profiles for the genes with the highest CLEAN scores in both datasets (circled in the Figure 3.3.A) showed a coherent pattern of expression

57 within both datasets (Figure 3.3.C) and all these genes are related to immune system, which is a commonly implicated functional group in the etiology of cancer in general.

Figure 3.3. Integrating cluster analysis and functional knowledge. Genes were clustered using the CSIMM [41] algorithm and variance-scaled data from two independent breast cancer datasets (GSE3494 [46] and GSE7390 [47]), and CLEAN scores were computed for both clusterings. The number of genes common in both datasets after filtering was 8,567. A) The gene-specific CLEAN scores for the two datasets were plotted against each other and the Pearson‟s correlation coefficient was computed. A small error was added in the scatter plot to better visualize overlapping data points. B) Pairwise similarity measures between genes computed by CSIMM were also plotted and correlated. C) Expression profiles of genes with the very highest CLEAN scores in both datasets showed strong co- expression in both datasets. All genes in this cluster are immunity related.

Next, we performed a comprehensive study of the reproducibility of the CLEAN scores in four breast cancer datasets and five clustering algorithms described in the previous section (since

Pearson‟s correlation implicitly re-scales data, only the Pearson‟s correlation clustering with re- scaled data was used in this case). The heatmap in Figure 3.4 represents the similarities of clusterings for different algorithms and different datasets in terms of the CLEAN and the cwCLEAN scores. Three groupings of clusterings-by-score type combinations clearly emerge:

58 clusterings formed using Euclidean distances and un-scaled data, cwCLEAN scores for clusterings based on the re-scaled data and CSIMM algorithm using un-scaled data, and CLEAN scores based on the re-scaled data and CSIMM algorithm using un-scaled data.

Figure 3.4. Reproducibility of CLEAN and cwCLEAN scores. The reproducibility of the functional coherence results for 6 different clustering algorithms was assessed by calculating all pairwise Pearson‟s correlation coefficients between scores for all algorithms applied to four independent human breast cancer datasets (GEO expression series GSE1456 [48], GSE3494 [46], GSE7390 [47], and GSE11121 [43]). Rows and columns in this symmetric heatmap represent specific scores for a specific clustering in a specific dataset in the heatmap. The symmetric hierarchical clustering of rows and columns was constructed using pairwise Pearson‟s correlations between different scores as the similarity measures and applying the complete linkage principle.

The improvement in reproducibility was further assessed by analyzing differences in correlations between CLEAN and cwCLEAN scores of all 6 pairs of breast cancer datasets for three different clustering algorithms (Figure 3.5.A). Since all differences are positive, this

59 indicates that the correlation coefficient was higher for CLEAN scores in each of the 6 pairs for all three algorithms. The increased reproducibility was also evident in the analysis utilizing the statistical significance cut-offs established by randomizing the gene labels for each clustering separately. For each pair of datasets we constructed a 2-by-2 contingency table based on the statistical significance scores (as in Table 1), and calculated differences in the odds ratios and the statistical significance of overlaps between lists of statistically significant genes in different datasets for a given algorithm and functional coherence score (CLEAN or cwCLEAN) (Figure

3.5.B). All differences in odds ratios were positive implicating again higher reproducibility of

CLEAN scores. Similarly, differences in the statistical significances (-log10(p-values)) of the

Fisher‟s Exact tests for the same contingency tables were also all positive implicating the higher reproducibility of CLEAN scores (Figure 3.5.C).

Figure 3.5. Differences in the reproducibility of CLEAN and cwCLEAN scores. Improvements in the reproducibility of CLEAN over cwCLEAN scores were demonstrated by box plots of differences in correlation coefficients, and odds ratios and p-values in 2-by-2 contingency tables of statistically significant scores. A) Box plots of differences in correlations between CLEAN and cwCLEAN scores of all 6 pairs of breast cancer datasets for three different clustering algorithms. Since all differences are positive, this indicates that the correlation coefficient was higher for CLEAN scores in each of the 6 pairs. B) Box plots of differences in odds ratios for 2-by-2 contingency tables of statistically significant CLEAN and cwCLEAN scores for all 6 pairs of breast cancer datasets and three different clustering algorithms. All differences are positive implicating higher reproducibility of CLEAN scores. C) Box plots of differences in the statistical significances in (-log10(p-values)) in the Fisher‟s Exact test for the same contingency tables as in B). The fact that all differences are positive again implicates higher reproducibility of CLEAN scores.

We repeated a similar type of analysis for a mouse and human datasets profiling gene expression in different tissues (79 human and 61 mouse tissue types). [85]. After matching 60 human and mouse probes using HomoloGene identifiers [56] we obtained 10,287 common genes that were represent on both microarray platforms. We constructed CSIMM-based gene clusterings for both species and applied CLEAN separately for the human and mouse datasets using both GO and KEGG based functional categories. Statistically significant correlation between the genes with statistically significant scores, using the Fisher‟s exact test for two-by- two tables, was firmly established for both CLEAN and cwCLEAN scores (Table 3.1 and 3.2 respectively). However, the statistical significance and the strength of association was considerably higher for the CLEAN score (odds ratio=3.82 and p-value=4.4x10-207) than for the cwCLEAN score (odds ratio=1.49 and p-value=1.8×10-17).

Table 3.1. Contingency table of genes with significant and non-significant CLEAN score in human and mouse tissues. This table shows results for the gene-specific CLEAN score. The odds ratio is 3.82 and the Fisher Exact test p-value is 4.4x10-207.

Human CLEAN score

Significant (> 2.7) Non-significant (<2.7)

Significant (> 3.2) 2,057 1,173 Mouse CLEAN score Non-significant (< 3.2) 2,222 4,835

Table 3.2. Contingency table of genes with significant and non-significant cwCLEAN score in human and mouse tissues. When using the cluster-wide cwCLEAN the odds ratio is 1.49 and the Fisher p-value is 1.8x10-17.

Human cwCLEAN score

Significant(> 2.7) Non-significant (< 2.7)

Mouse Significant (> 3.2) 4,852 1,315 cwCLEAN score Non-significant (< 3.2) 2,937 1,183

61

Unsupervised selection of informative genes

Reproducibly identifying genes whose expression patterns can delineate biologically meaningful groups of samples has been an important problem in computational biomedicine. We focus on the situation when the identity of samples belonging to different groups or even the number of the groups is not known in advance. In this case, the informative genes have to be selected in an unsupervised fashion. By studying the problem of classifying samples from different tissue types in the integrated mouse-human dataset, we demonstrate the utility of using the CLEAN score to select informative genes. We first identify genes with statistically significant CLEAN scores in mouse and human tissue expression profiling datasets. Then we show that expression profiles of these genes facilitate better separation of samples from different tissue types than expression profiles of genes not having statistically significant CLEAN scores.

Furthermore, we demonstrate that the improvements in precision are significantly larger when using the CLEAN score than when using the cluster-wide cwCLEAN scores.

We created a total of 6 different gene lists and assessed their abilities to distinguish different tissue types in the combined human-mouse expression dataset. Gene lists were as follows:

All Genes: All 10,287 genes present in both microarray platforms.

Significant CLEAN score: Genes with significant CLEAN scores in both human and mouse

datasets.

Non-significant CLEAN score: Genes with not significant CLEAN scores in both human and

mouse datasets.

62

Significant cluster score: Genes with significant cwCLEAN scores in both human and mouse

datasets.

Non-significant cluster score: Genes with not significant cwCLEAN scores in both human and

mouse datasets.

COPA genes: Genes identified by applying the Cancer Outlier Profiler Analysis (COPA) [59]

analysis. This COPA list of 2,668 genes was generated by performing COPA separately

for human and mouse datasets, selecting the top 5,000 genes, and using the overlapping

genes in the two datasets. This procedure was tuned to produce a number of genes that is

similar to the number of genes with significant CLEAN scores.

Tissue samples were then clustered based on each of these gene lists using Euclidean distance, average linkage hierarchical clustering. Co-clustered pairs of samples derived from the same tissue type (regardless of whether they are human or mouse derived) were considered true positives, and co-clustered pairs of samples derived from different tissues were considered false positives. By cutting the hierarchical tree structure at all possible levels and each time recording the number of true and false positives we determined the receiver operating characteristic (ROC) for each of the gene lists. Since the number of positive pairs (232) is small compared to the number of negative pairs (38,828), we used the ratio of number of false positive pairs divided by the total number of positive pairs, as recently described [86], instead of the traditional false positive rates on the x-axis. Genes with significant CLEAN scores in both human and mouse tissue expression sets were significantly better in separating different tissue types than the genes with non-significant CLEAN scores (Figure 3.6.A). Genes with significant cwCLEAN scores were marginally better in separating different tissue types than genes with non-significant cwCLEAN scores (Figure 3.6.B), but the difference was considerably smaller than for the

63

CLEAN score. Using COPA for selecting informative genes was completely ineffective as it did not show any improvement over using all genes (Figure 3.6.C).

Figure 3.6. Unsupervised selection of informative genes. Genes were clustered based on their expression across different tissue samples and functional coherence scores are calculated for the human and mouse datasets separately. Ability of different groups of genes to facilitate correct grouping of samples from the same tissue type in the combined human-mouse dataset was assessed by constructing ROC curves. The ROC curve for clustering samples based on all 10,287 genes is inserted in each plot (red line) for the reference. A) ROC curves for clustering samples based on genes with the statistically significant CLEAN scores in both mouse and human datasets, and genes not statistically significant in either of the datasets. B) Same as A) for the cwCLEAN instead of CLEAN scores. C) ROC curves based on genes selected using COPA. The number of selected genes was identical to the number of genes with statistically significant CLEAN scores used in A).

Computational Infrastructure

We developed an open-source R package [54] that performs Clustering Enrichment

Analysis (CLEAN). Typically, the user will provide a gene expression data set and a clustering of the genes. The package is intended for hierarchical clusterings but can also accommodate non-hierarchical clusterings such as k-means [12]. The package is compatible with a number of common input formats. GO and KEGG functional categories are derived from respective

Bioconductor packages [60], and users can provide their own functional categories. The CLEAN package provides functions to compute the CLEAN score and generate output files to interactively display expression data together with gene and sample clusterings, and functional cluster annotation.

64

1

3

2

Figure 3.7. Integrated software package. CLEAN was implemented as an add-on R package [54]. The package integrates routines for calculating gene specific functional coherence scores and the interactive Java-based viewer Functional TreeView (FTreeView). The figure shows a screenshot of the fTreeView session displaying CLEAN results for one breast cancer dataset GSE3494 [46]. fTreeView was developed from the original Java TreeView [55] by adding panel 3, which displays functional cluster annotations generated by the CLEAN R package. This functionality enables seamless integration and browsing of functional categories associated with each cluster of genes (panel 2), which in turn can be selected based on the functional coherence scores (panel 1). The selected cluster of genes (panel 2) which we identified based on the overall high CLEAN scores (panel 1) is highly enriched for genes associate with immunity related Gene Ontology terms (FDR<10-60) as well as two KEGG pathways, and putative targets of the Interferon Consensus Sequence-binding protein (ICSBP) transcription factor. These Results can be viewed interactively at http://Clusteranalysis.org using the Java web-start version of FTreeView.

In addition, we extended the Java-based expression data viewing software TreeView [55] to interactively display functional cluster annotations and the cwCLEAN scores produced by the

CLEAN R package. Figure 3.7 shows a screenshot of the new viewer we named Functional

TreeView (FTreeview) displaying CLEAN results for the breast cancer dataset GSE3494 [46].

Panel 1 displays the per-gene functional coherence scores for individual category types. The broader the red bars are the higher is the score. Green indicates statistically non-significant functional coherence scores. Guided by the display of the CLEAN scores, the user can choose a 65 subset of genes by selecting a branch of the hierarchical gene clustering tree (panel 2).

Functional cluster annotations generated by CLEAN for the selected group of genes displayed in panel 3. The interactive display of functional annotations is the major new feature of FTreeView, and it allows for seamless integration and browsing of functional categories associated with each cluster of genes. Such an integrated view of clustering results, expression patterns and the enriched functional categories, facilitates a straightforward interactive identification of functionally coherent patterns of expression. For example, the selected cluster of genes (panel 3) which we identified based on the overall high CLEAN scores (panel 1) is highly enriched for genes associate with immunity related Gene Ontology terms (FDR<10-60) as well as two KEGG pathways, and putative targets of the Interferon Consensus Sequence-binding protein (ICSBP) transcription factor. FTreeView is available as a stand-alone or as a Web Start application from our server (http://Clusteranalysis.org).

Discussion

Integrating biological knowledge encoded in lists of functionally related genes into the analysis of genome-wide functional genomics data is an increasingly important aspect of analyzing genomics data. In the context of cluster analysis, such integration is necessary for selecting meaningful clusters of genes, and for the adequate biological interpretation of patterns defined by such clusters. We developed a computational framework for analytically and visually integrating knowledge-based functional categories with the cluster analysis of genomics data.

The framework is based on the gene-specific functional coherence score derived by correlating the clustering structure as a whole with functional categories of interest. The statistical

66 significance of coherence scores is established by comparing them to the empirical null- distribution obtained by randomly permuting gene identifiers.

We established the reproducibility of the CLEAN score across related gene expression datasets, and its utility in comparing the functional coherence of different clusterings and in unsupervised selection of genes that discriminate between biologically meaningful groups of samples. When compared to the commonly used cluster-wide assessment of functional coherence, the CLEAN score exhibits higher reproducibility across different microarray datasets.

Genes selected based on the CLEAN score produced more precise sample groupings than genes selected using either cluster-wide score or by using COPA algorithm.

It is important notice that by using the CLEAN score instead of the traditional cluster- wide approach one cannot use the guilt-by-association principle [10] to hypothesize the function of non-annotated genes. Our analysis of the four breast cancer datasets yielded one obvious example of a relevant gene (FOXM1) with a high cwCLEAN score and the CLEAN score of zero in all four breast cancer datasets. FOXM1 is a proliferation-associated transcription factor

[87] which has recently been clearly implicated to be an important regulator in the progression [88]. However, functional annotations for this gene (Gene Ontologies and KEGG) do not reflect these recent findings and consequently FOXM1 was not associated with the “cell cycle” cluster based on the CLEAN score (Figure 3.8).

67

GSE11121 GSE1456 GSE3494 GSE7390

Nuclear part, RNA processing

Mitochondrion Mitosis

Cell cycle DNA replication

Extracellular region, organ development, cell adhesion

Immune response, signal transducer activity, receptor activity, response to stress

System process, (G-protein coupled) receptor activity, plasma membrane

Figure 3.8. Expression patterns of genes with statistically significant CLEAN scores in four independent breast cancer datasets. The heatmap indicates that all genes belong to clusters with coherent expression patterns in each dataset. Functional categories on the right-hand side indicate the enriched functional categories for each global cluster of co-expressed genes. This heatmap can be interactively browsed using FTreeView at http://ClusterAnalysis.org.

One way to think about the differences between CLEAN and cwCLEAN score in terms of differences between assuming functional coherence based on co-clustering (guilt-by- association, cwCLEAN) vs. having some additional pre-existing evidence of functional relationship (CLEAN). Our results in this context implicate that in the case of breast cancer and tissue datasets the previous evidence of functional relationship is overall more reliable than guilt- by-associations relationships arising only from the cluster analysis on their own. It is possible that in some other situations, new functional relationships would dominate the existing one and the opposite would be the case. Calculating the difference between the two scores can quickly implicate novel functional relationships arising from the data analysis alone.

A systematically different approach to integrating the experimental data and prior knowledge is to incorporate the functional information into the clustering algorithm itself [89-

68

93]. While conceptually appealing, such methods have a more limited applicability than the framework presented here and have not been widely used. Our framework follows the commonly utilized post-hoc integration approach in which cluster analysis is performed first using the experimental data and integration is achieved in the post-hoc analysis. The ability to validate the clusters produced by analyzing experimental data, and the transparency about how exactly the different types of information is utilized in constructing clusters is most likely the reason for the popularity and the wide usage of post-hoc approaches. When the functional knowledge is used in the process of constructing clusters, it can no longer be employed to provide the guidance in selecting biologically meaningful clusters.

Conclusions

We directly demonstrate that integrating prior biological knowledge encoding in the lists of functionally coherent genes improves the reproducibility of clustering results. We also demonstrate that our gene-specific functional coherence score, which differentiates between the levels of functional coherence for genes within the same cluster, shows higher reproducibility than the cluster-wide score. The CLEAN score also produced more informative genes for distinguishing different sample types than the cluster-wide score. This implicates that the gene- specificity of the CLEAN score is a fundamentally different and, at least in some circumstances, better approach for integrating biological knowledge with results of the cluster analysis than previously used cluster-based scores.

69

Methods

Data Preprocessing, Gene Selection and Clustering

Raw data files (Affymetrix HG-U133A CEL files) of four independent human breast cancer datasets (GEO expression series GSE1456 [48], GSE3494 [46], GSE7390 [47], and

GSE11121 [43]) were downloaded from the public repository GEO [56]. Each dataset was

RMA-preprocessed [57] separately using the Entrez Gene-based custom CDF (version 10) from the Psychiatry/MBNI Microarray Lab at the University of Michigan („Brainarray‟) [58].

Preprocessed data files of a large-scale tissue expression data set [85] were also downloaded from the same repository. The tissues included both human (GEO dataset record GDS596) and mouse (GDS592). For genes with multiple probes per Entrez gene ID, in each case, the probeset with the highest median expression value per probeset was selected as the representative probeset for that gene. To match corresponding genes across species, the HomoloGene database

[56] was used.

We applied a mild variation filter using Cancer Outlier Profiler Analysis (COPA, 95th percentile) [59] to select the top 10,000 genes to be clustered in each of the human breast cancer datasets (GSE1456, GSE3494, GSE7390, GSE11121). In each dataset expression values were centered by setting the median value of each gene to zero (subtracting the gene-specific medians) and clustering analyses were performed for each dataset and species independently using hierarchical clustering with three different similarity metrics or distance metrics, respectively:

 Context-Specific Infinite Mixture Models (CSIMM) [41]. For any given pair or

genes, this Bayesian method estimates the posterior pairwise probability (PPP) of the

genes being co-clustered. The resulting PPP matrix is used as the similarity measure

for the hierarchical clustering algorithm.

70

 Pearson Correlation of gene expression values as the similarity measure.

 Euclidean Distance based on per-gene normalized expression values as the distance

measure.

All three hierarchical clustering algorithms used Average Linkage.

Each clustering analysis was then repeated after further variance based re-scaling each dataset by dividing expression levels by their standard deviation for each gene and each datasets separately. When computing the Pearson correlation, expression values are implicitly divided by the standard deviation. Thus, this additional normalization step did not significantly affect

Pearson‟s correlations.

All statistical analyses were performed using the statistical programming environment R version 2.7.1 [54] and Bioconductor release 2.2 [60].

Clustering Enrichment Analysis

Clustering Enrichment Analysis (CLEAN) is based on testing every possible cluster within a gene clustering for statistically significant enrichment of biological categories. A background gene list (e.g., all genes represented on the microarray) is given as well as a hierarchical clustering of some or all genes in the background list. The method was implemented as described in Algorithm 3.1.

Algorithm 3.1. Clustering Enrichment Analysis (CLEAN). 1. Define one or more sets of biological categories with sufficient representation in

the background gene list

2. Determine all possible gene clusters within a given size range

3. For each gene cluster

71

3.1 For each functional category, determine the 2x2 contingency table and

perform Fisher‟s Exact test

3.2 Compute q-values, that is the adjusted Fisher p-values, to account for

multiple testing

3.3 Record categories the cluster is significantly enriched with and

corresponding q-values

4. Compute the cluster-wide cwCLEAN score

4.1 Determine the minimum q-value for each gene cluster

4.2 Sort gene clusters by minimum q-value starting with the lowest (i.e., most

significant) q-value.

4.3 Prune cluster supersets with less significant q-value to avoid „spill-over‟

effect, i.e., remove gene clusters whose significant functional enrichment

score is likely due to a single subtree.

4.4 For each gene, find the minimum q-value over all remaining clusters the

gene is member of

4.5 For each gene, find the minimum q-value over all category sets (e.g., GO

and KEGG)

4.6 The cwCLEAN is the resulting -log10-transformed minimum per-gene q-

value.

5. Compute the gene-specific CLEAN score

72

5.1 For each gene, find the minimum q-value over all clusters and all

categories the gene is member of.

5.2 For each gene, find the minimum q-value over all category sets (e.g., GO

and KEGG)

5.3 The CLEAN is the resulting -log10-transformed minimum per-gene q-

value.

Defining Functional Categories

A functional category is defined as a non-empty set of genes representing a biological concept such as “cell cycle”, “immunological synapse” or “cytokine-cytokine receptor activation”. The method is designed to accommodate any set of functional categories such that each category is comprised of a list of genes that has user-specified minimum overlap with the background gene list. Here, sets of categories were either downloaded from publically accessible databases such as Gene Ontology (GO) [61], Kyoto Encyclopedia of Genes and

Genomes (KEGG) [62, 83], or were defined based on the Transfac database [84].

More specifically, functional categories based on GO and KEGG were downloaded as R packages [54, 60] while co-regulation based categories (CG) were derived computationally.

Transcription factor and corresponding gene promoter data [84] and DNA sequence data [94] was downloaded. For each of the 304 human transcription factors with at least one position- weight matrix (PWM) in the Transfac version 12.1 database, a score was computed for each of the 24,190 genes, as to how likely they were to have a corresponding binding motif within 1 kbp of their transcriptional start site. The respective 750 top-scoring genes (or fewer if the score was not significant for at least 750 genes) were assigned to each transcription factor to define the respective functional categories.

73

For compatibility, all gene identifiers were converted to Entrez gene IDs, and matched across species where necessary using the HomoloGene database [56]. Subsequent analyses were restricted to categories that had at least ten genes in common with the respective background gene list (e.g., the genes represented on the microarray platform).

Obtaining All Possible Gene Clusters

Given a hierarchical gene tree, a list of all possible gene clusters is obtained by recursively traversing the tree structure and at each node recording the list of corresponding genes. The size of clusters is limited to a user-specified range. Here, clusters smaller than 10 genes and larger than 1,000 were disregarded.

Determining significant functional enrichment.

To determine whether a functional category is over-represented in a given gene cluster, that is, more genes of the functional category are present in the cluster than expected by chance, a two-by-two contingency table (number of genes in the cluster and category, in the cluster and not in the category, etc.) is constructed and Fisher's Exact Test is performed. The procedure is repeated for each category within a category set (e.g., “GO”, “KEGG”) and q-values (i.e., adjusted p-values) are computed to control the false discovery rate (FDR) [95]. Categories with a q-value not greater than a user-defined cutoff are considered significant. The default q-value cutoff is 0.1.

Procedure for Non-hierarchical Methods

The procedure can be extended to non-hierarchical methods such as k-means [12]. For a fixed number k of clusters, a mutually exclusive set of gene clusters is already given and step 2 in Algorithm 3.1 is skipped. If the user specifies a range K of cluster numbers, step 2 in

74

Algorithm 3.1 is preceded by Algorithm 3.2 which generates a hierarchical gene clustering as a means to average over multiple runs of the non-hierarchical clustering algorithm.

Algorithm 3.2. Averaging gene clusterings over n runs of a non-hierarchical clustering algorithm 1. For each k in K, run the non-hierarchical clustering algorithm

2. For each gene i

2.1 For each gene j ≠ i, count the number nij of clusterings where i and j are in

the same cluster

2.2 For each gene j ≠ i, compute pij = nij / n, where n is the size of K

3. Use the pij as a similarity metric and average linkage as a summary method to

generate a hierarchical clustering.

R package and FTreeView tool

An R package to perform CLEAN and an open-source Java tool to interactively display gene expression data, gene and sample clustering, gene annotation, and functional cluster annotation can be freely downloaded from the authors‟ web-site (http://ClusterAnalysis.org). A list of the R functions contained in the CLEAN package together with a short descriptions and examples can be found in Appendix B.

75

Chapter IV: Additional Applications

Simulation studies

In chapter II we describe a series of simulation studies to test the DCIM algorithm and to compare it to other algorithms. The description of the study results focuses on the recovery of simulated contexts. Here we expand the above report of our simulation studies by showing a) gene clustering results, and b) additional simulation scenarios.

In each of the simulations we generate “samples” (i.e., vectors of simulated relative expression levels) according to the above described framework. That is, each dataset comprises four global gene clusters which are grouped further within three distinct contexts such that each context has two local gene clusters (Figure 2.1). We repeat the simulations at five different noise levels, each time generating 100 datasets. We then run the DCIM method as well as other clustering algorithms on each of the simulated datasets and record the results. These algorithms include two hierarchical clustering methods using either Euclidean distance or Pearson‟s correlation to measure distance or similarity, respectively, as well as the Infinite Mixtures based methods GIMM [15, 96] and CSIMM [41]. The latter method uses a-priori defined contexts, and we specify the simulated contexts in order to use this method as a „gold standard‟ for the evaluation of the gene clustering results. GIMM and CSIMM do not provide sample clusterings and are only used in the comparison of gene clusterings.

As described in chapter II, Receiver Operating Characteristics (ROC) curves are derived from the number of correctly (true positives) and incorrectly (false positives) co-clustered pairs of samples and genes, respectively. The sliding threshold to generate the ROC is the height of the hierarchical gene tree at which it is cut to induce a discrete gene clustering. The area under

76 the curve (AUC) is computed to compare ROCs where the ideal AUC is equal to 1 and a truly random signal produces an AUC of 0.5.

Figure 4.1 shows gene clustering results for the simple simulation model. The Infinite

Mixture Model (IMM) based algorithms (GIMM, CSIMM, DCIM) are better able to recover gene clusters than the two ad-hoc methods as illustrated by the differences in AUC. This difference becomes more pronounced as the noise level of the data increases. It should be noted that the simulated noise levels range from lower than typically observed to about the level of the simulated fold changes. Differences between IMM based methods, if any, are small. As expected, CSIMM with known contexts gives the best gene clustering results (Figure 4.1.C).

Figure 4.1. Recovery of gene clusters in simple simulation scenario. A) Heatmap of a simulated data set (σ=0.3) to illustrate the imposed global and local gene clusters and sample groupings (contexts). B) ROC curve at σ=0.7 for gene clustering. C) Average gene clustering AUC plotted against σ.

Next, we slightly modify the simulation strategy by adding an additional fold change level (Figure 4.2). AUCs are similar to the previous scenario; the additional fold change level appears to actually help to slightly improve the gene clustering. In contrast, sample clusterings are influenced by this change as the ad-hoc clustering methods group samples by fold change level rather than by affected groups of genes (see chapter II).

77

Figure 4.2. Recovery of gene clusters in two fold-changes scenario. A) Heatmap of a simulated data set (σ=0.3) to illustrate the imposed gene clusters and contexts. B) Gene clustering ROC curve at σ=0.8. C) Average gene clustering AUC plotted against σ.

One of the benefits we are expecting of considering contexts when clustering genes is to account for samples that are uninformative. That is, the gene expression sub-profile for these samples reflects the background noise of the measurements. Clustering methods which do not take contexts into account weigh samples equally and would be expected to perform poorly in the presence of a significant number of non-informative samples since it becomes increasingly harder to distinguish gene clusters as the number of uninformative samples rises. In contrast,

DCIM and CSIMM algorithm exploit information on contexts and are expected to properly account for such samples.

To test this assertion, we extend the first simulation scenario by adding an increasing number of „non-informative samples‟, that is, vectors with same noise level σ but without any signal (λ=0, Figure 4.3.A). After clustering, ROCs and AUCs are determined as before but for the analysis of sample clusterings, we use only the informative samples. Figure 4.3 shows a representative example of the simulation results with parameters σ ranging from 0.4 to 0.8 and

Nnoninformative, the number of non-informative samples, ranging from 5 to 100. As expected, the recovery of gene clusters remains high for the DCIM algorithm and CSIMM but decreases for hierarchical clustering with Euclidean distance as Nnoninformative increases. The AUC for DCIM 78 clustering of samples remains high for up 20 non-informative samples but decreases somewhat as Nnoninformative approaches 100.

Figure 4.3. Recovery of gene and sample clusters non-informative samples scenario. Nnoninformative, the number of non-informative samples ranged from 5 to 100. A) Heatmap of a simulated data set (σ=0.3) with 20 non- informative samples to illustrate the imposed gene clusters and contexts. B) Gene clustering ROC curve at σ=0.7 and Nnoninformative=20. C) Average gene clustering AUC plotted against N at σ=0.7. D) Sample clustering ROC curve at σ=0.7 and Nnoninformative=20. Only the informative samples are used to determine true and false positive sample pairs. E) Average gene clustering AUC plotted against Nnoninformative at σ=0.7. Only the informative samples are used to compute the ROCs.

A “synthetic” dataset

The „two fold-changes‟ simulation scenario illustrates a key difference between sample groupings determined by the DCIM method and standard sample clustering algorithms. While the former groups samples based on similarities and differences of the respective gene co- expression patterns, the latter use direct measures of similarity of expression levels. To further illustrate this point, we use a dataset generated by the Microarray Quality Control (MAQC)

79

Consortium [7]. Universal reference (UR) RNA and brain RNA was mixed at four different concentrations (100:0, 75:25, 25:75, 0:100). These four synthetic samples were then hybridized to high-density oligonucleotide microarrays at six different laboratories and each time, the hybridization experiments were repeated five times (Figure 4.4). This experimental design is reminiscent of the „two fold-changes‟ simulation scenario in that the same groups of genes are expected to be affected in all samples but at different expression fold changes. That is, we do not expect to observe differential co-expression of genes. However, large differences in expression levels are expected between brain and UR RNA [97].

The raw data files are publicly available (GEO accession GSE5350) and are pre- processed and pre-filtered (N=10,000) as described in the previous chapters. The resulting data set is then clustered using DCIM and hierarchical sample clustering methods. ROC curves are generated as previously described by determining the numbers of correctly and falsely co- clustered sample pairs using relevant sample annotations (RNA concentration, source of majority of the RNA in the sample, and the laboratory were the hybridization was performed) as gold standard (Figure 4.5). Much like for the simulated data, cluster algorithms using traditional similarity measures group samples by RNA source (UR, brain) and by RNA concentration while the DCIM method does not detect these „signals‟ indicated by the expected ROC curve along the diagonal line (Figure 4.5 A, B). The DCIM sample clustering does, however, indicate slight differences between centers which may be due to site-specific hybridization effects. ROC curves indicate that these differences are not detected by the standard methods (Figure 4.5.C).

80

Figure 4.4. Heatmap of the expression dataset generated by the Microarray Quality Control (MAQC) Consortium [7]. Universal reference and brain RNA was mixed at four different concentrations (100:0, 75:25, 25:75, 0:100). Samples were then hybridized to high-density oligonucleotide microarrays at six different centers with five replicates each for a total of 120 arrays.

Figure 4.5. ROC curves for sample clustering of the MAQC dataset [7]. Pre-processed and pre-filtered data was clustered using DCIM and hierarchical sample clustering methods. ROC curves were generated by determining the numbers correctly and falsely co-clustered sample pairs using A) RNA concentration B) source of majority of the RNA in the sample (brain or UR), and C) laboratory were the hybridization was performed, as gold standard.

Large-scale breast cancer study – revisited

In chapter II, we describe the joint analysis of six breast cancer data sets [22, 43, 46-49].

We find that the completely unsupervised, DCIM determined sample groupings (contexts) are

81 better predictors of patient survival than standard clustering algorithms using the same data

(Figure 4.6).

Figure 4.6. Kaplan-Meier curves estimating patient survival in joint breast cancer data set. Data was analyzed using DCIM and hierarchical sample clustering with Pearson correlation as well as k-means (k=2). Hierarchical clusterings were cut at top level to generate two patient groups, respectively. Kaplan-Meier curves and corresponding logrank p-values were computed based on available disease specific patient survival data.

We then split each of the two top-level contexts further into two sub-contexts and repeat the survival analysis as indicated by colors black, red, green, and blue in the hierarchical clustering shown in Figure 4.7.A. We again observe highly significant differences (logrank p- value 1.1 × 10-12) in patient survival between the four groups (Figure 4.7.B). The 10-year survival rate ranges from 56% to 86%.

82

Figure 4.7. DCIM clustering and survival analysis of the joint breast cancer dataset. A) The heatmap of gene expression intensity measures was ordered by DCIM global gene and sample clusterings. B) Kaplan-Meier curves were computed based on available disease specific patient survival data. The color of each curve in the plot corresponds to the patient clustering shown in panel A.

We then apply the CLEAN procedure [37] described in chapter III to the global gene clustering using functional categories derived from GO [61], KEGG [62], L2L [63], among other computationally and literature derived categories. The integrated software package CLEAN also allows us generate and view output files to simultaneously display gene and sample clustering, relative gene expression levels (heatmap), and functional cluster annotations. In Figure 4.8 we show a series of details of the heatmap each representing a gene cluster with high CLEAN scores accompanied by a table showing selected functional annotations for the cluster. The blue boxes in each heatmap correspond to the four patient groups shown in Figure 4.7. Each of these gene clusters, that is, their expression patterns together with their functional annotation, may explain the significant differences in survival we observed between the patient groups. For example, a large group of genes annotated with proliferation related categories such as „cell cycle‟ is up- regulated in the first two groups (“black” and “red”) with poor survival but is down-regulated in the “blue” group which has the best outcome among the four groups. For another cluster of genes positively correlated with estrogen receptor (ER) status, expression is up-regulated in this

83 patient group along with the “green” group while the same genes are down-regulated in a number of patients in the “red” and “black” group. Positive ER status is well-known to be associated with better patient survival in breast cancer [23]. This and the other CLEAN derived functional cluster annotations displayed in Figure 4.8 (proliferation, cell adhesion, and metabolism) and their expression patterns are consistent with the observed patient survival for the four groups.

Type of Description Fisher FDR log OR Evidence Literature Cancer-related gene expression module [98] 3.6 × 10-84 5.1 Literature Doxorubicin-resistant cancer cells [99] 2.3 × 10-42 5.7 Literature Poor prognosis marker in breast cancer [19] 2.9 × 10-42 4.6 Literature Up-regulated in undifferentiated cancer [100] 1.7 × 10-39 4.9 GO Mitotic cell cycle 9.5 × 10-36 3.7 KEGG Cell cycle 9.8 × 10-14 3.4

Type of Description Fisher FDR log OR Evidence Literature Positively correlated with ER status [19] 6.8 × 10-115 3.0 Literature Negatively correlated with BRCA1 germline status [19] 1.1 × 10-12 2.0 GO extracellular region 3.2 × 10-10 1.2

84

Type of Description Fisher FDR log OR Evidence Literature Cancer module [98] 1.2 × 10-96 3.3 Literature Cancer module [98] 6.4 × 10-80 2.5 GO Extracellular matrix 4.3 × 10-36 1.2 GO Cell adhesion 3.4 × 10-19 1.3 KEGG Focal adhesion 1.4 × 10-11 1.5

Type of Description Fisher FDR log OR Evidence GO 2.5 × 10-46 1.8 KEGG Oxidative phosphorylation 7.7 × 10-27 2.7 GO Extracellular region 3.2 × 10-10 1.2 Figure 4.8. Clustering Enrichment Analysis (CLEAN) of the DCIM derived gene clustering. Each of the panels shows a detail of the heatmap in Figure 4.7 representing a gene cluster with high CLEAN scores. Below each heatmap is a table showing selected functional annotations for the gene cluster. The blue boxes correspond to the four patient groups shown in Figure 4.7.

Normal tissue dataset.

Mainly due to availability, the majority of our analyses focus on breast cancer samples.

However, here we analyze a set of 353 normal tissue samples derived from ten post-mortem donors. We use this example to demonstrate the utility of the various aspects of the DCIM method and the CLEAN software package. The dataset, generated by Neurocrine Biosciences,

85

San Diego, CA, is downloaded (GEO accession GSE3526), and pre-processed and pre-filtered as previously described. We then use DCIM to generate gene and sample clusterings (Figure 4.9).

The sample clustering reflect the different tissue types; most prominent is the clear separation between brain tissues and other tissue types. Next, we compute the gene-specific differential co- expression score (DCS) for the three top-level contexts. Finally, we perform CLEAN using the above named functional category types. We use the functionality provided in the CLEAN software package to generate and display output files containing the clusterings, heatmap, DCS, and functional cluster annotation (Figure 4.9).

Figure 4.9. Screenshot of fTreeView displaying DCIM derived gene clustering and contexts for the normal tissue study (GEO accession GSE3526). Pre-processed and pre-filtered data was analyzed with DCIM. Tissue types are reflected in the sample grouping with brain-specific tissues and non-brain tissues being the most obvious. CLEAN was then performed. Results are shown using fTreeView but the sidebars right of the heatmap displays the DCS for the top three contexts defined by the hierarchical sample clustering. The functional annotation panel (bottom right) displays enriched functional categories for the selected gene cluster. Brain specific categories are highlighted in yellow.

86

Figure 4.10. Screenshot of fTreeView displaying DCIM and CLEAN results for the GSE3526 dataset. A gene cluster with high DCS was identified by visual inspection (1). This gene cluster was highly expressed in a set of co- clustered cerebellum samples (2). By selecting these genes a list of functional categories is displayed (3). The gene cluster in this case was enriched for a number of cerebellum related mouse phenotype (MP) categories.

Figure 4.11. Combining the display of CLEAN scores and the DCS for the GSE3526 dataset. Displayed is the gene expression heatmap, gene and sample clusterings, CLEAN scores for twelve category types including GO, KEGG, L2L, among others (red/green sidebar), and the DCS (blue sidebar plot). The plot suggests a striking correlation between CLEAN score and DCS. Genes with high DCS also have a high CLEAN score for at least one of the category types.

87

These files can then be interactively viewed by the user. Figure 4.10 shows one such example. A gene cluster with high DCS is identified by visual inspection (1). This gene cluster is highly expressed in a set of co-clustered cerebellum samples (2). After selecting these genes by mouse click in the gene clustering panel, a list of relevant functional categories is displayed

(3). The gene cluster in this case is enriched for a number of cerebellum related mouse phenotype (MP) categories.

Lastly, in Figure 4.11 we combine the display of CLEAN scores and the DCS (blue sidebar plot). Both scores are determined per gene facilitating their comparison. The plot suggests a striking correlation between CLEAN score and DCS. Genes with high DCS also have a high CLEAN score for at least one of the category types.

88

Bibliography

1. Crick F: Central dogma of molecular biology. Nature 1970, 227:561-563.

2. Greenbaum D, Colangelo C, Williams K, Gerstein M: Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol 2003, 4:117.

3. Washburn MP, Koller A, Oshiro G, Ulaszek RR, Plouffe D, Deciu C, Winzeler E, Yates JR, III: Protein pathway and complex clustering of correlated mRNA and protein expression analyses in Saccharomyces cerevisiae. Proc Natl Acad Sci U S A 2003, 100:3107-3112.

4. Schena M, Shalon D, Davis RW, Brown PO: Quantitative Monitoring of Gene- Expression Patterns with A Complementary-Dna Microarray. Science 1995, 270:467-470.

5. Southern EM: Detection of specific sequences among DNA fragments separated by gel electrophoresis. J Mol Biol 1975, 98:503-517.

6. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M: Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 2001, 29:365-371.

7. Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de LF, Kawasaki ES, Lee KY, Luo Y, Sun YA, Willey JC, Setterquist RA, Fischer GM, Tong W, Dragan YP, Dix DJ, Frueh FW, Goodsaid FM, Herman D, Jensen RV, Johnson CD, Lobenhofer EK, Puri RK, Schrf U, Thierry-Mieg J, Wang C, Wilson M, Wolber PK, Zhang L, Amur S, Bao W, Barbacioru CC, Lucas AB, Bertholet V, Boysen C, Bromley B, Brown D, Brunner A, Canales R, Cao XM, Cebula TA, Chen JJ, Cheng J, Chu TM, Chudin E, Corson J, Corton JC, Croner LJ, Davies C, Davison TS, Delenstarr G, Deng X, Dorris D, Eklund AC, Fan XH, Fang H, Fulmer-Smentek S, Fuscoe JC, Gallagher K, Ge W, Guo L, Guo X, Hager J, Haje PK, Han J, Han T, Harbottle HC, Harris SC, Hatchwell E, Hauser CA, Hester S, Hong H, Hurban P, Jackson SA, Ji H, Knight CR, Kuo WP, LeClerc JE, Levy S, Li QZ, Liu C, Liu Y, Lombardi MJ, Ma Y, Magnuson SR, Maqsodi B, McDaniel T, Mei N, Myklebost O, Ning B, Novoradovskaya N, Orr MS, Osborn TW, Papallo A, Patterson TA, Perkins RG, Peters EH, Peterson R, Philips KL, Pine PS, Pusztai L, Qian F, Ren H, Rosen M, Rosenzweig BA, Samaha RR, Schena M, Schroth GP, Shchegrova S, Smith DD, Staedtler F, Su Z, Sun H, Szallasi Z, Tezak Z, Thierry- Mieg D, Thompson KL, Tikhonova I, Turpaz Y, Vallanat B, Van C, Walker SJ, Wang SJ, Wang Y, Wolfinger R, Wong A, Wu J, Xiao C, Xie Q, Xu J, Yang W, Zhang L, Zhong S, Zong Y, Slikker W, Jr.: The MicroArray Quality Control (MAQC) project

89

shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 2006, 24:1151-1161.

8. Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Muertter RN, Edgar R: NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res 2009, 37:D885-D890.

9. Jain AK, Murty MN, Flynn PJ: Data Clustering: A Review. ACM Computing Surveys 1999, 31:264-323.

10. Wolfe C, Kohane I, Butte A: Systematic survey reveals general applicability of "guilt- by-association" within gene coexpression networks. BMC Bioinformatics 2005, 6:227.

11. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 1998, 95:14863-14868.

12. MacQueen JB: Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1967, 1:281-297.

13. Kohonen T: Self-Organized Formation of Topologically Correct Feature Maps. Biological Cybernetics 1982, 43:59-69.

14. Geman S, Geman D: Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. Ieee Transactions on Pattern Analysis and Machine Intelligence 1984, 6:721-741.

15. Medvedovic M, Sivaganesan S: Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 2002, 18:1194-1206.

16. Medvedovic M, Yeung KY, Bumgarner RE: Bayesian mixture model based clustering of replicated microarray data. Bioinformatics 2004, 20:1222-1232.

17. Fraley C, Raftery AE: Mclust: Software for modelbased cluster analysis. Journal of Classification. Journal of Classification 1999, 16:297-306.

18. Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de RM, Jeffrey SS, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Eystein LP, Borresen-Dale AL: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A 2001, 98:10869-10874.

19. van ', V, Dai H, van d, V, He YD, Hart AA, Mao M, Peterse HL, van der KK, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415:530-536.

90

20. Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005, 365:671-679.

21. Hu Z, Fan C, Oh DS, Marron JS, He X, Qaqish BF, Livasy C, Carey LA, Reynolds E, Dressler L, Nobel A, Parker J, Ewend MG, Sawyer LR, Wu J, Liu Y, Nanda R, Tretiakova M, Ruiz OA, Dreher D, Palazzo JP, Perreard L, Nelson E, Mone M, Hansen H, Mullins M, Quackenbush JF, Ellis MJ, Olopade OI, Bernard PS, Perou CM: The molecular portraits of breast tumors are conserved across microarray platforms. BMC Genomics 2006, 7:96.

22. Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Buyse M, Van de Vijver MJ, Bergh J, Piccart M, Delorenzi M: Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis. J Natl Cancer Inst 2006, 98:262-272.

23. Sotiriou C, Pusztai L: Gene-expression signatures in breast cancer. N Engl J Med 2009, 360:790-800.

24. Cheng Y, Church GM: Biclustering of expression data. Proc Int Conf Intell Syst Mol Biol 2000, 8:93-103.

25. Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N: Revealing modular organization in the yeast transcriptional network. Nat Genet 2002, 31:370-377.

26. Ben-Dor A, Chor B, Karp R, Yakhini Z: Discovering local structure in gene expression data: the order-preserving submatrix problem. J Comput Biol 2003, 10:373-384.

27. Tanay A, Sharan R, Shamir R: Discovering statistically significant biclusters in gene expression data. Bioinformatics 2002, 18 Suppl 1:S136-S144.

28. Murali TM, Kasif S: Extracting conserved gene expression motifs from gene expression data. Pac Symp Biocomput 2003:77-88.

29. Liu X, Wang L: Computing the maximum similarity bi-clusters of gene expression data. Bioinformatics 2007, 23:50-56.

30. Choi JK, Yu U, Yoo OJ, Kim S: Differential coexpression analysis using microarray data and its application to human cancer. Bioinformatics 2005, 21:4348-4355.

31. Cho SB, Kim J, Kim JH: Identifying set-wise differential co-expression in gene expression microarray data. BMC Bioinformatics 2009, 10:109.

32. Choi Y, Kendziorski C: Statistical Methods for Gene Set Co-expression Analysis. Bioinformatics 2009.

91

33. Lai Y, Wu B, Chen L, Zhao H: A statistical method for identifying differential gene- gene co-expression patterns. Bioinformatics 2004, 20:3146-3155.

34. Kostka D, Spang R: Finding disease specific alterations in the co-expression of genes. Bioinformatics 2004, 20 Suppl 1:i194-i199.

35. Watson M: CoXpress: differential co-expression in gene expression data. BMC Bioinformatics 2006, 7:509.

36. Hudson NJ, Reverter A, Dalrymple BP: A differential wiring analysis of expression data correctly identifies the gene containing the causal mutation. PLoS Comput Biol 2009, 5:e1000382.

37. Freudenberg JM, Joshi VK, Hu Z, Medvedovic M: CLEAN: CLustering Enrichment ANalysis. BMC Bioinformatics 2009, 10:234.

38. Allison DB, Cui X, Page GP, Sabripour M: Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 2006, 7:55-65.

39. Ferguson TS: A Bayesian analysis of some nonparametric problems. The Annals of Statistics 1973, :209-230.

40. Neal RM: Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Journal of Computational and Graphical Statistics 2000, :249-265.

41. Liu X, Sivaganesan S, Yeung KY, Guo J, Bumgarner RE, Medvedovic M: Context- specific infinite mixtures for clustering gene expression profiles across diverse microarray dataset. Bioinformatics 2006, 22:1737-1744.

42. Cowell RG, Dawid PA, Lauritzen SL, Spiegelhalter DJ: Probabilistic Networks and Expert Systems. New York: Springer; 1999.

43. Schmidt M, Bohm D, von TC, Steiner E, Puhl A, Pilch H, Lehr HA, Hengstler JG, Kolbl H, Gehrmann M: The humoral immune system has a key prognostic impact in node- negative breast cancer. Cancer Res 2008, 68:5405-5413.

44. Carroll JS, Meyer CA, Song J, Li W, Geistlinger TR, Eeckhoute J, Brodsky AS, Keeton EK, Fertuck KC, Hall GF, Wang Q, Bekiranov S, Sementchenko V, Fox EA, Silver PA, Gingeras TR, Liu XS, Brown M: Genome-wide analysis of estrogen receptor binding sites. Nat Genet 2006, 38:1289-1297.

45. Haibe-Kains B, Desmedt C, Sotiriou C, Bontempi G: A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all? Bioinformatics 2008, 24:2200-2208.

46. Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, Bergh J: From The Cover: An expression signature for p53 status in

92

human breast cancer predicts mutation status, transcriptional effects, and patient survival. PNAS 2005, 102:13550-13555.

47. Desmedt C, Piette F, Loi S, Wang Y, Lallemand F, Haibe-Kains B, Viale G, Delorenzi M, Zhang Y, d'Assignies MS, Bergh J, Lidereau R, Ellis P, Harris AL, Klijn JG, Foekens JA, Cardoso F, Piccart MJ, Buyse M, Sotiriou C: Strong time dependence of the 76- gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clin Cancer Res 2007, 13:3207-3214.

48. Pawitan Y, Bjohle J, Amler L, Borg AL, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S, Liu ET, Miller L, Nordgren H, Ploner A, Sandelin K, Shaw PM, Smeds J, Skoog L, Wedren S, Bergh J: Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res 2005, 7:R953-R964.

49. Loi S, Haibe-Kains B, Desmedt C, Wirapati P, Lallemand F, Tutt AM, Gillet C, Ellis P, Ryder K, Reid JF, Daidone MG, Pierotti MA, Berns EM, Jansen MP, Foekens JA, Delorenzi M, Bontempi G, Piccart MJ, Sotiriou C: Predicting prognosis using molecular profiling in estrogen receptor-positive breast cancer treated with tamoxifen. BMC Genomics 2008, 9:239.

50. Perou CM, Sorlie T, Eisen MB, van de RM, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lonning PE, Borresen-Dale AL, Brown PO, Botstein D: Molecular portraits of human breast tumours. Nature 2000, 406:747-752.

51. Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, Demeter J, Perou CM, Lonning PE, Brown PO, Borresen-Dale AL, Botstein D: Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci U S A 2003, 100:8418-8423.

52. Gelman A, Carlin JC, Stern HS, Rubin DB: Bayesian Data Analysis. Second edition. New York: CRC Press; 2003.

53. Gelfand EA, Smith FMA: Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association 1990, :398-409.

54. R Development Core Team: R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2008.

55. Saldanha AJ: Java Treeview--extensible visualization of microarray data. Bioinformatics 2004, 20:3246-3248.

56. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Ostell J, Pruitt KD, Schuler GD, Shumway M, Sequeira E, Sherry ST, Sirotkin K, Souvorov A,

93

Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2008, 36:D13-D21.

57. Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19:185-193.

58. Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, Watson SJ, Meng F: Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucl Acids Res 2005, 33:e175.

59. Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S, Cao X, Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA, Chinnaiyan AM: Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 2005, 310:644-648.

60. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5:R80.

61. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25:25-29.

62. Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000, 28:27-30.

63. Newman JC, Weiner AM: L2L: a simple tool for discovering the hidden significance in microarray expression data. Genome Biol 2005, 6:R81.

64. Slonim DK: From patterns to pathways: gene expression data analysis comes of age. Nat Genet 2002, 32 Suppl:502-508.

65. Do JH, Choi DK: Clustering approaches to identifying gene expression patterns from DNA microarray data. Mol Cells 2008, 25:279-288.

66. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet 1999, 22:281-285.

67. Khatri P, Draghici S, Ostermeier GC, Krawetz SA: Profiling Gene Expression Using Onto-Express. Genomics 2002, 79:266-270.

94

68. Wei CL, Wu Q, Vega VB, Chiu KP, Ng P, Zhang T, Shahab A, Yong HC, Fu Y, Weng Z, Liu J, Zhao XD, Chew JL, Lee YL, Kuznetsov VA, Sung WK, Miller LD, Lim B, Liu ET, Yu Q, Ng HH, Ruan Y: A global map of p53 transcription-factor binding sites in the . Cell 2006, 124:207-219.

69. Sartor MA, Schnekenburger M, Marlow JL, Reichard JF, Wang Y, Fan Y, Ma C, Karyala S, Halbleib D, Liu X, Medvedovic M, Puga A: Genomewide Analysis of Aryl Hydrocarbon Receptor Binding Targets Reveals an Extensive Array of Gene Clusters that Control Morphogenic and Developmental Programs. Environ Health Perspect 2009, 117:1139-1146.

70. Rakyan VK, Down TA, Thorne NP, Flicek P, Kulesha E, Graf S, Tomazou EM, Backdahl L, Johnson N, Herberth M, Howe KL, Jackson DK, Miretti MM, Fiegler H, Marioni JC, Birney E, Hubbard TJ, Carter NP, Tavare S, Beck S: An integrated resource for genome-wide identification and analysis of human tissue-specific differentially methylated regions (tDMRs) . Genome Res 2008, 18:1518-1529.

71. Tomlins SA, Mehra R, Rhodes DR, Cao X, Wang L, Dhanasekaran SM, Kalyana- Sundaram S, Wei JT, Rubin MA, Pienta KJ, Shah RB, Chinnaiyan AM: Integrative molecular concept modeling of prostate cancer progression. Nat Genet 2007, 39:41- 51.

72. Toronen P: Selection of informative clusters from hierarchical cluster tree with gene classes. BMC Bioinformatics 2004, 5:32.

73. Buehler EC, Sachs JR, Shao K, Bagchi A, Ungar LH: The CRASSS plug-in for integrating annotation data with hierarchical clustering results. Bioinformatics 2004, 20:3266-3269.

74. Varshavsky R, Horn D, Linial M: Global considerations in hierarchical clustering reveal meaningful patterns in data. PLoS ONE 2008, 3:e2247.

75. Kuang-Hung P, Chih-Jian L, Stanley NC: Effects of threshold choice on biological conclusions reached during analysis of gene expression by DNA microarrays. Proceedings of the National Academy of Sciences of the United States of America 2005, 102:8961-8965.

76. Sartor MA, Leikauf GD, Medvedovic M: LRpath: a logistic regression approach for identifying enriched biological groups in gene expression data. Bioinformatics 2009, 25:211-217.

77. Newton MA, Quinatan FA, den Boon JA, Sengupta S, Ahlquist P: Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. The Annals of Applided Statistics 2007, 1:85-106.

78. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a

95

knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 2005, 102:15545-15550.

79. Yeung KY, Medvedovic M, Bumgarner RE: From co-expression to co-regulation: how many microarray experiments do we need? Genome Biol 2004, 5:R48.

80. Datta S, Datta S: Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinformatics 2006, 7:397.

81. Liu X, Jessen WJ, Sivaganesan S, Aronow BJ, Medvedovic M: Bayesian hierarchical model for transcriptional module discovery by jointly modeling gene expression and ChIP-chip data. BMC Bioinformatics 2007, 8:283.

82. Guo X, Liu R, Shriver CD, Hu H, Liebman MN: Assessing semantic similarity measures for the characterization of human regulatory pathways. Bioinformatics 2006, 22:967-973.

83. Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Y: KEGG for linking genomes to life and the environment. Nucleic Acids Res 2008, 36:D480-D484.

84. Wingender E, Chen X, Fricke E, Geffers R, Hehl R, Liebich I, Krull M, Matys V, Michael H, Ohnhauser R, Pruss M, Schacherer F, Thiele S, Urbach S: The TRANSFAC system on gene expression regulation. Nucleic Acids Res 2001, 29:281-283.

85. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 2004, 101:6062-6067.

86. Johnson DS, Li W, Gordon DB, Bhattacharjee A, Curry B, Ghosh J, Brizuela L, Carroll JS, Brown M, Flicek P, Koch CM, Dunham I, Bieda M, Xu X, Farnham PJ, Kapranov P, Nix DA, Gingeras TR, Zhang X, Holster H, Jiang N, Green RD, Song JS, McCuine SA, Anton E, Nguyen L, Trinklein ND, Ye Z, Ching K, Hawkins D, Ren B, Scacheri PC, Rozowsky J, Karpikov A, Euskirchen G, Weissman S, Gerstein M, Snyder M, Yang A, Moqtaderi Z, Hirsch H, Shulha HP, Fu Y, Weng Z, Struhl K, Myers RM, Lieb JD, Liu XS: Systematic evaluation of variability in ChIP-chip experiments using predefined DNA targets. Genome Res 2008, 18:393-403.

87. Wierstra I, Alves J: FOXM1, a typical proliferation-associated transcription factor. Biol Chem 2007, 388:1257-1274.

88. Fu Z, Malureanu L, Huang J, Wang W, Li H, van Deursen JM, Tindal DJ, Chen J: Plk1- dependent phosphorylation of FoxM1 regulates a transcriptional programme required for mitotic progression. Nat Cell Biol 2008, 10:1076-1082.

89. Dotan-Cohen D, Melkman AA, Kasif S: Hierarchical tree snipping: clustering guided by prior knowledge. Bioinformatics 2007, 23:3335-3342.

96

90. Huang D, Wei P, Pan W: Combining gene annotations and gene expression data in model-based clustering: weighted method. OMICS 2006, 10:28-39.

91. Huang D, Pan W: Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinformatics 2006, 22:1259-1268.

92. Lee SI, Batzoglou S: Application of independent component analysis to microarrays. Genome Biol 2003, 4:R76.

93. Tan MP, Smith EN, Broach JR, Floudas CA: Microarray data mining: a novel optimization-based approach to uncover biologically coherent structures. BMC Bioinformatics 2008, 9:268.

94. Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007, 35:D61-D65.

95. Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society B 1995, 57:289-300.

96. Medvedovic M, Guo J: Bayesian Model-Averaging in Unsupervised Learing From Microarray Data. BIOKDD 2004 2004.

97. Shippy R, Fulmer-Smentek S, Jensen RV, Jones WD, Wolber PK, Johnson CD, Pine PS, Boysen C, Guo X, Chudin E, Sun YA, Willey JC, Thierry-Mieg J, Thierry-Mieg D, Setterquist RA, Wilson M, Lucas AB, Novoradovskaya N, Papallo A, Turpaz Y, Baker SC, Warrington JA, Shi L, Herman D: Using RNA sample titrations to assess microarray platform performance and normalization techniques. Nat Biotechnol 2006, 24:1123-1131.

98. Segal E, Friedman N, Koller D, Regev A: A module map showing conditional activity of expression modules in cancer. Nat Genet 2004, 36:1090-1098.

99. Kang HC, Kim IJ, Park JH, Shin Y, Ku JL, Jung MS, Yoo BC, Kim HK, Park JG: Identification of genes with differential expression in acquired drug-resistant gastric cancer cells using high-density oligonucleotide microarrays. Clin Cancer Res 2004, 10:272-284.

100. Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM: Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci U S A 2004, 101:9309-9314.

97

Appendix

A. Gene Lists

Table A1. Top 200 DCS genes for the Schmidt et al. (2008) dataset. Entrez Gene Gene Gene Name ID Symbol 18 ABAT 4-aminobutyrate aminotransferase 79852 ABHD9 abhydrolase domain containing 9 59272 ACE2 angiotensin I converting (peptidyl-dipeptidase A) 2 58 ACTA1 , alpha 1, skeletal muscle 57180 ACTR3B ARP3 actin-related protein 3 homolog B (yeast) 133 ADM adrenomedullin 57016 AKR1B10 aldo-keto reductase family 1, member B10 (aldose reductase) 230 ALDOC aldolase C, fructose-bisphosphate 419 ART3 ADP-ribosyltransferase 3 429 ASCL1 achaete-scute complex homolog 1 (Drosophila) 8424 BBOX1 butyrobetaine (gamma), 2-oxoglutarate dioxygenase (gamma-butyrobetaine hydroxylase) 1 53335 BCL11A B-cell CLL/lymphoma 11A (zinc finger protein) 597 BCL2A1 BCL2-related protein A1 221061 C10orf38 10 open reading frame 38 89927 C16orf45 open reading frame 45 51313 C4orf18 open reading frame 18 79614 C5orf23 open reading frame 23 765 CA6 carbonic anhydrase VI 768 CA9 carbonic anhydrase IX 55799 CACNA2D3 calcium channel, voltage-dependent, alpha 2/delta 3 subunit 794 CALB2 calbindin 2, 29kDa (calretinin) 820 CAMP cathelicidin antimicrobial peptide 875 CBS cystathionine-beta-synthase 6364 CCL20 chemokine (C-C motif) ligand 20 50489 CD207 CD207 molecule, langerin 28513 CDH19 cadherin 19, type 2 1001 CDH3 cadherin 3, type 1, P-cadherin (placental) 1117 CHI3L2 chitinase 3-like 2 9435 CHST2 carbohydrate (N-acetylglucosamine-6-O) sulfotransferase 2 4435 CITED1 Cbp/p300-interacting transactivator, with Glu/Asp-rich carboxy-terminal domain, 1 9076 CLDN1 claudin 1 9071 CLDN10 claudin 10

98

9073 CLDN8 claudin 8 1690 COCH factor C homolog, cochlin (Limulus polyphemus) 1299 COL9A3 collagen, type IX, alpha 3 1373 CPS1 carbamoyl-phosphate synthetase 1, mitochondrial 1396 CRIP1 cysteine-rich protein 1 (intestinal) 1410 CRYAB crystallin, alpha B 51084 CRYL1 crystallin, lambda 1 1448 CSN3 casein kappa 6376 CX3CL1 chemokine (C-X3-C motif) ligand 1 51700 CYB5R2 cytochrome b5 reductase 2 1535 CYBA cytochrome b-245, alpha polypeptide 1672 DEFB1 defensin, beta 1 22943 DKK1 dickkopf homolog 1 (Xenopus laevis) 8788 DLK1 delta-like 1 homolog (Drosophila) 1824 DSC2 desmocollin 2 1825 DSC3 desmocollin 3 1828 DSG1 desmoglein 1 1830 DSG3 desmoglein 3 (pemphigus vulgaris antigen) 10085 EDIL3 EGF-like repeats and discoidin I-like domains 3 1956 EGFR epidermal growth factor receptor (erythroblastic leukemia viral (v-erb-b) oncogene homolog, avian) 2001 ELF5 E74-like factor 5 (ets domain transcription factor) 60481 ELOVL5 ELOVL family member 5, elongation of long chain fatty acids (FEN1/Elo2, SUR4/Elo3-like, yeast) 2019 EN1 engrailed homeobox 1 957 ENTPD5 ectonucleoside triphosphate diphosphohydrolase 5 2049 EPHB3 EPH receptor B3 2173 FABP7 fatty acid binding protein 7, brain 2203 FBP1 fructose-1,6-bisphosphatase 1 2273 FHL1 four and a half LIM domains 1 2346 FOLH1 folate (prostate-specific membrane antigen) 1 2348 FOLR1 folate receptor 1 (adult) 8323 FZD6 frizzled homolog 6 (Drosophila) 79623 GALNT14 UDP-N-acetyl-alpha-D-galactosamine 2678 GGT1 gamma-glutamyltransferase 1 2717 GLA galactosidase, alpha 2824 GPM6B glycoprotein M6B 57211 GPR126 G protein-coupled receptor 126 10149 GPR64 G protein-coupled receptor 64 51704 GPRC5B G protein-coupled receptor, family C, group 5, member B 2938 GSTA1 glutathione S- A1 2950 GSTP1 glutathione S-transferase pi

99

3215 HOXB5 homeobox B5 57110 HRASLS HRAS-like suppressor 3294 HSD17B2 hydroxysteroid (17-beta) dehydrogenase 2 3355 HTR1F 5-hydroxytryptamine (serotonin) receptor 1F 3475 IFRD1 interferon-related developmental regulator 1 3495 IGHD immunoglobulin heavy constant delta 28410 IGHV3-72 immunoglobulin heavy variable 3-72 7850 IL1R2 interleukin 1 receptor, type II 8821 INPP4B inositol polyphosphate-4-phosphatase, type II, 105kDa 84223 IQCG IQ motif containing G 3655 ITGA6 integrin, alpha 6 8645 KCNK5 potassium channel, subfamily K, member 5 3779 KCNMB1 potassium large conductance calcium-activated channel, subfamily M, beta member 1 3783 KCNN4 potassium intermediate/small conductance calcium-activated channel, subfamily N, member 4 65987 KCTD14 potassium channel tetramerisation domain containing 14 10656 KHDRBS3 KH domain containing, RNA binding, signal transduction associated 3 5655 KLK10 -related peptidase 10 43849 KLK12 kallikrein-related peptidase 12 25818 KLK5 kallikrein-related peptidase 5 5653 KLK6 kallikrein-related peptidase 6 5650 KLK7 kallikrein-related peptidase 7 11202 KLK8 kallikrein-related peptidase 8 3868 KRT16 16 (focal non-epidermolytic palmoplantar keratoderma) 3887 KRT81 keratin 81 3892 KRT86 keratin 86 3906 LALBA lactalbumin, alpha- 3934 LCN2 lipocalin 2 (oncogene 24p3) 3945 LDHB lactate dehydrogenase B 8825 LIN7A lin-7 homolog A (C. elegans) 8543 LMO4 LIM domain only 4 79782 LRRC31 leucine rich repeat containing 31 8581 LY6D lymphocyte antigen 6 complex, D 4111 MAGEA12 melanoma antigen family A, 12 4117 MAK male germ cell-associated kinase 115123 MARCH3 membrane-associated ring finger (C3HC4) 3 8685 MARCO macrophage receptor with collagenous structure 4199 ME1 malic enzyme 1, NADP(+)-dependent, cytosolic 4240 MFGE8 milk fat globule-EGF factor 8 protein 8190 MIA melanoma inhibitory activity 57553 MICAL3 associated monoxygenase, calponin and LIM domain containing 3

100

4281 MID1 midline 1 (Opitz/BBB syndrome) 10962 MLLT11 myeloid/lymphoid or mixed-lineage leukemia (trithorax homolog, Drosophila); translocated to, 11 79083 MLPH melanophilin 4319 MMP10 matrix metallopeptidase 10 (stromelysin 2) 4321 MMP12 matrix metallopeptidase 12 (macrophage ) 10232 MSLN mesothelin 94025 MUC16 mucin 16, cell surface associated 4603 MYBL1 v-myb myeloblastosis viral oncogene homolog (avian)-like 1 4609 MYC v-myc myelocytomatosis viral oncogene homolog (avian) 83988 NCALD neurocalcin delta 10397 NDRG1 N-myc downstream regulated gene 1 57447 NDRG2 NDRG family member 2 10763 NES 81831 NETO2 neuropilin (NRP) and tolloid (TLL)-like 2 23114 NFASC neurofascin homolog (chicken) 9603 NFE2L3 nuclear factor (erythroid-derived 2)-like 3 10874 NMU neuromedin U 56654 NPDC1 neural proliferation, differentiation and control, 1 64943 NT5DC2 5'-nucleotidase domain containing 2 54959 ODAM odontogenic, ameloblast asssociated 4953 ODC1 ornithine decarboxylase 1 79627 OGFRL1 opioid growth factor receptor-like 1 10439 OLFM1 olfactomedin 1 5004 ORM1 orosomucoid 1 5005 ORM2 orosomucoid 2 5121 PCP4 Purkinje cell protein 4 10158 PDZK1IP1 PDZK1 interacting protein 1 79605 PGBD5 piggyBac transposable element derived 5 26227 PHGDH phosphoglycerate dehydrogenase 51050 PI15 peptidase inhibitor 15 8544 PIR pirin (iron-binding nuclear protein) 5320 PLA2G2A phospholipase A2, group IIA (platelets, synovial fluid) 58473 PLEKHB1 pleckstrin homology domain containing, family B (evectins) member 1 5354 PLP1 proteolipid protein 1 (Pelizaeus-Merzbacher disease, spastic paraplegia 2, uncomplicated) 10687 PNMA2 paraneoplastic antigen MA2 23532 PRAME preferentially expressed antigen in melanoma 5613 PRKX protein kinase, X-linked 58503 PROL1 proline rich, lacrimal 1 10942 PRSS21 protease, serine, 21 (testisin) 29968 PSAT1 phosphoserine aminotransferase 1

101

5746 PTH2R parathyroid hormone 2 receptor 9200 PTPLA protein tyrosine phosphatase-like (proline instead of catalytic arginine), member A 5799 PTPRN2 protein tyrosine phosphatase, receptor type, N polypeptide 2 5806 PTX3 pentraxin-related gene, rapidly induced by IL-1 beta 1827 RCAN1 regulator of calcineurin 1 6019 RLN2 relaxin 2 152015 ROPN1B ropporin, rhophilin associated protein 1B 6261 RYR1 ryanodine receptor 1 (skeletal) 6271 S100A1 S100 calcium binding protein A1 6285 S100B S100 calcium binding protein B 29970 SCHIP1 schwannomin interacting protein 1 6338 SCNN1B sodium channel, nonvoltage-gated 1, beta (Liddle syndrome) 11341 SCRG1 scrapie responsive protein 1 54847 SIDT1 SID1 transmembrane family, member 1 28965 SLC27A6 solute carrier family 27 (fatty acid transporter), member 6 6519 SLC3A1 solute carrier family 3 (cystine, dibasic and neutral amino acid transporters, activator of cystine, dibasic and neutral amino acid transport), member 1 8501 SLC43A1 solute carrier family 43, member 1 11254 SLC6A14 solute carrier family 6 (amino acid transporter), member 14 6590 SLPI secretory leukocyte peptidase inhibitor 9751 SNPH syntaphilin 25928 SOSTDC1 sclerostin domain containing 1 6663 SOX10 SRY (sex determining region Y)-box 10 6664 SOX11 SRY (sex determining region Y)-box 11 25803 SPDEF SAM pointed domain containing ets transcription factor 6715 SRD5A1 steroid-5-alpha-reductase, alpha polypeptide 1 (3-oxo-5 alpha-steroid delta 4- dehydrogenase alpha 1) 6489 ST8SIA1 ST8 alpha-N-acetyl-neuraminide alpha-2,8-sialyltransferase 1 79921 TCEAL4 transcription elongation factor A (SII)-like 4 83439 TCF7L1 transcription factor 7-like 1 (T-cell specific, HMG-box) 7018 TF transferrin 25907 TMEM158 transmembrane protein 158 55076 TMEM45A transmembrane protein 45A 11013 TMSL8 thymosin-like 8 27242 TNFRSF21 tumor necrosis factor receptor superfamily, member 21 7162 TPBG trophoblast glycoprotein 25823 TPSG1 gamma 1 7200 TRH thyrotropin-releasing hormone 23321 TRIM2 tripartite motif-containing 2 23650 TRIM29 tripartite motif-containing 29 25893 TRIM58 tripartite motif-containing 58 7103 TSPAN8 tetraspanin 8

102

57348 TTYH1 tweety homolog 1 (Drosophila) 347733 TUBB2B , beta 2B 7345 UCHL1 ubiquitin carboxyl-terminal esterase L1 (ubiquitin thiolesterase) 54490 UGT2B28 UDP glucuronosyltransferase 2 family, polypeptide B28 51442 VGLL1 vestigial like 1 (Drosophila) 8838 WISP3 WNT1 inducible signaling pathway protein 3 54361 WNT4 wingless-type MMTV integration site family, member 4 7545 ZIC1 Zic family member 1 (odd-paired homolog, Drosophila)

Table A2: Top 200 genes differentially expressed between two major contexts (marked green and blue in Figure 2.3A) for the Schmidt et al. (2008) dataset. Entrez Gene Gene Symbol Gene Name ID 16 AARS alanyl-tRNA synthetase 18 ABAT 4-aminobutyrate aminotransferase 150 ADRA2A adrenergic, alpha-2A-, receptor 9590 AKAP12 A kinase (PRKA) anchor protein (gravin) 12 57763 ANKRA2 repeat, family A (RFXANK-like), 2 9582 APOBEC3B apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like 3B 9824 ARHGAP11A Rho GTPase activating protein 11A 259266 ASPM asp (abnormal spindle) homolog, microcephaly associated (Drosophila) 54829 ASPN asporin 55210 ATAD3A ATPase family, AAA domain containing 3A 6790 AURKA aurora kinase A 9212 AURKB aurora kinase B 332 BIRC5 baculoviral IAP repeat-containing 5 (survivin) 641 BLM Bloom syndrome 7832 BTG2 BTG family, member 2 10950 BTG3 BTG family, member 3 699 BUB1 BUB1 budding uninhibited by benzimidazoles 1 homolog (yeast) 701 BUB1B BUB1 budding uninhibited by benzimidazoles 1 homolog beta (yeast) 705 BYSL bystin-like 219654 C10orf56 open reading frame 56 79786 C16orf44 chromosome 16 open reading frame 44 56942 C16orf61 chromosome 16 open reading frame 61 29105 C16orf80 chromosome 16 open reading frame 80 78995 C17orf53 open reading frame 53 220134 C18orf24 chromosome 18 open reading frame 24 114899 C1QTNF3 C1q and tumor necrosis factor related protein 3 24141 C20orf103 chromosome 20 open reading frame 103 64921 CASD1 CAS1 domain containing 1 890 CCNA2 cyclin A2

103

9133 CCNB2 cyclin B2 22948 CCT5 chaperonin containing TCP1, subunit 5 (epsilon) 8872 CDC123 cell division cycle 123 homolog (S. cerevisiae) 983 CDC2 cell division cycle 2, G1 to S and G2 to M 991 CDC20 cell division cycle 20 homolog (S. cerevisiae) 993 CDC25A cell division cycle 25 homolog A (S. pombe) 8318 CDC45L CDC45 cell division cycle 45-like (S. cerevisiae) 83461 CDCA3 cell division cycle associated 3 55143 CDCA8 cell division cycle associated 8 1013 CDH15 cadherin 15, M-cadherin (myotubule) 1033 CDKN3 cyclin-dependent kinase inhibitor 3 (CDK2-associated dual specificity phosphatase) 81620 CDT1 chromatin licensing and DNA replication factor 1 1054 CEBPG CCAAT/enhancer binding protein (C/EBP), gamma 1058 CENPA centromere protein A 1063 CENPF centromere protein F, 350/400ka (mitosin) 55839 CENPN centromere protein N 55165 CEP55 centrosomal protein 55kDa 10036 CHAF1A chromatin assembly factor 1, subunit A (p150) 8208 CHAF1B chromatin assembly factor 1, subunit B (p60) 1111 CHEK1 CHK1 checkpoint homolog (S. pombe) 8483 CILP cartilage intermediate layer protein, nucleotide pyrophosphohydrolase 57396 CLK4 CDC-like kinase 4 1307 COL16A1 collagen, type XVI, alpha 1 10328 COX4NB COX4 neighbor 51380 CSAD cysteine sulfinic acid decarboxylase 1503 CTPS CTP synthase 79075 DCC1 defective in sister chromatid cohesion homolog 1 (S. cerevisiae) 55794 DDX28 DEAD (Asp-Glu-Ala-Asp) box polypeptide 28 10212 DDX39 DEAD (Asp-Glu-Ala-Asp) box polypeptide 39 1723 DHODH dihydroorotate dehydrogenase 55526 DHTKD1 dehydrogenase E1 and transketolase domain containing 1 85458 DIXDC1 DIX domain containing 1 1736 DKC1 1, dyskerin 9787 DLG7 discs, large homolog 7 (Drosophila) 1869 E2F1 E2F transcription factor 1 79733 E2F8 E2F transcription factor 8 10682 EBP emopamil binding protein (sterol ) 1842 ECM2 extracellular matrix protein 2, female organ and adipocyte specific 1909 EDNRA endothelin receptor type A 54821 ERCC6L excision repair cross-complementing rodent repair deficiency, complementation group 6-like

104

9700 ESPL1 extra spindle pole bodies homolog 1 (S. cerevisiae) 9156 EXO1 exonuclease 1 2146 EZH2 enhancer of zeste homolog 2 (Drosophila) 54478 FAM64A family with sequence similarity 64, member A 2175 FANCA Fanconi anemia, complementation group A 2178 FANCE Fanconi anemia, complementation group E 26263 FBXO22 F-box protein 22 2237 FEN1 flap structure-specific endonuclease 1 2305 FOXM1 forkhead box M1 51363 GALNAC4S- B cell RAG associated protein 6ST 8836 GGH gamma-glutamyl hydrolase (conjugase, folylpolygammaglutamyl hydrolase) 64785 GINS3 GINS complex subunit 3 (Psf3 homolog) 2697 GJA1 gap junction protein, alpha 1, 43kDa 55830 GLT8D1 glycosyltransferase 8 domain containing 1 29899 GPSM2 G-protein signaling modulator 2 (AGS3-like, C. elegans) 23560 GTPBP4 GTP binding protein 4 51512 GTSE1 G-2 and S-phase expressed 1 8352 HIST1H3C histone cluster 1, H3c 3149 HMGB3 high-mobility group box 3 55806 HR hairless homolog (mouse) 51182 HSPA14 heat shock 70kDa protein 14 10581 IFITM2 interferon induced transmembrane protein 2 (1-8D) 3481 IGF2 insulin-like growth factor 2 (somatomedin A) 3669 ISG20 interferon stimulated exonuclease gene 20kDa 83700 JAM3 junctional adhesion molecule 3 56888 KCMF1 potassium channel modulatory factor 1 3755 KCNG1 potassium voltage-gated channel, subfamily G, member 1 23116 KIAA0423 KIAA0423 23247 KIAA0556 KIAA0556 3832 KIF11 family member 11 9928 KIF14 kinesin family member 14 10112 KIF20A kinesin family member 20A 9493 KIF23 kinesin family member 23 11004 KIF2C kinesin family member 2C 24137 KIF4A kinesin family member 4A 3833 KIFC1 kinesin family member C1 4001 LMNB1 B1 4016 LOXL1 lysyl oxidase-like 1 7804 LRP8 low density lipoprotein receptor-related protein 8, apolipoprotein e receptor 10234 LRRC17 leucine rich repeat containing 17 2615 LRRC32 leucine rich repeat containing 32

105

4148 MATN3 matrilin 3 55388 MCM10 minichromosome maintenance complex component 10 4171 MCM2 minichromosome maintenance complex component 2 4174 MCM5 minichromosome maintenance complex component 5 4175 MCM6 minichromosome maintenance complex component 6 4199 ME1 malic enzyme 1, NADP(+)-dependent, cytosolic 9833 MELK maternal embryonic leucine zipper kinase 4239 MFAP4 microfibrillar-associated protein 4 4288 MKI67 antigen identified by monoclonal antibody Ki-67 57496 MKL2 MKL/myocardin-like 2 4330 MN1 meningioma (disrupted in balanced translocation) 1 6183 MRPS12 mitochondrial ribosomal protein S12 4597 MVD mevalonate (diphospho) decarboxylase 9961 MVP major vault protein 4605 MYBL2 v-myb myeloblastosis viral oncogene homolog (avian)-like 2 4675 NAP1L3 nucleosome assembly protein 1-like 3 64151 NCAPG non-SMC condensin I complex, subunit G 23397 NCAPH non-SMC condensin I complex, subunit H 10403 NDC80 NDC80 homolog, kinetochore complex component (S. cerevisiae) 4771 NF2 neurofibromin 2 (bilateral acoustic neuroma) 11188 NISCH nischarin 4839 NOL1 nucleolar protein 1, 120kDa 9221 NOLC1 nucleolar and coiled-body phosphoprotein 1 51203 NUSAP1 nucleolar and spindle associated protein 1 4969 OGN osteoglycin 4958 OMD osteomodulin 4998 ORC1L origin recognition complex, subunit 1-like (yeast) 23594 ORC6L origin recognition complex, subunit 6 like (yeast) 80310 PDGFD platelet derived growth 5163 PDK1 pyruvate dehydrogenase kinase, isozyme 1 23590 PDSS1 prenyl (decaprenyl) diphosphate synthase, subunit 1 29990 PILRB paired immunoglobin-like type 2 receptor beta 8544 PIR pirin (iron-binding nuclear protein) 5347 PLK1 polo-like kinase 1 (Drosophila) 57088 PLSCR4 4 9055 PRC1 protein regulator of cytokinesis 1 5557 PRIM1 primase, DNA, polypeptide 1 (49kDa) 5688 PSMA7 proteasome (prosome, macropain) subunit, alpha type, 7 5690 PSMB2 proteasome (prosome, macropain) subunit, beta type, 2 23198 PSME4 proteasome (prosome, macropain) activator subunit 4 8624 PSMG1 Down syndrome critical region gene 2

106

9232 PTTG1 pituitary tumor-transforming 1 26255 PTTG3 pituitary tumor-transforming 3 5813 PURA purine-rich element binding protein A 5817 PVR poliovirus receptor 29127 RACGAP1 Rac GTPase activating protein 1 5888 RAD51 RAD51 homolog (RecA homolog, E. coli) (S. cerevisiae) 5902 RANBP1 RAN binding protein 1 55544 RBM38 RNA binding motif protein 38 10181 RBM5 RNA binding motif protein 5 8434 RECK reversion-inducing-cysteine-rich protein with kazal motifs 6241 RRM2 ribonucleotide reductase M2 polypeptide 22929 SEPHS1 selenophosphate synthetase 1 23450 SF3B3 splicing factor 3b, subunit 3, 130kDa 6502 SKP2 S-phase kinase-associated protein 2 (p45) 8140 SLC7A5 solute carrier family 7 (cationic amino acid transporter, y+ system), member 5 6631 SNRPC small nuclear ribonucleoprotein polypeptide C 10615 SPAG5 sperm associated antigen 5 8404 SPARCL1 SPARC-like 1 (mast9, hevin) 57405 SPC25 SPC25, NDC80 kinetochore complex component, homolog (S. cerevisiae) 6491 STIL SCL/TAL1 interrupting locus 3925 STMN1 1/oncoprotein 18 81493 SYNC1 , 1 10460 TACC3 transforming, acidic coiled-coil containing protein 3 6924 TCEB3 transcription elongation factor B (SIII), polypeptide 3 (110kDa, elongin A) 6949 TCOF1 Treacher Collins-Franceschetti syndrome 1 7043 TGFB3 transforming growth factor, beta 3 54962 TIPIN TIMELESS interacting protein 7083 TK1 1, soluble 79838 TMC5 transmembrane channel-like 5 63923 TNN tenascin N 7153 TOP2A topoisomerase (DNA) II alpha 170kDa 22974 TPX2 TPX2, microtubule-associated, homolog (Xenopus laevis) 9319 TRIP13 thyroid hormone receptor interactor 13 10024 TROAP trophinin associated protein (tastin) 7272 TTK TTK protein kinase 7296 TXNRD1 thioredoxin reductase 1 11065 UBE2C ubiquitin-conjugating enzyme E2C 7371 UCK2 uridine-cytidine kinase 2 56886 UGCGL1 UDP-glucose ceramide glucosyltransferase-like 1 55355 URLC9 hypothetical protein DKFZp762E1312 27183 VPS4A vacuolar protein sorting 4 homolog A (S. cerevisiae)

107

57728 WDR19 WD repeat domain 19 9942 XYLB xylulokinase homolog (H. influenzae) 6935 ZEB1 zinc finger E-box binding homeobox 1 23414 ZFPM2 zinc finger protein, multitype 2 10127 ZNF263 zinc finger protein 263 54816 ZNF280D suppressor of hairy wing homolog 4 (Drosophila) 23090 ZNF423 zinc finger protein 423 11130 ZWINT ZW10 interactor

B. R packages

B.1 Contributions to gimmR computeDCEscore: A wrapper function to compute the DCE score given two contexts. Description: After running gimm and posthoc with the estimate_context option set to "y" and the intFiles option set to "TRUE", the user may specify two sets of samples (called contexts) to first compute the local pairwise posterior probabilities for all pairs of genes for both contexts and then compute a gene-specific differential co-expression score reflecting the differences between the two contexts. generateLocalPPPs: Wrapper function to compute local pairwise posterior probabilities given a set of samples (context). Description: After running gimm and posthoc with the estimate_context option set to "y" and the intFiles option set to "TRUE", the user may specify a set of samples (called context) to compute the local pairwise posterior probabilities for all pairs of genes. getBinarySubTree: Get the left or right subtree of a given hierarchical clustering (non- recursive). getPerm: Gets all the permutations from hierarchical clustering (non-recursive). splitMatrix: Split a matrix into sub matrices. Description: This function is similar to split() but returns a list of (sub-)matrices.

B.2 CLEAN

CLEAN-package: R package for Clustering Enrichment Analysis Description: Given an hierarchical gene clustering and a list of functional categories, this package performs functional enrichment analysis of all possible clusters and generates files to simultaneously display gene expression data, gene clustering, sample clustering, and functional annotation of gene clustering.

108

References: Freudenberg e. al. [37]. Examples data(gimmOut) require(CLEAN.Rn) res <- runCLEAN(gimmOut, species = "Rn") generateTreeViewFiles(gimmOut, functionalCategories= getFunctionalCategories("geneRIFs", species="Rn")) #same as generateTreeViewFiles(gimmOut, functionalCategories="geneRIFs", species = "Rn")

#multiple category types generateTreeViewFiles(gimmOut, functionalCategories= c("geneRIFs", "CpGislands", "GO", "KEGG"), species = "Rn") trt <- sapply(colnames(gimmOut$clustData)[-(1:2)], function(str) strsplit(str, split = "_")[[1]][1]) generateTreeViewFiles(gimmOut, cclust = NA, verbose = FALSE, functionalCategories=c("geneRIFs", "CpGislands", "GO", "KEGG"), species = "Rn", callTreeView = TRUE, sampleDesc = trt)

#non-hierarchical clustering d <- nonHierarchicalClustering(function(m, k, ...) kmeans(m, k, ...)$cluster, gimmOut$clustData[,-(1:2)], k = 2:4, nstart = 10) generateTreeViewFiles(gimmOut, rclust = d, cclust = NA, verbose = FALSE, functionalCategories=c("geneRIFs","CpGislands", "GO", "KEGG"), species = "Rn", callTreeView = TRUE, sampleDesc = trt)

#geneList enrichment geneList <- gimmOut$clustData[,1] require(org.Rn.eg.db) allGenes <- unique(keys(org.Rn.egSYMBOL)) #one should really use the list of #genes represented on the microarray instead res <- geneListEnrichment(geneList, allGenes, functionalCategories = "GO", species = "Rn", sigFDR = 0.01, maxGenesInCategory = 10000) genesInEnrichedCategories(res[,1], geneList, funcCategories = "GO", species = "Rn") runCLEAN, generateTreeViewFiles, geneListEnrichment: Wrapper functions to perform functional Clustering Enrichment Analysis (CLEAN) Description: These functions take a gene expression data set, hierarchical clusterings of genes and samples, and a list of gene sets representing functional categories. It performs hierarchical clustering (if not provided) and performs the Clustering Enrichment Analysis. Finally, it generates files to display data, clustering, and functional annotation using tools such as the Java-based fTreeView. getFunctionalCategories: Retrieve functional categories from species-specific library. Description: Load the species-specific library/ies and retrieve the functional catogories of type(s) "functionalCategories". getAllClusters: Generate a list of all possible clusters from a hierarchical tree. Description: The function takes an hclust object and returns a list of all possible clusters of desired minimum to maximum size. Clusters are determined by cutting the tree at a given level and recording the leaf node labels of the resulting subtree(s). convertFunctionalCategories: Function to convert Entrez gene ID-based functional categories from one species to another using the Homologene database. Description: This function loads the Homologene table and a list of functional categories and converts Entrez gene IDs from one species to another. Currently implemented species are "h" (Homo sapiens), "m" (Mus musculus), and "r" (Rattus norvegicus). Use the following

109

command to download the Homologene table into the working directory: download.file("ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/homologene.data", destfile="homologene.data",mode="wb") This function was adapted from from convertSmc() of the PGSEA package. getBinarySubTree: Get the left or right subtree of a given hierarchical clustering. Description: Reads a hierarchical clustering and returns the left or right child of the root node. r2cdt, r2gtr, r2atr, r2fni: Functions to generate fTreeView files Description: r2cdt, r2gtr, and r2atr are slightly modified functions from the package ctc. r2cdt generates a tab-delimited cdt file from a dataframe where the first column contains gene IDs, the second column contains gene names or descriptions, and the remaining columns represent samples. Column names are assumed to be sample names. In addition, corresponding hierarchical gene and sample clusterings are required to re-order genes and samples accordingly. r2gtr and r2atr generate tab-delimited TreeView files representing the hierarchical gene and sample clustering, respectively. r2fni generates the Functional Node Information file for the Java application fTreeView using an hierarchical gene clustering and corresponding functional Cluster Enrichment Analysis results. This tab-delimited file is then displayed interactively as the user selects a node (i.e., gene cluster) in the gene tree. call.treeview: Function to call the standalone fTreeView Java application Description: This function calls fTreeview, a Java appplication to display the heatmap, gene clustering, sample clustering, and functional annotation files generated with the CLEAN package. cdt2r, gtr2r, atr2r: Functions to import files in TreeView format into R. Description: cdt2r() imports a cdt file and converts it to a dataframe where rows represent genes and columns represent samples. The first column of the dataframe contains the gene identifiers and the second column contains gene names or descriptions. Column names represent sample names. Details: cdt2r() reads a tab-delimited file in generalized cdt format where the first row contains sample names, followed by additional optional rows. The first two or three columns contain gene IDs and descriptions, and optional additional columns, followed by gene expression data. The function returns a dataframe with gene IDs and descriptions in the first two columns, gene expression data in the remaining columns, and column names representing samples. gtr2r() and atr2r() read tree files representing the hierarchical gene tree or sample tree structure, respectively. Both functions return an hclust object. Value: cdt2r() returns a dataframe (see details) gtr2r() and atr2r() each return an hclust objects plotCLEANscore: Function to generate a diagnostic plot using the CLEAN score Description: This function generates a plot of the number of genes with CLEANscores <= a threshold against that threshold. These plots can be used, for example, to compare different clustering algorithms. Examples: data(gimmOut) require(org.Rn.eg.db) allGenes <- unique(keys(org.Rn.egSYMBOL)) # plotCLEANscore is intended for larger datasets. Here we use the larger # background gene list to somewhat better demonstrate the 110

# plotCLEANscore() function. In real life, the background list should be # the list of genes represented on the microarray fClustAnnotations <- runCLEAN(gimmOut, bkgList = allGenes, functionalCategories="CpGislands", species = "Rn", maxGenesInCategory=10000) plotCLEANscore(fClustAnnotations, fCategoryName="CpGislands") nonHierarchicalClustering: Convert a non-hierarchical clustering into hierarchical tree-structure

LRpath: Testing GO terms or KEGG with logistic regression. Description: This function uses logistic regression to test for enriched biological categories in gene expression data. Our method models the probability of a randomly selected gene belonging to a specific category given the significance level of that gene. For categories significantly affected by the experimental condition, this probability will increase as the significance statistic increases. Categories with significant p-values and positive slope coefficients are enriched with differentially expressed genes. References: Sartor et al. [76]

111