Integrating diverse biological sources and computational methods for the

analysis of high-throughput expression data

Inauguraldissertation

zur

Erlangung des akademischen Grades eines

Doktors der Naturwissenschaften (Dr. rer. nat.)

der

Mathematisch-Naturwissenschaftlichen Fakult¨at

der

Ernst-Moritz-Arndt-Universit¨atGreifswald

vorgelegt von

Nitesh Kumar Singh

geboren am 20.03.1983

in Kanpur

Greifswald, 04.08.2014

Dekan: Professor Dr. Klaus Fesser

1. Gutachter: Professor Dr. Volkmar Liebscher

2. Gutachter: Professor Dr. Andrew Dalby

Tag der Promotion: ...... 29.10.2014

CONTENTS

Contents

Contents...... IV

I. Integrating diverse biological sources and computational methods for the analysis of high-throughput expression data1

1. Introduction 5

1.1. Complexity of high-throughput expression data...... 6

1.2. Biological units underlying complex phenotypes...... 8

1.3. Interaction networks based on orthology relationships...... 11

2. Results 13

2.1. Reducing data complexity of expression data for the sample classification problem 13

2.1.1. Feature selection methods...... 13

2.1.2. Comparison of feature selection methods...... 17

2.2. Identifying biological units underlying complex phenotypes...... 19

2.2.1. Clustering sharing functional annotation and expression profile sim-

ilarity...... 19

2.2.2. Network inference and validation...... 22

2.3. Creating interaction networks based on orthology relationships...... 25

3. Discussion and Outlook 29

II. Publications 31

4. Contributions to the publications 33

IV 5. Publication #1 34

5.1. Identifying Genes Relevant to Specific Biological Conditions in Time Course

Microarray Experiments...... 34

5.2. Supplement...... 47

6. Publication #2, under review 49

6.1. Elucidating complex phenotypes based on high-throughput expression and bio-

logical annotation data...... 49

6.2. Supplement...... 97

7. Publication #3 99

7.1. Derivation of an interaction/regulation network describing pluripotency in human 99

7.2. Supplement...... 108

III. Appendix 109

8. CD with supplementary information 111

9. Affidavit 113

10.Curriculum vitae and publications 115

11.Acknowledgements 117

12.Bibliography 119

Bibliography...... 119

Part I.

Integrating diverse biological sources and computational methods for the analysis of high-throughput expression data

1

Summary

High-throughput expression data have become the norm in molecular biology research. How- ever, the analysis of expression data is statistically and computationally challenging and has not kept up with their generation. This has resulted in large amounts of unexplored data in public repositories. After pre-processing and quality control, the typical expression analysis workflow follows two main steps. First, the complexity of the data is reduced by removing the genes that are redundant or irrelevant for the biological question that motivated the experiment, using a feature selection method. Second, relevant genes are investigated to extract biological information that could aid in the interpretation of the results. Different methods, such as func- tional annotation, clustering, network analysis, and/or combinations thereof are useful for the latter purpose.

Here, I investigated and presented solutions to three problems encountered in the expression data analysis workflow.

First, I worked on reducing complexity of high-throughput expression data by selecting rele- vant genes in the context of the sample classification problem. The sample classification problem aims to assign unknown samples into one of the known classes, such as ‘healthy’ and ‘diseased’.

For this purpose, I developed the relative signal-to-noise ratio (rSNR), a novel feature selec- tion method which was shown to perform significantly better than other methods with similar objectives.

Second, to better understand complex phenotypes using high-throughput expression data, I developed a pipeline to identify the underlying biological units, as well as their interactions.

3 These biological units were assumed to be represented by groups of genes working in synchro- nization to perform a given function or participate in common biological processes or pathways.

Thus, to identify biological units, those genes that had been identified as relevant to the pheno- type under consideration through feature selection methods were clustered based on both their functional annotations and expression profiles. Relationships between the associated biological functions, processes, and/or pathways were investigated by means of a co-expression network.

The developed pipeline provides a new perspective to the analysis of high-throughput expression data by investigating interactions between biological units.

Finally, I contributed to a project where a network describing pluripotency in mouse was used to infer the corresponding network in human. Biological networks are context-specific.

Combining network information with high-throughput expression data can explain the control mechanisms underlying changes and maintenance of complex phenotypes. The human network was constructed on the basis of orthology between mouse and human genes and . It was validated with available data in the literature.

The methods and strategies proposed here were mainly trained and tested on microarray expression data. However, they can be easily adapted to next-generation sequencing and pro- teomics data.

1. Introduction

High-throughput methodologies have transformed molecular biology research. In particu- lar, DNA microarrays have led the way to study thousands of genes in parallel, and, hence, global cellular activity [Hoheisel, 2006]. Recent advances in technology and reduction in the cost of high-throughput experiments have resulted in an enormous amount of data available in public repositories. Thus, over 1 million samples are currently present in the Gene Expres- sion Omnibus (GEO,http://www.ncbi.nlm.nih.gov/geo/, [Barrett et al., 2013]). The analysis of high-throughput expression data, however, has not kept up with their generation, and presents problems at multiple levels.

After pre-processing and quality assessment of the raw expression data, a typical gene ex- pression analysis workflow follows two main steps (see Figure 1.1). First, the complexity of expression data is reduced by selecting those genes that are relevant to the problem under investigation. High-throughput expression experiments measure the expression levels of thou- sands of genes in parallel [Guyon and Elisseeff, 2003]. Most measured genes are redundant or irrelevant to the problem being studied [Tusher et al., 2001, Wolfinger et al., 2001]. Relevant genes can be selected using feature selection methods such as fold-change, t-statistic and vari- ants, significance analysis of microarrays (SAM,[Tusher et al., 2001]), signal-to-noise ratio, or combinations of them. Second, the relevant genes are further mined to gain novel information and insights into the biological problem being addressed. Several methods are available for this type of analysis. A simple strategy consists in classifying the relevant genes into groups, according to the biological functions, processes, and/or pathways in which they are involved.

5 CHAPTER 1. INTRODUCTION

Databases such as (GO, [Botstein et al., 2000]) and the Kyoto Encyclopedia of

Genes and Genomes (KEGG, [Kanehisa, 2000, Kanehisa et al., 2014]) can be used to obtain biological annotation information for the genes. On this basis, enrichment analysis can evaluate the statistical significance of the over-representation or depletion of given biological annotation terms among relevant genes [Hosack et al., 2003]. Another popular approach for analyzing sets of relevant genes is clustering. Clustering genes based on their expression profiles often identifies groups of genes with similar expression patterns, which often function within related biological processes and/or pathways [Eisen et al., 1998, Shannon et al., 2003]. Finally, the expression profiles of relevant genes can be studied in the context of an interaction network [Kong et al.,

2014, Siatkowski et al., 2013, Warsow et al., 2010]. These analytical methods address different problems and reveal distinct insights into the expression data. I investigated and presented solutions to three problems encountered at different levels of the expression analysis workflow.

In the next sections, I describe each of these problems in more detail and present my solutions to them.

1.1. Complexity of high-throughput expression data

The main challenge in the analysis of high-throughput expression data is the availability of a very small number of samples in comparison to the number of genes being considered [Duval and Hao, 2010]. Experimental variation in measured levels poses another prob- lem [Jirapech-Umpai and Aitken, 2005]. As a result, the analysis of high-throughput expression data is time-consuming and computationally expensive. Feature selection methods reduce the complexity of the data by selecting those genes that are relevant to the problem being inves- tigated. These relevant genes are commonly used as input for subsequent analysis algorithms.

Feature selection methods have the following main objectives: a) to avoid overfitting and im- prove algorithm accuracy; b) to provide fast and cost effective algorithm; and c) to gain a better understanding of the biological process underlying the data [Saeys et al., 2007].

6 1.1. COMPLEXITY OF HIGH-THROUGHPUT EXPRESSION DATA

Figure 1.1.: Overview of a typical gene expression analysis workflow.

7 CHAPTER 1. INTRODUCTION

I investigated feature selection methods in the context of sample classification, an interesting problem in high-throughput data analysis where an unknown sample is classified into one of the known classes according to their gene expression profiles [Diaz Uriarte and Alvarez de Andres,

2006, Statnikov et al., 2005]. For example, a patient can be classified into having either high risk or low risk of developing cancer by comparing his expression profile with the expression profiles of healthy individuals and cancer patients. The advantages of feature selection in the context of sample classification are exemplified in Figure 1.2. Here, a test sample with unknown class label is compared with a database of microarray samples to determine its correct class label. The microarray samples in the database are separated into either of the two classes:

‘cancer patients’ or ‘healthy individuals’. Similarity scores are computed between the test sample and the samples in the database. The test sample is assigned to the class of the database sample that returns the highest similarity score. In an alternative scenario, a feature selection algorithm is applied to the database, reducing its complexity and resulting in a summarized database. In the summarized database, features relevant to each class label are selected, resulting in smaller sets of genes representing the samples of each class. Using a smaller number of genes for computing the similarity score reduces the computational cost of determining the correct label of the test sample. Additionally, feature selection methods often provide better separation between the similarity scores computed for different class labels. With these two aims, I developed the relative signal-to-noise ratio (rSNR), a novel feature selection method that efficiently distinguishes between different phenotypes based on the expression profiles of the samples. The rSNR was compared with standard feature selection methods, and shown to perform significantly better.

1.2. Biological units underlying complex phenotypes

Complex phenotypes are often the result of the concerted action of several genes and/or proteins . These genes participate in or are associated with similar biological functions, processes,

8 1.2. BIOLOGICAL UNITS UNDERLYING COMPLEX PHENOTYPES

Figure 1.2.: Illustration of the advantages of feature selection methods in the sample classifica- tion problem.

9 CHAPTER 1. INTRODUCTION and/or pathways. For example, the interaction between the sonic hedgehog (Shh), retinoic acid

(RA), and fibroblast growth factor (Fgf) signaling pathways has been shown to influence early mouse caudal development [Ribes et al., 2009]. Also, the transforming growth factor-beta/bone morphogenic signaling pathway, which is fundamentally important during embryonic development and adult tissue homeostasis in metazoans, depends on interactions with other pathways, such as mitogen-activated protein kinase (MAPK), Wnt/Wg, Hedgehog (Hh), Notch, and others for proper functioning [Guo and Wang, 2009]. Dysregulation of the interactions between the biological functions, processes, and/or pathways has been associated with several phenotypic anomalies resulting in diseases. A comprehensive study of these interactions can provide new insights into the experimental data. After selecting the genes that are relevant to a given phenotype investigated with high-throughput expression experiments using one of the abovementioned feature selection methods, I analyzed the expression data to identify the biological units underlying complex phenotypes as well as interactions between these biological units.

Genes in high-throughput expression datasets are commonly clustered based on either of two criteria: functional annotation similarity or expression profile similarity. Both of these clustering strategies have their own advantages and disadvantages. Clustering based on functional anno- tation will cluster together genes with similar functional annotation. However, genes associated with a given functional annotation can be the target of other genes with the same functional annotation, and, hence, exhibit a time shift in their expression profiles [Cordell, 2009, Emily et al., 2009, Iossifov et al., 2008, Peri et al., 2003]. The functional annotation of a gene signi-

fies the biological functions it performs or biological processes and/or pathways it is associated with. Gene functional annotations can be obtained from databases such as Gene Ontology (GO,

[Botstein et al., 2000]) and the Kyoto Encyclopedia of Genes and Genomes (KEGG, [Kanehisa,

2000, Kanehisa et al., 2014]). On the other hand, genes exhibiting similar expression profiles are not always associated with similar functional annotations [Young, 2003]. As a result, clus- tering based on expression profile similarity can result in clusters of genes performing different

10 1.3. INTERACTION NETWORKS BASED ON ORTHOLOGY RELATIONSHIPS molecular function, or associated with different biological processes and/or pathways. Hence, a clustering method based on both criteria could be more advantageous than clustering based on only one of the criteria.

With this aim in view, I developed a pipeline to identify biological units consisting of clusters of genes sharing similar functional annotation as well as exhibiting similar expression profiles. More precisely, genes were first clustered based on their functional annotation similarities. Then, the genes in the resulting clusters were further separated based on expression profiles using fuzzy clustering, resulting in sub-clusters of genes. Finally, functional annotation enrichment was performed on the sub-clusters to identify the associated biological functions, processes, and/or pathways.

To study the interactions between biological functions, processes, and/or pathways associated with the sub-clusters, an interaction network of sub-clusters was created using a co-expression- based network inference method. Each edge in the network was associated with the absolute

Pearson correlation coefficient between the expression profiles of the sub-clusters associated with the edge. I showed that edges with higher absolute Pearson correlation coefficients are more likely to signify real biological interactions. Further, the network comprising the 10% of the edges with the highest absolute Pearson correlation coefficients was analyzed to identify interrelations between sub-clusters and associated biological functions, processes, and/or pathways.

1.3. Interaction networks based on orthology relationships

Large-scale biological interaction networks can greatly contribute to our understanding of the cellular machinery. Unfortunately, the available interaction networks are far from perfect, both in terms of accuracy and coverage. Considering high false positive rates, it has been suggested that the yeast interactomes are roughly 50% complete. In the case of human, the entire interactome coverage is merely 10% [Hart et al., 2006, Tyagi et al., 2012]. In recent years, many computational methods have emerged to predict protein-protein interactions [Marbach

11 et al., 2012, Shoemaker and Panchenko, 2007, Siegenthaler and Gunawan, 2014, Valencia and

Pazos, 2002]. Some of these computational methods are based on genomic data [Marcotte, 1999,

Ouzounis et al., 1999], protein sequences [Gomez et al., 2003] or phylogenetic profiles [Pellegrini et al., 1999].

As part of my doctoral research, I contributed to a project where evidences for regulatory interactions were transferred across species. The idea, which had already been implemented in a different context [Dutkowski and Tiuryn, 2009, Matthews, 2001, Rhodes et al., 2005], consists of predicting an interaction between a pair of proteins or genes a and b if there exists a known interaction between a0 and b0 in another species, where a and b are orthologs of a0 and b0 respectively. The method assumes that regulatory interactions are conserved between organisms.

In particular, we mapped the interactions from a literature-curated mouse pluripotency network

(the PluriNetWork, [Som et al., 2010]) to human, the species of maximum interest. This resulted in an interaction network for human for which experimental data are much more limited. The predicted networks can be used to analyze expression data in the context of interaction networks

[Warsow et al., 2010].

These three problems were the focus of my doctoral research. I was able to solve specific problems in high-throughput expression data analysis in general and microarray expression data in particular. The analysis of interactions between biological units underlying complex phenotype provides a new perspective in the analysis of high-throughput expression data.

2. Results

2.1. Reducing data complexity of expression data for the sample

classification problem

2.1.1. Feature selection methods

Feature selection is an important step in high-throughput expression data analysis. Apart from reducing the computational complexity (in both time and space), it allows to focus on relevant genes in the context of the problem under investigation. Here, I explored multiple feature selection methods to select relevant genes for the microarray sample classification problem. In the same context, Hafemeister et al. had previously shown that one-nearest neighbor classification significantly outperforms other classification algorithms. Hence, I chose a one-nearest neighbor classifier as classification algorithm. I explored two fundamentally different approaches to feature selection:

1. I assumed that genes/proteins interacting with other genes/proteins are biologically im-

portant. STRING [Jensen et al., 2009], a database of known and predicted protein-protein

(and protein-gene) interactions, was used as a source of information for interactions. I used

these biologically important genes in two different ways. First, I used them as gene dataset,

consisting of expression profiles of selected genes. Second, I used the selected genes to cre-

ate a link dataset, where a link signifies interactions between two genes and the expression

profile of the link was derived from the expression profiles of the two associated genes.

13 CHAPTER 2. RESULTS

2. I developed the relative signal-to-noise ratio (rSNR), a feature selection method which

depends entirely on the expression profiles of genes.

Additionally, as a control, random subsets of genes were selected as input to the classification algorithm. In brief, I developed the following feature selection methods:

1. STRING-based gene selection: Each interaction in STRING is associated with a

confidence score, which signifies the probability that the interaction exists. All genes

associated with at least one interaction with confidence score higher than a given cut-off

threshold were selected. The resulting subset of genes is further referred to as ‘linked gene

dataset’.

2. STRING-based link selection: For each pair of genes in a ‘linked gene dataset’, a ‘link’

was added to the corresponding ‘link dataset’ if there exists an interaction in STRING

between the respective pair of genes. Thus, only those genes that exists in the ‘linked gene

dataset’ were used to create a ‘link dataset’. The expression profile of the link was defined

as mean of the expression profiles of the two genes associated with the link.

3. relative signal-to-noise ratio: Given a microarray dataset comprising several experi-

mental conditions, the rSNR calculates a gene rank for each experimental condition by

comparing the variation in the expression pattern of the genes in the experimental condi-

tion of interest with their variation in the remaining experimental conditions of the dataset.

In a cross-validation framework, the rSNR generates a gene ranking for each experimental

condition of the training dataset. Then, the test samples are systematically compared with

each experimental condition of the training dataset, using only those genes that ranked

higher for that experimental condition in the previous step. The rSNR computes the gene

ranking for an experimental condition by defining a positive and a negative set in the

training dataset. The positive set corresponds to the experimental condition for which

gene ranking is being computed. The negative set comprises all experimental conditions

in the dataset not involving any of the experimental factors defining the positive set. The

14 2.1. REDUCING DATA COMPLEXITY OF EXPRESSION DATA FOR THE SAMPLE CLASSIFICATION PROBLEM

experimental conditions in the dataset can be categorized based on experimental factors,

e.g., tissue type.

The definition of the positive and the negative sets can be exemplified for the AtGen-

Express dataset. AtGenExpress consists of microarray data describing stress response in

Arabidopsis thaliana [Kilian et al., 2007]. The dataset can be divided based on two ex-

perimental factors, tissue type and stress treatment. Let us assume we are interested in

calculating the rSNR gene ranking for the experimental condition involving the experi-

mental factors cold (stress treatment) and root (tissue type), ‘cold-root’. In this case,

the positive set is the experimental condition of interest, i.e. ‘cold-root’. On the other

hand, the negative set comprises all the experimental conditions that are not associated

with either experimental factor ‘cold’ or ‘root’. The remaining experimental conditions

are excluded from the rSNR calculation for ‘cold-root’. The definition of the positive and

the negative sets for the experimental condition ‘cold-root’ is illustrated in Figure 2.1.

After the positive and the negative sets are defined, the signal-to-noise ratio is calculated

for both the positive (SNR+) and the negative (SNR−) set. For experimental condition

k, the SNR+ for gene g is given by:

µ+ + k,g SNRk,g = + σk,g

+ + where, µk,g and σk,g are the mean and the standard deviation respectively of the expression values for gene g in the positive set. The corresponding SNR− is given by:

N P µi,g i=1 − N SNR = v u N u P 2 u (µi,g − µ¯g) t i=1 N

15 CHAPTER 2. RESULTS

Figure 2.1.: Example showing the division of training dataset into positive and negative sets for the calculation of the rSNR for the experimental condition ‘cold-root’ in the AtGenExpress dataset.

where, µi,g is the mean expression value of gene g under experimental condition i,µ ¯g is the mean expression value of gene g across all experimental conditions in the negative set.

i ∈ {1,...,N } is the set of experimental conditions in the negative set and N is the total

number of experimental conditions in the negative set.

Finally, the rSNR is defined for experimental condition k and gene g as:

+ SNRk,g rSNR = − SNRk,g

The rSNR ranking is generated for each experimental condition in the training dataset

16 2.1. REDUCING DATA COMPLEXITY OF EXPRESSION DATA FOR THE SAMPLE CLASSIFICATION PROBLEM

which is used to select fewer genes for that experimental condition.

4. Random selection: For each cut-off threshold on the STRING confidence scores, the

number of genes in the corresponding ‘linked gene dataset’ were counted. Further, the

same number of genes were randomly selected to create a control dataset. Twenty-five

random dataset were created for each cut-off threshold.

2.1.2. Comparison of feature selection methods

All the four feature selection methods that I developed were used to select smaller subsets of relevant genes from the original dataset. The number of genes for each cut-off threshold, based on STRING confidence score, were determined by counting the number of genes in the ‘linked gene dataset’. These smaller subsets of relevant genes were used as input for the classification algorithm. At different cut-off thresholds, the accuracies of the classification algorithm using different feature selection methods were compared. The comparisons were performed on two time-series microarray datasets: 1) AtGenExpress (described in rSNR section above). 2) EDGE, a microarray expression dataset describing mouse response to various toxins at different dosage levels [Hayes et al., 2005]. The rSNR performed significantly better than the two STRING-based feature selection methods, as well as randomly selected genes on both the datasets (See Figure

2.2). The accuracy of the classification algorithm deteriorates significantly for the last cut-off chosen for the rSNR together with other feature selection methods. This could be due to the reason that very few genes were left for the classification algorithm to predict the class of the test data.

Further analysis were performed on the rSNR-generated gene ranking on the AtGenExpress dataset, demonstrating its effectiveness in selecting important genes. The rSNR was also com- pared with two standard feature selection methods designed with similar purpose, namely the signal-to-noise ratio (SNR) and Welch’s t-test on the AtGenExpress dataset. The SNR has been extensively used as a feature selection method while Welch’s t-test is an adaption of Student’s t-test to be used on two samples with unequal variance. The Signal-to-noise ratio is defined as:

17 CHAPTER 2. RESULTS

Figure 2.2.: Classification accuracy using feature selection methods on time-course expression data. A) Accuracy calculated on the AtGenExpress dataset. B) Accuracy calculated on the EDGE dataset. Error bars indicate 1 standard deviation away from the mean.

18 2.2. IDENTIFYING BIOLOGICAL UNITS UNDERLYING COMPLEX PHENOTYPES

µ − µ SNR = 1 2 . σ1 + σ2

Welch’s t-score is defined as:

µ − µ Welch’s t-score = 1 2 , q σ2 σ2 1 + 2 n1 n2 where, µ1 and µ2 are mean for class 1 and class 2 respectively while σ1 and σ2 are standard deviation for class 1 with sample size n1 and class 2 with sample size n2 respectively. The rSNR performed significantly better compared to both standard feature selection meth- ods (see Figure 2.3). Additionally, the usability of the rSNR in improving accuracy of the classification algorithm was demonstrated on a static microarray dataset [Engreitz et al., 2010].

2.2. Identifying biological units underlying complex phenotypes

Complex phenotypes are the result of the interactions between groups of genes performing similar functions or participating in similar biological processes and/or pathways, which consti- tute basic biological units. I developed a pipeline to identify and characterize such biological units. The input to the pipeline is a gene expression dataset consisting of relevant genes selected by feature selection methods. First, the genes are clustered based on functional annotations and expression profiles resulting in clusters and sub-clusters of genes. The sub-clusters are used to create a co-expression based network, part of which is used for the validation and the analysis of the network (see Figure 2.4). The pipeline is described in detail in the following sections.

2.2.1. Clustering genes sharing functional annotation and expression profile similarity

The first step of the pipeline identifying biological units underlying complex phenotypes is clustering the genes that are relevant to the phenotype under consideration, selected using

19 CHAPTER 2. RESULTS

Figure 2.3.: Cross-validation result of the comparison between rSNR and standard feature selec- tion methods on the AtGenExpress dataset. Error bars indicate 1 standard deviation away from the mean. some of the abovementioned feature selection methods. Clusters of genes sharing similar func- tional annotations and exhibiting similar expression profiles have been shown to share regula- tory mechanisms [Allocco et al., 2004, Altman and Raychaudhuri, 2001, Nguyen et al., 2011,

Schulze and Downward, 2001]. Therefore, I chose to cluster genes based on both the prop- erties, i.e., functional annotation and expression profile. First, genes were clustered based on their functional annotation similarity using ‘functional annotation clustering’ method of DAVID

(http://david.abcc.ncifcrf.gov/, [Huang et al., 2009b,a]). The whole mouse genome was used as background to identify significantly over- or under- represented functional categories among the genes in each cluster. Only the functional annotation categories ‘GOTERM BP FAT’,

‘GOTERM MF FAT’, and ‘KEGG’ were considered for the enrichment analysis. Next, genes

20 2.2. IDENTIFYING BIOLOGICAL UNITS UNDERLYING COMPLEX PHENOTYPES

Figure 2.4.: A graphical overview of the pipeline to identify biological units underlying complex phenotype.

21 CHAPTER 2. RESULTS in each cluster were further divided into sub-clusters based on their expression profiles. The R package ‘Mfuzz’ [Kumar and Futschik, 2007], an implementation of the fuzzy c-means algorithm, was used for this purpose. Unlike hard clustering, where each element can be part of only one cluster, fuzzy clustering calculates the probability of an element to be part of a cluster. Hence, fuzzy clustering captures the ambiguity in the data, which is important in the context of noisy data such as microarray. Each of these sub-clusters of genes sharing functional annotation and expression profile similarity were then assigned an expression profile. The expression profile for each sub-cluster is its centroid, calculated by the fuzzy c-mean clustering algorithm. The centroid of a sub-cluster is defined as the mean of its members, weighted by their membership values. The centroid of a sub-cluster Ck is given by:

N X m xiwik Y = i=1 , k N X m wik i=1 where wik indicates the membership of element xi to sub-cluster Ck. m ∈ R is a parameter controlling the ‘fuzziness’. For m = 1, genes are assigned to exactly one cluster, while for m → ∞ genes are uniformly assigned to all clusters. xi is a vector representing the expression profile of gene i. Finally, functional enrichment was performed on the genes of the sub-clusters using the R package ‘RDAVIDWebService’ [Fresno and Fernandez, 2013], to identify molecular functions, biological processes, and/or pathways associated with the sub-clusters.

2.2.2. Network inference and validation

To analyze the interactions between the biological functions, processes, and/or pathways as- sociated with the sub-clusters, I created a network describing interactions between sub-clusters.

Specifically, the expression profiles of the sub-clusters were used to infer a co-expression net- work. In the network, nodes represent biological units represented by sub-clusters. Each edge connecting two nodes was associated with the absolute Pearson correlation coefficient between

22 2.2. IDENTIFYING BIOLOGICAL UNITS UNDERLYING COMPLEX PHENOTYPES the expression profiles of the corresponding sub-clusters. Self-edges were removed from the network. Edges with higher absolute Pearson correlation coefficients are more likely to consti- tute biologically significant interactions [Aoki et al., 2007, D’haeseleer et al., 2000, Eisen et al.,

1998, Ihmels et al., 2004]. Therefore, I hypothesize that a biologically significant network can be created from the above mentioned network by selecting edges with higher absolute Pearson correlation coefficient score than a given threshold. Two networks were created, a ‘top network’, using the 10% of the edges with highest absolute Pearson correlation coefficients, and a ‘bottom network’, using the 10% of the edges with lowest absolute Pearson correlation coefficients. To verify the assumption that the edges associated with higher absolute Pearson correlation coef-

ficients are more likely to signify real biological interactions, the biological relevance of the top network was compared with that of the bottom network using the following three criteria:

1. Number of transcriptional regulators shared between genes in sub-clusters

connected by an edge: Common transcriptional regulators of the genes in the sub-

clusters associated with the edges of the top network were compared with those of the

bottom network. Basically, motifs in the promoter regions of the genes in each sub-cluster

were identified using MEME [Bailey et al., 2006, Bailey and Elkan, 1994] and TOMTOM

[Gupta et al., 2007]. For each pair of sub-clusters connected by an edge, the motifs that

were identified in the corresponding promoter regions were compared and common motifs

were statistically assessed using Fisher’s exact test, with the entire database of known mo-

tifs as background. Finally, P-values obtained for pairs of sub-clusters in the top network

were compared with those computed for pairs of sub-clusters in the bottom network using

the Wilcoxon Rank-Sum test. Assuming the edges in the top network are more likely to

represent real biological interactions, the P-values for the edges in the top network should

be significantly smaller than the P-values for the edges in the bottom network.

2. Number of pairwise interactions between genes in sub-clusters connected by

an edge: For each edge in the network, lists of pairwise interactions between the genes of

23 CHAPTER 2. RESULTS

the associated sub-clusters were computed. Next, the number of interactions in this list

present in the STRING database (version 9.1) were calculated. This number was then com-

pared to the total number of interactions in the abovementioned list, giving the STRING

connectivity score for that edge. Mathematically, the STRING connectivity score between

1 2 3 N1 1 2 3 N2 cluster C1 = {G1,G1,G1,...,G1 } with N1 genes and cluster C2 = {G2,G2,G2,...,G2 }

with N2 genes is given as:

 j j N1 N2  i i X X 1, if{(G1,G2) ∈ STRING} ∧ G1 6= G2  i=1 j=1 0, otherwise STRING connectivity (C1,C2) = . N1 · N2

STRING connectivity scores for pairs of sub-clusters in the top network were compared

with those computed for pairs of sub-clusters in the bottom network using the Wilcoxon

Rank-Sum test. Assuming that the edges in the top network are biologically more relevant

than those in the bottom network, the STRING connectivity scores of the edges in the

top network are expected to be significantly higher than those of the bottom network.

3. GO annotation similarity between each pair of sub-clusters connected by an

edge: For calculating GO annotation similarity, the overlapping genes between the two

associated sub-clusters were first excluded. Then, the functional distance between the

genes in the two associated sub-clusters were computed using the approach described in

[Ruths et al., 2009]. Basically, the method computes pairwise similarity of genes based

on an estimate of their distance to the closest common ancestor in the GO annotation

tree, similar to Wang et al. [Wang et al., 2007]. An graphical description of the GO

annotation tree for the GO term ‘GO:2001240’, associated with “negative regulation of

extrinsic apoptotic signaling pathway”, is presented in figure 2.5. The GO annotation

similarity for an edge is defined as average similarity between all pairs of genes of the

24 2.3. CREATING INTERACTION NETWORKS BASED ON ORTHOLOGY RELATIONSHIPS

two sub-clusters connected by the edge. Assuming that biologically-relevant interactions

are more enriched in the top network than in the bottom network, the GO annotation

similarity for the edges in the top network is expected to be significantly higher than that

for the edges in the bottom network.

The biological relevance of the top network as compared to the bottom network was demon- strated on two microarray expression datasets. The first dataset describes the differentiation of mouse embryonic stem cells into embryoid bodies [Hailesellasse Sene et al., 2007], while the second dataset describes the comparison between the developing embryonic liver and the regen- erating adult liver in mice [Otu et al., 2007]. The top network performed significantly better than the bottom network on all the three criteria evaluating biological relevance (See Figure

2.6). After the validation of the network, the top network, containing 10% of the edges with highest absolute Pearson correlation coefficient, was used for further analysis. The pipeline was able to identify biological units specific to the dataset and problem being investigated and the analysis of top networks revealed interesting relationships between biological units, few of which has already been mentioned in various studies.

2.3. Creating interaction networks based on orthology relationships

An interaction network is needed to study interactions between proteins. However, informa- tion on protein-protein and protein-gene interactions is sparse and the coverage is poor. Due to research interest, most interaction information is available for model organisms like Arabidopsis or mouse, but little is known about non-model organisms, including humans. However, inter- action data on model organisms can be used to predict interactions in non-model organisms.

In this context, I contributed to a project where a literature curated interaction network of a model organism, namely mouse, was used to predict interactions in human. Here, we used interolog mapping, and assumed that if a pair of proteins/genes that is known to interact in a given model organism interact, then their orthologs are likely to interact in the target organism.

25 CHAPTER 2. RESULTS

Figure 2.5.: A GO semantic tree for the GO term ‘GO:2001240’, “negative regulation of extrinsic apoptotic”. Source: QuickGO – http://www.ebi.ac.uk/QuickGO.

26 2.3. CREATING INTERACTION NETWORKS BASED ON ORTHOLOGY RELATIONSHIPS

Figure 2.6.: Comparison of the top and bottom networks describing the differentiation of mESCs into embryoid bodies. P-values were computed using the Wilcoxon Rank-Sum test. A) Average significance (log10 P-value) of observing common transcriptional reg- ulators binding to the promoters of genes in sub-clusters connected by edges. B) Average STRING connectivity score between sub-clusters connected by edges. C) Average GO annotation similarity score between sub-clusters connected by edges.

As a source network to predict interactions and construct a regulatory network in humans, a literature curated mouse network describing pluripotency [Som et al., 2010] was used.

The interolog-based method predicts many false-positive interactions in the target organism.

The main reason for the false positives is the lack of evolutionary conservation of interactions in distantly related organisms. In such cases, the pair of proteins interacts in the source organism but not in the target organism. Four separate filtering methods were used to minimize the false positives: 1) phylogenetic profiling, 2) GO semantic similarity, 3) gene co-expression, and

4) RNAi data. The performance of all filtering methods were compared, and the RNAi-based method was shown to perform best.

Finally, an interaction network for human was created from existing mouse pluripotency network. The predicted interactions were filtered using RNAi data to minimize false positive interactions in the new network.

27

3. Discussion and Outlook

As part of my doctoral research, I investigated three problems encountered in the workflow of high-throughput expression data analysis. I was the main contributor to the corresponding former two projects, and contributed to the third project.

The first project focused on the challenges in reducing data complexity, an important step in the workflow of high-throughput expression data analysis. I explored several feature selec- tion methods to reduce the data complexity and improve the accuracy of sample classification algorithms. First, I used interaction information to select biologically relevant genes. The per- formance of such feature selection methods was comparable to that of random gene selection.

However, alternative sources of interaction information, such as databases describing interactions in specific biological processes, should be advantageous. Ideally, one would like to use process- specific databases, e.g., a database of interactions involving only stress response in Arabidopsis should help in selecting genes relevant to stress response in the Arabidopsis. On the other hand, the rSNR based ranking, derived from expression profiles of genes, performed significantly better than other feature selection methods, either developed or standard. The rSNR-based ranking can be used to summarize the database of microarray samples, where each sample will be rep- resented by fewer important genes than the original gene list. This can be advantageous in a scenario where test data are compared with the database to determine the best match of the test data. A summarized database will lead to fewer genes being used for computing the similarity scores and, hence, less computational cost incurred in such database searches.

The second project was centered on the identification of the biological units underlying com-

29 plex phenotypes using high-throughput expression data and biological annotation. More pre- cisely, the input to the pipeline I developed is the expression profiles of a set of relevant genes, selected with the help of above mentioned feature selection methods. Genes were first clustered based on their functional annotation similarity, resulting in clusters of genes. The genes in these clusters were further clustered based on expression profile similarity. Clustering approaches using both criteria, i.e., expression profiles and biological annotation similarity, have advantages over clustering using only one criterion. Genes clustering based only on biological annotation fails to utilize information from the high-throughput expression data, and, hence, is not specific to the condition under study. In addition to that, it also disregards the fact that genes associated with similar biological functions, processes and/or pathways may exhibit different expression profiles. In such case, it is difficult to derive a representative expression profile for the cluster.

On the other hand, clustering approaches solely based on expression profile similarity present different problems. For example, genes associated with different biological functions, processes, and/or pathways may exhibit similar expression profiles and can be clustered together. This, in turn, poses a problem for standard functional analysis, such as Gene Set Enrichment analysis.

In the future, I plan to improve the developed pipeline by using alternative approaches, such as using a weight function to cluster the genes based on biological annotation and expression profile similarity simultaneously. Also, instead of constructing a co-expression-based network to study the interactions between the identified biological units, an alternative network inference method or an existing regulatory network can be used.

Development of new sophisticated methods have generated vast amounts of data. Technology continues to advance, and even larger datasets are expected to be generated in the future.

Analyzing these data is a daunting challenge, but promises to reveal in-depth information about the biological systems at the molecular level. With my doctoral research work, I have contributed to this new, exciting field, improving the workflow of high-throughput expression data analysis at key steps, as well as providing new insights into real biological problems.

Part II.

Publications

31

4. Contributions to the publications

1. Identifying Genes Relevant to Specific Biological Conditions in Time Course Microarray Experiments

Singh N K, Repsilber D, and Liebscher V and Taher L, and Fuellen G (2013). PLoS One

8, e76561.

Contribution Statement

Nitesh Kumar Singh accquired and analyzed three microarray datasets. He developed

different feature selection methods, particularly the rSNR method for the sample classi-

fication problem. He further analyzed the rSNR method on multiple datasets. He had a

major contribution in writing the manuscript and preparing the figures.

2. Elucidating complex phenotypes based on high-throughput

expression and biological annotation data

Singh N K, Ernst M, Liebscher V, Fuellen G,and Taher L (2014). PLoS One (Submitted).

Contribution Statement

Nitesh Kumar Singh processed and analyzed two time-series microarray datasets. He de-

33 CHAPTER 4. CONTRIBUTIONS TO THE PUBLICATIONS

veloped a pipeline to identify biological units underlying complex phenotypes. He used

functional annotation clustering and fuzzy clustering, performed enrichment analysis on

the clusters of genes and created co-expression based networks of the clusters. He validated

these networks using three criteria. He had a major contribution in writing the manuscript

and creating figures and tables.

3. Derivation of an interaction/regulation network describing

pluripotency in human

Som A, Lustrek S, Singh N K, and Fuellen G (2012). Gene 502 99-107.

Contribution Statement

Nitesh Kumar Singh helped in preprocessing of data. He provided scripts for transfer-

ring interaction information from network of one species to another species. He was also

involved in critically analyzing the manuscript.

34 5 Article

Identifying Genes Relevant to Specific Biological Conditions in Time Course Microarray Experiments

Nitesh Kumar Singh1, Dirk Repsilber2, Volkmar Liebscher3, Leila Taher1*, Georg Fuellen1* 1 Institute for Biostatistics and Informatics in Medicine and Ageing Research, Department of Medicine, University of Rostock, Rostock, Germany, 2 Institute for Genetics and Biometry, Leibniz Institute for Farm Animal Biology, Dummerstorf, Germany, 3 Institute for Mathematics and Informatics, Ernst Moritz Arndt University of Greifswald, Greifswald, Germany

Abstract Microarrays have been useful in understanding various biological processes by allowing the simultaneous study of the expression of thousands of genes. However, the analysis of microarray data is a challenging task. One of the key problems in microarray analysis is the classification of unknown expression profiles. Specifically, the often large number of non- informative genes on the microarray adversely affects the performance and efficiency of classification algorithms. Furthermore, the skewed ratio of sample to variable poses a risk of overfitting. Thus, in this context, feature selection methods become crucial to select relevant genes and, hence, improve classification accuracy. In this study, we investigated feature selection methods based on gene expression profiles and protein interactions. We found that in our setup, the addition of protein interaction information did not contribute to any significant improvement of the classification results. Furthermore, we developed a novel feature selection method that relies exclusively on observed gene expression changes in microarray experiments, which we call ‘‘relative Signal-to-Noise ratio’’ (rSNR). More precisely, the rSNR ranks genes based on their specificity to an experimental condition, by comparing intrinsic variation, i.e. variation in gene expression within an experimental condition, with extrinsic variation, i.e. variation in gene expression across experimental conditions. Genes with low variation within an experimental condition of interest and high variation across experimental conditions are ranked higher, and help in improving classification accuracy. We compared different feature selection methods on two time-series microarray datasets and one static microarray dataset. We found that the rSNR performed generally better than the other methods.

Citation: Singh NK, Repsilber D, Liebscher V, Taher L, Fuellen G (2013) Identifying Genes Relevant to Specific Biological Conditions in Time Course Microarray Experiments. PLoS ONE 8(10): e76561. doi:10.1371/journal.pone.0076561 Editor: Andrew R. Dalby, University of Westminster, United Kingdom Received June 14, 2013; Accepted August 28, 2013; Published October 11, 2013 Copyright: ß 2013 Singh et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: LT and GF are funded by the German Research Foundation (DFG, FUE 583/2-2). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] (LT); [email protected] (GF)

Introduction number of non-informative variables involved: a regular micro- array dataset comprises from 6000 to 60,000 genes [3]. First, as for DNA microarrays can be classified into static experiments, any large-scale dataset, classification algorithms require substantial where a snapshot of gene expression in different samples is computational resources. In this regard, the current affordability of measured, and time series experiments, where a temporal process massive computer power and recent advent of cloud computing is measured over a period. While static experiments may reveal have opened new possibilities. In particular, web-based work- genes that are expressed under specific conditions, time series benches such as Galaxy [4], standalone, comprehensive collections experiments may help in determining the temporal profiles of the of data analysis and integration tools such as Chipster [5], and genes expressed under a specific condition, as well as interactions large, active communities devoted to open source and open between them [1]. development projects such as the Bioconductor [6] have made An interesting problem in microarray analysis is the classifica- data-intensive biology available to virtually all scientists. Second, tion of unknown expression profiles with the goal of assigning and more importantly, the performance of most classification them to one or many predefined classes. Such classes represent algorithms is affected by the relatively low signal-to-noise ratio of various phenotypes, for example, diseases. Moreover, classifying such datasets. Furthermore, because often only a few dozens of microarray data by cross-comparing microarray data from samples are available, most algorithms face the risk of overfitting different laboratories and phenotypes could be helpful not only [7]. Reducing the number of genes using feature selection methods to identify unknown samples, but to reveal obscure associations not only results in a more efficient management of the between complex phenotypes, such as shared pathogenic pathways computational resources and a lower the risk of overfitting, but among different diseases. Such approaches have been made more also enables a better biological understanding of the data. feasible in recent years with the availability of large database Many studies have shown that integrating microarray data with repositories of high throughput gene expression data, such as the additional biological information improves classification accuracy. Gene Expression Omnibus (GEO) [2]. However, classifying For example, Bar-Joseph et al. discuss how protein-DNA binding microarray data is a challenging task, mainly because of the large data and protein interaction data can be used to constrain the

PLOS ONE | www.plosone.org 1 October 2013 | Volume 8 | Issue 10 | e76561 CHAPTER 5.1 PUBLICATION #1

Relevant Genes in Microarrays Experiments number of hypotheses that can explain a specific expression Here, we show that identifying biologically relevant features pattern [1]. More generally, protein interactions and their substantially improves microarray classification. First, we explore dynamics are considered helpful, and even essential for under- the addition of protein interaction information as a means to select standing biological processes [8]. For instance, de la Fuente argues features specific to particular experimental conditions and improve for the importance of integrating gene expression data with microarray classification. We found that, in the form presented network information to identify dysfunctional regulatory networks here, the addition of protein interaction information resulted in no in disease [9], and, consistently, several studies suggest that improvement in classification. Second, we introduce a novel pathway dysregulation is a stronger biomarker for cancer feature selection method based on the SNR, which we call the compared to the dysregulation of individual genes [10–12]. Also, ‘‘relative signal-to-noise ratio’’ (rSNR). Given a microarray dataset Ma et al. showed that an approach to identifying genes associated comprising various experimental conditions, the rSNR is a feature with a given phenotype that combines expression and protein ranking method that ranks genes based on their specificity to a interaction data outperforms other approaches that use either gene given experimental condition, by comparing variation in gene expression or protein interaction [13]. Similarly, Wu et al. used expression within that particular experimental condition, with both gene expression and protein interaction data to prioritize variation in gene expression across other experimental conditions. potential cancer-related genes for further investigation, with Basically, the rSNR can be expressed as a quotient of SNRs or encouraging results [14]. Additionally, gene combinations have CVs, and in practice, gives higher rank to genes with high been shown to be more effective than individual genes in expression values and low standard deviations in the experimental classifying cancer versus healthy samples [15], and pluripotent condition of interest. We tested this and other feature selection versus non-pluripotent cells [16]. Consequently, although the exact methods on two time-series microarray datasets and one static contribution of protein interaction information is difficult to assess microarray dataset. The rSNR method performed generally better [17], such information is potentially useful to identify features with than other feature selection methods, and its application substan- biological relevance to specific experimental conditions, and tially improved classification accuracy. Our results also suggest improve microarray classification. that the rSNR rank could be used to reduce the number of genes Signal-to-noise (SNR) ratios have been extensively used in representing a microarray experiment in a database, hence, various fields. In image processing, the SNR is defined as the mean making searches across the entire database more efficient. of the variable being measured divided by its standard deviation [18]: Methods

m Classifying Gene Expression Time-course Data Using SNR~ : s Correlation Tests To classify time course microarray experiments we adopted a nearest neighbor approach based on Pearson correlation. Training In this case, the standard deviation represents noise and other and test data consisted of gene expression profiles from two or interference in comparison to the mean. The reciprocal of the more different time points. When considering test datasets SNR is known as coefficient of variation (CV), which has been comprising more than two time points, we first split the data into widely applied as quality control and validation method for the test subsets consisting of pairs of consecutive time points. We then analysis of microarray assays, see, e.g., [19–21]. For instance, evaluated the classification on each of these subsets, and decided Raman et al. used the CV to investigate differences in the the final classification of the entire test data by majority voting. variability of expression levels with regards to quality control, and With the goal of improving classification performance we found that the CV was greater for the microarrays that failed the examined different data features. First, we evaluated different quality control inspection compared to those that did not [21]. manners in which gene expression profiles from two different time The SNR has also been proposed and successfully employed as a points can be combined into what we call transition profiles. Then, feature selection method for classification problems. For the sake we incorporated protein interaction information from the of clarity, we will refer to the Signal-to-Noise ratio used for STRING database into the definition of such transition profiles. classification problems as SNR . In the context of feature selection Finally, we compared different feature selection approaches to for microarray classification, the SNR measures the effectiveness extract the genes that are the most relevant to specific biological of a feature in discriminating between two classes and is defined as conditions. [22–26]: Notation m {m SNR~ 1 2 , In the following, we adopt notation by Hafemeister et al. [30]. s1zs2 Let where, m1 and m2 are the mean expression value for class 1 and N i [ fg1,:::,N represent the experimental conditions/queries, class 2 respectively while s and s are the standard deviation for 1 2 N g [ fg1,:::,G be the running index of the genes, class 1 and class 2 respectively. The SNR has been extensively used with some modifications N t [ fg1,:::,T denote the time points, or in combination with other methods. For example, Lakshmi and N Oi,g,t [ : expression value of gene g of experiment i at time t, T Mukherjee used a ‘‘maximized’’ SNR that identifies features with N Oi,g,: [ : expression time course of gene g of experiment i, G the aim of increasing the distance between the category profiles N Oi : [ : expression value of all genes at time t of , ,t [27]; Mishra and Sahu used SNR in combination with various experiment i. This vector of expression values is also called clustering methods and classifiers [28]; and Goh et al. applied the expression profile at time t of experiment i, multiple passes of feature selection using SNR and Pearson | N O [ N T : collection of all expression values of gene g for correlation coefficient [29]. :,g,: all experiments i across all time points t,

PLOS ONE | www.plosone.org 2 October 2013 | Volume 8 | Issue 10 | e76561 5 Article

Relevant Genes in Microarrays Experiments

G|T N Oi,:,: [ : collection of all expression values for experiment STRING is associated with a confidence score, which is associated i for all genes g across all time points t. with the probability that the interaction exists.

Gene Expression and Transition Profiles Datasets As test/training data we used pairs of expression profiles from We trained and tested our method on the two time-series two time points of the same experimental condition. We converted microarray datasets used by Hafemeister et al. [30], consisting of each of these pairs of expression profiles into a vector of expression expression data describing Arabidopsis thaliana stress response values to which we refer as transition profile. We used two different (‘‘AtGenExpress’’) [31] and the EDGE toxicology database kinds of transition profiles: (‘‘EDGE’’) [32]. We further tested our method on a static microarray dataset used by Engreitz et al. [33]. A brief description N Differential transition profile (DTP). The DTP mea- of each of the three datasets follows. sures the change in expression value for all genes between two AtGenExpress. We downloaded this dataset from the TAIR time points tx and ty. The DTP of an experimental condition i database [http://www.arabidopsis.org/, [31]]. It consists of for a pair of time points tx and ty is calculated as: microarray expression data from Arabidopsis thaliana when exposed to various stress treatments. The dataset has 232 samples comprising 9 stress treatments and 2 tissue types (root and shoot) at 8 time points (T = 8), with 2 replicates at each time point. In DTP ~O {O : i,:,ðÞtx,ty i,:,tx i,:,ty total, there are 18 unique combinations of stress treatments and tissue types to which we will refer as experimental conditions where, Oi,:,tx and Oi,:,ty are vectors containing the expression (N = 18). We will refer to the variables in the experiment, i.e., stress values of all genes under experimental condition i at time points tx treatment and tissue type, as ‘‘experimental factors’’. The original and ty respectively. dataset contained expression values for 22810 genes. Hafemeister et al. reduced it to 2074 genes by applying a 2-fold-change filter N Mean transition profile (MTP). The mean transition [30]. All our analyses are based on the reduced dataset of 2074 profile is the mean expression value for all genes between two genes (G = 2074). time points. The MTP of an experimental condition i for a pair EDGE. We obtained this dataset directly from its original of time points tx and ty is calculated as follows: author [32]. It contains gene expression values from mice when treated with various toxins at different dosage levels. The dataset has 216 samples comprising 7 toxin treatments at various dosage levels at up to 12 time points, ranging from 2 to 192 hours Oi,:,t zOi,:,t MTP ~ x y : (T = 12), with replicates ranging from 1 to 40 at different time i,:,ðÞtx,ty 2 points. In total, there are 11 unique combinations of toxin treatments and dosage levels, to which we will refer as where, Oi,:,tx and Oi,:,ty are defined as for the DTP. experimental conditions. We will refer to the variables in the To obtain a transition profile, two expression profiles are experiment, i.e., toxin treatment and dosage level, as ‘‘experimen- combined into a single vector of expression values. Hence, we also tal factors’’. The dataset contains expression values for 1600 genes compared transition profiles with single gene expression profiles: (G = 1600). Since our validation framework requires samples from at least 4 time points, we discarded experimental conditions with N Time point expression profile. Here, we based the fewer than 4 time points. This resulted in a reduced EDGE dataset classification on individual gene expression profiles. To make with 6 unique experimental conditions (N = 6). All our analyses are the comparison with the classification based on pairs of gene based on this reduced dataset. expression profiles fair, the individual gene expression profiles Engreitz dataset. We obtained this dataset from Jesse M. were taken from a set of two profiles (see subsection below). Engreitz [33]. It is a collection of 32 disease-associated microarray We evaluated the performance of our method using the above- experiments. These microarray experiments compared normal to mentioned expression and transition profiles on the AtGenExpress diseased tissue for Duchenne muscular dystrophy, breast cancer and EDGE datasets. and Huntington’s disease. Hence, these 32 microarray experi- ments can be categorized into 3 experimental conditions (N = 3), Similarity Between Expression and Transition Profiles based on the associated diseases. Each microarray experiment consists of arrays or expression profiles which can be categorized We classified gene expression and transition profiles according to the 1-nearest neighbor rule (1NN). Similarity between in either ‘‘normal’’ or ‘‘diseased’’ (T = 2). Each microarray expression or transition profiles was evaluated using the Pearson experiment consisted of a unique set of genes. In order to create correlation coefficient. Thus, for a given transition profile selected a single dataset with around 3000 common genes, we excluded 5 as test data, we computed, pairwise, the Pearson correlation microarray experiments. The reduced dataset contains 3378 genes coefficient between it and all the transition profiles in the training (G = 3378). The expression values in the reduced dataset were data. Then, we examined the Pearson correlation coefficient of quantile normalized. each pair, and labeled the test data with the experimental condition of the transition profile in the training data for which we STRING Database obtained the highest Pearson correlation coefficient. To make the We obtained protein interaction information from STRING, a comparison between expression and transition profiles fair, in the database of known and predicted protein-protein (and protein- case of time point expression profiles, the test data consisted of two gene) interactions [http://string-db.org/, [34–39]]. The interac- expression profiles. We computed, pairwise, the Pearson correla- tions in STRING are derived from various sources and contain tion coefficient between all the expression profiles in the test data both physical and functional associations. Each interaction in and all the expression profiles in the training data. Finally, the test

PLOS ONE | www.plosone.org 3 October 2013 | Volume 8 | Issue 10 | e76561 CHAPTER 5.1 PUBLICATION #1

Relevant Genes in Microarrays Experiments data were labeled with the experimental condition of the Generation of Linked Gene and Link Dataset expression profile in the training data for which we obtained the We also investigated the effect of adding protein interaction highest Pearson correlation coefficient. information on microarray classification. We retrieved this information from the protein interaction database STRING Leave-one-out Cross-validation [http://string-db.org/[34–39]], and refer hereafter to interactions We evaluated the predictive power of our method for different (both omnidirectional and directional) obtained from this source as choices of parameters and features using leave-one-out cross- ‘‘links’’. First, we created a dataset of genes for which we observe validation. In the following, we assume that pairs of time-points at least one interaction in STRING. We decided on these are sorted by time in ascending order. interactions using a range of STRING confidence score thresholds. We call this dataset ‘‘linked gene dataset’’. Second, N AtGenExpress and EDGE datasets. The test dataset was we converted the ‘‘linked gene dataset’’, into a dataset describing generated by randomly selecting 2 time points from a given interactions, rather than genes. We call this dataset ‘‘link dataset’’. experimental condition. For these 2 time points, we used one Link datasets include exactly one link for each pair of genes in the of the replicates as test dataset, and excluded all other ‘‘linked gene dataset’’ for which there exists an interaction in replicates from the training and test datasets. The training STRING. Then, similarly to Warsow et al. [40], we define the dataset consisted of all the remaining time points for this ‘‘expression value’’ of a link as the mean expression value of the experimental condition together with all the time points for two genes involved: other experimental conditions. We generated ten such test/ training datasets for each experimental condition, and z O:,gx,: O:,gy,: repeated this random sub-sampling validation procedure 30 O:,l,:~ : times for AtGenExpress, and 100 times for EDGE. 2

N Engreitz dataset. This dataset contains 27 microarray where link l implies a protein interaction between gene gx and gy. experiments categorized into 3 experimental conditions. Each and O.,l,. is a N6T matrix containing the link expression values of microarray experiment has expression profiles in ‘‘normal’’ the link l for all experiments and time points. and ‘‘diseased’’ states. We treated ‘‘normal’’ as time point 0 We compared the classification performance of classifiers and ‘‘diseased’’ as time point 1. For the cross validation, we relying on gene, linked gene and link datasets. selected one microarray experiment (2 time points) as test data, and the rest as training data. This process was repeated so that Feature Selection Methods each microarray experiment was selected as test data exactly Microarray experiments are intrinsically noisy in that they once. involve a very large number of genes, most of which exhibit irrelevant variation [41]. In this context, the aim of feature selection methods is to identify and exclude such non-informative Evaluation of the Classification genes. We evaluated different feature selection methods. First, we If the predicted experimental condition for the test data is the used information from STRING to select biologically relevant same experimental condition from which the test data had been genes, and created a linked gene dataset and a link dataset taken, then, the classification is correct; otherwise, the classification consisting of only these selected genes. Second, we developed the is incorrect. We used the accuracy to evaluate the classification rSNR, a novel feature selection method that identifies biologically results for different methods, parameters, and features: relevant genes based on their SNR. Finally, we compared our classification results with those obtained by selecting genes correct predictions randomly. Except for the rSNR, which is explained in detail in accuracy~ : total predictions the following section, the remaining aforementioned methods are outlined as follows:

For the Engreitz dataset, in addition to the accuracy we used 1. STRING-based link selection. Our link datasets included the area under the Receiver Operating Characteristic (ROC) only those links representing interactions between genes curve (AUC). Each of the 27 microarray experiments were used present on the microarray and with a confidence score in to query the remaining 26 microarray experiments exactly once STRING greater than a given cut-off, which was selected from in order to determine whether they correspond to the same {0, 250, 500, 750, 900}. experimental condition (i.e., disease). Thus, given a query 2. STRING-based gene selection. Our linked genes datasets microarray experiment, we computed 26 correlation coefficients. included only those genes present on the microarray and for Then, we defined a cut-off on those correlation coefficients. All which there is an interaction in STRING with a confidence microarray experiments for which we obtained correlation score greater than a given cut-off. As for the STRING-based coefficients higher than the cut-off were classified as ‘‘positives’’. link selection, cut-off scores were selected from {0, 250, 500, Out of these positive microarray experiments, those that indeed 750, 900}. It follows that for each cut-off score, the link and corresponded to the experimental condition of the query linked genes datasets included information of exactly the same microarray experiment were considered ‘‘true positives’’ (TP); genes. those corresponding to a different experimental condition were 3. Random Selection. Randomly selected genes were used as considered ‘‘false positives’’ (FP). We then computed the true controls. For each STRING cut-off score, we counted the positive rate (TPR) and false positive rate (FPR). The ROC number of genes present in the linked genes dataset, and curve represents the TPR as a function of the FPR for different randomly selected the same number of genes to create a cut-off values. The AUC reported is the average computed for control dataset. Leave-one-out cross-validation was performed all 27 microarray experiments taken as query. on this control dataset. The process of creating a dataset from

PLOS ONE | www.plosone.org 4 October 2013 | Volume 8 | Issue 10 | e76561 5 Article

Relevant Genes in Microarrays Experiments

randomly selected genes was repeated 25 times for each STRING cut-off score. z SNRk,g rSNRk,g~ { : SNRk,g Relative Signal-to-Noise Ratio (rSNR) The rSNR ranks genes according to their association with a given experimental condition. In regards to our classification The rSNR can be also be interpreted as the ratio between two problem, this means that when the transition profile of a test data coefficients of variation: is compared with the transition profile of an experimental s condition in the training data, only those genes that were found ðÞm1,:::,mN{ ,g to be relevant for that experimental condition will be used to assess m { ðÞm1,:::,mN{ ,g CV rSNRk,g~ z ~ z the similarity between the two profiles. In a cross-validation s k,g CV framework, the rSNR gene rank is computed based exclusively on z m the training data. Hence, the sets of relevant genes are also based k,g exclusively on the training data, and will be different for each z { where, CV and CV are the coefficients of variation of the cross-validation fold. In order to rank the genes based on their positive and negative set respectively. The CV describes the rSNR, for each experimental condition we first define a positive normalized dispersion of the expression level of a gene, i.e., the and a negative set in the training data. Basically, the positive set is variability of its expression value with respect to its mean. In the training data for the experimental condition of interest, while microarray experiments, the variability in gene expression levels the negative set comprises the remaining training data not originates from two sources, technical and biological. Technical involving any of the experimental factors defining the positive variability is primarily controlled using pre-processing, normali- set. For example, let us assume that we are computing the rSNR zation and replication [42–45]. Additional technical variability can for the experimental condition involving the experimental factors be assumed to be approximately constant across experimental cold and root (‘‘cold-root’’) in the AtGenExpress dataset. In this conditions. In the case of the positive set, most of the biological case, the positive set is the training data for the experimental variability arises from the effect of the experimental factors condition of interest (‘‘cold-root’’), while the negative set is the involved in the experimental condition of interest across time training data involving neither ‘‘cold’’ nor ‘‘root’’. Other points. Thus, CVz quantifies the biological variability within a experimental conditions sharing experimental factors with the particular experimental condition, i.e., ‘‘intrinsic noise’’. In the positive set are excluded from the rSNR calculation. The case of the negative set, most of the biological variability is due to definition of the positive and the negative set is exemplified in differences in the effect of the experimental factors considered Figure 1A. Next, for all genes we calculate the signal-to-noise ratio z { under the various experimental conditions represented in the in the positive set (SNR ) and in the negative set (SNR ) { dataset. Thus, CV quantifies the biological variability between separately. For the experimental condition k, the SNRz for gene g experimental conditions, i.e., ‘‘extrinsic noise’’. With regards to the is given by: rSNR, this implies that genes that show little variation within the experimental condition of interest but whose expression levels vary z m across experimental conditions will tend to exhibit high rSNR SNRz ~ k,g k,g z scores. For the AtGenExpress dataset, we found that genes with sk,g high rSNR scores tend to be highly expressed in the positive set, z z and show little variation across time points, consistent with the where, mk,g and sk,g are the mean and the standard deviation respectively of the expression values for gene g in the positive set. expression pattern of fast-response genes to the interventions And the SNR{ is given by: applied there (data not shown). The rSNR calculation described above is repeated for all experimental conditions in the training data. Hence, for each P experimental condition, we obtain a list of genes with their N{ m i~1 i,g m corresponding rSNR scores, which is then used to select relevant { ~ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiN{ ~ (m1,:::,mN{ ),g SNRk,g P genes. After feature selection, each experimental condition in the N{ 2 m {m s(m ,:::,mN ),g i~1 ðÞi,g g 1 training data is represented by a separate list of relevant genes and N{ their corresponding expression values. Subsequently, when test data are compared with a given experimental condition (e.g. where, m is the mean expression value of gene g under i,g ‘‘cold-root’’), only the genes that were found to be relevant to that experimental condition i, mg is the mean of the mean expression particular experimental condition (‘‘cold-root’’) are used for value of gene g across all experimental conditions in the negative similarity score calculation. This process is repeated for each set, and m and s are, respectively, the ðÞm1,:::,mN{ ,g ðÞm1,:::,mN{ ,g combination of test data and experimental condition in the mean and standard deviation of the expression value of gene g training data. For a fair comparison with the results based on the across all experimental conditions in the negative set. i[fg1,:::,N{ link and linked genes datasets, we selected the same number of is the set of experimental conditions in the negative set and N{ is genes based on their rSNR scores, as obtained for each STRING the total number of experimental conditions in the negative set. cut-off score (see previous subsection). Figure 1B illustrates the calculation of the mean and standard deviation for the positive and the negative set. Comparison with Standard Feature Selection Methods Finally, we define the rSNR for experimental condition k and We compared the performance of the rSNR with two standard gene g as: feature selection methods. Like the rSNR, these feature selection methods rank genes and select genes based on a ranking:

PLOS ONE | www.plosone.org 5 October 2013 | Volume 8 | Issue 10 | e76561 CHAPTER 5.1 PUBLICATION #1

Relevant Genes in Microarrays Experiments

Figure 1. Example showing the calculation of the rSNR for the experimental condition ‘‘cold-root’’ in the AtGenExpress dataset. A) Division of the training dataset into positive and negative sets. The positive set corresponds to the experimental condition of interest (i.e., ‘‘cold-root’’, in red). Only some of the remaining experimental conditions in the training set, namely those not involving any of the experimental factors that define the experimental condition of interest, are used to build the negative set (in dark gray). B) Calculation of the mean and standard deviation for the positive and negative sets.T1 to T6 are time points. Numbers inside the boxes represent the number of replicates for an experimental condition at a given time point. m and s represents the mean and standard deviation respectively. For the positive set, we compute a mean and a standard deviation at each time point. For the negative set, we compute a mean at each time point. We then compute the mean and the standard deviation of the negative set as the mean and the standard deviation of the means computed at each time point respectively. doi:10.1371/journal.pone.0076561.g001

1. SNR: As previously described, SNR has been extensively used as feature selection method. In particular, it is often used to measure the effectiveness of a feature in discriminating m {m 0 { ~ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1 2 between two classes, and defined as: Welch st score s2 s2 1 z 2 m {m n1 n2 SNR~ 1 2 s1zs2 where, mi, si and ni are the mean, standard deviation and sample size, respectively. where, m1 and m2 are the mean expression value for class 1 and It is noteworthy that SNR and Welch’s t-test differ only in their class 2 respectively while s1 and s2 are the standard deviation for class 1 and class 2 respectively. estimation of the variance. 2. Welch’s t-test: Welch’s t-test is an adaptation of Student’s t- test to be used on two samples with unequal variance:

PLOS ONE | www.plosone.org 6 October 2013 | Volume 8 | Issue 10 | e76561 5 Article

Relevant Genes in Microarrays Experiments

Selecting Optimal Parameters for the Nearest-neighbor method to the entire datasets. We compared the performance of Classification the method using different gene expression and transition profiles, We began by evaluating the performance of our 1NN classifier as well as with previous work. Figure 3A represents scenario 1, in on the entire AtGenExpress (2074 genes) and EDGE (1600 gene) which replicates of the test data are included in the training data datasets. These datasets had been previously investigated by (for better comparability with the results by Hafemeister et al), Hafemeister et al in the context of time-series microarray while Figure 3B represents scenario 2, in which replicates of the classification [30]. Hafemeister et al. suggested to use piecewise test data are excluded from the training data. Among the three constant functions to model time courses, and implemented them expression and transition profiles, MTP performed best in most as Hidden Markov Models. Parameter estimation and inference cases. This suggests that the mean expression value of the genes, was achieved using a Bayesian approach. Since this is a state-of- rather than their change in expression, is specific to the the-art method, we considered their results as benchmark for experimental condition. Moreover, MTP performed better than comparison with our preliminary analysis. We also used their single time point expression profiles, indicating, as expected, that results as a baseline to improve with our feature selection two time points considered together contain more information approach. Since we modified these datasets from the original than two single time points considered separately for the ones (4 experimental conditions were removed from EDGE classification task. Also, MTP performs comparably to the method dataset, see subsection on datasets), here we report the accuracy by Hafemeister et al. Hence, MTP was chosen as the transition obtained by running the Python package available from profile formula for further evaluations. In addition, all further Hafemeister et al. [30] on the modified datasets. We were also evaluations exclude the replicates from the training data. able to reproduce the results reported by Hafemeister et al. on the original dataset (data not shown). Additionally, to make our results The rSNR Performed Best among All Evaluated Feature comparable, we considered two different scenarios: 1) including Selection Methods replicates of the test data in the training data (as presented by Next, we evaluated different feature selection methods with the Hafemeister et al.), and 2) excluding replicates of the test data aim of improving classification performance. We compared the from the training data (see Figure 2). The test data comprised at performance of gene and link-based methods, using randomly least 2 time points in scenario 1, and exactly 2 time points in selected genes as controls (see Methods). To render all feature scenario 2. In scenario 1, the test data were defined by randomly selection methods comparable, the classification decision was selecting 1) the number of time points, 2) the particular time always based on the same number of genes, independently of the points, and 3) the replicates used. The rest of the data were used as feature selection method. The results are shown in Figures 4A and training data. The selected time points were sorted in ascending 4B. The accuracy of the classifier based on randomly selected order according to time, and pairs of consecutive time points were genes decreased with the amount of information available for the used to generate transition profiles. Then, we compared each classification. The decrease in accuracy is gradual, but drops transition profile against all possible transition profiles of the training data, and computed the similarity value (Pearson significantly at the end for both AtGenExpress and EDGE correlation coefficient) between them. Each transition profile in datasets, suggesting the existence of a necessary and sufficient the test data was labeled with the experimental condition for which number of genes to describe experimental conditions. Both we obtained the highest similarity value. Finally, the test data were STRING-based feature selection methods failed to perform better labeled with an experimental condition using majority voting, or than the random selection for AtGenExpress. Only the method highest sum of similarity scores in case of tie. Under scenario 2, based on the linked genes dataset performed better than random replicates of the test data were excluded from the training data; for EDGE. Additionally, we observed that the performance of the otherwise, we followed the same setup. Note that, if the number of classifier based on the link dataset was generally not better than time points selected for the test data equaled the total number of that of the linked genes dataset. There are some plausible time points available for a particular experimental condition, explanations for the poor performance of STRING-based feature including one replicate for each time point in the test data and selection methods, and more specifically, the link-based feature excluding all other replicates from the training data would result in selection method: an empty training data for that experimental condition (see Figure N S1). To avoid such a situation, under scenario 2 we limited the size The STRING database is not a process-specific protein of the test data to only 2 time points. interaction database. A process-specific database, e.g., a database of interactions involved in stress response of Arabidopsis, would probably help in selecting interactions Results and Discussion specific to each experimental condition in AtGenExpress. We set out to improve the classification of microarray time N Some genes that might be relevant for the experimental series by various means. We evaluated how gene expression conditions under study are not present in STRING, and, profiles from two different time points can be combined, yielding hence, information provided by these genes is lost. transition profiles. We incorporated protein interaction informa- N It is disputable whether genes with more links in STRING are tion into the definition of such transition profiles, and selected more important, or simply more extensively studied. genes based on the same information. Finally, we compared different feature selection approaches, and developed the rSNR The rSNR feature selection method exhibited the best overall (relative Signal-to-Noise Ratio) method to extract the genes that performance, achieving higher accuracy values compared to are the most relevant to a specific biological condition. randomly selected genes for both datasets. Additionally, applying the rSNR feature selection method generally resulted in a The Mean Transition Profile (MTP) Performed Best among significantly higher accuracy as compared to the other feature our Profile Generation Methods selection methods. A detailed analysis of the rSNR method based We evaluated our method on the AtGenExpress and the EDGE on the AtGenExpress dataset is presented in the following section. datasets using leave-one-out cross-validation. First, we applied our

PLOS ONE | www.plosone.org 7 October 2013 | Volume 8 | Issue 10 | e76561 CHAPTER 5.1 PUBLICATION #1

Relevant Genes in Microarrays Experiments

Figure 2. Framework for the evaluation of the performance for methods classifying time-course expression data. In this example, T1 to T6 represent time points. Numbers inside the boxes indicate the number of replicates for each experimental condition at each time point. Test data consists of two time points, T1 and T3, and is taken from the experimental condition 1. In scenario 1, the replicates of the test data are part of the training data, while in scenario 2, the replicates of the test data have been excluded from the training data. doi:10.1371/journal.pone.0076561.g002

Figure 3. Accuracy of cross-validations using different param- Figure 4. Effect of feature selection methods on the classifica- eters. A) Accuracy computed under scenario 1 (see Methods). B) tion of time-course expression data. Accuracy was calculated Accuracy computed under scenario 2 (see Methods). Error bars indicate based on the A) AtGenExpress and B) EDGE datasets. Error bars indicate 1 standard deviation away from the mean. 1 standard deviation away from the mean. doi:10.1371/journal.pone.0076561.g003 doi:10.1371/journal.pone.0076561.g004

PLOS ONE | www.plosone.org 8 October 2013 | Volume 8 | Issue 10 | e76561 5 Article

Relevant Genes in Microarrays Experiments rSNR Performed Distinctly Better than Random Gene To verify that this is indeed the case, we selected 200 genes from Selection different sections of the rSNR-based sorted gene list, and To better understand the rSNR method, we systematically performed cross-validation, reporting average accuracy values. decreased the number of features in the dataset by uniformly As control, we used 200 randomly selected genes. As shown in removing 200 genes in a step-wise fashion, and compared the Figure 5B, the accuracy of the classifier is minimum when 200 performance of the rSNR with control datasets containing the genes are selected from the bottom of the rSNR-based sorted gene same number of randomly selected genes (Figure 5A). The rSNR list. Accuracy increases gradually as we select genes from higher in performed significantly better than the controls. However, as the list, until it reaches a plateau. The graph shows that genes at observed for different STRING cut-off scores (Figure 4A), with a the top of the rSNR-based sorted gene list are more important for small number of genes, the accuracy obtained using the rSNR the classification than genes at the bottom, suggesting that the becomes similar to random, suggesting, as expected, that there is a rSNR indeed identifies relevant genes for the experimental minimum number of genes required to distinguish between condition under study. experimental conditions. Comparing the rSNR with Similar Feature Selection Relevant Genes Scored High on the rSNR-based Gene Methods Rank We compared the performance of the rSNR with SNR and For each experimental condition in the training dataset, we Welch’s t-test. As shown in Figure 6 and Text S1, the rSNR created a gene list, sorted in ascending order (from bottom to top) performs significantly better than the other two feature selection according to their rSNR scores. For the rSNR to constitute a methods. reliable feature selection method in the context of microarray classification, the genes on the top of the list should be relevant to Testing Methods on Non-time Series Microarray Data the specific experimental condition of interest. On the other hand, In addition to testing our methods on time-series datasets, we the bottom of the list should contain genes considered to be noise. applied them to the static microarray dataset from Engreitz et al.

Figure 5. Evaluation of the effect of the rSNR-based rank on the classification of experiments from the AtGenExpress dataset. A) Accuracy was calculated based on gene sets of uniformly decreasing size, selected based on rSNR ranking. B) Accuracy was calculated based on gene sets belonging to different rSNR-based rank sections. Error bars indicate 1 standard deviation away from the mean. doi:10.1371/journal.pone.0076561.g005

PLOS ONE | www.plosone.org 9 October 2013 | Volume 8 | Issue 10 | e76561 CHAPTER 5.1 PUBLICATION #1

Relevant Genes in Microarrays Experiments

Figure 6. Cross-validation results of the comparison between rSNR and other feature selection methods on the AtGenExpress dataset. Error bars indicate 1 standard deviation away from the mean. doi:10.1371/journal.pone.0076561.g006

Engreitz et al. classified microarray experiments into three disease types, achieving an area under the ROC curve (AUC) of 0.729. As discussed in the Methods section, the application of our method required the modification of the original dataset. Hence, direct comparison with the results of Engreitz et al. is not appropriate. Even randomly selected genes perform extremely well at classifying these data (Figure 7A), an observation reminiscent of results reported by Venet et al. [46]. Here, the rSNR results in a slightly lower accuracy than other methods, but shows a superior performance in terms of the AUC (Figure 7B). A higher AUC indicates a better predictive ability of the rSNR. Thus, in a database search setup, where it is not necessarily the best candidate which is of interest, but rather a set of highly scoring candidates, a high rSNR is a useful feature for improving classification results.

Conclusions We first showed that the nearest neighbor method performs comparably to the method developed by Hafemeister et al. We found that for the purpose of classification, mean expression profiles describe time-series transitions based on microarray experiments better than differential expression profiles and single time point expression profiles. We used biological information as a feature selection criterion wherein a selected feature is known to involve a protein interaction. The source of this biological information was protein interaction information from the STRING database. We investi- gated the performance of different methods for reducing the high dimensionality of microarray data based on such information. We Figure 7. Effect of feature selection methods on the classifica- found that, compared to simple expression profiles, the addition of tion of the Engreitz dataset. A) Accuracy. B) Area under the ROC curve (AUC). information on interactions does not provide a clear advantage in doi:10.1371/journal.pone.0076561.g007 the classification of microarray experiments. Moreover, the application of feature selection methods relying on interactions Finally, we proposed a novel method for feature selection that resulted in performances comparable to those obtained with we called the relative Signal to Noise Ratio (rSNR). The rSNR randomly selected genes. Alternative forms of including such gives a score to each gene based on its relevance for each information, or the use of interactions from more specific experimental condition. This score can then be used to select databases, describing the process of interest, might be of advantage. relevant genes. We showed that the performance of classifiers

PLOS ONE | www.plosone.org 10 October 2013 | Volume 8 | Issue 10 | e76561 5 Article

Relevant Genes in Microarrays Experiments based on genes with low rSNR scores is substantially worse than Supporting Information that of classifiers based on genes with high rSNR scores. This result indicates that the genes relevant for classification of the Figure S1 Example of cross-validation when all time experimental condition of interest rank high, in contrast to points of an experimental condition are selected as test irrelevant genes, which may be considered noise. Due to its data. simplicity, our method is particularly attractive for database (TIF) searches. In a preprocessing step, the microarray datasets for Text S1 Results of comparison between rSNR and different experimental conditions in the database can be summa- Significance Analysis of Microarray (SAM). rized using only the most relevant genes. Hence, when a query (PDF) microarray is searched against the entire database, instead of comparing the expression profiles of all genes, only the expression Author Contributions profiles of genes relevant to an experimental condition need to be Conceived and designed the experiments: NKS DR GF. Performed the compared. Both in terms of memory and performance, such a experiments: NKS. Analyzed the data: NKS DR LT. Wrote the paper: procedure would demand relatively few computational resources. NKS VL LT GF.

References 1. Bar-Joseph Z (2004) Analyzing time series gene expression data. Bioinformatics microarray assays run in a clinical context. J Mol Diagn 7: 404–412. Available: 20: 2493–2503. Available: http://dx.doi.org/10.1093/bioinformatics/bth283. http://dx.doi.org/10.1016/S1525-1578(10)60570-3. 2. Edgar R, Domrachev M, Lash AE (2002) Gene Expression Omnibus: NCBI 20. Mutch DM, Berger A, Mansourian R, Rytz A, Roberts M-A (2002) The limit gene expression and hybridization array data repository. Nucleic Acids Res 30: fold change model: a practical approach for selecting differentially expressed 207–210. genes from microarray data. BMC Bioinformatics 3: 17. 3. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. 21. Raman T, O’Connor TP, Hackett NR, Wang W, Harvey B-G, et al. (2009) The Journal of Machine Learning Research 3: 1157–1182. Available: http://dl. Quality control in microarray assessment of gene expression in human airway acm.org/citation.cfm?id = 944968. epithelium. BMC Genomics 10: 493. Available: http://dx.doi.org/10.1186/ 4. Goecks J, Nekrutenko A, Taylor J, Galaxy Team (2010) Galaxy: a 1471-2164-10-493. comprehensive approach for supporting accessible, reproducible, and transpar- 22. Hengpraprohm S, Chongstitvatana P (2009) Feature selection by weighted-SNR ent computational research in the life sciences. Genome Biol 11: R86. Available: for cancer microarray data classification. International Journal of Innovative http://dx.doi.org/10.1186/gb-2010-11-8-r86. Computing, Information and Control 5. 5. Kallio MA, Tuimala JT, Hupponen T, Klemela¨ P, Gentile M, et al. (2011) 23. Huang CJ, Liao WC (2003) A comparative study of feature selection methods for Chipster: user-friendly analysis software for microarray and other high- probabilistic neural networks in cancer classification. Tools with Artificial throughput data. BMC Genomics 12: 507. Available: http://dx.doi.org/10. Intelligence, 2003. Proceedings. 15th IEEE International Conference on. IEEE. 1186/1471-2164-12-507. 451–458. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber = 1250224. 6. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, et al. (2004) 24. Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, et al. (2002) Bioconductor: open software development for computational biology and Prediction of central nervous system embryonal tumour outcome based on gene bioinformatics. Genome Biol 5: R80. Available: http://dx.doi.org/10.1186/gb- expression. Nature 415: 436–442. Available: http://www.nature.com/nature/ 2004-5-10-r80. journal/v415/n6870/abs/415436a.html. 7. Duval B, Hao J-K (2010) Advances in metaheuristics for gene selection and 25. Ryu J, Cho SB (2002) Gene expression classification using optimal feature/ classification of microarray data. Brief Bioinform 11: 127–141. Available: classifier ensemble with negative correlation. Neural Networks, 2002. IJCNN’02. http://dx.doi.org/10.1093/bib/bbp035. Proceedings of the 2002 International Joint Conference on. IEEE, Vol. 1. pp. 8. Beyer A, Bandyopadhyay S, Ideker T (2007) Integrating physical and genetic 198–203. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber = 1005469. maps: from genomes to interaction networks. Nat Rev Genet 8: 699–710. 26. Slonim DK, Tamayo P, Mesirov JP, Golub TR, Lander ES (2000) Class Available: http://dx.doi.org/10.1038/nrg2144. prediction and discovery using gene expression data. Proceedings of the fourth annual international conference on Computational molecular biology. ACM. 9. Fuente A de la (2010) From ‘differential expression’ to ‘differential networking’ - pp. 263–272. Available: http://dl.acm.org/citation.cfm?id = 332564. identification of dysfunctional regulatory networks in diseases. Trends Genet 26: 27. Lakshmi K, Mukherjee S (2006) An improved feature selection using maximized 326–333. Available: http://dx.doi.org/10.1016/j.tig.2010.05.001. signal to noise ratio technique for TC. Information Technology: New 10. Chuang H-Y, Lee E, Liu Y-T, Lee D, Ideker T (2007) Network-based Generations, 2006. ITNG 2006. Third International Conference on. IEEE. classification of breast cancer metastasis. Mol Syst Biol 3: 140. Available: http:// pp. 541–546. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber = 1611649. dx.doi.org/10.1038/msb4100180. 28. Mishra D, Sahu B (2011) Feature Selection for Cancer Classification: A Signal- 11. Parsons DW, Jones S, Zhang X, Lin JC-H, Leary RJ, et al. (2008) An integrated to-noise Ratio Approach. International Journal of Scientific & Engineering genomic analysis of human glioblastoma multiforme. Science 321: 1807–1812. Research 2. Available: http://dx.doi.org/10.1126/science.1164382. 29. Goh L, Song Q, Kasabov N (2004) A novel feature selection method to improve 12. Rapaport F, Zinovyev A, Dutreix M, Barillot E, Vert J-P (2007) Classification of classification of gene expression data. Proceedings of the second conference on microarray data using gene networks. BMC Bioinformatics 8: 35. Available: Asia-Pacific bioinformatics-Volume 29. Australian Computer Society, Inc. http://dx.doi.org/10.1186/1471-2105-8-35. pp. 161–166. Available: http://dl.acm.org/citation.cfm?id = 976542. 13. Ma X, Lee H, Wang L, Sun F (2007) CGI: a new approach for prioritizing genes 30. Hafemeister C, Costa IG, Scho¨nhuth A, Schliep A (2011) Classifying short gene by combining gene expression and protein-protein interaction data. Bioinfor- expression time-courses with Bayesian estimation of piecewise constant matics 23: 215–221. Available: http://dx.doi.org/10.1093/bioinformatics/ functions. Bioinformatics 27: 946–952. Available: http://dx.doi.org/10.1093/ btl569. bioinformatics/btr037. 14. Wu C, Zhu J, Zhang X (2012) Integrating gene expression and protein-protein 31. Kilian J, Whitehead D, Horak J, Wanke D, Weinl S, et al. (2007) The interaction network to prioritize cancer-associated genes. BMC Bioinformatics AtGenExpress global stress expression data set: protocols, evaluation and model 13: 182. Available: http://dx.doi.org/10.1186/1471-2105-13-182. data analysis of UV-B light, drought and cold stress responses. The Plant Journal 15. Chopra P, Lee J, Kang J, Lee S (2010) Improving cancer classification accuracy 50: 347–363. Available: http://onlinelibrary.wiley.com/doi/10.1111/j.1365- using gene pairs. PLoS One 5: e14305. Available: http://dx.doi.org/10.1371/ 313X.2007.03052.x/full. journal.pone.0014305. 32. Hayes KR, Vollrath AL, Zastrow GM, McMillan BJ, Craven M, et al. (2005) 16. Scheubert L, Schmidt R, Repsilber D, Lusˇtrek M, Fuellen G (2011) Learning EDGE: a centralized resource for the comparison, analysis, and distribution of biomarkers of pluripotent stem cells in mouse. DNA research 18: 233–251. toxicogenomic information. Molecular pharmacology 67: 1360–1368. Available: Available: http://dnaresearch.oxfordjournals.org/content/18/4/233.short. http://molpharm.aspetjournals.org/content/67/4/1360.short. 17. Staiger C, Cadot S, Kooter R, Dittrich M, Mu¨ller T, et al. (2012) A critical 33. Engreitz JM, Morgan AA, Dudley JT, Chen R, Thathoo R, et al. (2010) evaluation of network and pathway-based classifiers for outcome prediction in Content-based microarray search using differential expression profiles. BMC breast cancer. PLoS One 7: e34796. Available: http://dx.doi.org/10.1371/ Bioinformatics 11: 603. Available: http://dx.doi.org/10.1186/1471-2105-11- journal.pone.0034796. 603. 18. Constantinides CD, Atalar E, McVeigh ER (2005) Signal-to-noise measure- 34. Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, et al. (2009) STRING 8–a ments in magnitude images from NMR phased arrays. Magnetic Resonance in global view on proteins and their functional interactions in 630 organisms. Medicine 38: 852–857. Available: http://onlinelibrary.wiley.com/doi/10.1002/ Nucleic acids research 37: D412–D416. Available: http://nar.oxfordjournals. mrm.1910380524/abstract. org/content/37/suppl_1/D412.short. 19. Daly TM, Dumaual CM, Dotson CA, Farmen MW, Kadam SK, et al. (2005) 35. Snel B, Lehmann G, Bork P, Huynen MA (2000) STRING: a web-server to Precision profiling and components of variability analysis for Affymetrix retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic

PLOS ONE | www.plosone.org 11 October 2013 | Volume 8 | Issue 10 | e76561 CHAPTER 5.1 PUBLICATION #1

Relevant Genes in Microarrays Experiments

acids research 28: 3442–3444. Available: http://nar.oxfordjournals.org/ 41. Ding C, Peng H (2005) Minimum redundancy feature selection from microarray content/28/18/3442.short. gene expression data. Journal of bioinformatics and computational biology 3: 36. Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, et al. (2011) The 185–205. Available: http://www.worldscientific.com/doi/abs/10.1142/ STRING database in 2011: functional interaction networks of proteins, globally S0219720005001004. integrated and scored. Nucleic acids research 39: D561–D568. Available: 42. Ritchie ME, Silver J, Oshlack A, Holmes M, Diyagama D, et al. (2007) A http://nar.oxfordjournals.org/content/39/suppl_1/D561.short. comparison of background correction methods for two-colour microarrays. 37. Von Mering C, Huynen M, Jaeggi D, Schmidt S, Bork P, et al. (2003) STRING: Bioinformatics 23: 2700–2707. Available: http://bioinformatics.oxfordjournals. a database of predicted functional associations between proteins. Nucleic acids org/content/23/20/2700.short. research 31: 258–261. Available: http://nar.oxfordjournals.org/content/31/1/ 43. Smyth GK, Michaud J, Scott HS (2005) Use of within-array replicate spots for 258.short. assessing differential expression in microarray experiments. Bioinformatics 21: 38. Von Mering C, Jensen LJ, Kuhn M, Chaffron S, Doerks T, et al. (2007) 2067–2075. Available: http://bioinformatics.oxfordjournals.org/content/21/9/ STRING 7–recent developments in the integration and prediction of protein 2067.short. interactions. Nucleic acids research 35: D358–D362. Available: http://nar. 44. Smyth GK, Speed T (2003) Normalization of cDNA microarray data. Methods oxfordjournals.org/content/35/suppl_1/D358.short. 31: 265–273. Available: http://www.sciencedirect.com/science/article/pii/ 39. Von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, et al. (2005) STRING: S1046202303001555. known and predicted protein–protein associations, integrated and transferred 45. Yang YH, Speed T (2002) Design issues for cDNA microarray experiments. Nat across organisms. Nucleic acids research 33: D433–D437. Available: http://nar. Rev Genet 3: 579–588. Available: http://dx.doi.org/10.1038/nrg863. oxfordjournals.org/content/33/suppl_1/D433.short. 40. Warsow G, Greber B, Falk SSI, Harder C, Siatkowski M, et al. (2010) 46. Venet D, Dumont JE, Detours V (2011) Most random gene expression ExprEssence–revealing the essence of differential experimental data in the signatures are significantly associated with breast cancer outcome. PLoS context of an interaction/regulation net-work. BMC Syst Biol 4: 164. Available: Comput Biol 7: e1002240. Available: http://dx.doi.org/10.1371/journal.pcbi. http://dx.doi.org/10.1186/1752-0509-4-164. 1002240.

PLOS ONE | www.plosone.org 12 October 2013 | Volume 8 | Issue 10 | e76561 5.2. SUPPLEMENT

5.2. Supplement

Supplementary information can be found on the CD attached to this thesis.

6 Article

Elucidating complex phenotypes based on high‐throughput expression and biological annotation data Nitesh Kumar Singh1, Mathias Ernst1, Volkmar Liebscher2, Georg Fuellen1,* and Leila

Taher1,*

1Institute for Biostatistics and Informatics in Medicine and Ageing Research, Rostock University

Medical Center, Rostock, Germany.

2Institute for Mathematics and Informatics, Ernst Moritz Arndt University of Greifswald, Greifswald,

Germany.

* Corresponding authors: Dr. Leila Taher, Prof. Dr. Georg Fuellen, Institute for Biostatistics and Informatics in Medicine and Ageing Research, Ernst‐Heydemann‐Str. 8, D‐18057 Rostock, Germany. E‐mail: leila.taher@uni‐rostock.de, fuellen@uni‐rostock.de.

Abstract The interpretation of large gene expression datasets describing complex phenotypes is obscured by the fact that such phenotypes are likely to be controlled by the concerted action of multiple genes. Combining gene expression analysis with annotation regarding the functions, processes and pathways in which the genes are involved has the potential to elucidate the biological interactions underlying a given phenotype. Here, we present an approach that integrates gene expression and biological annotation data to identify biological units and their interactions that influence a phenotype of interest. First, we divide genes with similar biological annotation into clusters. Second, we group the genes within each cluster into sub‐clusters, based on their expression profiles. Finally, we construct a co‐expression network of sub‐clusters to analyze the interactions between the biological units represented by these sub‐clusters. We applied our approach to two microarray expression datasets describing the differentiation of mouse embryonic stem cells into embryoid bodies, and mouse liver development and regeneration, respectively. 1 CHAPTER 6.1 PUBLICATION #2, under review

For the first dataset, our findings confirm that developmental processes and apoptosis have a key role in cell differentiation. Furthermore, we suggest that processes related to pluripotency and lineage commitment, which are known to be critical for development, interact mainly indirectly, through genes implicated in more general biological processes. For the second dataset, we concluded that the transcriptional mechanisms beneath liver regeneration are fundamentally different from those regulating embryonic liver development. Moreover, we provide evidence that supports the relevance of cell organization in the developing liver for proper liver function. Understanding how genes involved in specific biological functions, processes and/or pathways interact depending on particular experimental conditions is crucial to decipher the molecular basis of health, disease, and drug response. This study provides a new approach to examining gene expression data that can be easily extended to other high‐throughput expression data.

Author Summary Complex phenotypes are controlled by the concerted action of multiple genes. Combining gene expression analysis with functional annotation information has been shown to be useful in revealing the biological interactions underlying complex phenotypes. Here, we propose a novel method to identify the biological units underpinning the maintenance or progression of a given phenotype relying on both expression and functional annotation data. These biological units are represented by clusters of genes associated with similar biological functions, processes, and/or pathways and exhibiting similar expression profiles. Subsequently, we demonstrate how to the expression profiles of these clusters can be used to create a co‐expression network of clusters. These networks make the data more interpretable and can be used to guide further experiments.

2 6 Article

Introduction Complex phenotypes are usually the product of the coordinated action of numerous genes underpinning multiple interacting biological processes and/or pathways 1–7. For example, plant flowering is regulated by multiple distinct yet interacting pathways with common downstream target genes 1. Further, proper functioning of the transforming growth factor‐beta/bone morphogenic protein signaling pathway, which is fundamentally important during embryonic development and adult tissue homeostasis in metazoans, depends on interactions with other pathways, such as mitogen‐activated protein kinase MAPK, Wnt/Wg, Hedgehog Hh, Notch, and others 3. Also, early mouse caudal development has been shown to rely on the interaction between the sonic hedgehog Shh, retinoic acid RA, and fibroblast growth factor Fgf signaling pathways 7. Dysregulation of these interactions has been associated with phenotypic anomalies, such as developmental defects and diseases. Therefore, a comprehensive exploration of the biological interactions between different biological functions, processes, and pathways would facilitate the interpretation of experimental data.

Complex phenotypes are commonly explored within a single experiment in which the expression level of thousands of genes is simultaneously evaluated. Microarray technologies were developed and are routinely employed with that purpose. Depending on the object of the study, microarray experiments can be designed as static, to test differences between two conditions, such as ’normal‘ and ’diseased‘, or as time‐series, to evaluate the progression of a given condition over time. In most cases, only a relatively small fraction of the genes on the microarray will be of significance to the biological question being addressed 8,9. Selecting differentially expressed genes is the first step towards identifying biologically meaningful genes.

After selecting differentially expressed genes, the complexity of microarray data can be further reduced by clustering these genes. Two basic strategies are widely applied: i genes are clustered based on the similarity in their function or in the biological processes or pathways in which they are involved, determined based on databases such as Gene

3 CHAPTER 6.1 PUBLICATION #2, under review

Ontology GO 10 or Kyoto Encyclopedia of Genes and Genomes KEGG 11,12; and ii genes are clustered based on the similarity of their expression profiles. Genes with similar biological annotation are likely to be correlated or anti‐correlated, in case of inhibition in their expression profiles 13–16. However, target genes and their regulators may have similar annotation despite exhibiting a time shift in their expression profiles. More generally, genes with identical biological annotation may show completely different expression patterns, simply because they are subject to different regulatory mechanisms 17–19. On the other hand, genes with similar expression profiles will not necessarily be associated with the same biological functions, processes and/or pathways 20. Alternatively, a clustering method combining both biological annotation and expression data would identify sets of co‐expressed genes involved in the same specific functions, processes and/or pathways. Such an approach has the potential of recognizing the genes that underpin the maintenance or progression of a given complex phenotype, since genes that are both co‐expressed and share similar biological annotations have been shown to share regulatory mechanisms 21,22,19,23.

Microarray data can be analyzed by reconstructing a network that captures biological interactions. Recognizing precise condition‐specific interactions, as well as the typical modular organization of most gene expression programs is a key challenge in systems biology. Indeed, many studies have attempted to unravel the biological interactions and networks underlying complex phenotypes by applying ‘reverse engineering‘ algorithms, which use gene expression profiles as their starting point 24–26. Most of these approaches focus on single biological functions, processes and/or pathways 27–29. However, since most complex phenotypes are likely to be controlled by multiple genes, an approach able to identify their biological functions as well as the processes and/or pathways in which they are involved, and describe their interactions would provide a new perspective on microarray data.

Here we present a strategy that clusters genes based both on their biological annotation GO and KEGG as well as their expression profiles across time see Figure 1 for an overview. This information allows us to investigate possible interactions between the 4 6 Article

different biological functions, processes and/or pathways represented by these clusters. We do this by creating a co‐expression network which is not based on the expression pattern of individual genes, but rather on clusters of genes. We demonstrate how this network can be used to study the biological interactions underlying complex phenotypes. In particular, we applied our approach to two different microarray datasets. First, we analyzed changes in different developmental processes as mouse embryonic stem cells mESCs differentiate into embryoid bodies 30. Consistent with the relatively high expression of the core pluripotency gene Pou5f1, our analysis did not reveal any prevalent developmental processes early in the differentiation process, but rather up‐ and down‐ regulation of genes implicated in normal cell functioning, cell cycle, and cell death, among others . Furthermore, our results suggest that the interactions between processes known to be crucial for development, namely, pluripotency and lineage‐specific differentiation, are mainly indirect, through other, more general biological processes. Second, we compared the functional basis of mouse liver regeneration with development, and found no similarity between the two, a finding in accordance with Out et al 31. In particular, our data support a major role for the transcriptional regulation of biosynthetic processes in mouse liver regeneration, and highlight the relevance of the spatial organization of the cells in the developing liver for its function. In summary, we describe here a novel approach that provides new, valuable insights for gene expression data analysis.

5 CHAPTER 6.1 PUBLICATION #2, under review

Results

Clustering genes based on biological annotation and expression profile The development and analysis of more or less simplified models of biological processes and pathways has proven to be useful to further our understanding of complex phenotypes 32,33. Thus, our first aim was to recognize biological units that could be easily linked to known biological functions, processes, and pathways. For this purpose, we designed a two‐step procedure. First, we clustered together genes associated with similar GO terms and KEGG pathways, through a process to which we refer as ‘functional annotation clustering’ see Materials and Methods. The resulting clusters are partially overlapping, since some genes have multiple functional annotations depending on the amount of information available in the associated literature. Second, we divided the aforementioned clusters into also possibly partially overlapping sub‐clusters of genes exhibiting similar expression patterns using a fuzzy clustering algorithm see Materials and Methods. We first applied our approach to a gene expression dataset describing the differentiation of mESCs into embryoid bodies over a period of two weeks see Materials and Methods. The differentiation of stem cells to form embryoid bodies has been long regarded as a model for the formation of cells from all germ layers 34.

Subjecting the set of 2,984 differentially expressed genes to functional annotation clustering resulted in 16 clusters comprising 1,123 genes see Materials and Methods and Table S1 in Text S1. Eight of these clusters were involved in developmental processes, while the remaining 8 were related to normal cell functioning and presumably involved mainly ‘house‐keeping‘ genes. The largest cluster C2, see Table S1 in Text S1 included 460 genes associated with ‘positive regulation of macromolecule metabolic processes’, while the smallest cluster C16, see Table S1 in Text S1 included only 20 genes associated with ‘positive regulation of cell division’. We observed a variable degree of overlap between any two clusters, with a maximum overlap of 95 genes between C2 and C14, both

6 6 Article

associated with positive and negative regulation of macromolecule metabolic processes respectively. The expression profiles of genes within a given cluster were more highly correlated than expected by chance P‐value 0.01, Figure 2A. However, they were only weakly correlated, with a median pairwise Pearson correlation coefficient of 0.1. Indeed, the fact that genes within a given cluster are likely to be associated with similar biological functions, processes or pathways, does not imply that they exhibit similar expression profiles. In particular, many genes within the same cluster displayed antagonistic expression patterns, as evidenced by the higher median pairwise absolute Pearson correlation coefficient of 0.6. Further fuzzy clustering based on the expression profiles of the genes in each of these 16 clusters see Materials and Methods filtered out genes with inconsistent expression patterns. As a result of this procedure, we identified 51 sub‐ clusters Table S1 in Text S1 comprising a total of 755 genes. By construction, the expression profiles of genes within the same sub‐clusters were not only more highly correlated than expected by chance P‐value 0.01, Figure 2C, but also more than those of genes within the same clusters medians 0.8 and 0.1 respectively, P‐value 2.0x10‐9, Wilcoxon Rank‐Sum test, see also Figure S1. Similarly, the number of sub‐clusters within each cluster is correlated with the number of genes in the cluster Pearson correlation coefficient 0.97. The largest sub‐cluster SC 7.3, see Table S2 in Text S1 consists of 57 genes, while the smallest sub‐clusters included only 12 genes SC 9.1, SC 13.3 and SC 14.1, see Table S2 in Text S1. The maximum overlap between any two sub‐clusters contained 32 genes, and was observed between a sub‐cluster associated with ‘embryonic organ development’ SC 4.1 and a sub‐cluster associated with ‘embryonic development’ SC 7.3. The majority of the sub‐clusters 27, see Table S2 in Text S1 and Figures S2 and S3 showed expression profiles that increased approximately monotonically as the cells differentiate with a log2 fold‐change greater than 1 from the first to the last time point. Eighteen sub‐clusters see Table S2 in Text S1 and Figures S2 and S3 exhibited expression profiles with the opposite trend with a log2 fold‐change smaller than ‐1 from the first to the last time point. Finally, despite the expression of their gene members significantly

7 CHAPTER 6.1 PUBLICATION #2, under review

changing over time, the remaining 6 sub‐clusters exhibited relatively constant expression profiles with a log2 fold‐change between ‐1 and 1 from the first to the last time point.

Biologically related genes are likely to interact with each other more often than would be expected by chance 35. To verify that our clusters and sub‐clusters indeed represent sets of genes with common functions and/or involved in similar processes or pathways, we quantified the number of experimentally verified and computationally predicted interactions in the STRING database 36 that are found between the members of a given cluster or sub‐cluster, and compared with the number observed for gene sets matching the sizes of the clusters or sub‐clusters randomly sampled among all genes represented in the 16 clusters or 51 sub‐clusters. We refer to this number as the ’intra‐ cluster or sub‐cluster STRING connectivity score‘ see Materials and Methods. We found that the average intra‐cluster and intra‐sub‐cluster STRING connectivity scores were significantly greater than expected by chance P‐values 0.01, Figures 2B and D. We saw no significant difference between the intra‐cluster STRING connectivity score and the intra‐ sub‐cluster STRING connectivity score P‐value 0.7, Wilcoxon Rank‐Sum test, suggesting that sub‐clustering based on expression profiles preserves the biological relationships between the genes in the clusters.

Finally, we examined the sub‐clusters for markers of pluripotency by comparing with genes in the ‘Plurinetwork’ 37. Most sub‐clusters were significantly enriched in such markers when compared to the genes whose expression was assayed in the microarray experiment 27/51 sub‐clusters, P‐values 0.05, Fisher’s Exact Test. Significantly, most sub‐clusters enriched in pluripotency markers had expression profiles that decreased as the cells differentiate 15/27 sub‐clusters, P‐value 0.001, hypergeometric test. As expected, all 4 sub‐clusters comprising Pou5f1, a characteristic of pluripotent stem cells 37 exhibited decreasing expression profiles over time. These results are consistent with a down‐regulation of the pluripotency network, which is a prerequisite of germ layer formation.

8 6 Article

Together, our observations suggest that, starting with biological annotation and gene expression data, our clustering method is able to identify groups of genes representing biological units that have the potential of advancing our understanding of complex phenotypes.

Network inference and validation To better understand the biological interactions underlying the differentiation of mESCs into embryoid bodies, we created a co‐expression network of sub‐clusters. A co‐ expression network of sub‐clusters is an undirected graph where nodes are sub‐clusters, and two nodes are connected by an edge if the expression profiles of the corresponding sub‐clusters are similar. Edges between two nodes in the network represent biological interactions between the functions, processes, and/or pathways associated with the respective sub‐clusters. We quantified the similarity between two expression profiles using the absolute Pearson correlation coefficient. Moreover, since Pearson correlation is a symmetric measure, edges are non‐directional. To avoid or minimize redundancy, self‐ edges were excluded. Several studies have used co‐expression networks to shed light on the mechanisms controlling a given phenotype 38–40.

We hypothesized that network edges associated with higher absolute Pearson correlation coefficients are more likely to represent real biological interactions. To verify this, we separated the edges of the co‐expression network of sub‐clusters describing the differentiation of mESCs into embryoid bodies into two groups, defining a ’top network‘ comprising the 10% of the edges with the highest absolute Pearson correlation coefficients and a ’bottom network‘ comprising the 10% of the edges with the lowest absolute Pearson correlation coefficients, see Materials and Methods. We found that sub‐ clusters connected by edges in the top network shared a number of genes which was significantly greater than that observed for the bottom network P‐value 2.1x10‐23, Wilcoxon Rank‐Sum test, see Figure 3A. This hints at the mutual interaction between the biological functions, processes and/or pathways represented by these sub‐clusters. Also, we analyzed the promoter regions of the genes in a sub‐cluster, searching for binding sites

9 CHAPTER 6.1 PUBLICATION #2, under review

for transcriptional regulators. If the top network describes actual biological interactions, the expression of genes in sub‐clusters connected by edges in the top network is likely to be controlled by a partially overlapping set of transcriptional regulators. Thus, we computed and compared the statistical significance of observing a given number of common transcriptional regulators between pairs of sub‐clusters connected by edges in the top and the bottom networks see Materials and Methods. We found that it was more likely to observe a significant enrichment in common transcriptional regulators for pairs of sub‐clusters connected by edges in the top network as compared with the bottom network P‐value 1.5x10‐6, Wilcoxon Rank‐Sum test, see Figure 3B. This result suggests that many genes in sub‐clusters connected by edges in the top network are co‐regulated. While these findings support our hypothesis, they might be influenced by the fact that the higher the overlap between two sub‐clusters, the higher the absolute Pearson correlation coefficient between the corresponding expression profiles. Hence, despite of its biological significance, the overlap between sub‐clusters, and thus, the number of common transcriptional regulators between two sub‐clusters, are not independent from the definition of the top and bottom networks. To rule out that our results only reflect the overlap between pairs of sub‐clusters, we next defined and computed two measures that explicitly exclude the genes that are common between any two sub‐clusters. First, we quantified the number of interactions in the STRING database 36 between pairs of genes in respective pairs of sub‐clusters connected by an edge, leaving out genes that are common between the corresponding sub‐clusters. We further refer to this measure as the ‘STRING connectivity score’ see Materials and Methods. We found that the edges of the top network had a significantly greater STRING connectivity score than those of the bottom network P‐value 0.0008, Wilcoxon Rank‐Sum test, see Figure 3C. Second, we estimated the similarity between any two sub‐clusters connected by an edge based on the GO annotation of their gene members, excluding those genes that are shared between the corresponding sub‐clusters see Materials and Methods. Once more, those sub‐clusters connected by an edge in the top network exhibited greater GO annotation similarities than those in the bottom network P‐value 0.001, Wilcoxon Rank‐Sum test, see Figure 3D.

10 6 Article

Taken together, the above results suggest that edges between nodes associated with high Pearson correlation coefficients are likely to describe real biological interactions between the functions, processes and/or pathways represented by the corresponding sub‐clusters.

Network analysis In order to better understand the interactions between the functions, processes and/or pathways involved in germ layer formation, we next analyzed the aforementioned top network, i.e., the network constructed with the 10% of the edges 128 edges in total with the highest absolute Pearson correlation coefficients Figure 4. Twelve out of the 51 sub‐clusters identified for the dataset describing the differentiation of mESCs into embryoid bodies see Table S2 in Text S1 were absent from this network. In particular, 4 out of the 6 sub‐clusters displaying relatively constant expression profiles in the dataset were among these 12 absent sub‐clusters, a significantly higher number than expected by chance P‐value 0.02, hypergeometric test. This observation suggests that function, processes, and/or pathways associated with sub‐clusters with relatively constant expression profiles participate in relatively weak interactions. Also, only one SC 7.2, see Table S2 in Text S1 out of the 12 sub‐clusters absent in the top network belonged to a development‐related cluster. Given that we identified a total of 18 development‐related sub‐clusters in the dataset see Table S2 in Text S1, this number is significantly lower than expected by chance P‐value 0.02, hypergeometric test, and is consistent with the importance of functions, processes, and pathways associated with development for germ layer formation.

Compared to random networks, biological networks have a greater fraction of highly connected nodes or ’hubs’ 41,42. At the gene level, hubs have been shown to be important for mediating interactions between genes represented by less connected nodes, improving the coordination between different parts of the network 43,44. Analogously, we hypothesize that highly connected sub‐clusters play important coordinating roles. Based on this assumption, we calculated the number of edges of each sub‐cluster in the network, i.e., the degree of the nodes. The median degree of the network was 7, which is

11 CHAPTER 6.1 PUBLICATION #2, under review

significantly higher than the median expected by chance P‐value 0.0001, computed by observing how many out of 10,000 random networks have median degree 7 or higher. In particular, we found 13 sub‐clusters with degree 10 or higher see Table S2 in Text S1. Seven of these sub‐cluster hubs were associated with developmental processes, functions, and/or pathways, and the other 6 with general regulatory functions such as ‘nucleotide binding’ and ‘DNA binding’, highlighting the relevance of transcriptional regulation and development to germ layer formation. In addition, although the degree of the sub‐clusters was only moderately correlated with their size Pearson correlation coefficient r 0.5, the 13 highest‐degree sub‐clusters were among the largest identified for the dataset median size 36 as compared to 20.5 for the remaining 26 sub‐clusters, P‐value 0.001, Wilcoxon Rank‐Sum test. Furthermore, we found that among the 16 sub‐clusters that had degrees higher than the median, 12 exhibited expression profiles that increased over time, with a log 2 fold‐change between the last and the first point in the time series greater than or equal to 1. This is significantly higher than expected by chance, considering that the entire network comprised a total of 20 sub‐clusters with such expression profiles see Table S2 in Text S1, P‐value 0.02, hypergeometric test. Overall, these results suggest that cell differentiation and germ layer formation is dominated by gene activation, mediated by coordinated transcriptional regulation.

Finally, many of the edges associated with the highest absolute Pearson correlation coefficient thicker edges in Figure 4 connect sub‐clusters representing developmental processes. Among them, we observed that most sub‐clusters representing similar biological functions, processes, and/or pathways but exhibiting antagonistic expression profiles are not directly connected by an edge, but rather through other sub‐clusters. For example, the two sets of sub‐clusters comprising SCs 15.3, 9.2, and 7.1, with decreasing expression profiles, and SCs 15.1, 9.3, 7.3 and 5.1, with increasing expression profiles, both annotated with organ, tube, embryonic and reproductive developmental processes, are not directly connected by any edges. The former includes genes such as Pou5f1, Sox2, Mycn, Zic3, Sall1, Esrrb, and Myc, which are considered to be the hallmarks of pluripotency. The latter comprises genes such as Hand1, Gata2, Gata4, Gata6, T, Foxa2, Id1, Fgf8, Hoxa1, and,

12 6 Article

Hoxb3, which are associated with differentiation into specific lineages. Tight regulation of all these genes is likely to be crucial for germ layer formation. However, we hypothesize that this is mainly accomplished by transcription factors in other biological units or other regulatory mechanisms, such as those mediated by miRNAs.

Together, the above results indicate that the co‐expression network of sub‐clusters captures real interactions between meaningful biological units, and, thus, constitutes a useful model of complex phenotypes.

Early vs. late cell differentiation After validating our approach to infer interactions between biological units that are relevant for a given complex phenotype, we applied it to gain new insights into the differentiation of mESC into embryoid bodies. To this end, we first clustered all samples in the aforementioned gene expression dataset using hierarchical clustering using Euclidean distance and complete linkage, see Figure S4. We concluded that the dataset was divided into two groups, namely ‘early’ and ‘late’ differentiation, which is consistent with results reported by Hailesellasse Sene et al. 30. Because we assumed that, despite the clear division, many genes would exhibit gradual changes, we constructed two datasets: ‘early differentiation’, comprising the first 6 time points 0, 6, 12, 18, 24, and 36h, and ‘late differentiation’, comprising the last 7 24h, 36h, 48h, 4d, 7d, 9d, and 14d. Next, for each dataset we identified two sets of clusters, and subsequently sub‐clusters, which were in turn used to infer two respective co‐expression networks of sub‐clusters. These networks were constructed as described in previous sections, based on the absolute Pearson correlation between the expression profiles of any two sub‐clusters. As before, we further defined two top networks comprising the 10% of the edges in the co‐expression networks with the highest absolute Pearson correlation coefficients, see Materials and Methods, to which we refer as ‘early differentiation’ and ‘late differentiation’ networks.

For early differentiation we obtained 10 clusters comprising 932 genes. All 10 clusters were associated with normal cell functioning, transcription machinery, cell cycle, and apoptosis. None was significantly enriched in biological functions, processes or

13 CHAPTER 6.1 PUBLICATION #2, under review

pathways associated with development. We also observed that, in this dataset, the expression of the pluripotency marker Pou5f1 decreases very gradually over time. Thus, compared to 0h, at 36h Pou5f1 exhibits a log2 fold‐change of 0.07, reaching ~50% of its total loss across the entire time series. This suggests that the cells maintain characteristics of pluripotency throughout the first six time points in the early differentiation dataset, which explains the lack of development‐related clusters in the corresponding network. These 10 clusters were further divided into 20 sub‐clusters comprising a total of 654 genes see Table S3 in Text S1.

The early differentiation network featured 19 edges and 15 out of 20 sub‐clusters identified among differentially expressed genes in the corresponding dataset Figure 5A. The 4 sub‐clusters with highest degree 4 or higher, as compared to the median of 2, see Table S3 in Text S1 were associated with ’transcription factor binding‘, ’RNA processing‘, ’protein localization‘, and ’death‘. Six out of 10 sub‐clusters exhibiting expression profiles that increased over time and 9 out of 10 sub‐clusters exhibiting expression profiles that decreased over time were included in the early differentiation network. We also noted that there was no significant difference between the degree of sub‐clusters with increasing expression profiles and that of sub‐clusters with decreasing expression profiles. Both results suggest a balance between up‐regulated and down‐regulated biological functions, processes, and/or pathways in early differentiation. In contrast to what we observed for the complete dataset Figure 4, in the early differentiation network sub‐clusters with antagonistic expression profiles were separated into disconnected sub‐networks Figure 5A. The weak correlation between the expression profiles of the respective sub‐clusters during the initial stages of differentiation is already evident from Figures S2 and S3. Indeed, unlike the sub‐clusters with expression profiles decreasing over time, which exhibited approximately monotonic patterns, most sub‐clusters with generally increasing expression profiles showed an initial downward trend and an inflection point around 6 hours i.e., the second time point. We conjecture that, while the regulation of antagonistic functions, processes, and/or pathways involving pluripotency and lineage commitment across the entire differentiation series entails indirect mutual interactions at the transcriptional level

14 6 Article

see Figure 4, during early differentiation it is mainly controlled by other mechanisms. Very little is known regarding the first 24 hours of embryoid body development using low adherence culture dishes which were used in this case, instead of the standard hanging drop method. However, we found that, while up‐regulation of cell‐cycle SC 10.2 primarily reflects the expression of genes associated with meiosis Tdrd1, Smc1b, Fanca, Mki67, Psmc3ip, Rad51c, Stag3, Stag2, Rad50, Syce2, Smc4, Cks2, Myh9, Piwil1, Nek2, Rec8, down‐regulation of cell cycle SC 10.1 reflects the expression of genes associated with mitosis Haus3, Haus4, Cdca5, Ccng1, Ercc6l, Tipin, Usp16, Ran, Tubb3, Ranbp1, Nup37, Birc5, Ruvbl1, Fzr1, Bub3, Cdc26, Zwint. Indeed, some evidence suggests that meiosis is a feature of early germ cell formation 45. Moreover, our results suggest that the down‐regulated mitosis‐related genes SC 10.1 interact with genes related to nucleotide binding, protein localization, and RNA processing SCs 6.1, 7.3 and 3.2. Specifically, the genes involved in protein localization SC 7.3 are also implicated in RNA transport Nupl1, Phax, Nmd3, Tpr, Sec13, Nup37, Nup35, Ndc1, and some of them Nupl1, Tpr, Sec13, Nup37, Nup35, Ndc1 might regulate the level of pluripotency genes in the nucleus 46.

For late differentiation, we obtained 22 clusters comprising 1,331 genes. Five out of these 22 clusters were associated with development, while the rest were related to normal cell functioning. The 22 clusters were further divided into 45 sub‐clusters comprising 906 genes see Table S4 in Text S1. The observation that the late differentiation network comprised more developmental related clusters than the early differentiation network is a consequence of most developmental related genes being significantly differentially expressed relatively late in time. Furthermore, the development related sub‐clusters in the early and late differentiation networks together do not fully recapitulate those in the complete differentiation dataset see Figures 4 and 5 and Tables S2, S3 and S4 in Text S1, suggesting that many development related genes are differentially expressed across the entire time range, and not necessarily between close time points.

The late differentiation network comprised 99 edges and 38 out of 45 sub‐clusters Figure 5B. Seven sub‐clusters, namely those associated with ’chordate embryonic development‘ SC 1.2, ’regulation of cell death‘ SC 8.2, ’transcription factor binding‘ SC 15 CHAPTER 6.1 PUBLICATION #2, under review

9.2, ’nucleotide binding‘ SC10.1 and 10.2, ’vasculature development‘ SC 19.1 and ’reproductive developmental process‘ SC 22.3, had degree 11 or higher as compared to the median of 3.5, see Table S4 in Text S1. This suggests a major role for sub‐clusters associated with such biological annotations in late differentiation. Most high‐degree sub‐ clusters exhibited expression profiles that increased with time see Figure 5B and Table S4 in Text S1. Moreover, out of 15 sub‐clusters in the network with increasing expression profiles, 14 had degrees higher than the median P‐value 1.5x10‐5, hypergeometric test. Additionally, most 5/7 sub‐clusters excluded from the network showed expression profiles that decreased over time. Together, these results highlight the importance of gene activation in late differentiation. Finally, consistent with the observations on the network inferred from the entire dataset, in both the early differentiation and late differentiation networks we observed that the sub‐clusters with highest degree tend to be among the largest, while the sub‐clusters excluded from the networks comprise fewer genes. Based on this, we suggest that progression of germ layer formation is controlled through the coordinated interaction of complex molecular function, biological processes, and/or pathways. In contrast to what we found in the network describing the differentiation across the entire time series Figure 4, where sub‐clusters concerned with pluripotency and differentiation with antagonistic expression profiles are only connected through other sub‐clusters, the late differentiation network exhibits direct edges between sub‐clusters with antagonistic expression profiles Figure 5B. A likely reason for this is that the late differentiation network specifically captures the interaction between sub‐clusters annotated with basic biological functions. For example, SC 7.1 comprises many genes involved in signaling pathways Hedgehog, Notch, Jak‐STAT, MAPK that directly control cell cycle SC 3.2KEGG Pathway: mmu05200 11,12. This explains the strong interaction observed between the two sub‐clusters.

Last, we compared the biological functions, processes, and/or pathways driving early differentiation and late differentiation. To this end, we first mapped the sub‐clusters in the early differentiation dataset to the sub‐clusters in the late differentiation dataset see Materials and Methods. All mapping pairs were associated with normal cell functioning

16 6 Article

see Table S5 in Text S1, which was expected since the early differentiation dataset did not include any developmental related sub‐clusters see red edges in Figure 5. We identified three pairs of mapped sub‐clusters related to protein localization between the early and late differentiation datasets see Table S5 in Text S1. This is consistent with the morphological changes accompanying cell differentiation, which most likely involve major protein transport and relocalization. However, only one of these mapped sub‐clusters was included in the early differentiation network, while all 3 were present in the late differentiation network, advocating for the increased importance of these processes as the cells differentiate and the system moves to a complex multicellular stage. We also identified sub‐clusters associated with cell death as some of those with highest degrees in both the early and late differentiation networks see Figure 5, suggesting its relevance in cell differentiation and germ layer formation, a fact well documented in the literature 49–51. Finally, we observed a relatively large number of sub‐clusters connected through edges with sub‐clusters associated with ’transcription factor binding’ in both the early and the late differentiation networks with maximum degrees 5 and 12 respectively, indicating a high significance of transcriptional regulation across all stages of germ layer formation.

Embryonic liver development vs adult liver regeneration

Adult liver has a unique ability to renew and repair after injury. To compare the transcriptional program of the regenerating liver with that of the developing liver, Otu et al. generated and analyzed microarray data of adult liver after partial hepatectomy as well as of the developing embryonic liver 31. We applied a procedure similar to that described above to both datasets see Materials and Methods for more details. For the adult liver regeneration dataset we obtained 12 clusters that were further divided into 28 sub‐ clusters Table S6 in Text S1. The largest cluster C6, 387 genes was associated with ’nucleotide binding‘, suggesting a general importance for transcriptional activity in liver regeneration. The second largest cluster C1 and the largest sub‐cluster SC 1.5, 67 genes were associated with ’cell cycle‘, whose regulation is necessarily relevant to tissue repair. In turn, the embryonic liver development dataset generated 24 clusters, which were further divided into 53 sub‐clusters Table S7 in Text S1. The largest cluster C4, 696 genes was 17 CHAPTER 6.1 PUBLICATION #2, under review

associated with ‘metal ion binding‘, and the largest sub‐cluster SC 3.1, 84 genes with ’embryonic development‘. Control of metal homeostasis has been shown to be essential throughout development as early as at the blastocyst stage 52. Furthermore, in agreement with the developmental nature of the data, we found 7 clusters associated with ’developmental process’.

As described for previous datasets, we then constructed two co‐expression networks of sub‐clusters, based on the absolute Pearson correlation coefficient between any two pairs of sub‐clusters in the liver regeneration and in the liver development datasets, respectively. As before, we further defined two ‘top’ and two ‘bottom’ networks comprising the 10% of the edges in the co‐expression networks with the highest and the lowest absolute Pearson correlation coefficients respectively, see Materials and Methods. We showed that edges with high absolute Pearson correlation coefficients are more likely represent biologically meaningful interactions than those with low absolute Pearson correlation coefficients see Figures 6 and 7. The corresponding top networks were used for further analysis. The top network for the adult liver regeneration dataset ‘liver regeneration network’ comprised 38 edges and 25 out of 28 total sub‐clusters Figure 8A. The network consisted of many disconnected sub‐networks, with the largest sub‐network involving 11 sub‐clusters. These 11 sub‐clusters were associated with normal metabolism, biosynthesis of small molecules and amino acids, and drug metabolism. Drug metabolism is presumably linked to the fact that the mice in this study had been given an analgesic drug following hepatectomy, but could also be attributed to the re‐establishment of adequate liver function over time. Additionally, as expected considering the annotation of the largest clusters, many sub‐clusters were associated with ’nucleotide binding‘ and ’cell cycle‘. However, none of the 7 sub‐clusters associated with ’cell cycle’ were part of the largest sub‐ network. Instead, ‘cell cycle’ sub‐clusters formed separate sub‐networks, and were mainly connected with sub‐clusters associated with ’nucleotide binding’. This suggests a strong interaction between ’cell cycle‘ and ’nucleotide binding‘, and a specific transcriptional control of the cell cycle in the regenerating liver. Re‐entry of hepatocytes into cell cycle and hepatocyte proliferation to reconstitute lost liver mass is well documented 53,54.

18 6 Article

Furthermore, our results underscore a strong role for transcriptional regulation of biosynthetic processes represented by SC 6.2 and SCs 3.2 and 8.2, respectively. Unlike the sub‐clusters describing the differentiation of mESCs into embryoid bodies, the sub‐clusters of the liver regeneration dataset did not exhibit monotonically increasing or decreasing expression profiles data not shown. Instead, for this dataset, the highest expression levels were observed between 12 hr and 18 hr. Similar findings were reported by Hsieh et al. 55. The top network for the developing embryonic liver ‘liver development network’ comprised 138 edges and 51 sub‐clusters out of 53 sub‐clusters Figure 8B. The two sub‐ clusters excluded from the network were related to ’metal ion binding‘; the fact that the network comprised 5 other sub‐clusters with similar annotation indicates that such molecular function is indeed relevant, but regulated at multiple levels and, possibly, organized in a redundant configuration. Furthermore, the liver development network presented a relatively high median degree of 6 P‐value 0.0001, computed by observing how many out of 10,000 random networks have median degree 6 or higher, suggesting mutual interactions between all biological functions, processes, and/or pathways involved. Similar to the network describing early differentiation of mESCs into embryoid bodies Figure 5A, sub‐clusters in the liver development network exhibiting increasing and decreasing expression profiles were mostly separated into disconnected sub‐networks. Twenty‐six sub‐clusters exhibited expression profiles decreasing over time, while 27 sub‐ clusters showed opposite trends. The sub‐cluster with the highest degree 10, SC 10.1, Figure 8B, Table S7 in Text S1 in the liver development network was related to ’blood vessel development‘; the development of hepatic vasculature has been shown to play an integral part in the generation of the liver bud 56. Along with this sub‐cluster, the 3 sub‐ clusters with the next highest degree 9, all putative hubs, exhibited expression patterns increasing over time. These observations indicate that gene activation has a major role in liver development. Moreover, while the highest‐degree sub‐clusters with decreasing expression profiles were found to be related to development and cell motion Figure 8B, top network, those with increasing expression profiles were associated with development and metabolism Figure 8B, bottom network. We hypothesize that this reflects the fact

19 CHAPTER 6.1 PUBLICATION #2, under review

that the liver becomes functional only after its cells have attained the proper spatial organization.

We did not find any one‐to‐one mapping between the liver regeneration dataset and the liver development dataset, indicating that the biological functions, processes, and/or pathways involved in liver regeneration and liver development are fundamentally different. Otu et al. 31 reached similar conclusions.

Discussion and Conclusions The primary outcome of microarray analysis is the identification of genes that are differentially expressed between two or more conditions. Perhaps the simplest method for detecting such genes consists of evaluating the log ratio of their expression between pairs of conditions. All genes differing by more than a cut‐off value are then considered differentially expressed 57. This filtering approach is usually effective, but does not provide information regarding the confidence that the identified genes are indeed differentially expressed and, hence, it is unreliable 58. Commonly used statistical methods include the standard t‐test and its analogues 59. Statistical significance of changes observed in expression relies on the accurate estimation of gene‐specific variances. This is fundamentally difficult when the number of replicates is small 60. Moreover, since these methods calculate a probability P‐value for each gene independently of others, the problem of multiple testing arises. A classical approach to this concern is to control the false discovery rate FDR 61,62. In practice, volcano plots, which combine the outcome of a statistical test with the magnitude of the changes, are routinely employed as a means to filter genes by visual inspection 63. In any case, the thresholds set on both fold‐change and significance are arbitrary, and, ultimately, determined individually for each dataset. Accordingly, we selected relevant genes by applying a combination of FDR and fold‐change filtering. For the particular processes analyzed in this manuscript, we found that gene sets of 3,000 genes work reasonably well. Both larger and smaller gene sets pose a problem for the functional annotation clustering. The upper size limit is defined by a procedural limitation see Materials and Methods. On

20 6 Article

the other hand, small gene sets tend to yield fewer clusters and smaller functional clusters, enriched in few and generic functional categories.

After selecting the genes that are putatively relevant to the complex phenotype of interest, we clustered them. The goal was to identify coherent gene sets representing biological functions, processes, and/or pathways. Genes are commonly clustered either based on their biological annotations or on their expression profiles. The former strategy does not utilize the information arising from the high‐throughput experiment, and, hence, is not specific to the condition under study. Additionally, this approach disregards the fact that, as we have shown here, genes associated with similar biological functions, processes and/or pathways may exhibit different expression profiles. In this case, it is not possible to derive a representative expression profile for the cluster based on that of its members, leading to difficulties in the interpretation of the results. On the other hand, the latter strategy results in clusters comprising genes with various biological functions or belonging to multiple processes or pathways. In turn, this poses a problem for standard functional analyses, such as Gene Set Enrichment analysis. Indeed, lack of enrichment often hinders a full assessment of the data. In contrast, our method clusters genes at two levels: first, based on their biological annotation, and second, based on their expression profiles. As a result, we obtain overlapping gene clusters, each comprising genes with both similar biological functions, processes, and/or pathways, and similar expression patterns. Thus, by design, these clusters constitute a simple but useful model to describe the biological units within a real biological system. An alternative approach to clustering that we consider worth exploring would involve the use of a weighting function to cluster the genes based on expression and biological annotation simultaneously, in a comparable manner as proposed by Liu et al. 64.

We then created a co‐expression network of sub‐clusters to study the interactions between the biological functions, processes and/or pathways represented by them. In particular, we constructed a simple sub‐cluster co‐expression network, based on pairwise correlations between the expression profiles of the sub‐clusters. This is a value‐based approach, in which two sub‐clusters are connected with each other in the network if and 21 CHAPTER 6.1 PUBLICATION #2, under review

only if the correlation between their expression profiles is above a certain threshold. Many network inference methods require such thresholds 65. These thresholds are arbitrary, and should be adjusted depending on the dataset and nature of problem being addressed. In general, the stricter the threshold, the sparser the inferred network. Only robust conclusions are supported independently of the chosen threshold. However, many alternative network inference methods could, in principle, be used to improve network reliability 66. In particular, a network inference method relying both on expression profiles and biological annotation would likely result in a worthwhile representation of the complex dynamics of interaction between biological functions, processes, and/or pathways. However, gold standard datasets, that would allow a suitable assessment of such methods, are scarce. Consequently, alternative validation strategies would be required to evaluate the method. We designed this study as a proof of principle for a generic method applicable to any gene expression dataset, and mostly relied on biologically meaningful insights as a means of method validation. Network‐based approaches have been successfully applied to compare different gene expression datasets e.g., 67. However, such methods depend on an ’expert network’ which is, given the data available under most circumstances, static. Our strategy is fundamentally different in that it allows us to build condition‐specific networks, and compare the impact of different treatments or conditions on both relevant biological functions, processes, and/or pathways and their interactions.

In summary, our co‐expression networks of sub‐clusters represent interacting biological units and provide new insights into gene expression data. Traditionally microarray data is analyzed at the gene level, by identifying differentially expressed genes and inferring a network to study the interactions between those genes. Not many studies focus on interactions between more complex biological units, such as entire pathways, and even less on their dynamics. Biological functions, processes, and/or pathways do not work in isolation, though. Moreover, their exact composition and interactions are known to change over time or across conditions. Relying solely on biological annotation and expression data, our method constitutes a novel approach to identifying and characterizing biologically relevant units and to understanding the interactions between them.

22 6 Article

Materials and Methods

Datasets and differential expression analysis

We analyzed 2 different datasets. GSE2972 GDS2666 corresponds to an mRNA expression profiling experiment for mouse embryonic stem cells mESC, line R1 undergoing undirected differentiation into embryoid bodies across 11 time points 30. Expression measurements span a period of two weeks following LIF removal: 0h undifferentiated mESCs, 6h, 12 h, 18h, 24h, 36h, 48h, 4d, 7d, 9d, and 14d, with 3 replicates per time point. Affymetrix probes were mapped to 12,917 transcripts in the Mouse Genome Informatics MGI database http://www.informatics.jax.org/ 68. Differentially expressed genes were identified between pairs of time points by applying the Empirical Bayes approach of the R package “Limma” 69. P‐values were corrected for multiple testing using the false discovery rate FDR 70. To validate the method we selected genes with FDR 0.01 and log2 fold‐change 1 from the entire dataset. This resulted in a total of 2,984 genes. We also divided the entire dataset into two overlapping subsets: ‘early differentiation’, comprising the first 6 time points, and ‘late differentiation’, comprising the last 7. Here, differentially expressed genes were identified using a slightly larger FDR to enable a higher sensitivity FDR 0.05 . A total of 2,735 genes were considered differentially expressed in comparisons among early differentiation time points, while 4,665 genes were considered differentially expressed in comparisons among late differentiation time points. If necessary, to grant more specificity, we further restricted the set of differentially expressed genes to a maximum of 3,000 genes, which is also an operational requirement for the functional annotation clustering see below. For this purpose, we designed a rank aggregation method that ranks genes based on two criteria. First, we sorted the genes based on the number of pairwise comparisons for which the gene is significantly differentially expressed. Second, if two genes are significantly differentially expressed for the same number of comparison, the gene with the smallest FDR across all comparisons was ranked higher.

23 CHAPTER 6.1 PUBLICATION #2, under review

GSE6998 corresponds to an mRNA expression profiling experiment comparing the developing embryonic liver and the regenerating adult liver in mice 31. We considered liver gene expression at 10.5, 11.5, 12.5, 13.5, 14.5, and 16.5 days post‐conception, as well at 1, 2, 6, 12, 18, 24, 30, 48, and 72 hours after partial hepatectomy in adult mice. Affymetrix probes were mapped to 19,808 transcripts in the MGI database. As per experimental design, the dataset was divided into 2 datasets, one for embryonic liver development ‘development dataset’, and one for adult liver regeneration ‘regeneration dataset’. Differentially expressed genes were identified between pairs of time points by applying the Empirical Bayes t‐statistics of the R package “Limma” 69. P‐values were corrected for multiple testing using the false discovery rate FDR 70. We selected genes with FDR 0. 01, restricting the final datasets to 3,000 genes using the aforementioned ranking aggregation method. The final development and regeneration datasets contained 3,000 and 2,401 genes respectively.

Functional Annotation Clustering

Functional Annotation Clustering was performed using DAVID http://david.abcc.ncifcrf.gov/, 18,71, with default parameters. DAVID does not process jobs with more than 3,000 genes due to resource limitations. Significantly over‐ or under‐ represented functional categories were identified using the whole mouse genome as background. We only considered the functional annotation categories ’GOTERM_BP_FAT‘, ’GOTERM_MF_FAT‘, and ’KEGG‘. Based on a standard enrichment analysis on a given gene set, DAVID clusters highly similar functional annotation terms by using a fuzzy clustering approach. Similarity between functional annotation terms is determined based on the genes associated with those terms. Basically, similar functional annotation terms should be associated with many of the same genes. The resulting functional annotation clusters are composed of functional annotation terms. In turn, each functional annotation term was assigned an FDR representing its enrichment in the gene set. We discarded functional annotation terms with FDR 0.01. Then, for each functional annotation cluster we constructed a gene cluster comprising all genes associated with the functional annotation

24 6 Article

terms in that particular functional annotation cluster. Finally, gene clusters were pairwise compared. Clusters reciprocally sharing over 90% of their members were merged together.

Expression Profile Clustering We then divided each gene cluster into sub‐clusters based on the expression profiles of its members. To this end, we used the R package ‘Mfuzz’ 72, an implementation of the fuzzy c‐mean algorithm. In contrast to hard clustering, fuzzy clustering establishes a degree of belonging for each element to each cluster, rather than an assignment of each element to just one cluster, reflecting ambiguity in the data. This is particularly important in the context of microarray data, which tend to be noisy. Setting a threshold on these gradual membership values allows controlling the definition of the clusters so that they comprise more or less tightly co‐expressed genes. Thus, given a collection of N elements

X x 1 ,x 2 ,...,x N  the fuzzy c‐mean algorithm returns a collection of c clusters

C C 1 ,C 2 ,...,C c  and a matrix of real values values W  (w ij ) with w ij 0,1 , i  1,...,N ,

j  1,...,c , where w ij indicates the membership of element xi to cluster C j The centroid of a given cluster is then defined as the mean of its members, weighted by their membership values. Hence the centroid of sub‐cluster C k is:

N x w m i1 iik Yk  N , w m i1 ik where m is a parameter controlling the “fuzziness”. Basically, m determines the influence of noise on the cluster analysis; while form  1 genes are assigned to exactly one cluster, for m  genes are uniformly assigned to all clusters. In our case, xi is a vector representing the expression profile of gene G i across the time series under study. To estimate the number of sub‐clusters, we ran the algorithm starting with 2 sub‐clusters, and iteratively increased the number of sub‐clusters by 1 at each step. We stopped this iteration when we obtained at least one sub‐cluster with less than U  10 members. The number of sub‐clusters at the previous step was then selected as parameter, so that all final sub‐clusters contain at least U members. To select a value for the parameter controlling

25 CHAPTER 6.1 PUBLICATION #2, under review

the “fuzziness” m we employed the inbuilt method of Mfuzz. By specifying a membership cut‐off value on the matrix W we can assign a gene to one or more sub‐clusters. As suggested by the developer of Mfuzz, to reduce the within‐sub‐cluster variation, we chose 0.7 as membership cut‐off value. Further we refer to the centroids of the sub‐clusters as their ‘expression profiles’.

Sub‐cluster annotation

Functional enrichment for the sub‐clusters was computed using the R package ‘RDAVIDWebService’ 73. We employed two different strategies to identify the biological function of the sub‐clusters:

 Enrichment against genome. Each sub‐cluster was compared with the entire mouse genome to identify enriched functional categories.

 Enrichment against cluster. Also, each sub‐cluster was compared with all the genes from their respective cluster. The genes from the cluster were used as ‘background’ to identify the enriched functional categories as compared to the clusters.

Cluster sub‐cluster validation

To demonstrate that clusters or sub‐clusters are not just random collections of genes, but meaningful groups of genes sharing biological annotation and exhibiting similar expression patterns, we used the STRING database http://string‐db.org/36. STRING is a database of known and predicted protein‐protein protein‐gene interactions. Functionally related genes are more likely to interact with each other than with genes related to other functions 13–16. On this basis, we counted the number of pairwise interactions between members of a cluster or sub‐cluster present in the STRING database version 9.1. We refer to this number as “intra‐cluster ‐sub‐cluster STRING connectivity score”. For each cluster sub‐cluster, 100 random set of genes of similar size were created. The random genes were sampled from the set of genes that constitute all clusters all sub‐clusters. Finally, we compared the intra‐cluster ‐sub‐cluster STRING connectivity scores of the clusters with the respective average intra‐cluster ‐sub‐cluster STRING connectivity

26 6 Article

scores of the 100 random gene sets. Self‐interactions were excluded from this analysis. The intra‐cluster STRING connectivity score for a given cluster C containing N genes G i G 1 ,G 2 ,G 3 ,...,G N  is defined as:

ij i j NN1,if ( G , G )  STRING G G  ij110, otherwise intra clusterSTRING connectivity (C)  . N2

Biologically meaningful clusters sub‐clusters are expected to have higher intra‐cluster STRING connectivity scores than random gene sets.

Co‐expression network inference The expression profiles of the sub‐clusters were used to construct a co‐expression network. Basically, we computed the absolute Pearson correlation coefficient between pairs of vectors representing the expression profiles of any two sub‐clusters. Nodes in the network represent sub‐clusters, while edges are weighted by the absolute Pearson correlation coefficient between the expression profiles of the two sub‐clusters at their ends. We excluded self‐edges sub‐clusters interacting with themselves. Edges associated with higher absolute Pearson correlation coefficients are, in principle, more likely to constitute biologically‐significant interactions 74–77. Therefore, a threshold can be applied to the resulting network to retain only edges corresponding to the most biologically significant interactions.

Co‐expression network evaluation and validation To verify that edges associated with high absolute Pearson correlation coefficients are more likely to represent real biological interactions, we created two separate networks. We first sorted the edges according to their absolute Pearson correlation coefficient. Then, we defined a ‘top network’ using the 10% of the edges with the highest absolute Pearson correlation coefficients, and a ‘bottom network’ using the 10% of the edges with the lowest absolute Pearson correlation coefficients. We used the following four criteria to assess the quality of the top network as compared with that of the bottom network:

1. Sub‐cluster overlap.

27 CHAPTER 6.1 PUBLICATION #2, under review

1 2 3 N 1 1 2 3 N 2 Let C 1 G 1 ,G 1 ,G 1 ,...,G 1  and C 2 G 2 ,G 2 ,G 2 ,...,G 2  be two sub‐clusters containing

N 1 and N 2 genes, respectively, connected by an edge in a network. We defined the

overlap between C 1 and C 2 as:

2CC12 Overlap(, C12 C ) NN12

where C 1 C 2 is the number of genes in C 1 that are also members of C 2 . We hypothesize that interactions between biological functions, processes, and/or pathways involve a relatively large number of common genes. Therefore, we expect the sub‐ clusters associated with edges in the top network to exhibit a significantly higher overlap than the sub‐clusters associated with edges in the bottom network. The overlap between two sub‐clusters, however, is not independent from the absolute Pearson correlation coefficient between the corresponding expression profiles. Hence, we do not strictly consider the sub‐cluster overlap as validation.

2. Common transcriptional regulators.

For the genes in each sub‐cluster, we first located the transcription start site using the annotation in the Mouse Genome Database http://www.informatics.jax.org/, 68. Subsequently we obtained the sequence comprising 500 base‐pairs bp upstream and 500 bp downstream of the transcription start site, and finally ran MEME 78,79 to identify enriched motifs. MEME was run with default parameters, imposing 10 motifs. The resulting log‐odds matrices of the motifs were used to search against a database of known motifs JASPAR CORE 80, using TOMTOM 81 with default parameters. Given a pair of sub‐clusters connected by an edge, we then compared the motifs that were identified in the corresponding promoter regions. The number of motifs commonly identified for the two sets of promoter regions were statistically assessed using Fisher's exact test, using the entire database of known motifs as background. Finally, we analyzed the P‐values obtained for pairs of sub‐clusters in the top network with those computed for pairs of sub‐clusters in the bottom network using the Wilcoxon Rank‐Sum

28 6 Article

test. Assuming the edges in the top network are more likely to represent real biological interactions, involving common transcriptional regulators, the P‐values for the edges in the top network should be considerably smaller than the P‐values for the edges in the bottom network.

3. STRING connectivity.

1 2 3 N 1 1 2 3 N 2 Let C 1  G 1 ,G 1 ,G 1 ,...,G 1  and C 2  G 2 ,G 2 ,G 2 ,...,G 2  be two sub‐clusters containing

N 1 and N 2 genes, respectively, connected by an edge in a network. We first computed the number of pairwise interactions between members of these two sub‐clusters that are actually present in STRING version 9.1. Then, we compared this number with the total possible number of interactions:

ij i j NN12 1,if ( G12 , G )  STRING G 1 G 2  ij110, otherwise STRING connectivity(, C12 C ) . NN12

We refer to this value as the ‘STRING connectivity’ score between two sub‐clusters. Assuming the edges in the top network are more likely to represent actual biological interactions than those in the bottom network, the edges in the top network should have a higher STRING connectivity score than the edges in the bottom network.

4. GO annotation similarity.

1 2 3 N 1 1 2 3 N 2 Let C 1  G 1 ,G 1 ,G 1 ,...,G 1  and C 2  G 2 ,G 2 ,G 2 ,...,G 2  be two sub‐clusters

containing N 1 and N 2 genes respectively, connected by an edge in a network. We first

excluded all genes in C 1 that are also members of C 2 . If any of the resulting sub‐ clusters was empty, this particular pair of sub‐clusters was excluded from subsequent calculations. Then, we computed the functional distance between the genes in the two sub‐clusters using the approach described in 82. Briefly, the method computes pairwise similarity of genes based on an estimate of their distance to the closest common ancestor in the GO annotation tree, similar to Wang et al. 83. The similarity

29 CHAPTER 6.1 PUBLICATION #2, under review

between two gene sets is defined as the average similarity between all pairs of genes. Assuming that biologically‐relevant interactions are more enriched in the top network than in the bottom network, we expect the GO annotation similarity for the edges in the top network to be greater than that for the edges in the bottom network.

Degree analysis of co‐expression network After verifying that edges associated with a higher absolute Pearson correlation coefficient are more likely to represent actual biological interactions, we created a network including only the edges associated with the highest 10% of the absolute Pearson correlation coefficients. The sub‐clusters missing from this network are assumed to have a minor influence on the regulation of other sub‐clusters. We then calculated the number of edges connecting each sub‐cluster to other sub‐clusters in the network ’sub‐cluster degree’.

Comparisons between different biological units across different datasets In order to compare the sets of sub‐clusters from two datasets, we first defined a mapping function that identifies pairs of sub‐clusters in both datasets representing the same biological functions, processes, and/or pathways. For this purpose, given two datasets X and Y, we first calculated a pairwise overlap matrix for all sub‐clusters in each

X Y dataset. More precisely, we compare each sub‐cluster C i in X with each sub‐cluster C j in Y, by determining the number of common gene members and dividing it by number of gene

X members in the largest sub‐cluster. Finally, a sub‐cluster C i in X is mapped onto sub‐

Y X Y Y cluster C j in Y if and only if C i has the highest overlap with C j , and C j has the highest

X overlap with C i . All mappings including multiple sub‐clusters were removed. Finally, after visual inspection we discarded the mapping between SC 6.2 ‘nucleotide binding’ in the early differentiation of mESCs into embryoid bodies dataset and SC 3.1 ‘cell cycle’ in the late differentiation of mESCs into embryoid bodies datasets GSE2972.

30 6 Article

Acknowledgements This work was supported in part by the German Research Foundation DFG, Priority Program Pluripotency and Cellular Reprogramming, FUE 583/2‐2 and the German Federal Ministry of Education and Research BMBF, Validierung des Innovationspotenzials wissenschaftlicher Forschung – VIP, 03 V0396.

31 CHAPTER 6.1 PUBLICATION #2, under review

References

1. Boss PK, Bastow RM, Mylne JS, Dean C 2004 Multiple pathways in the decision to flower: enabling, promoting, and resetting. Plant Cell 16 Suppl: S18–S31. Available: http://dx.doi.org/10.1105/tpc.015958.

2. Gordon MD, Nusse R 2006 Wnt signaling: multiple pathways, multiple receptors, and multiple transcription factors. J Biol Chem 281: 22429–22433. Available: http://dx.doi.org/10.1074/jbc.R600015200.

3. Guo X, Wang X‐F 2009 Signaling cross‐talk between TGF‐beta/BMP and other pathways. Cell Res 19: 71–88. Available: http://dx.doi.org/10.1038/cr.2008.302.

4. Larkindale J, Hall JD, Knight MR, Vierling E 2005 Heat stress phenotypes of Arabidopsis mutants implicate multiple signaling pathways in the acquisition of thermotolerance. Plant Physiol 138: 882–897. Available: http://dx.doi.org/10.1104/pp.105.062257.

5. Louis DN, Gusella JF 1995 A tiger behind many doors: multiple genetic pathways to malignant glioma. Trends in Genetics 11: 412–415. Available: http://dx.doi.org/10.1016/S0168‐95250089125‐8.

6. Parekh DB, Ziegler W, Parker PJ 2000 Multiple pathways control protein kinase C phosphorylation. EMBO J 19: 496–503. Available: http://dx.doi.org/10.1093/emboj/19.4.496.

7. Ribes V, Le Roux I, Rhinn M, Schuhbaur B, Dolle P 2009 Early mouse caudal development relies on crosstalk between retinoic acid, Shh and Fgf signalling pathways. Development 136: 665–676. Available: http://dx.doi.org/10.1242/dev.016204.

8. Tusher VG, Tibshirani R, Chu G 2001 Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences 98: 5116–5121. Available: http://dx.doi.org/10.1073/pnas.091062498.

9. Wolfinger RD, Gibson G, Wolfinger ED, Bennett L, Hamadeh H, et al. 2001 Assessing Gene Significance from cDNA Microarray Expression Data via Mixed Models. Journal of Computational Biology 8: 625–637. Available: http://dx.doi.org/10.1089/106652701753307520.

10. Botstein D, Cherry JM, Ashburner M, Ball CA, Blake JA, et al. 2000 Gene Ontology: tool for the unification of biology. Nature Genetics 25: 25–29. Available: http://dx.doi.org/10.1038/75556.

11. Kanehisa M 2000 KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 28: 27–30. Available: http://dx.doi.org/10.1093/nar/28.1.27.

32 6 Article

12. Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, et al. 2014 Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Research 42: D199– D205. Available: http://dx.doi.org/10.1093/nar/gkt1076.

13. Cordell HJ 2009 Detecting gene‐gene interactions that underlie human diseases. Nat Rev Genet 10: 392–404. Available: http://dx.doi.org/10.1038/nrg2579.

14. Emily M, Mailund T, Hein J, Schauser L, Schierup MH 2009 Using biological networks to search for interacting loci in genome‐wide association studies. Eur J Hum Genet 17: 1231– 1240. Available: http://dx.doi.org/10.1038/ejhg.2009.15.

15. Iossifov I, Zheng T, Baron M, Gilliam TC, Rzhetsky A 2008 Genetic‐linkage mapping of complex hereditary disorders to a whole‐genome molecular‐interaction network. Genome Res 18: 1150–1162. Available: http://dx.doi.org/10.1101/gr.075622.107.

16. Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, et al. 2003 Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res 13: 2363–2371. Available: http://dx.doi.org/10.1101/gr.1680803.

17. Choi D, Fang Y, Mathers WD 2006 Condition‐specific coregulation with cis‐regulatory motifs and modules in the mouse genome. Genomics 87: 500–508. Available: http://dx.doi.org/10.1016/j.ygeno.2005.11.015.

18. Huang DW, Sherman BT, Lempicki RA 2009 Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4: 44–57. Available: http://dx.doi.org/10.1038/nprot.2008.211.

19. Nguyen TT, Foteinou PT, Calvano SE, Lowry SF, Androulakis IP 2011 Computational identification of transcriptional regulators in human endotoxemia. PLoS One 6: e18889. Available: http://dx.doi.org/10.1371/journal.pone.0018889.

20. Young ET 2003 Multiple Pathways Are Co‐regulated by the Protein Kinase Snf1 and the Transcription Factors Adr1 and Cat8. Journal of Biological Chemistry 278: 26146–26158. Available: http://dx.doi.org/10.1074/jbc.M301981200.

21. Allocco DJ, Kohane IS, Butte AJ 2004 Quantifying the relationship between co‐expression, co‐regulation and gene function. BMC Bioinformatics 5: 18. Available: http://dx.doi.org/10.1186/1471‐2105‐5‐18.

22. Altman RB, Raychaudhuri S 2001 Whole‐genome expression analysis: challenges beyond clustering. Curr Opin Struct Biol 11: 340–347.

23. Schulze A, Downward J 2001 Navigating gene expression using microarrays–a technology review. Nat Cell Biol 3: E190–E195. Available: http://dx.doi.org/10.1038/35087138.

24. Bansal M, Belcastro V, Ambesi‐Impiombato A, Bernardo D di 2007 How to infer gene networks from expression profiles. Molecular Systems Biology 3. Available: http://dx.doi.org/10.1038/msb4100120.

33 CHAPTER 6.1 PUBLICATION #2, under review

25. Gardner TS 2003 Inferring Genetic Networks and Identifying Compound Mode of Action via Expression Profiling. Science 301: 102–105. Available: http://dx.doi.org/10.1126/science.1081900.

26. Nachman I, Regev A, Friedman N 2004 Inferring quantitative models of regulatory networks from expression data. Bioinformatics 20: i248–i256. Available: http://dx.doi.org/10.1093/bioinformatics/bth941.

27. Hsu C‐L, Yang U‐C 2012 Discovering pathway cross‐talks based on functional relations between pathways. BMC Genomics Suppl 7:S25: 13.

28. Kelder T, Eijssen L, Kleemann R, Erk M van, Kooistra T, et al. 2011 Exploring pathway interactions in insulin resistant mouse liver. BMC Systems Biology 5: 127. Available: http://dx.doi.org/10.1186/1752‐0509‐5‐127.

29. Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, et al. 2008 A novel signaling pathway impact analysis. Bioinformatics 25: 75–82. Available: http://dx.doi.org/10.1093/bioinformatics/btn577.

30. Hailesellasse Sene K, Porter CJ, Palidwor G, Perez‐Iratxeta C, Muro EM, et al. 2007 Gene function in early mouse embryonic stem cell differentiation. BMC Genomics 8: 85. Available: http://dx.doi.org/10.1186/1471‐2164‐8‐85.

31. Otu HH, Naxerova K, Ho K, Can H, Nesbitt N, et al. 2007 Restoration of liver mass after injury requires proliferative and not embryonic transcriptional patterns. J Biol Chem 282: 11197– 11204. Available: http://dx.doi.org/10.1074/jbc.M608441200.

32. Khatri P, Sirota M, Butte AJ 2012 Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges. PLoS Comput Biol 8: e1002375. Available: http://dx.doi.org/10.1371/journal.pcbi.1002375.

33. Wang K, Li M, Hakonarson H 2010 Analysing biological pathways in genome‐wide association studies. Nature Reviews Genetics 11: 843–854. Available: http://dx.doi.org/10.1038/nrg2884.

34. Doetschman TC, Eistetter H, Katz M, Schmidt W, Kemler R 1985 The in vitro development of blastocyst‐derived embryonic stem cell lines: formation of visceral yolk sac, blood islands and myocardium. J Embryol Exp Morphol 87: 27–45.

35. Scott J, Ideker T, Karp RM, Sharan R 2006 Efficient algorithms for detecting signaling pathways in protein interaction networks. J Comput Biol 13: 133–144. Available: http://dx.doi.org/10.1089/cmb.2006.13.133.

36. Mering C von, Huynen M, Jaeggi D, Schmidt S, Bork P, et al. 2003 STRING: a database of predicted functional associations between proteins. Nucleic Acids Res 31: 258–261.

34 6 Article

37. Som A, Harder C, Greber B, Siatkowski M, Paudel Y, et al. 2010 The PluriNetWork: an electronic representation of the network underlying pluripotency in mouse, and its applications. PLoS One 5: e15165. Available: http://dx.doi.org/10.1371/journal.pone.0015165.

38. Butte AJ, Kohane IS 2000 Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput: 418–429.

39. Carter SL, Brechbuhler CM, Griffin M, Bond AT 2004 Gene co‐expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics 20: 2242–2250. Available: http://dx.doi.org/10.1093/bioinformatics/bth234.

40. Stuart JM 2003 A Gene‐Coexpression Network for Global Discovery of Conserved Genetic Modules. Science 302: 249–255. Available: http://dx.doi.org/10.1126/science.1087447.

41. Barabási A‐L, Oltvai ZN 2004 Network biology: understanding the cell’s functional organization. Nat Rev Genet 5: 101–113. Available: http://dx.doi.org/10.1038/nrg1272.

42. Oltvai ZN, Barabási A‐L, Jeong H, Tombor B, Albert R 2000 The large‐scale organization of metabolic networks. Nature 407: 651–654. Available: http://dx.doi.org/10.1038/35036627.

43. Costanzo M, Baryshnikova A, Bellay J, Kim Y, Spear ED, et al. 2010 The Genetic Landscape of a Cell. Science 327: 425–431. Available: http://dx.doi.org/10.1126/science.1180823.

44. Lehner B, Crombie C, Tischler J, Fortunato A, Fraser AG 2006 Systematic mapping of genetic interactions in Caenorhabditis elegans identifies common modifiers of diverse signaling pathways. Nat Genet 38: 896–903. Available: http://dx.doi.org/10.1038/ng1844.

45. Clark AT 2004 Spontaneous differentiation of germ cells from human embryonic stem cells in vitro. Human Molecular Genetics 13: 727–739. Available: http://dx.doi.org/10.1093/hmg/ddh088.

46. Yang J, Cai N, Yi F, Liu G‐H, Qu J, et al. 2014 Gating pluripotency via nuclear pores. Trends in Molecular Medicine 20: 1–7. Available: http://dx.doi.org/10.1016/j.molmed.2013.10.003.

47. Hindley C, Philpott A 2013 The cell cycle and pluripotency. Biochemical Journal 451: 135– 143. Available: http://dx.doi.org/10.1042/BJ20121627.

48. Williamson AJK, Smith DL, Blinco D, Unwin RD, Pearson S, et al. 2007 Quantitative Proteomics Analysis Demonstrates Post‐transcriptional Regulation of Embryonic Stem Cell Differentiation to Hematopoiesis. Molecular & Cellular Proteomics 7: 459–472. Available: http://dx.doi.org/10.1074/mcp.M700370‐MCP200.

49. Cheung H‐H, Liu X, Rennert OM 2012 Apoptosis: Reprogramming and the Fate of Mature Cells. ISRN Cell Biology 2012: 1–8. Available: http://dx.doi.org/10.5402/2012/685852.

35 CHAPTER 6.1 PUBLICATION #2, under review

50. Duval D, Trouillas M, Thibault C, Dembele D, Diemunsch F, et al. 2006 Apoptosis and differentiation commitment: novel insights revealed by gene profiling studies in mouse embryonic stem cells. Cell Death Differ 13: 564–575. Available: http://dx.doi.org/10.1038/sj.cdd.4401789.

51. Zhai Z, Ha N, Papagiannouli F, Hamacher‐Brady A, Brady N, et al. 2012 Antagonistic Regulation of Apoptosis and Differentiation by the Cut Transcription Factor Represents a Tumor‐Suppressing Mechanism in Drosophila. PLoS Genet 8: e1002582. Available: http://dx.doi.org/10.1371/journal.pgen.1002582.

52. Kambe T, Weaver BP, Andrews GK 2008 The genetics of essential metal homeostasis during development. genesis 46: 214–228. Available: http://dx.doi.org/10.1002/dvg.20382.

53. Kohler UA, Kurinna S, Schwitter D, Marti A, Schafer M, et al. 2013 Activated Nrf2 impairs liver regeneration in mice by activation of genes involved in cell cycle control and apoptosis. Hepatology. Available: http://dx.doi.org/10.1002/hep.26964.

54. Michalopoulos GK 2007 Liver regeneration. J Cell Physiol 213: 286–300. Available: http://dx.doi.org/10.1002/jcp.21172.

55. Hsieh H‐C, Chen Y‐T, Li J‐M, Chou T‐Y, Chang M‐F, et al. 2009 Protein Profilings in Mouse Liver Regeneration after Partial Hepatectomy Using iTRAQ Technology. Journal of Proteome Research 8: 1004–1013. Available: http://dx.doi.org/10.1021/pr800696m.

56. Zhao R, Duncan SA 2005 Embryonic development of the liver. Hepatology 41: 956–967. Available: http://dx.doi.org/10.1002/hep.20691.

57. Saeys Y, Inza I, Larranaga P 2007 A review of feature selection techniques in bioinformatics. Bioinformatics 23: 2507–2517. Available: http://dx.doi.org/10.1093/bioinformatics/btm344.

58. Chen Y, Dougherty ER, Bittner ML 1997 Ratio‐based decisions and the quantitative analysis of cDNA microarray images. J Biomed Opt 2: 364–374. Available: http://dx.doi.org/10.1117/12.281504.

59. Devore JL, Peck R 1996 Statistics: The Exploration and Analysis of Data. Wadsworth Pub Co. Available: http://www.amazon.com/Statistics‐The‐Exploration‐Analysis‐ Data/dp/0534228968%3FSubscriptionId%3D0JYN1NVW651KCA56C102%26tag%3Dtechki e‐ 20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3 D0534228968.

60. Lee M‐LT, Kuo FC, Whitmore GA, Sklar J 2000 Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proceedings of the National Academy of Sciences 97: 9834–9839. Available: http://dx.doi.org/10.1073/pnas.97.18.9834.

36 6 Article

61. Gusnanto A, Calza S, Pawitan Y 2007 Identification of differentially expressed genes and false discovery rate in microarray studies. Curr Opin Lipidol 18: 187–193. Available: http://dx.doi.org/10.1097/MOL.0b013e3280895d6f.

62. Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A 2005 False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics 21: 3017–3024. Available: http://dx.doi.org/10.1093/bioinformatics/bti448.

63. Cui X, Churchill GA 2003 Statistical tests for differential expression in cDNA microarray experiments. Genome Biology 4: 210. Available: http://dx.doi.org/10.1186/gb‐2003‐4‐4‐210.

64. Liu J, Wang W, Yang J 2004 Gene Ontology friendly biclustering of expression profiles. Proc IEEE Comput Syst Bioinform Conf: 436–447.

65. Huynh‐Thu VA, Irrthum A, Wehenkel L, Geurts P 2010 Inferring Regulatory Networks from Expression Data Using Tree‐Based Methods. PLoS ONE 5: e12776. Available: http://dx.doi.org/10.1371/journal.pone.0012776.

66. Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, et al. 2012 Wisdom of crowds for robust gene network inference. Nature Methods 9: 796–804. Available: http://dx.doi.org/10.1038/nmeth.2016.

67. Warsow G, Greber B, Falk SSI, Harder C, Siatkowski M, et al. 2010 ExprEssence–revealing the essence of differential experimental data in the context of an interaction/regulation net‐ work. BMC Syst Biol 4: 164. Available: http://dx.doi.org/10.1186/1752‐0509‐4‐164.

68. Blake JA, Richardson JE, Davisson MT, Eppig JT 1997 The Mouse Genome Database MGD. A comprehensive public resource of genetic, phenotypic and genomic data. The Mouse Genome Informatics Group. Nucleic Acids Res 25: 85–91.

69. Smyth GK 2004 Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3: Article3. Available: http://dx.doi.org/10.2202/1544‐6115.1027.

70. Benjamini Y, Hochberg Y 1995 Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B Methodological 57, No. 1: 289–300.

71. Huang DW, Sherman BT, Lempicki RA 2009 Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 37: 1–13. Available: http://dx.doi.org/10.1093/nar/gkn923.

72. Kumar L, E Futschik M 2007 Mfuzz: a software package for soft clustering of microarray data. Bioinformation 2: 5–7.

73. Fresno C, Fernández EA 2013 RDAVIDWebService: a versatile R interface to DAVID. Bioinformatics 29: 2810–2811. Available: http://dx.doi.org/10.1093/bioinformatics/btt487.

37 CHAPTER 6.1 PUBLICATION #2, under review

74. Aoki K, Ogata Y, Shibata D 2007 Approaches for extracting practical information from gene co‐expression networks in plant biology. Plant Cell Physiol 48: 381–390. Available: http://dx.doi.org/10.1093/pcp/pcm013.

75. D’haeseleer P, Liang S, Somogyi R 2000 Genetic network inference: from co‐expression clustering to reverse engineering. Bioinformatics 16: 707–726.

76. Eisen MB, Spellman PT, Brown PO, Botstein D 1998 Cluster analysis and display of genome‐ wide expression patterns. Proc Natl Acad Sci U S A 95: 14863–14868.

77. Ihmels J, Levy R, Barkai N 2004 Principles of transcriptional control in the metabolic network of Saccharomyces cerevisiae. Nat Biotechnol 22: 86–92. Available: http://dx.doi.org/10.1038/nbt918.

78. Bailey TL, Williams N, Misleh C, Li WW 2006 MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Research 34: W369–W373. Available: http://dx.doi.org/10.1093/nar/gkl198.

79. Bailey TL EC 1994 Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2: 28–36.

80. Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley‐Hunt R, et al. 2014 JASPAR 2014: an extensively expanded and updated open‐access database of transcription factor binding profiles. Nucleic Acids Res 42: D142–D147. Available: http://dx.doi.org/10.1093/nar/gkt997.

81. Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS 2007 Quantifying similarity between motifs. Genome Biol 8: R24. Available: http://dx.doi.org/10.1186/gb‐2007‐8‐2‐r24.

82. Ruths T, Ruths D, Nakhleh L 2009 GS2: an efficiently computable measure of GO‐based similarity of gene sets. Bioinformatics 25: 1178–1184. Available: http://dx.doi.org/10.1093/bioinformatics/btp128.

83. Wang JZ, Du Z, Payattakool R, Yu PS, Chen C‐F 2007 A new method to measure the semantic similarity of GO terms. Bioinformatics 23: 1274–1281. Available: http://dx.doi.org/10.1093/bioinformatics/btm087.

38 6 Article

Figures

Figure 1. Overview of the method. Genes in the experiment are first clustered based on functional annotation similarity. Subsequently, the genes in each cluster are grouped into sub‐clusters, based on similarity of their expression profiles. The expression profiles of the sub‐clusters are then used to infer a co‐expression network of sub‐clusters based on the absolute Pearson correlation coefficient. The network is validated by creating two networks: the ‘top network’, consisting of the 10% of the edges with the highest absolute Pearson

39

CHAPTER 6.1 PUBLICATION #2, under review

correlation coefficient, and the ‘bottom network’, consisting of the 10% of the edges with the lowest absolute Pearson correlation coefficient. Finally, the ‘top network’ is used for further detailed analysis.

40

6 Article

Figure 2. Cluster and sub‐cluster validation for the differentiation of mESCs into embryoid bodies. The boxes represent the interquartile range; the medians are indicated by thick lines; the whiskers extend to the most extreme data points which are no more than 1.5 times of the interquartile range of the data. P‐values were determined empirically by sampling 100 random gene sets matching the sizes of the clusters from the set of all genes represented in the 16 clusters or 51 sub‐clusters. A Average pairwise correlation between the expression profiles of genes in the same cluster, as compared to random gene sets. B Average intra‐ cluster STRING connectivity between genes in the same cluster, as compared to random gene sets. C Average pairwise correlation between the expression profiles of genes in the same sub‐cluster, as compared to random gene sets. D Average intra‐sub‐cluster STRING connectivity between genes in the same sub‐cluster, as compared to random gene sets.

41

CHAPTER 6.1 PUBLICATION #2, under review

Figure 3. Comparison of the top and bottom networks describing the differentiation of mESCs into embryoid bodies. The boxes represent the interquartile range; the medians are indicated by thick lines; the whiskers extend to the most extreme data points which are no more than 1.5 times interquartile range of the data. P‐values were computed using the Wilcoxon Rank‐

42

6 Article

Sum test. A Average sub‐cluster overlap for sub‐clusters connected by edges in the top network as compared to the bottom network. B Average significance log10 P‐value of observing common transcriptional regulators binding to the promoters of genes in sub‐ clusters connected by edges in the top network as compared to the bottom network. C Average STRING connectivity score between sub‐clusters connected by edges in the top network as compared to the bottom network. D Average GO annotation similarity score between sub‐clusters connected by edges in the top network as compared to the bottom network.

Figure 4. Top network describing the differentiation of mESCs into embryoid bodies. Node labels include sub‐cluster numbering and first annotation term from ‘Sub‐cluster Enrichment’ annotation. Nodes are colored in yellow if the log2 fold‐change between the last

43

CHAPTER 6.1 PUBLICATION #2, under review

and the first time point is greater than 1, blue if the log2 fold‐change between the last and the first time point is less than ‐1, and white otherwise.

Figure 5. Top networks describing early and late differentiation of mESCs into embryoid bodies. See Figure 4 for details regarding the layout. Nodes in red indicate the nodes that has been successfully mapped between the early and late differentiation datasets. Edges in red indicate those edges that connect to at least one sub‐cluster that has been successfully mapped between the early and late differentiation datasets. A Early differentiation network, constructed based on time points 0 to 36 hours of the differentiation of mESCs into embryoid bodies. B Late differentiation network, constructed based on time points 24 hours to 14 days of the differentiation of mESCs into embryoid bodies.

44

6 Article

Figure 6. Comparison of the top and bottom networks describing adult mouse liver regeneration after partial hepatectomy. See Figure 3 for details regarding the box plots. A Average sub‐cluster overlap for sub‐clusters connected by edges in the top network as compared to the bottom network. B Average significance log10 P‐value of shared transcriptional regulators binding to the promoters of genes in sub‐clusters connected by 45

CHAPTER 6.1 PUBLICATION #2, under review

edges in the top network as compared to the bottom network. C Average STRING connectivity score between sub‐clusters connected by edges in the top network as compared to the bottom network. D Average GO annotation similarity score between sub‐clusters connected by edges in top network as compared to the bottom network.

46

6 Article

Figure 7. Comparison of the top and bottom networks describing mouse embryonic liver development. See Figure 3 for details regarding the box plots. A Average sub‐cluster overlap for sub‐clusters connected by edges in the top network as compared to the bottom network. B Average significance log10 P‐value of shared transcriptional regulators binding to the promoters of genes in sub‐clusters connected by edges in the top network as compared to the bottom network. C Average STRING connectivity score between sub‐clusters connected by edges in the top network as compared to the bottom network. D Average GO annotation similarity score between sub‐clusters connected by edges in top network as compared to the bottom network.

47

CHAPTER 6.1 PUBLICATION #2, under review

Figure 8. Top networks describing the mouse adult liver regeneration and mouse embryonic liver development. See Figure 4 for details regarding the layout. A Liver regeneration network, constructed based on time points 1 hour to 72 hours after partial hepatectomy in adult mice. B Liver development network, constructed based on time points 10.5 days to 16.5 days post‐conception in mice.

48

6.2. SUPPLEMENT

6.2. Supplement

Supplementary information can be found on the CD attached to this thesis.

7 Article

Gene 502 (2012) 99–107

Contents lists available at SciVerse ScienceDirect

Gene

journal homepage: www.elsevier.com/locate/gene

Derivation of an interaction/regulation network describing pluripotency in human

Anup Som a ,1, Mitja Luštrek a,b,1, Nitesh Kumar Singh a, Georg Fuellen a,⁎ a Institute for Biostatistics and Informatics in Medicine and Ageing Research, University of Rostock, Ernst-Heydemann-Str. 8, 18057, Rostock, Germany b Jožef Stefan Institute, Department of Intelligent Systems, Jamova cesta 39, 1000 Ljubljana, Slovenia article info abstract

Article history: Identification of the key genes/proteins of pluripotency and their interrelationships is an important step in Accepted 9 April 2012 understanding the induction and maintenance of pluripotency. Experimental approaches have accumulated Available online 20 April 2012 large amounts of interaction/regulation data in mouse. We investigate how far such information can be trans- ferred to human, the species of maximum interest, for which experimental data are much more limited. To Keywords: address this issue, we mapped an existing mouse pluripotency network (the PluriNetWork) to human. We Interolog mapping transferred interaction and regulation links between genes/proteins from mouse to human on the basis of Phylogenetic profiling Co-expression orthologous relationship of the genes/proteins (called interolog mapping). To reduce the number of false Gene Ontology positives, we used four different methods: phylogenetic profiling, Gene Ontology semantic similarity, gene RNA interference co-expression, and RNA interference (RNAi) data. The methods and the resulting networks were evaluated Stem cell by a novel approach using the information about the genes known to be involved in pluripotency from the literature. The RNAi method proved best for filtering out unlikely interactions, so it was used to construct the final human pluripotency network. The RNAi data are based on human embryonic stem cells (hESCs) that are generally considered to be in a (primed) epiblast stem cell state. Therefore, we assume that the final human network may reflect the (primed) epiblast stem cell state more closely, while the mouse net- work reflects the (unprimed/naïve) embryonic stem cell state more closely. © 2012 Elsevier B.V. All rights reserved.

1. Introduction inferring interactions in human based on mouse, which is the most close- ly related species with rich interaction/regulation data, we created a In recent years the field of embryonic stem cell (ESC) and pluripo- view of the human pluripotency network, an important first step in tency research has gained importance because of its therapeutic po- systems-level understanding of the underlying mechanisms. tential in regenerative medicine (Boiani and Scholer, 2005; Rao and To make full use of the currently available interaction (and regula- Orkin, 2006; Jaenisch and Young, 2008). Unraveling the mechanisms tion) data, computational methods have been developed to predict underlying pluripotency and reprogramming in human may open a new interactions. These methods are based on diverse attributes, con- new era in medicine (Cohen and Melton, 2011). These mechanisms cepts, and data types, such as interologs (Matthews et al., 2001; Yu et involve protein–protein interactions, which participate in key biolog- al., 2004), gene expression profiles (Ideker et al., 2002), Gene Ontology ical processes in cells. They are supplemented by protein–DNA inter- (GO) annotations (Wu et al., 2006), phylogenetic profiling (Pellegrini actions describing gene regulation by the control of transcription. et al., 1999), domain interactions (Ng et al., 2003), and co-evolution Describing and interpreting such a network of interaction and regula- (Jothi et al., 2005). Some machine learning methods, such as support tion (i.e., stimulation and inhibition) links are essential tasks of com- vector machines (SVMs), were also used to predict protein–protein in- putational biology. teractions based on sequence data (Shen et al., 2007; Guo et al., 2008). Considering how beneficial the knowledge of the mechanisms under- Among these methods, the interolog approach has been widely imple- lying pluripotency and reprogramming in human would be, the descrip- mented (Rhodes et al., 2005). The method assumes that protein–protein tion of the gene/protein interaction/regulation underlying pluripotency interactions are conserved between organisms and that pairs of proteins (called the pluripotency network) in human is remarkably limited. By whose orthologs are known to interact in a model organism probably in- teract in the organism of interest (target organism) as well (Walhout et al., 2000). Numerous studies were published in which mainly human Abbreviation: RNAi, RNA interference; hESCs, human embryonic stem cells; ESC, protein–protein interactions were predicted based on interolog detec- embryonic stem cell; GO, Gene Ontology; SVM, support vector machine; EDS, evolu- tion (Han et al., 2004; Lehner and Fraser, 2004; Rhodes et al., 2005; tionary dissimilarity score; CC, Cellular Component; BP, Biological Process; TF, tran- Tirosh and Barkai, 2005; Brown and Jurisica, 2005; Persico et al., 2005; scription factor; LCA, lowest common ancestor. Huang et al., 2007). ⁎ Corresponding author. Tel.: +49 381 4947360; fax: +49 381 4947203. – E-mail address: [email protected] (G. Fuellen). A potential problem in predicting protein protein interactions 1 These authors contributed equally to this work. using an interolog-based method is that it may generate false positive

0378-1119/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.gene.2012.04.025 CHAPTER 7.1 PUBLICATION #3

100 A. Som et al. / Gene 502 (2012) 99–107 interactions, i.e., interactions that are falsely predicted to exist in the target organism. The false positive interactions appear due to two reasons. The first reason is false positives in original interactions obtained experimentally in the model organism (von Mering et al., 2002; Sprinzak et al., 2003; Yu et al., 2008). The second reason is the lack of evolutionary conservation of interactions, in particular when applied to phylogenetically distant organisms (Mika and Rost, 2006; Brown and Jurisica, 2007). In such cases an interaction does exist in the model organism but not in the target organism. We minimized the appearance of false positive interactions in the interaction data of the model organism (i.e., the first reason) by consid- ering the high-quality literature-curated mouse PluriNetWork as the model network (Som et al., 2010). To reduce the number of false posi- tives due to the lack of evolutionary conservation of interactions (i.e., the second reason), we filtered the interactions using four methods: (1) phylogenetic profiling, (2) GO semantic similarity, (3) gene co- expression, and (4) considering RNAi data. A novel approach was adopted to evaluate the relative performance of these four methods. The best of them (that is, RNAi) was finally selected to filter out the un- likely interactions, resulting in the final predicted human pluripotency network.

2. Materials and methods

Fig. 1 shows the flowchart of our approach to mapping the mouse pluripotency network to human. Its steps are described in detail in the rest of the paper.

2.1. Model network: mouse pluripotency network

We previously assembled a network of 547 molecular interactions, stimulations and inhibitions involved in mouse pluripotency called PluriNetWork (Som et al., 2010), which we consider the model network. It is shown in Fig. 2 (a high-resolution JPEG image and a Cytoscape version of the network are given in Supplementary Fig. S1). It is based on a collection of primary research data from 177 publications involving 264 mouse genes/proteins. It includes the core circuit of Oct4 (Pou5f1), Sox2, Nanog and Klf4, its periphery Esrrb, c-Myc, Nr5a2, Stat3, and Sall4 (red region), connections to upstream signaling pathways such as Acti- Fig. 1. Flowchart of the approach used for the derivation of human pluripotency vin, Wnt, FGF, BMP, Insulin, Notch, and LIF (green region), and epigenet- network. ic regulators such as Dnmt3a, Dnmt3b, Hdac1, Hdac2, and Kdm3a (blue region). A detailed description of the network assembly, its properties, only HomoloGene had an ortholog for Ctbp2 (C-terminal binding protein the associated biological information, and its applications is found in 2), which we nevertheless decided to omit from our set of human ortho- the publication of Som et al. (2010). logs. We also looked for co-orthologs of mouse pluripotency proteins in human. The Ensemble ortholog dataset showed that two mouse pro- 2.2. Ortholog identification teins Lefty1 (Left right determination factor 1)andZfx(Zinc finger protein X-linked) have co-orthologs in human, whereas Inparanoid and Homolo- Orthologs of mouse pluripotency genes/proteins in human were Gene did not support these co-orthologs. Therefore, we assumed that identified from three publicly available ortholog databases: (1) Ensembl only one-to-one relationships exist in the ortholog data. In summary, (Release 62, April 2011) [http://www.ensembl.org], (2) InParanoid (Ver- we obtained a clean set of human orthologs of mouse pluripotency sion 7.0, June 2009) (Berglund et al., 2008), and (3) HomoloGene (Re- players that contains 262 genes/proteins (Supplementary Table S1). lease 64, February 2011) [http://www.ncbi.nlm.nih.gov/homologene/], as follows. We exported mouse–human ortholog pairs from Ensembl 2.3. Mouse-to-human interolog mapping by the help of BioMart software (Haider et al., 2009). The mouse Ensembl gene ID in the PluriNetWork was used as an identifier to mine Ensembl In the PluriNetWork, protein–protein interactions and regulatory with BioMart. From the InParanoid database, the complete dataset of protein–DNA interactions (i.e., stimulations and inhibitions) are col- mouse–human orthologs was downloaded. We then extracted the lectively called links. We do not distinguish a gene and its protein orthologs of mouse pluripotency genes in human. InParanoid is one product—they are both referred to by the gene name. For all links of the best ortholog databases, especially as it identifies more correct between mouse genes, the mouse–human ortholog pairs were inves- co-orthologs (defined as two or more genes that were duplicated after tigated. If both genes comprising a mouse link have human orthologs, the speciation and hence are orthologs to one or more genes in then these human orthologs were predicted to be linked by an inter- another species) than other such databases (Chen et al., 2007). Finally, olog. The interolog detection strategy was initially developed to the HomoloGene online web interface was used to establish mouse– transfer information on protein–protein interactions (from yeast to human ortholog pairs. All three databases contained orthologs for all higher organisms) (Walhout et al., 2000; Matthews et al., 2001), the genes except two. None had an ortholog for Ins1 (Insulin 1)and but it can also be employed for regulatory links (Yu et al., 2004; 7 Article

A. Som et al. / Gene 502 (2012) 99–107 101

Fig. 2. Manual layout of the PluriNetWork in Cytoscape. Nodes (264) are genes/proteins, edges (547) are stimulations (arrows), inhibitions (T-bar arrows) and interactions (lines). The top third of the network includes upstream signaling pathways (green region), the middle is composed of the core circuitry of pluripotency (Pou5f1—also known as Oct4, Sox2 and Nanog) and its periphery (red region), and the left part includes epigenetic factors and related mechanisms (blue region). A high-resolution JPEG image and a Cytoscape version of the network are presented in Supplementary Fig. S1.

Yellaboina et al., 2007). This method assumes that links are conserved whether the transferred links truly belong to the human pluripotency between organisms: pairs of proteins whose orthologs are known to network and thus to reduce the number of false positives due to interact in other species probably interact in the species of interest the second reason, we used four different link evaluation methods: as well. Based on human orthologs of mouse pluripotency players (1) phylogenetic profiling, (2) GO semantic similarity, (3) gene co- (genes/proteins), we transferred the links from mouse to human, expression, and (4) considering RNAi data. The first three methods resulting in the initial human version of the mouse PluriNetWork. are well known for their potential to improve the quality of predicted The predicted network consists of 262 nodes (genes/proteins) linked interactions, while the fourth one is new. by 545 links (Supplementary Fig. S2). Thus, with the exception of two For each link, each of the four link evaluation methods provided a links to genes that had no ortholog (to Ins 1 and to Ctbp2), the value (we called it link value) corresponding to the probability that predicted network is identical to the original one. the link is involved in pluripotency. However, in order to actually filter out false positive interologs, we needed a threshold for the values pro- 2.4. Filtering out false positive interologs vided by each of the four methods to separate the links to be included in the network from those to be excluded. Furthermore, in order to select The links transferred from mouse to human may include false the best of the four filtering methods, we need to evaluate the networks positives due to false positives in the original interactions obtained constructed using each of them. The four methods are described in the experimentally in mouse (von Mering et al., 2002; Sprinzak et al., following four subsections. The selection of the thresholds and the eval- 2003; Yu et al., 2008) or the lack of evolutionary conservation of uation of the networks is described in the final subsection. interactions (Mika and Rost, 2006; Brown and Jurisica, 2007). We minimized the first reason (i.e., the appearance of false positive in- 2.4.1. Phylogenetic profiling method teractions in the model organism) by considering the high-quality Phylogenetic profiling assumes that two proteins displaying a literature-curated PluriNetWork as the model network. To evaluate similar phylogenetic profile (i.e., a similar presence/absence pattern CHAPTER 7.1 PUBLICATION #3

102 A. Som et al. / Gene 502 (2012) 99–107 in a set of reference organisms) are functionally linked (Pellegrini et al., methods for measuring the semantic similarity (Guo et al., 2006; 1999). In other words, if both proteins are either conserved or deleted Wang et al., 2007). in several organisms, this is an indication of a link between them. A binary phylogenetic profile of a gene is represented by a vector of 0 and 1, 2.4.3. Gene co-expression method depending on the presence or absence of the gene's homolog in the set This method assumes that interacting pairs of proteins tend to be co- of reference organisms. We constructed binary profiles for all mouse plur- expressed. Human pluripotency-specificgeneexpressiondatawereused ipotency genes, using 14 reference vertebrate genomes (incl. human) to measure the co-expression values (twelve samples: GSM530601–3, 6, listed in Supplementary Table S2. A Blastp phylogenetic profile (Enault 9, 11, and 13–18 from the Gene Expression Omnibus series GSE21222). Of et al., 2003) of a mouse gene is represented by a vector of normalized the twelve samples, six samples (GSM530601–3, 6, 9, 11) are of the bit scores obtained from BLAST when searching for homologs of the pro- (primed) epiblast type (hESC), and six samples (GSM530613–18) are of tein encoded by the gene in the 14 other genomes. The normalized Blastp the (unprimed) naïve type (see Fig. 4C of Hanna et al., 2010 from the (Altschul et al., 1997) bit scores were taken from the InParanoid ortholog paper that is associated with the GSE21222 dataset). Expression values dataset. We then calculated an “evolutionary dissimilarity score” (EDS) were preprocessed using MAS5 algorithm (Hubbell et al., 2002). for each link using binary and Blastp profiles. The EDS of the linked Co-expression of a pair of genes was first assessed by calculating the genes i and j is defined as the sum of the absolute differences of binary Pearson correlation coefficient of their expression values. We calculated or Blastp scores across the profile, i.e. the correlation coefficient for each link in three different manners: (1) using the hESC type data only (six samples), which we called hESC correlation score of the link, (2) using the naïve type data only (also six XN ¼ − ; samples), which we called naïve correlation score, and (3) using both EDSij Pik Pjk k¼1 hESC and naïve type data (i.e., using all the twelve samples), which we called hESC+naïve correlation score. The correlation coefficients range from −1to1.Valuescloseto1or−1 indicate likely interaction (nega- where Pik and Pjk denote the presence or absence of the homologs of the tive values mean one gene/protein probably inhibits the other) and genes i and j in the genome k, or the Blastp bit scores of the proteins values close to zero indicate no interaction. We thus used the absolute encoded by the genes i and j when searching for them in the genome k, values of correlation coefficients of pairs of genes as the link values. and N is the number of genomes. Additionally, we used a method to measure the change of frequency of We defined the EDS based on the hypothesis that a pair of inter- interaction (that is, its startup or shutdown) known as LinkScore acting genes should feature similar evolutionary changes among spe- (Warsow et al., 2010). The LinkScore method calculates the amount of cies, elaborating on the fundamental assumption of phylogenetic change in interaction between two genes/proteins, usually measured by profiling that co-evolving genes are functionally linked. According two gene expression experiments. We computed the LinkScores of each to our definition of EDS, two proteins with similar phylogenetic pro- pair of linked genes from pluripotent states (we again considered hESC, files should have a low EDS, and two proteins with dissimilar profiles naïve, and hESC+naïve samples separately) to non-pluripotent state, should have a high EDS. A link with a low EDS indicates that the link and used them as the link values. For the pluripotent states we used the should be included in the human pluripotency network, whereas a already described twelve samples. For the non-pluripotent state, we link with a high EDS is likely a false positive and should be excluded used the non-pluripotent counterparts of our samples (GSE7178 and from the network. GSE17772, samples GSM172865–73 and GSM443832–34). A high Link- Score means that both genes interact more in the non-pluripotent state, 2.4.2. GO semantic similarity method which indicates that the link should not be included in the human pluri- This method assumes that interacting proteins share the same potency network. subcellular localization (Shin et al., 2009) and are involved in similar biological processes (Ewing et al., 2007). We assessed these two proper- 2.4.4. RNAi method ties by the similarity of the genes according to their Cellular Component Recently, a genome-wide RNAi screen was conducted to identify (CC) and Biological Process (BP) GO terms (Schlicker et al., 2006). We genes which regulate self-renewal and pluripotency properties in used three variants of the GO semantic similarity method: (1) BP simi- hESC (Chia et al., 2010). In this study, the authors screened a small in- larity, (2) CC similarity, and (3) BP+CC similarity. The BP and CC simi- terfering RNA (siRNA) library targeting 21,121 human genes and larities were computed as the similarities of the GO BP and CC terms of reported the importance of their role in hESC. Each of the 21,121 the genes that are linked. The BP+CC was computed as the average of human genes was assigned a numerical score called Fav. Higher values the BP and CC similarities if the link consisted of either two transcription of Fav indicated a more important role in hESC. For example, in their factors (TFs) or two non-TFs, or as the BP similarity otherwise. The rea- score sheet Pou5f1 was placed on top with the highest score, indicat- son for not considering the CC similarity of links between a TF and non- ing Pou5f1's critical importance in hESC, a fact established by several TF is as follows. The PluriNetWork contains several transcriptional links other studies. (i.e., a link between a TF and a non-TF, such as a signaling protein), e.g. For the purpose of filtering out false positive interologs, we need- Sox2–Fgf4 and Pou5f1–Fgf4. Naturally, the TFs, Sox2 and Pou5f1 are lo- ed to assign values to links instead of genes. We considered three cated in the nucleus, whereas the location of Fgf4 is the extracellular approaches (i.e., three variants of processing the RNAi data) to trans- space. In such cases the CC similarity cannot reflect the probability of form two gene scores to a link score: (1) the minimum of the Fav stimulation/inhibition. scores of the linked genes, (2) the arithmetic mean of the scores, We calculated the GO BP and CC semantic similarities between and (3) the geometric mean of the scores. The rationale for (1) is interacting genes using Resnik's term similarity method as imple- that if one of the linked genes is not involved in human pluripotency, mented in the GOSim package (Frohlich et al., 2007). Resnik's method the link between it and another gene does not belong to the human is based on the information content of the lowest common ancestor pluripotency network, either. The rationale for (2) is that since the re-

(LCA) of two terms (Resnik, 1999). The more frequently a term oc- lation between the Fav score and the involvement in pluripotency is curs, the lower is its information content. If the LCA of two terms de- somewhat uncertain, averaging the scores of both genes cancels out scribes a generic concept, these terms are not very similar and this is some of the uncertainty. The last method, (3) geometric mean, is reflected in the low information content of their LCA. The correspond- striving for a balance between (1) and (2): it averages the Fav scores, ing genes probably do not interact and are thus assigned a low link but if one of the scores if very low and the other very high, their geo- value. Resnik's method is considered the best among the existing metric mean is still very low, as in (1). 7 Article

A. Som et al. / Gene 502 (2012) 99–107 103

2.5. Evaluation of the methods Table 2 Genes experimentally shown to be required, or shown not to be required for the induc- tion and/or maintenance of pluripotency in human. Each of the four link evaluation methods provided us with link values for a subset of the 545 links in the initial predicted human Genes References Genes not References pluripotency network. This is because some of the data required to required required use the methods was not available for all the links—for example, a ACVR1 Schnerch et al. (2010) CTNNB1 Lam et al. (2008) number of genes have no GO annotations, so the links associated DNMT3B Adewumi et al. (2007) ESRRB Xie et al. (2009) with these genes have no GO semantic similarity scores. The sizes of DPPA4 Assou et al. (2007) FBX15 Rao (2004) FGFR1 Schnerch et al. (2010) IL6ST Schnerch et al. (2010) these subsets are given in Table 1. The size of the intersection of the HELLS Assou et al. (2007) JAK1 Schnerch et al. (2010) four subsets, which contains the links for which all four methods pro- KLF4 Pera and Tam (2010) KLF2 Greber et al. (2007) vided link values, is 406. The link evaluation methods were compared LEFTY1 Schnerch et al. (2010) KLF5 Greber et al. (2007) on these 406 links. NANOG Pera and Tam (2010) LIF Pera and Tam (2010) NODAL Pera and Tam (2010) NR5A2 Xie et al. (2009) In order to compare the link evaluation methods and to evaluate the PHF17 Assou et al. (2007) SMAD1 Schnerch et al. (2010) — resulting pluripotency networks, we needed some reference data POU5F1 Pera and Tam (2010) STAT3 Schnerch et al. (2010) reliable and independent information on which links belong to the SMAD2 Schnerch et al. (2010) TBX3 Greber et al. (2007) human pluripotency network. Such data for links between genes is SOX2 Pera and Tam (2010) hard to come by, but there is some literature on genes/proteins experi- TGFB1 Schnerch et al. (2010) ZIC3 Assou et al. (2007) mentally shown to be or not to be involved in human pluripotency. We found 15 genes known to be involved in human pluripotency and 12 genes known not to be involved (Table 2). We translated this informa- corresponding ranks for the four link evaluation methods are shown tion on genes/proteins into information on links as follows. If both in Supplementary Table S3. To construct the human network, we genes in a link are known to be involved in human pluripotency, we needed to decide for each link whether to include it or not, which considered the link to belong to the human network. If at least one of means that we needed thresholds for the values provided by the the genes is known not to be involved, we considered the link not to be- link evaluation methods. Ideally, the links above such a threshold long to the human network. If one of the genes is known to be involved would all be involved in human pluripotency, and the links below it and we had no information on the other, we assigned a probability to would not be involved. As the threshold we used the number of that link belonging to the human network. Since one gene in such a links to be included in the human network, which we denoted n. link is already known to be involved, the probability for the whole This made it possible to compare the four link evaluation methods, link is equal to the probability of the other gene being involved. There- since each method uses a different scale. Each method and each fore, the probability of such a link belonging to the human network was value n split the links into those to be included in the human network computed as the number of link endpoints known to be involved in (the top-ranked n links) and those not to be included (the bottom- human pluripotency (314, that is the number of times that the 15 ranked 406—n links). For example, if we chose the RNAi method genes known to be involved are the endpoint of a link), divided by the and the threshold 100, the 100 links with the highest link values number of link endpoints on which we had some literature-based infor- according to the RNAi method would be included in the human net- mation (436 endpoints, 314 involved and 122 not involved), which work, and the remaining 306 links would be excluded. For the links equals 0.72. (Note that a “link endpoint” is a gene, but we use this to be included, the correctness scores from the third column (inclu- term to convey that we count it once for every occurrence in a link.) sion) of Table 3 were added up, and for the links to be excluded, the We used the counts of link endpoints instead of simple gene counts be- scores from the fourth column (exclusion) were added up. The total cause we wished to count each gene once for each occurrence in a link, sum of these gave the overall quality of the human pluripotency net- since this gives a greater weight to the genes involved in more links. We work for a given link evaluation method and a given number of links ignored the links for which we had no information on either gene. The n. The value n for which this score was the largest, was considered the correctness scores for inclusion in or exclusion from the human pluripo- optimal threshold for a given method. tency network are given in Table 3. A link is assigned the same score if the involvements of the first and the second gene are reversed. For 3. Results example, if the third case were (yes, no) instead of (no, yes), the cor- rectness scores for that case would be the same. We first constructed the initial human pluripotency network by Based on the link values assigned by the link evaluation methods, transferring the links between the genes from the mouse pluripo- the links were ranked in the order of the probability that they should tency network. Afterwards we compared the four link evaluation be included in the human pluripotency network (i.e., the higher rank methods used to filter out the false positive links, as described in of a link by a given method, the more likely the link is involved in the Materials and methods section, and constructed the final human pluripotency according to that method). The link values and the network using the best of them. The following two sections present

Table 3 Table 1 Correctness scores for the inclusion of a link between genes in the human pluripotency The number of links (genes/proteins) in the pluripotency network at various steps of network or the exclusion from the network, depending on whether the genes that are our approach to transfer the mouse pluripotency network to human. linked are known to be involved in human pluripotency based on the literature. Step Number Involvement in human Inclusion in the Exclusion from the Interactions identified experimentally in mouse 547 (264) pluripotency based human network human network Interologs transferred to human based on orthologous relationship 545 (262) on the literature Interactions for which phylogenetic profiling data was available 545 First gene Second gene Interactions for which GO semantic similarity data was available 480 Interactions for which gene co-expression data was available 453 Yes Yes 1 0 Interactions for which RNAi data was available 540 No No 0 1 Interactions for which data for all four link evaluation methods 406 Yes No 0 1 was available Yes Unknown 0.72 0.28 Interactions remaining after filtering by the RNAi method 215 (148) No Unknown 0 1 Interactions remain in the final network 196 (136) Unknown Unknown 0 0 CHAPTER 7.1 PUBLICATION #3

104 A. Som et al. / Gene 502 (2012) 99–107 the results of the comparison of the link evaluation methods and the final human network.

3.1. Comparison of the link evaluation methods

Each of the four link evaluation methods has multiple variants, which we compared to choose the best variant for each method. For phylogenetic profiling, we compared binary profiles with profiles consisting of Blastp bit scores. For GO semantic similarity, we compared similarities of BP, CC and BP+CC GO terms. For gene co- expression, we compared the degree of co-expression computed with the Pearson correlation coefficient and LinkScore using hESC, naïve and hESC+naïve samples. For the RNAi method, we compared the link values computed as the minimum, arithmetic mean and geo- metric mean of gene values. The results of the comparisons revealed that for the phylogenetic profiling method binary profile is the best variant (see Supplementary Fig. S3), for the GO semantic similarity method CC similarity is the best variant, for the gene co-expression method LinkScore using hESC samples is the best variant, and for the RNAi method arithmetic mean of gene values is the best variant fi (the comparison of the variants of each of the four link evaluation Fig. 3. Comparison of the interolog ltering methods. Threshold versus correctness score is plotted for the four link evaluation methods. A higher correctness score indicates that methods, as described in the next paragraph, are presented in Supple- the method better filters out possible false positive interactions. The plot shows that the mentary Fig. S3). We selected as the best variant the one with the RNAi method is the best method followed by the co-expression method. highest average and peak correctness score. After selecting the best variant of each link evaluation method, we compared the four methods using these variants. The results are pre- After filtering out possible false positive interactions by the RNAi sented in Fig. 3. The horizontal axis of the graph represents the method, we observed that for a number of genes the majority of threshold (n), which ranges from 0 (no links included) to 406 (all their links were filtered out. This observation led us to the conjecture links included). The vertical axis shows the number of links whose in- that they are not involved in human pluripotency. We decided to de- clusion in the human network or exclusion from it is correct based on lete all genes from the network for which at least 80% of the links the literature. The correctness curves show the correctness of each were filtered out. The threshold of 80% was set to the highest value method at a given threshold. Of particular interest is the threshold such that the remaining genes not involved in human pluripotency at which a method reaches the highest correctness, since that is the based on the literature were deleted. By applying this criterion, ten threshold for inclusion in the pluripotency network that would be more genes (and 18 links) shown in Table 4 were deleted from the fil- chosen for that method. The comparison of the link evaluation tered network. Out of the ten deleted genes, the literature reported methods shows that the RNAi method is the best performer followed that seven genes (Ctnnb1, Esrrb, Klf2, Klf5, Nr5a2, Smad1, and by the gene co-expression method. The most commonly used phylo- Stat3) are not involved in human pluripotency (Table 2) and the genetic profiling and GO semantic similarity methods performed role of the other genes (Mbd3, Myc and Sall4) in hESCs is unknown. poorly. Finally, we deleted the remaining gene that the literature reported

3.2. Derivation of human pluripotency network

Since the RNAi method proved to be the best link evaluation meth- od, we used it to filter out false positive interactions from the human pluripotency network. Fig. 4 shows the correctness score with respect to the threshold for the RNAi method for all the links for which we had RNAi information, which are 540 out of 545 links. RNAi data (Fav score) was missing for four genes (Kdm1a, Kdm4c, Kdm5c, and mTOR), which appeared in five links. The threshold thus ranges from 0 (no links included) to 540 (all links included). The maximum possible correctness that could be achieved by the method is 307.52, which is calculated as follows. We have literature- based information on 373 links. For 139 of them we know that both genes in the link are involved in pluripotency (so the link is involved as well) or that one of the genes is not involved (so the link is not in- volved, either). If all of these links are included in/excluded from the network correctly, they contribute 139 to the correctness score. For 234 links we know that one gene is involved, so if all of them are included in the network, they contribute 234×0.72=168.52 to the correctness score (as per Table 3). The maximum number achieved by the RNAi method is 250.29 when the threshold is 215, which indi- cates that the top-ranking 215 links are to be retained in the human network. Thereafter, we used the gene co-expression method to eval- Fig. 4. Threshold versus correctness score for the RNAi method for all the links for which uate the remaining five links (on which we had no RNAi data). All the we had RNAi information, which is 540 out of 545 links. The correctness score reaches the fi fi ve links were ltered out by the co-expression method, so we delet- highest value (250.29) when the threshold is 215, which indicates the top-ranking 215 ed them from the human network. links are to be retained in the human network. 7 Article

A. Som et al. / Gene 502 (2012) 99–107 105 not to be involved in human pluripotency – LIF – and its links. As a re- is fairly problematic. A large number of GO annotations of human sult, LIFR was disconnected from the network, so we deleted it as well. genes come from mouse genes (the annotation that was made for The final human pluripotency network retained 196 links and 136 the mouse gene was transferred to the human gene) (http://www. genes (see Supplementary Table S4). This means that the majority geneontology.org/). of links (approximately 64%) were filtered out from the predicted Even though the gene co-expression methods performed better network. Fig. 5 thus shows the final putative interaction/regulation than phylogenetic profiling and GO semantic similarity methods, it network of human pluripotency (a high-resolution JPEG image and a was still unable to filter many genes properly. The method is based Cytoscape version of the network are given in Supplementary Fig. S4). on the notion that interacting proteins are co-expressed. However, expression data are usually derived from a heterogeneous mixture 4. Discussion of cells and cellular compartments. Genes may have very specific ex- pression patterns based on a variety of cellular activities. Thus, over- In this study, we mapped a mouse pluripotency network to human by lapping local expression patterns may not be identifiable in a global interolog detection and then used the RNAi method as a filter to increase co-expression measure. Moreover, the human pluripotency network the quality of the transferred network. We first established orthologous is composed of TFs, signaling proteins and epigenetic factors. It is, there- relationships of mouse–human pluripotency genes/proteins by the fore, possible that interacting signaling proteins or interacting TFs are combination of the three most popular publicly available databases, strongly co-expressed, but this may not be true for an interacting pair namely Ensembl, InParanoid and HomoloGene. Even given an adequate of a signaling protein and a TF (i.e., both signaling proteins and TFs set of orthologs, the predicted network is expected to contain several may have their specific expression patterns, including time delays if a false positive interactions. To filter out such false positives, we used TF is activated by a signal, or if a TF activates a signal). The Pearson cor- four methods to evaluate the interactions. To find out the best method, relation coefficient variant of the gene co-expression method utilizes we evaluated their relative performance. Interestingly, we found that co-expression pattern only, while the LinkScore variant looks for a dif- the most widely used methods (i.e., phylogenetic profiling, GO semantic ference between pluripotent and non-pluripotent samples. The latter similarity, and gene co-expression) performed worse than the RNAi performed better which confirms our observation that false positives method. The high performance of the RNAi method is not entirely sur- are caused by the lack of involvement in pluripotency, not the lack of prising, because the RNAi experiment was conducted precisely to identi- interaction. fy the genes which regulate self-renewal and pluripotency in hESCs (i.e., After filtering out false positives, the final human pluripotency net- it directly measured the involvement and/or criticality of the genes in work retained 196 links and 136 genes, which means that the majority hESCs). However, it is worth examining why the other methods per- of links (approximately 64%) were filtered out from the predicted formed relatively poorly. network. This result implies that the underlying mechanisms of pluripo- A likely reason for the poor performance of phylogenetic profiling is tency significantly differ between mouse and human. This is somewhat that it is too general. It predicts interactions based on evolutionary con- surprising considering that 99% of mouse pluripotency genes have servation of genes only and does not take into account the function of human orthologs. Furthermore, mouse and human genomes are highly the genes. For example, Esrrb, LIF, and Il6st (also called Gp130) are conservedingeneral,withabout85%ofallthemousegeneshaving well conserved across the genomes, so phylogenetic profiling is unable human orthologs. However, the difference in the mechanism of pluripo- to filter them out, even though they are known not to be involved in tency between the species was explained by Tesar et al. (2007),who human pluripotency (i.e., the phylogenetic profiling method is unable found fundamental differences between the mouse and human state of to measure the degree of involvement in pluripotency; rather it mea- pluripotency usually investigated that is an embryonic mouse stem cell sures the degree of interaction between genes). However, it is possible (naïve/unprimed) versus an epiblast-like human stem cell (primed). that these genes are involved in other cellular phenomena (i.e., they are We inspected the pathways in the final human network, namely false positives not because they do not interact, but because they are not TGF-beta/Activin/Nodal, Wnt signaling and LIF signaling, and found involved in pluripotency). that the TGF-beta/Activin/Nodal pathway exists (Fig. 5; yellow re- The underlying hypothesis of the GO semantic similarity method is gion), the Wnt pathway was disconnected from the core network that interacting proteins share the same sub-cellular localization and (brown region) and the LIF signaling pathway was deleted. It was are involved in similar biological processes. As described in the GO reported that the activation of the TGF-beta/Activin/Nodal branch semantic similarity method, because of transcriptional links the CC through SMAD2/3 is associated with pluripotency in human and is score cannot reflect the probability of interaction. The BP score is not re- required for the maintenance of the undifferentiated state in hESCs liable, either, because the observations that pair of proteins shares the (Vallier et al., 2005; James et al., 2005). Our filtered human network same biological process do not guarantee that they in fact interact. thus agrees with the current state of experimental knowledge of the Finally, GO annotations are known to be incomplete and erroneous TGF-beta/Activin/Nodal pathway. However, Wnt proteins are also (Done et al., 2010). Particularly for the species human, GO annotation believed to play an important role in controlling hESC maintenance (Sato et al., 2004), but the Wnt pathway was disconnected. A possible reason is that different Wnt genes are required for the maintenance of Table 4 the undifferentiated state of the ESCs in human and in mouse. The Wnt The nodes deleted by the 80% criterion (i.e., a node was deleted when at least 80% of family has 19 members (genes). In the mouse PluriNetWork,twoWnt fi the links attached to it were ltered out). genes, namely Wnt3a and Wnt5a, were included. In human, it was Gene Total no. of links No. of links Fraction of reported in the literature that Wnt3, Wnt5a and Wnt10B are involved name attached to the gene deleted links deleted in pluripotency mediated by the Wnt pathway (Katoh, 2008). Further- ESRRB 16 14 87% more, the RNAi screening result confirmed that Wnt3 and Wnt10B KLF2 5 4 80% (together with Wnt2B and Wnt9B) play a more important role in KLF5 11 10 91% hESCs than Wnt3a and Wnt5a (Chia et al., 2010). Several experimental MBD3 6 5 83% studies reported that unlike in mouse, the LIF signaling pathway is not NR5A2 23 19 82% SALL4 12 10 83% required to maintain hESC (Okita and Yamanaka, 2006; Sun et al., SMAD1 11 9 82% 2006). Our filtered human network matches these experimental results. MYC 15 13 86% The main limitation of the human pluripotency network derived STAT3 32 30 94% from the mouse network is the missing links. Links are missing because CTNNB1 7 6 86% they are absent from the mouse network or are human-specific. We CHAPTER 7.1 PUBLICATION #3

106 A. Som et al. / Gene 502 (2012) 99–107

Fig. 5. The derived human pluripotency network after filtering out putative false positives. This network contains 136 nodes (genes/proteins) and 196 links. A high-resolution JPEG image and a Cytoscape version of the network are found in Supplementary Fig. S4. The box on the left marks the TGF-beta/Activin/Nodal pathways and the box on the right marks the disconnected Wnt pathway (see text).

believe that the mouse network reflected the current knowledge of 5. Conclusions mouse pluripotency fairly well, and as new links are discovered, they can be added to the mouse network and transferred to human. Some In conclusion, we derived a putative human pluripotency network of the missing links, however, are human-specific. Considering that for mouse, for which experimental data are much more plentiful than the size of the derived human network is less than half of the mouse for human. The quality of the predicted network was improved by network and that we have no reason to believe that the true human net- using genome-wide RNAi screening data, which directly measured the work is smaller than the mouse network, the human-specific links may involvement and/or criticality of the genes in hESCs. The predicted net- well be in the majority. Some of these links may be transferred from work will be useful to understand the biology underlying pluripotency, other species or inferred from gene expression and RNAi data, although and scientists are expected to benefit from the access to a human net- it is doubtful that these approaches would yield much reliable informa- work of pluripotency players and mechanisms, which will help them tion. We may consider them in the future, but the only sure way to fill in make sense of high-throughput data. Most importantly, given recent in- the missing links is experimental identification. It is also future work to vestigations, we assume that the human network may reflect the find out how useful networks from other murine cell types may be for “primed” epiblast stem cell state more closely, while the mouse net- estimating the human pluripotency network. We expect that the main work reflects the “unprimed”,or,“naïve”, ESC state more closely. It is determinant of usefulness will be the proximity of the murine cell future work, requiring more experimental data, to disentangle the dif- type to the specific kind of pluripotency featured by hESC. In particular, ference in the developmental state and the species difference. a network from mouse Epiblast stem cells (EpiSC) may be very close, In the future, we are interested in mapping the pluripotency net- whereas networks of proliferating (cancer) cells of mice are expected work to more organisms of interest where the experimental data to be less related, though they may still share features related to prolif- are also limited (e.g., Axolotl, chicken, rat, etc.) and to investigate eration/renewal. Networks from very different cell types are expected their evolution. Multi-species pluripotency networks should be useful to be only remotely related; they should miss the core pluripotency net- to identify species-specific pathways evolution, and afford a deeper work around Pou5f1, Sox2 and Nanog as well as much of its periphery. understanding of the evolution of pluripotency. 7 Article

A. Som et al. / Gene 502 (2012) 99–107 107

Supplementary data to this article can be found online at http:// Lam, H., Patel, S., Wong, J., Chu, J., Lau, A., Li, S., 2008. Localized decrease of ß-catenin contributes to the differentiation of human embryonic stem cells. Biochem. dx.doi.org/10.1016/j.gene.2012.04.025. Biophys. Res. Commun. 372, 601–606. Lehner, B., Fraser, A.G., 2004. A first-draft human protein-interaction map. Genome Biol. 5 (9), R63. Acknowledgments Matthews, L.R., et al., 2001. Identification of potential interaction networks using sequence-based searches for conserved protein–protein interactions or “interologs”. We thank the two anonymous referees for their helpful comments. Genome Res. 11, 2120–2126. – Funding by the DFG SPP 1356, Pluripotency and Cellular Reprogramming Mika, S., Rost, B., 2006. Protein protein interactions more conserved within species than across species. PLoS Comput. Biol. 2 (7), e79. (FU583/2-1), is also gratefully acknowledged. We thank Boris Greber Ng, S.K., Zhang, Z., Tan, S.H., 2003. Integrative approach for computationally inferring for valuable advice in preparing the manuscript. protein domain interactions. Bioinformatics 19 (8), 923–929. Okita, K., Yamanaka, S., 2006. Intracellular signaling pathways regulating pluripotency of embryonic stem cells. Curr. Stem Cell Res. Ther. 1 (1), 103–111. References Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., Yeates, T.O., 1999. Assign- ing protein functions by comparative genome analysis, protein phylogenetic Adewumi, O., et al., 2007. Characterization of human embryonic stem cell lines by the profiles. Proc. Natl. Acad. Sci. U. S. A. 96 (8), 4285–4288. International Stem Cell Initiative. Nat. Biotechnol. 25, 803–816. Pera, M.F., Tam, P.P., 2010. Extrinsic regulation of pluripotent stem cells. Nature 465 Altschul, S.F., et al., 1997. Gapped BLAST and PSIBLAST: a new generation of protein (7299), 713–720. database search programs. Nucleic Acids Res. 25, 3389–3402. Persico, M., et al., 2005. HomoMINT: an inferred human network based on orthology Assou, S., et al., 2007. A meta-analysis of human embryonic stem cells transcriptome mapping of protein interactions discovered in model organisms. BMC Bioinformatics integrated into a web-based expression atlas. Stem Cells 25, 961–973. 6 (Suppl. 4), S21. Berglund, A.C., Sjolund, E., Ostlund, G., Sonnhammer, E.L.L., 2008. InParanoid 6: eukaryotic Rao, M., 2004. Conserved and divergent paths that regulate self-renewal in mouse and ortholog clusters with inparalogs. Nucleic Acids Res. 36, D263–D266. human embryonic stem cells. Dev. Biol. 275, 269–286. Boiani, M., Scholer, H.R., 2005. Regulatory networks in embryo-derived pluripotent Rao, S., Orkin, S.H., 2006. Unraveling the transcriptional network controlling ES cell stem cells. Nat. Rev. Mol. Cell Biol. 6 (11), 872–884. pluripotency. Genome Biol. 7, 230. Brown, K.R., Jurisica, I., 2005. Online predicted human interaction database. Bioinformatics Resnik, P., 1999. Semantic similarity in a taxonomy, an information-based measure and 21 (9), 2076–2082. its application to problems of ambiguity in natural language. J Artif. Intell Res. 11, Brown, K.R., Jurisica, I., 2007. Unequal evolutionary conservation of human protein 95–130. interactions in interologous networks. Genome Biol. 8 (5), R95. Rhodes, D.R., et al., 2005. Probabilistic model of the human protein–protein interaction Chen, F., Mackey, A.J., Vermunt, J.K., Roos, D.S., 2007. Assessing performance of orthology network. Nat. Biotechnol. 23 (8), 951–959. detection strategies applied to eukaryotic genomes. PLoS One 2, e383. Sato,N.,Meijer,L.,Skaltsounis,L.,Greengard,P.,Brivanlou,A.H.,2004.Maintenance Chia, N.Y., et al., 2010. A genome-wide RNAi screen reveals determinants of human of pluripotency in human and mouse embryonic stem cells through activation of embryonic stem cell identity. Nature 468, 316–320. Wnt signaling by a pharmacological GSK-3-specific inhibitor. Nat. Med. 10, Cohen, D.E., Melton, D., 2011. Turning straw into gold, directing cell fate for regenera- 55–63. tive medicine. Nat. Rev. Genet. 12, 243–252. Schlicker, A., Domingues, F.S., Rahnenfuhrer, J., Lengauer, T., 2006. A new measure for Done, B., Khatri, P., Done, A., Draghici, S., 2010. Predicting novel human Gene Ontology functional similarity of gene products based on Gene Ontology. BMC Bioinformatics annotations using semantic analysis. IEEE/ACM Trans. Comput. Biol. Bioinform. 7 (1), 7, 302. 91–99. Schnerch, A., Cerdan, C., Bhatia, M., 2010. Distinguishing between mouse and human Enault, F., Suhre, K., Abergel, C., Poirot, O., Claverie, J.M., 2003. Annotation of bacterial pluripotent stem cell regulation, the best laid plans of mice and man. Stem Cells genomes using improved phylogenomic profiles. Bioinformatics 19 (Suppl. 1), 28, 419–430. i105–i107. Shen, J., et al., 2007. Predicting protein–protein interactions based only on sequences Ewing, R.M., et al., 2007. Large-scale mapping of human protein–protein interactions information. Proc. Natl. Acad. Sci. U. S. A. 104 (11), 4337–4341. by mass spectrometry. Mol. Syst. Biol. 3, 89. Shin, C.J., Wong, S., Davis, M.J., Ragan, M.A., 2009. Protein–protein interaction as a Frohlich, H., Speer, N., Poustka, A., BeiSZbarth, T., 2007. GOSim—an R-package for predictor of subcellular location. BMC Syst. Biol. 3, 28. computation of information theoretic GO similarities between terms and gene Som, A., et al., 2010. The PluriNetWork, an in-silico representation of the network products. BMC Bioinformatics 8, 166. underlying pluripotency in mouse, and its applications. PLoS One 5 (12), e15165. Greber, B., Lehrach, H., Adjaye, J., 2007. Silencing of core transcription factors in human Sprinzak, E., Sattath, S., Margalit, H., 2003. How reliable are experimental protein–protein EC cells highlights the importance of autocrine FGF signaling for self-renewal. BMC interaction data? J. Mol. Biol. 327 (5), 919–923. Dev. Biol. 7, 46. Sun, Y., Li, H., Yang, H., Rao, M.S., Zhan, M., 2006. Mechanisms controlling embryonic Guo, X., Liu, R., Shriver, C.D., Hu, H., Liebman, M.N., 2006. Assessing semantic similarity stem cell self-renewal and differentiation. Crit. Rev. Eukaryot. Gene Expr. 16 (3), measures for the characterization of human regulatory pathways. Bioinformatics 211–231. 22, 967–973. Tesar, P.J., et al., 2007. New cell lines from mouse epiblast share defining features with Guo, Y., et al., 2008. Using support vector machine combined with auto covariance to human embryonic stem cells. Nature 448, 196–199. predict protein–protein interactions from protein sequences. Nucleic Acids Res. Tirosh, I., Barkai, N., 2005. Computational verification of protein–protein interactions 36, 3025–3030. by orthologous co-expression. BMC Bioinformatics 6, 40. Haider, S., Ballester, B., Smedley, D., Zhang, J., Rice, P., Kasprzyk, A., 2009. BioMart Vallier,L.,Alexander,M.,Pedersen,R.A.,2005.Activin/nodalandFGFpathways central portal—unified access to biological data. Nucleic Acids Res. 37, 23–27. cooperate to maintain pluripotency of human embryonic stem cells. J. Cell Sci. Han, K., Park, B., Kim, H., Hong, J., Park, J., 2004. HPID: the human protein interaction 118, 4495–4509. database. Bioinformatics 20 (15), 2466–2470. von Mering, C., et al., 2002. Comparative assessment of large-scale data sets of protein– Hanna, J., et al., 2010. Human embryonic stem cells with biological and epigenetic protein interactions. Nature 417, 399–403. characteristics similar to those of mouse ESCs. Proc. Natl. Acad. Sci. U. S. A. 107, Walhout, A.J., et al., 2000. Protein interaction mapping in C. elegans using proteins 9222–9227. involved in vulval development. Science 287, 116–122. Huang, T.W., Lin, C.Y., Kao, C.Y., 2007. Reconstruction of human protein interolog Wang, J.Z., Du, Z., Payattakool, R., Yu, P.S., Chen, C.F., 2007. A new method to measure network using evolutionary conserved network. BMC Bioinformatics 8, 152. the semantic similarity of go terms. Bioinformatics 23, 1274–1281. Hubbell, E., Liu, W.M., Mei, R., 2002. Robust estimators for expression analysis. Bioin- Warsow, G., et al., 2010. ExprEssence—revealing the essence of differential experimental formatics 18, 1585–1592. data in the context of an interaction/regulation network. BMC Syst. Biol. 4, 164. Ideker, T., Ozier, O., Schwikowski, B., Siegel, A.F., 2002. Discovering regulatory and Wu, X., Zhu, L., Guo, J., Zhang, D.Y., Lin, K., 2006. Prediction of yeast protein–protein signalling circuits in molecular interaction networks. Bioinformatics 18 (Suppl. interaction network, insights from the Gene Ontology and annotations. Nucleic 1), S233–S240. Acids Res. 34 (7), 2137–2150. Jaenisch, R., Young, R.A., 2008. Stem cells, the molecular circuitry of pluripotency and Xie, C.Q., et al., 2009. Expression profiling of nuclear receptors in human and mouse nuclear reprogramming. Cell 132, 567–582. embryonic stem cells. Mol. Endocrinol. 23, 724–733. James, D., Levine, A.J., Besser, D., Hemmati-Brivanlou, A., 2005. TGFß/activin/nodal Yellaboina, S., Goyal, K., Mande, S.C., 2007. Inferring genome-wide functional linkages signaling is necessary for the maintenance of pluripotency in human embryonic in E. coli by combining improved genome context methods, comparison with stem cells. Development 132, 1273–1282. high-throughput experimental data. Genome Res. 17, 527–535. Jothi, R., Kann, M.G., Przytycka, T.M., 2005. Predicting protein–protein interaction by search- Yu, H., et al., 2004. Annotation transfer between genomes, protein–protein interologs ing evolutionary tree automorphism space. Bioinformatics 21 (Suppl. 1), i241–i250. and protein–DNA regulogs. Genome Res. 14, 1107–1118. Katoh, M., 2008. WNT signaling in stem cell biology and regenerative medicine. Curr. Yu, H., et al., 2008. High-quality binary protein interaction map of the yeast interactome Drug Targets 9, 565–570. network. Science 322 (5898), 104–110. CHAPTER 7.1 PUBLICATION #3

7.2. Supplement

Supplementary information can be found on the CD attached to this thesis.

Part III.

Appendix

109

8. CD with supplementary information

All supplementary information for publications Identifying Genes Relevant to Spe- cific Biological Conditions in Time Course Microarray Experiments, Elucidating com- plex phenotypes based on high-throughput expression and biological annotation data and Derivation of an interaction/regulation network describing pluripotency in human are available on the CD attached to this thesis. The content of the CD is shown below as a directory tree. The source codes related to Identifying Genes Relevant to Specific Biological Condi- tions in Time Course Microarray Experiments and Elucidating complex phenotypes based on high-throughput expression and biological annotation data are provided as R scripts in the zip files.

/ Identifying Genes Relevant to Specific Biological Conditions in Time Course Microarray Experiments Figure S1.tif TextS1.pdf rSNR.Source.Code.Zip Elucidating complex phenotypes based on high-throughput expression and biological annotation data Figure S1.tif Figure S2.png Figure S3.png Figure S4.png TextS1.pdf Identifying.Biological.Unit.Zip Derivation of an interaction/regulation network describing pluripotency in human Figure S1.cys Figure S1.jpeg Figure S2.cys Figure S3.doc Figure S4.cys Figure S4.jpeg Table S1.xls Table S2.doc Table S3.xls Table S4.xls

111

9. Affidavit

Hiermit erkl¨are ich, dass diese Arbeit bisher von mir weder an der Mathematisch- Naturwissenschaftlichen Fakult¨at der Ernst-Moritz-Arndt-Universit¨at Greifswald noch einer anderen wissenschaftlichen Einrichtung zum Zwecke der Promotion eingereicht wurde. Ferner erkl¨are ich, dass ich diese Arbeit selbst¨andig verfasst und keine anderen als die darin angegebenen Hilfsmittel und Hilfen benutzt und keine Textabschnitte eines Dritten ohne Kennzeichnung ubernommen¨ habe.

Nitesh Kumar Singh

113

10. Curriculum vitae and publications

Personal Data

• Birth: 20.03.1983 in Kanpur

Professional Experience

• Since February 2010: Phd student in biostatistics and bioinformatics at Institute for Biostatistics and Informatics in Medicine and Ageing Research, University of Rostock

Education

• 2007-2010: M. Sc. in Bioinformatics at University of Saarland in Saarbrucken (Germany), Center for Bioinformatics.

• 2004-2006: M. Sc. in Biotechnology and Bioinformatics at Sri Ramaswamy Memo- rial University in Chennai (India).

• 2001-2004: B. Sc. in Biotechnology at Magadh University in Patna (India).

Publications

1. Singh N K , Ernst M, Liebscher V, Fuellen G and Taher L, (2014) Elucidating complex phenotypes based on high-throughput expression and biological annota- tion data. PLoS One (Submitted).

2. Singh N K , Repsilber D, Liebscher V, Taher L and Fuellen G, (2013) Iden- tifying genes relevant to specific biological conditions in time course microarray experiments. PLoS One 8 e76561.

3. Som A, Lustrek M, Singh N K and Fuellen G, (2012) Derivation of an interac- tion/regulation network describing pluripotency in human., Gene 502 99–107.

4. Singh N K, Goodman A, Walter P, Helms V and Hayat S, (2011) TMBHMM: a frequency profile based HMM for predicting the topology of transmembrane beta barrel proteins and the exposure status of transmembrane residues., Biochim Biophys Acta 1814 664–670.

5. Singh N K, Selvam S M and Chakravarthy P, (2006) T-iDT : tool for identification of drug target in bacteria and validation by Mycobacterium tuberculosis., In Silico Biol 6 485–493.

115

11. Acknowledgements

It would not have been possible to write this doctoral thesis without the help and support of the kind people around me, to only some of whom it is possible to give particular mention here. In particular, I wish to express my gratitude to my research guide, Professor Dr. Georg Fuellen of Rostock University for his continued encouragement and invaluable suggestions during this work. In this I would also like to include my gratitude to Professor Dr. Volkmar Liebscher of Greifswald University who have provided constant support for my research work. Furthermore, I am deeply indebted to Dr. Leila Taher, who constantly helped me with her suggestions and comments. Members of Institute for Biostatistics and Informatics in Medicine and Ageing Re- search in Rostock also deserve my sincerest thanks, their friendship and assistance has meant more to me than I could ever express. I wish to thank my parents and other members of my family. Their love provided my inspiration and was my driving force. I owe them everything and wish I could show them just how much I love and appreciate them. I hope that this work makes you proud. Last but not the least, I would like to express my gratitude to the invisible force that helped me in the tough times.

117

12. Bibliography

Bibliography

D. J. Allocco, I. S. Kohane, and A. J. Butte. Quantifying the relationship between co- expression, co-regulation and gene function. BMC Bioinformatics, 5(1):18, 2004. doi: 10.1186/1471-2105-5-18. URL http://dx.doi.org/10.1186/1471-2105-5-18.

R. B. Altman and S. Raychaudhuri. Whole-genome expression analysis: challenges beyond clustering. Curr Opin Struct Biol, 11(3):340–347, Jun 2001.

K. Aoki, Y. Ogata, and D. Shibata. Approaches for extracting practical information from gene co-expression networks in plant biology. Plant Cell Physiol, 48(3):381–390, Mar 2007. doi: 10.1093/pcp/pcm013. URL http://dx.doi.org/10.1093/pcp/pcm013.

T. Bailey and C. Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol, 2:28–36, 1994.

T. L. Bailey, N. Williams, C. Misleh, and W. W. Li. Meme: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Research, 34(Web Server):W369– W373, Jul 2006. doi: 10.1093/nar/gkl198. URL http://dx.doi.org/10.1093/nar/ gkl198.

T. Barrett, S. E. Wilhite, P. Ledoux, C. Evangelista, I. F. Kim, M. Tomashevsky, K. A. Marshall, K. H. Phillippy, P. M. Sherman, M. Holko, and et al. Ncbi geo: archive for functional genomics data sets–update. Nucleic Acids Research, 41(D1):D991–D995, Jan 2013. ISSN 1362-4962. doi: 10.1093/nar/gks1193. URL http://dx.doi.org/10. 1093/nar/gks1193.

D. Botstein, J. M. Cherry, M. Ashburner, C. A. Ball, J. A. Blake, H. Butler, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, and et al. Gene ontology: tool for the unification of biology. Nature Genetics, 25(1):25–29, May 2000. ISSN 1061-4036. doi: 10.1038/75556. URL http://dx.doi.org/10.1038/75556.

H. J. Cordell. Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet, 10(6):392–404, Jun 2009. doi: 10.1038/nrg2579. URL http://dx.doi.org/ 10.1038/nrg2579.

P. D’haeseleer, S. Liang, and R. Somogyi. Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics, 16(8):707–726, Aug 2000.

R. Diaz Uriarte and S. Alvarez de Andres. Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(1):3, 2006. ISSN 1471-2105. doi: 10.1186/1471-2105-7-3. URL http://dx.doi.org/10.1186/1471-2105-7-3.

J. Dutkowski and J. Tiuryn. Phylogeny-guided interaction mapping in seven eu- karyotes. BMC Bioinformatics, 10(1):393, 2009. ISSN 1471-2105. doi: 10.1186/ 1471-2105-10-393. URL http://dx.doi.org/10.1186/1471-2105-10-393.

119 BIBLIOGRAPHY

B. Duval and J.-K. Hao. Advances in metaheuristics for gene selection and classification of microarray data. Brief Bioinform, 11(1):127–141, Jan 2010. doi: 10.1093/bib/ bbp035. URL http://dx.doi.org/10.1093/bib/bbp035.

M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A, 95(25):14863–14868, Dec 1998.

M. Emily, T. Mailund, J. Hein, L. Schauser, and M. H. Schierup. Using biological networks to search for interacting loci in genome-wide association studies. Eur J Hum Genet, 17(10):1231–1240, Oct 2009. doi: 10.1038/ejhg.2009.15. URL http: //dx.doi.org/10.1038/ejhg.2009.15.

J. M. Engreitz, A. A. Morgan, J. T. Dudley, R. Chen, R. Thathoo, R. B. Altman, and A. J. Butte. Content-based microarray search using differential expression profiles. BMC Bioinformatics, 11:603, 2010. doi: 10.1186/1471-2105-11-603. URL http:// dx.doi.org/10.1186/1471-2105-11-603.

C. Fresno and E. A. Fernandez. Rdavidwebservice: a versatile r interface to david. Bioinformatics, 29(21):2810–2811, Nov 2013. doi: 10.1093/bioinformatics/btt487. URL http://dx.doi.org/10.1093/bioinformatics/btt487.

S. M. Gomez, W. S. Noble, and A. Rzhetsky. Learning to predict protein-protein in- teractions from protein sequences. Bioinformatics, 19(15):1875–1881, Oct 2003. ISSN 1460-2059. doi: 10.1093/bioinformatics/btg352. URL http://dx.doi.org/10.1093/ bioinformatics/btg352.

X. Guo and X.-F. Wang. Signaling cross-talk between tgf-beta/bmp and other pathways. Cell Res, 19(1):71–88, Jan 2009. doi: 10.1038/cr.2008.302. URL http://dx.doi.org/ 10.1038/cr.2008.302.

S. Gupta, J. A. Stamatoyannopoulos, T. L. Bailey, and W. S. Noble. Quantifying sim- ilarity between motifs. Genome Biol, 8(2):R24, 2007. doi: 10.1186/gb-2007-8-2-r24. URL http://dx.doi.org/10.1186/gb-2007-8-2-r24.

I. Guyon and A. Elisseeff. An introduction to variable and feature selection. The Journal of Machine Learning Research, 3:1157–1182, 2003. URL http://dl.acm. org/citation.cfm?id=944968.

K. Hailesellasse Sene, C. J. Porter, G. Palidwor, C. Perez-Iratxeta, E. M. Muro, P. A. Campbell, M. A. Rudnicki, and M. A. Andrade-Navarro. Gene function in early mouse embryonic stem cell differentiation. BMC Genomics, 8:85, 2007. doi: 10.1186/ 1471-2164-8-85. URL http://dx.doi.org/10.1186/1471-2164-8-85.

G. T. Hart, A. K. Ramani, and E. M. Marcotte. How complete are current yeast and human protein-interaction networks? Genome Biology, 7(11):120, 2006. ISSN 1465-6906. doi: 10.1186/gb-2006-7-11-120. URL http://dx.doi.org/10.1186/ gb-2006-7-11-120.

K. Hayes, A. Vollrath, G. Zastrow, B. McMillan, M. Craven, S. Jovanovich, D. Rank, S. Penn, J. Walisser, J. Reddy, et al. Edge: a centralized resource for the comparison, analysis, and distribution of toxicogenomic information. Molecular pharmacology, 67 (4):1360–1368, 2005. URL http://molpharm.aspetjournals.org/content/67/4/ 1360.short.

120 BIBLIOGRAPHY

J. D. Hoheisel. Microarray technology: beyond transcript profiling and genotype anal- ysis. Nature Reviews Microbiology, 7(3):200–210, Mar 2006. ISSN 1471-0064. doi: 10.1038/nrg1809. URL http://dx.doi.org/10.1038/nrg1809. D. A. Hosack, G. Dennis, B. T. Sherman, H. Lane, and R. A. Lempicki. Identifying biological themes within lists of genes with ease. Genome Biology, 4(10):R70, 2003. ISSN 1465-6906. doi: 10.1186/gb-2003-4-10-r70. URL http://dx.doi.org/10.1186/ gb-2003-4-10-r70. D. W. Huang, B. T. Sherman, and R. A. Lempicki. Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nat Protoc, 4(1):44–57, 2009a. doi: 10.1038/nprot.2008.211. URL http://dx.doi.org/10.1038/nprot.2008.211. D. W. Huang, B. T. Sherman, and R. A. Lempicki. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res, 37(1):1–13, Jan 2009b. doi: 10.1093/nar/gkn923. URL http://dx.doi.org/10. 1093/nar/gkn923. J. Ihmels, R. Levy, and N. Barkai. Principles of transcriptional control in the metabolic network of saccharomyces cerevisiae. Nat Biotechnol, 22(1):86–92, Jan 2004. doi: 10.1038/nbt918. URL http://dx.doi.org/10.1038/nbt918. I. Iossifov, T. Zheng, M. Baron, T. C. Gilliam, and A. Rzhetsky. Genetic-linkage mapping of complex hereditary disorders to a whole-genome molecular-interaction network. Genome Res, 18(7):1150–1162, Jul 2008. doi: 10.1101/gr.075622.107. URL http: //dx.doi.org/10.1101/gr.075622.107. L. Jensen, M. Kuhn, M. Stark, S. Chaffron, C. Creevey, J. Muller, T. Doerks, P. Julien, A. Roth, M. Simonovic, et al. String 8 - a global view on proteins and their functional interactions in 630 organisms. Nucleic acids research, 37(suppl 1):D412–D416, 2009. URL http://nar.oxfordjournals.org/content/37/suppl_1/D412.short. T. Jirapech-Umpai and S. Aitken. Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC Bioinformatics, 6(1):148, 2005. ISSN 1471-2105. doi: 10.1186/1471-2105-6-148. URL http://dx.doi. org/10.1186/1471-2105-6-148. M. Kanehisa. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28(1):27–30, Jan 2000. ISSN 1362-4962. doi: 10.1093/nar/28.1.27. URL http://dx. doi.org/10.1093/nar/28.1.27. M. Kanehisa, S. Goto, Y. Sato, M. Kawashima, M. Furumichi, and M. Tanabe. Data, information, knowledge and principle: back to metabolism in kegg. Nucleic Acids Research, 42(D1):D199–D205, Jan 2014. ISSN 1362-4962. doi: 10.1093/nar/gkt1076. URL http://dx.doi.org/10.1093/nar/gkt1076. J. Kilian, D. Whitehead, J. Horak, D. Wanke, S. Weinl, O. Batistic, C. D’Angelo, E. Bornberg-Bauer, J. Kudla, and K. Harter. The atgenexpress global stress expres- sion data set: protocols, evaluation and model data analysis of uv-b light, drought and cold stress responses. The Plant Journal, 50(2):347–363, 2007. URL http: //onlinelibrary.wiley.com/doi/10.1111/j.1365-313X.2007.03052.x/full. B. Kong, T. Yang, L. Chen, Y.-q. Kuang, J.-w. Gu, X. Xia, L. Cheng, and J.-h. Zhang. Protein-protein interaction network analysis and gene set enrichment anal- ysis in epilepsy patients with brain cancer. Journal of Clinical Neuroscience, 21

121 BIBLIOGRAPHY

(2):316–319, Feb 2014. ISSN 0967-5868. doi: 10.1016/j.jocn.2013.06.026. URL http://dx.doi.org/10.1016/j.jocn.2013.06.026.

L. Kumar and M. E. Futschik. Mfuzz: a software package for soft clustering of microarray data. Bioinformation, 2(1):5–7, 2007.

D. Marbach, J. C. Costello, R. Kuffner, N. M. Vega, R. J. Prill, D. M. Camacho, K. R. Allison, A. Aderhold, K. R. Allison, R. Bonneau, and et al. Wisdom of crowds for robust gene network inference. Nature Methods, 9(8):796–804, Jul 2012. ISSN 1548- 7105. doi: 10.1038/nmeth.2016. URL http://dx.doi.org/10.1038/nmeth.2016.

E. M. Marcotte. Detecting protein function and protein-protein interactions from genome sequences. Science, 285(5428):751–753, Jul 1999. ISSN 1095-9203. doi: 10.1126/ science.285.5428.751. URL http://dx.doi.org/10.1126/science.285.5428.751.

L. R. Matthews. Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or ”interologs”. Genome Research, 11(12):2120–2126, Dec 2001. ISSN 1088-9051. doi: 10.1101/gr.205301. URL http: //dx.doi.org/10.1101/gr.205301.

T. T. Nguyen, P. T. Foteinou, S. E. Calvano, S. F. Lowry, and I. P. Androulakis. Computational identification of transcriptional regulators in human endotoxemia. PLoS One, 6(5):e18889, 2011. doi: 10.1371/journal.pone.0018889. URL http: //dx.doi.org/10.1371/journal.pone.0018889.

H. H. Otu, K. Naxerova, K. Ho, H. Can, N. Nesbitt, T. A. Libermann, and S. J. Karp. Restoration of liver mass after injury requires proliferative and not embry- onic transcriptional patterns. J Biol Chem, 282(15):11197–11204, Apr 2007. doi: 10.1074/jbc.M608441200. URL http://dx.doi.org/10.1074/jbc.M608441200.

C. A. Ouzounis, A. J. Enright, I. Iliopoulos, and N. C. Kyrpides. Protein interaction maps for complete genomes based on gene fusion events. Nature, 402(6757):86–90, Nov 1999. ISSN 0028-0836. doi: 10.1038/47056. URL http://dx.doi.org/10.1038/ 47056.

M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates. Assign- ing protein functions by comparative genome analysis: Protein phylogenetic profiles. Proceedings of the National Academy of Sciences, 96(8):4285–4288, Apr 1999. ISSN 1091-6490. doi: 10.1073/pnas.96.8.4285. URL http://dx.doi.org/10.1073/pnas. 96.8.4285.

S. Peri, J. D. Navarro, R. Amanchy, T. Z. Kristiansen, C. K. Jonnalagadda, V. Suren- dranath, V. Niranjan, B. Muthusamy, T. K. B. Gandhi, M. Gronborg, N. Ibarrola, N. Deshpande, K. Shanker, H. N. Shivashankar, B. P. Rashmi, M. A. Ramya, Z. Zhao, K. N. Chandrika, N. Padma, H. C. Harsha, A. J. Yatish, M. P. Kavitha, M. Menezes, D. R. Choudhury, S. Suresh, N. Ghosh, R. Saravana, S. Chandran, S. Krishna, M. Joy, S. K. Anand, V. Madavan, A. Joseph, G. W. Wong, W. P. Schiemann, S. N. Constan- tinescu, L. Huang, R. Khosravi-Far, H. Steen, M. Tewari, S. Ghaffari, G. C. Blobe, C. V. Dang, J. G. N. Garcia, J. Pevsner, O. N. Jensen, P. Roepstorff, K. S. Deshpande, A. M. Chinnaiyan, A. Hamosh, A. Chakravarti, and A. Pandey. Development of hu- man protein reference database as an initial platform for approaching systems biology in humans. Genome Res, 13(10):2363–2371, Oct 2003. doi: 10.1101/gr.1680803. URL http://dx.doi.org/10.1101/gr.1680803.

122 BIBLIOGRAPHY

D. R. Rhodes, S. A. Tomlins, S. Varambally, V. Mahavisno, T. Barrette, S. Kalyana- Sundaram, D. Ghosh, A. Pandey, and A. M. Chinnaiyan. Probabilistic model of the human protein-protein interaction network. Nature Biotechnology, 23(8):951–59, Aug 2005. ISSN 1087-0156. doi: 10.1038/nbt1103. URL http://dx.doi.org/10.1038/ nbt1103. V. Ribes, I. Le Roux, M. Rhinn, B. Schuhbaur, and P. Dolle. Early mouse caudal development relies on crosstalk between retinoic acid, shh and fgf signalling pathways. Development, 136(4):665–676, Feb 2009. ISSN 1477-9129. doi: 10.1242/dev.016204. URL http://dx.doi.org/10.1242/dev.016204. T. Ruths, D. Ruths, and L. Nakhleh. GS2: an efficiently computable measure of go- based similarity of gene sets. Bioinformatics, 25(9):1178–1184, May 2009. ISSN 1460-2059. doi: 10.1093/bioinformatics/btp128. URL http://dx.doi.org/10.1093/ bioinformatics/btp128. Y. Saeys, I. Inza, and P. Larranaga. A review of feature selection techniques in bioin- formatics. Bioinformatics, 23(19):2507–2517, Oct 2007. ISSN 1460-2059. doi: 10. 1093/bioinformatics/btm344. URL http://dx.doi.org/10.1093/bioinformatics/ btm344. A. Schulze and J. Downward. Navigating gene expression using microarrays–a technology review. Nat Cell Biol, 3(8):E190–E195, Aug 2001. doi: 10.1038/35087138. URL http://dx.doi.org/10.1038/35087138. W. Shannon, R. Culverhouse, and J. Duncan. Analyzing microarray data using cluster analysis. Pharmacogenomics, 4(1):41–52, Jan 2003. doi: 10.1517/phgs.4.1.41.22581. URL http://dx.doi.org/10.1517/phgs.4.1.41.22581. B. A. Shoemaker and A. R. Panchenko. Deciphering protein-protein interactions. part ii. computational methods to predict protein and domain interaction partners. PLoS Comput Biol, 3(4):e43, 2007. ISSN 1553-7358. doi: 10.1371/journal.pcbi.0030043. URL http://dx.doi.org/10.1371/journal.pcbi.0030043. M. Siatkowski, V. Liebscher, and G. Fuellen. Cellfatescout - a bioinformatics tool for elucidating small molecule signaling pathways that drive cells in a specific direction. Cell Communication and Signaling, 11(1):85, 2013. ISSN 1478-811X. doi: 10.1186/ 1478-811x-11-85. URL http://dx.doi.org/10.1186/1478-811X-11-85. C. Siegenthaler and R. Gunawan. Assessment of network inference methods: How to cope with an underdetermined problem. PLoS ONE, 9(3):e90481, Mar 2014. ISSN 1932-6203. doi: 10.1371/journal.pone.0090481. URL http://dx.doi.org/10.1371/ journal.pone.0090481. A. Som, C. Harder, B. Greber, M. Siatkowski, Y. Paudel, G. Warsow, C. Cap, H. Scholer, and G. Fuellen. The plurinetwork: an electronic representation of the network under- lying pluripotency in mouse, and its applications. PLoS One, 5(12):e15165, 2010. doi: 10.1371/journal.pone.0015165. URL http://dx.doi.org/10.1371/journal.pone. 0015165. A. Statnikov, C. F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy. A compre- hensive evaluation of multicategory classification methods for microarray gene ex- pression cancer diagnosis. Bioinformatics, 21(5):631–643, Mar 2005. ISSN 1460- 2059. doi: 10.1093/bioinformatics/bti033. URL http://dx.doi.org/10.1093/ bioinformatics/bti033.

123 BIBLIOGRAPHY

V. G. Tusher, R. Tibshirani, and G. Chu. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A, 98(9):5116–5121, Apr 2001. doi: 10.1073/pnas.091062498. URL http://dx.doi.org/10.1073/pnas.091062498.

M. Tyagi, K. Hashimoto, B. A. Shoemaker, S. Wuchty, and A. R. Panchenko. Large- scale mapping of human protein interactome using structural complexes. EMBO Rep, 13(3):266–271, Mar 2012. doi: 10.1038/embor.2011.261. URL http://dx.doi.org/ 10.1038/embor.2011.261.

A. Valencia and F. Pazos. Computational methods for the prediction of protein inter- actions. Current Opinion in Structural Biology, 12(3):368–373, Jun 2002. ISSN 0959- 440X. doi: 10.1016/s0959-440x(02)00333-0. URL http://dx.doi.org/10.1016/ S0959-440X(02)00333-0.

J. Z. Wang, Z. Du, R. Payattakool, P. S. Yu, and C.-F. Chen. A new method to measure the semantic similarity of go terms. Bioinformatics, 23(10):1274–1281, May 2007. doi: 10.1093/bioinformatics/btm087. URL http://dx.doi.org/10.1093/ bioinformatics/btm087.

G. Warsow, B. Greber, S. S. I. Falk, C. Harder, M. Siatkowski, S. Schordan, A. Som, N. Endlich, H. Scholer, D. Repsilber, K. Endlich, and G. Fuellen. Expressence– revealing the essence of differential experimental data in the context of an interac- tion/regulation net-work. BMC Syst Biol, 4:164, 2010. doi: 10.1186/1752-0509-4-164. URL http://dx.doi.org/10.1186/1752-0509-4-164.

R. D. Wolfinger, G. Gibson, E. D. Wolfinger, L. Bennett, H. Hamadeh, P. Bushel, C. Afshari, and R. S. Paules. Assessing gene significance from cdna microarray ex- pression data via mixed models. Journal of Computational Biology, 8(6):625–637, Nov 2001. doi: 10.1089/106652701753307520. URL http://dx.doi.org/10.1089/ 106652701753307520.

E. T. Young. Multiple pathways are co-regulated by the protein kinase snf1 and the transcription factors adr1 and cat8. Journal of Biological Chemistry, 278(28):26146– 26158, May 2003. doi: 10.1074/jbc.M301981200. URL http://dx.doi.org/10.1074/ jbc.M301981200.

124