Active Module Discovery: Integrated Approaches of Co-Expression and PPI Networks and MicroRNA Data

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Ayat Hatem, M.Sc.

Graduate Program in Electrical and Computer Engineering

The Ohio State University

2014

Dissertation Committee:

Umit¨ V. C¸ataly¨urek,Advisor Yuejie Chi Kun Huang F¨usun Ozg¨uner¨ c Copyright by Ayat Hatem 2014 Abstract

Integrating -protein interaction (PPI) networks with data to extract active modules is shown to be promising in detecting meaningful biomark- ers for and other diseases. However, current algorithms suffer from many drawbacks such as focusing only on the highly differentially expressed , ana- lyzing dependencies between genes in the PPI network only; totally neglecting the genes whose interactions are not known yet, and finally using mRNA gene expression data; ignoring other types of data such as gene mutation information and microRNAs expressions. In addition, lately, using the next generation sequencing technology to sequence the mRNA (RNA-Seq) has become the new standard for gene expression.

However, existing algorithms either cannot handle the RNA-Seq data, or they return large modules which are hard to analyze. Therefore, we need new approaches to ad- dress the current drawbacks while utilizing and integrating the RNA-Seq data to the module discovery process.

This work explores some of the drawbacks of current active module discovery algorithms. We first discuss the differences between RNA-Seq data and microarray data. With experimental evidence, we show that RNA-Seq is more powerful than microarray in providing better active modules at the expense of generating larger ones. Therefore, new approaches are needed to handle RNA-Seq data.

ii Afterwards, we present a new workflow, PRASE, that is specifically designed to handle and obtain better active modules while using RNA-Seq data. PRASE employs a variation of the famous PageRank algorithm to preprocess the gene expression p- values. Then, it applies a scaling function to construct new p-values for the genes.

Such new p-values redefine the importance of the genes: a gene is important not only based on its own value but also based on the values of the surrounding genes, thus, boosting the importance of genes that might not be differentially expressed from the p-value perspective. Finally, PRASE uses the new p-values with the existing active module discovery algorithms to extract the final modules. We applied our workflow on , oligodendroglioma tumor, and datasets.

Using PRASE, we obtain more specialized modules which contain information that is overlooked by existing algorithms.

Finally, we present our novel microRNA-mRNA integration technique, Mica, that efficiently integrates microRNA and mRNA expressions with the PPI network to discover more disease-specific active modules. The novelty of Mica lies in the early integration of microRNA expression with mRNA expression to better highlight the indirect dependencies between genes. We applied Mica on microRNA-Seq and

mRNA-Seq data sets of 699 invasive ductal carcinoma samples and 150 invasive lob-

ular carcinoma samples from the Cancer Genome Atlas Project (TCGA). The Mica

modules unravel new and interesting dependencies between the genes and miRNAs.

Additionally, the modules accurately differentiate between case and control samples

while being highly enriched with disease-specific pathways and genes.

iii To my parents, Karim, Omar, and Maleeka.

iv Acknowledgments

I would like to thank and express gratitude to my advisor Prof. Umit¨ V. C¸ataly¨urek, for his continuous and generous support and guidance throughout my study at OSU.

Prof. C¸ataly¨urekshowed great faith in my abilities and allowed me to work quite independently, but at the same time provided invaluable guidance at the necessary times.

I would also like to thank the dissertation examination committee members, in- cluding, Prof. F¨usun Ozg¨uner,Prof.¨ Kun Huang, Prof. Yuejie Chi, and Prof. Dawn

Chandler. The discussion and comments I received during my defense were invaluable; opening my mind to new ideas and research directions.

I would also like to thank Prof. Kamer Kaya for his support and the various discussions we had; some of which already generated ideas used in my work.

I want to thank all of my colleagues and friends at the HPC lab including Erdem

Sariyuce, Mehmet Deveci, Anas AbuDolah, ad Izzet Senturk. Also, I would like to thank the former members of the HPC lab, including, Erik Saule, Onur Kucuktunc, and Doruk Bozda˘g. It has been a privilege to know such a great group of people.

Particularly I would like to mention Doruk Bozda˘gand Erik Saule for the numerous fruitful and interesting discussions.

I would like to extend my deepest gratitude and love to my mother and my late father, who supported me during my research career and always encouraged me to

v follow my dreams. I also would like to thank my children, Omar and Maleeka, for their sense of humor and their wonderful characters, they totally changed my life. I can’t describe how grateful I am towards my husband, Karim, whose sweet presence has brought happiness into my life. He was always there for me in my tough times and always encouraging me to go forward with my PhD and never to give up.

Finally, I acknowledge the support of the Graduate School of The Ohio State

University, for the University Fellowship Award and the support from the National

Science Foundation.

vi Vita

September 15th, 1985 ...... Born - Giza, Egypt

July 2007 ...... B.S., Computer Engineering, Cairo University, Cairo, Egypt August 2009 ...... M.S., Software Engineering, Nile University, Cairo, Egypt September 2009–August 2010 ...... University Fellow, The Ohio State University, Columbus, OH, USA September 2010–Spring 2013 ...... Grad. Research Assoc., The Ohio State University, Columbus, OH, USA Spring 2013–Present ...... Grad. Teaching Assoc., The Ohio State University, Columbus, OH, USA

Publications

Research Publications

A. Hatem, K. Kaya, J. Parvin, K. Huang, U.¨ V. C¸ataly¨urek, ”MICA: MicroRNA Integration for Active Module Discovery,” In the 13th European Conference on Computational Biology (ECCB), Submitted

K. Kaya, A. Hatem, H. G. Ozer,¨ K. Huang, U.¨ V. C¸ataly¨urek, ”High-Performance Computing in High-Throughput Sequencing,” In Biological Knowledge Discovery Handbook, John Wiley & Sons, Editors M. Elloumi, A. Y. Zomaya, 2014

vii L. Wang, A. Hatem, U.¨ V. C¸ataly¨urek,M. Morrison, Z. Yu, ”Metagenomic Insights into the Carbohydrate-Active Enzymes Carried by the Microorganisms Adhering to Solid Digesta in the Rumen of Cows,” In PLoS One, vol. 8, no. 11, pg. e78507, Nov 2013

A. Hatem, D. Bozda˘g, A. E. Toland, U.¨ V. C¸ataly¨urek, ”Benchmarking Short Se- quence Mapping Tools,” In BMC Bioinformatics, vol. 14, no. 1, pg. 184, 2013

A. Hatem, K. Kaya, U.¨ V. C¸ataly¨urek,”PRASE: PageRank-based Active Subnetwork Extraction,” In Proc. of ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics (BCB), Sep 2013

A. Hatem, K. Kaya, U,¨ V.C¸ataly¨urek, ”Microarray vs. RNA-Seq: A comparison for active subnetwork discovery,” In Proc. of ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics (BCB), Oct 2012

A. Hatem,D. Bozda˘g, U.¨ V. C¸ataly¨urek, ”Benchmarking Short Sequence Mapping Tools,” In Proc. of IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Dec 2011

D. Bozda˘g,A. Hatem, U.¨ V. C¸ataly¨urek,”Exploring Parallelism in Short Sequence Mapping Using Burrows-Wheeler Transform,” In Proc. of 9th IEEE International Workshop on High Performance Computational Biology (in conjunction with IPDPS), 2010

A. Hatem, D. Bozda˘g, U.¨ V. C¸ataly¨urek,”Benchmarking Short Sequence Alignment Tools,” In Abstract, Bioinformatics, 2010 Ohio Collaborative Conference, 2010

Fields of Study

Major Field: Electrical & Computer Engineering

viii Table of Contents

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita...... vii

List of Tables ...... xii

List of Figures ...... xiv

1. Introduction ...... 1

1.1 Dissertation Outline and Summary of Contributions ...... 4

2. Background and Related Work ...... 8

2.1 DNA and the central dogma ...... 8 2.1.1 Measuring gene expression levels ...... 10 2.1.2 Other elements in the central dogma ...... 11 2.2 Active module discovery problem ...... 12 2.3 microRNA and mRNA integration ...... 18

3. An Evaluation of RNA-Seq Mapping Tools ...... 20

3.1 Background ...... 26 3.1.1 Features ...... 26 3.1.2 Tools’ description ...... 28 3.1.3 Default options of the tested tools ...... 31 3.1.4 Evaluation criteria ...... 34

ix 3.2 Methods ...... 38 3.2.1 Benchmark design ...... 38 3.2.2 Usecase: SNP Calling ...... 41 3.3 Results and discussion ...... 42 3.3.1 Mapping options ...... 48 3.3.2 Input properties ...... 54 3.3.3 Algorithmic features ...... 63 3.3.4 Scalability ...... 65 3.3.5 Accuracy evaluation ...... 67 3.3.6 Rabema evaluation ...... 70 3.3.7 Use case: SNP calling ...... 71 3.4 Conclusion ...... 72

4. Efficiency of RNA-Seq Data for Active Module Discovery in Comparison to MicroArrays ...... 77

4.1 Background ...... 79 4.1.1 Tools for Active Module Discovery ...... 79 4.1.2 Microarray vs. RNA-Seq: History ...... 81 4.2 Experimental Evaluation ...... 82 4.2.1 Colorectal cancer cell lines ...... 84 4.2.2 Oligodendroglioma tumors ...... 92 4.3 Conclusion and Future Work ...... 94

5. PRASE: PageRank-based Active Module Extraction ...... 97

5.1 Background ...... 102 5.1.1 Active module extraction tools ...... 102 5.1.2 PageRank for gene ranking ...... 103 5.2 PRASE ...... 104 5.2.1 Input network and matrix construction ...... 105 5.2.2 Re-ranking ...... 107 5.2.3 Scaling and combining ...... 107 5.3 Experimental Results ...... 109 5.3.1 Breast invasive carcinoma ...... 111 5.3.2 Colorectal cancer cell line (CRC) ...... 116 5.3.3 Oligodendroglioma tumors ...... 119 5.4 Conclusions ...... 122

6. MICA: MicroRNA Integration for Active Module Discovery ...... 125

6.1 Background ...... 127

x 6.2 Methods ...... 128 6.2.1 Data integration ...... 130 6.2.2 ICA on gene expression values ...... 131 6.2.3 Connected module extraction ...... 133 6.3 Results ...... 134 6.3.1 Results on ILC data ...... 136 6.3.2 Results on IDC data ...... 143 6.4 Conclusion ...... 148

7. Conclusions and Future Directions ...... 150

7.1 Summaries and our findings ...... 150 7.2 Future Work ...... 153

Bibliography ...... 155

xi List of Tables

Table Page

2.1 Famous active module discovery algorithms and their features . . . . 17

3.1 Features supported by the tools ...... 32

3.2 Sensitivity evaluation of the different tools ...... 69

3.3 Rabema evaluation ...... 70

3.4 SNP calling results ...... 73

4.1 Size of active modules obtained by the different tools ...... 85

4.2 Number of DE genes in each active module ...... 87

4.3 Occurrence of significant genes in the different modules ...... 88

4.4 Top three hub nodes in each module ...... 91

4.5 Size of active modules found by the tools ...... 93

4.6 Hub node analysis ...... 94

5.1 Standard names for the curated gene sets ...... 110

5.2 Go enrichment analysis for the BRCA dataset ...... 114

5.3 Pathway enrichment analysis for the BRCA data set ...... 116

5.4 Number of DE and significant genes in each module ...... 118

xii 5.5 Percentages of DE genes in each module ...... 119

5.6 Summary of improvements ...... 124

6.1 Size of the modules obtained using Mica and ICA for the ILC data set. 136

6.2 Pathway enrichment analysis for Mica, ICA, and DEGAS on the ILC data.140

6.3 The components obtained by ICA and Mica on the IDC data set. . . 144

6.4 Pathway enrichment analysis for ICA, DEGAS, and Mica on the IDC data...... 147

6.5 DO enrichment analysis for ICA, DEGAS, and Mica...... 148

xiii List of Figures

Figure Page

1.1 PRASE workflow ...... 5

1.2 MICA workflow ...... 6

2.1 Central dogma of biology ...... 10

3.1 Evaluation criteria ...... 35

3.2 Default options effect using wgsim ...... 46

3.3 Default options effect ...... 47

3.4 Quality threshold vs. number of mismatches ...... 49

3.5 Effect of changing the number of mismatches using a synthetic data set extracted using wgsim ...... 51

3.6 Effect of changing the number of mismatches using a synthetic data set extracted using ART...... 52

3.7 Effect of changing the number of mismatches using a real data set. . . 53

3.8 Effect of changing the seed length using a synthetic data set . . . . . 55

3.9 Effect of changing the seed length using a real data set ...... 56

3.10 Effect of changing the read length using a synthetic data set extracted using wgsim ...... 58

3.11 Effect of changing the read length using a ART generated data set . . 59

xiv 3.12 Effect of using paired-end data using a wgsim synthetic data set. . . . 60

3.13 Effect of changing the genome type using wgsim generated synthetic data set...... 62

3.14 Effect of changing the genome type using ART generated synthetic data set...... 64

3.15 Effect of enabling gapped alignment using a real data set...... 66

3.16 Speedup when using multithreading and multiprocessing...... 68

4.1 Visualization of the MicroNet modules ...... 86

5.1 PRASE workflow ...... 105

5.2 Evaluation of the modules obtained for the BRCA dataset ...... 113

5.3 Size of the active modules from the CRC dataset ...... 117

5.4 Size of active modules for the Oligo dataset ...... 120

5.5 Percentage of important genes in the jActiveModules module . . . . 122

6.1 Mica: The workflow ...... 129

6.2 Random t-score distribution...... 137

6.3 AUC for Mica, ICA, and DEGAS for a 10-fold cross validation. . . . . 138

6.4 Overlap between Important pathways enriched in both Mica and ICA modules...... 141

6.5 mica15 module. The red nodes are for the nodes in the Hemostasis pathway...... 146

xv Chapter 1: Introduction

In complex diseases, genes do not act in isolation, rather, they interact together in pathways and modules to perform the designated function [1]. In addition, their interaction patterns are changed based on the type of the cell and the condition [2].

A well-structured characterization and analysis of such modules have always been intriguing for the researchers, especially for extremely heterogeneous diseases. Cancer is such a disease: the derivative tissue differs for many cancer types. Besides, each cancer type can have many subtypes. Identifying a biologically correct and valid module is important for each cancer type and subtype since the treatment options and their success rates can significantly differ [3].

One way to find such modules is to look for clusters of genes with certain prop- erties, e.g., dense cluster, in different biological networks, such as the protein-protein interaction (PPI) network or the gene co-expression network [4,5,6,7,8,9, 10, 11,

12, 13, 14]. A more efficient method is the integration of different biological data to better highlight these gene modules [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27].

Following this idea, various techniques that integrate gene-expression values or p- values with biological networks to extract such gene modules have been proposed, e.g., [28, 29, 30, 31, 32, 33, 34, 19, 35, 36, 37]. Such extracted modules are called

1 active modules since the gene expression data, which is dynamically changing, is in-

tegrated with the PPI network, which is static. Hence, the word active comes from

the notion that these modules are active in certain cells or conditions. Following

this track, many algorithms have been developed to better make use of the network

structure and other types of data as well, such as genotypic data. An excellent review

and categorization of these algorithms was recently provided [24].

Although the gene expression signature-based algorithms have proven to be flexi-

ble in practice, they do not provide a be-all and end-all solution for the active modules discovery problem. Today, we have the high throughput sequencing technology with its unprecedented amount of data generated in different areas, such as mRNA-Seq and microRNA-Seq. The integration of the different types of data would indeed increase the accuracy of active modules detection in addition to providing a better picture about how the underlying cell works. However, many of the existing algorithms and workflows do not exploit such heterogeneity. Besides, these algorithms are usually restricted to the /genes in the networks they use and ignore the other genes in the gene expression data that we do not yet have any information regarding their interaction patterns.

Specifically, the drawbacks of current active module discovery algorithms can be summarized as follows (See Chapter2 for more details):

• The use of p-value or fold change based criteria to define the differentially ex-

pressed genes,

• The under-estimation of genes that might not be differentially expressed from

the p-value perspective but might be important based on its interaction patterns

with surrounding important genes.

2 • Even when the above problem was solved using information flow based algo-

rithms, the problem is addressed per sample and it is not obvious how to address

the problem to find the active module across all of the samples.

• The assumption that genes should exhibit linear correlations across the samples.

• The focus on only integrating gene expression and PPI network while there

are other mechanisms the cell further use to regulate the expression levels of

the genes. The effect of such mechanisms might not be apparent at the gene

expression level, therefore, other possible active modules might not be detected.

• Most of the algorithms focus only on the genes existing in the PPI network while

ignoring other genes that are not yet discovered in the PPI network but have

important relations to other genes or would increase the rank of other genes.

• Most of the tools were designed and experimented using only Microarray data

and it is not obvious how they would further perform using the high throughput

data, such as mRNA-Seq.

An important question regarding high throughput sequencing generated data is how to map the data and finally obtain genes and microRNAs expression values.

To this end, many tools have been developed to map the short sequences into a reference genome. However, the quality of the mapping is still questionable and an effective evaluation of the mapping, and hence the final expression values used for active module discovery, are needed to effectively understand the efficiency and the importance of using high throughput sequencing datasets in the active module discovery problem.

3 1.1 Dissertation Outline and Summary of Contributions

In this dissertation, a deep evaluation, design, and implementation of different approaches to efficiently integrate different types of biological data are presented. We discuss the high throughput sequencing technology and how it affected measuring gene expression. Additionally, unlike old techniques, such as Microarrays, the quality of measuring the gene expression from mRNA-Seq data highly depends on the quality of the mapping step. To this end, a comparison between the different short sequence mapping tools and an evaluation of the effect of the different settings on the quality of the output are presented in Chapter3. In addition, we further discuss the different approaches used in short sequencing mapping and the state of the arts tools that implement those approaches.

In Chapter4, a comparison between the effect of the mRNA-Seq and Microar- ray data on the quality of the discovered active modules is addressed. It is highly important to understand the significance of mRNA-Seq datasets in discovering more disease-related active modules, otherwise, Microarray datasets, which are cheaper, would be more suitable to further advance the field. Therefore, a deep evaluation is carried out in this chapter to address this point.

In Chapter5 and6, we present the two approaches we implemented to effectively and efficiently integrate different types of high throughput data for active module discovery. Figure 1.1 and 1.2 show the workflow of the two approaches. The first workflow, PRASE, is based on intelligently making use of the mRNA-Seq proper- ties to create a gene co-expression network and further adjusting the p-values for the genes to highlight the most important ones. Basically, a variation of the PageR- ank algorithm [38] is used to populate the significance of gene to its neighbors in

4 Figure 1.1: PRASE workflow

a gene co-expression network constructed from the mRNA-Seq data. Since mRNA-

Seq data contains a complete image about which genes exist in the cell, gene co- expression network would contain relations that would otherwise be missed by other physical interactions-based networks. After using PageRank to populate gene signifi- cance, we accordingly adjust the p-values and generate new ones that are further used with current active module discovery tools. The second workflow, Mica, presented in Chapter6, provides a new solution for active module discovery by integrating microRNA-Seq, mRNA-Seq, and PPI network in one frame. Additionally, a novel microRNA-mRNA integration method is introduced to instead of depending on the common correlation integration based method. Basically, we modify the gene ex- pression values, generated from the mRNA-Seq data, with the microRNA expression values to better approximate the actual protein expression values. Afterwards, the genes are clustered to groups using independent component analysis (ICA) [39]. Fi- nally, the clusters of genes are mapped to the PPI network to extract the active modules.

5 Controls Cases Controls Cases PPI Network

gene 1 miRNA 1 gene 2 miRNA 2 gene 3 miRNA 3 gene 4 miRNA 4 gene 5 ...... miRNA m gene n microRNA Expression Profiles Gene Expression Profiles

Integration gene 1 gene 2 gene 3 miRNA r: z r,s > t gene 4 miRNA r': z r',s > t gene 5 . . . . Connected Module Extraction . . miRNA r'': z > t gene n r'',s module 1 module 2 Adjusted Gene Expressions

ICA

Output of ICA module 3

Figure 1.2: MICA workflow

The contributions of this dissertation can be summarized as follows:

• A comprehensive comparison between the different short sequence mapping

tools. The comparison is carried out to tackle different perspectives, such as,

the effect of changing the default settings, algorithm, type and size of inputs,

and reference genomes.

• A comprehensive analysis of the effect of mRNA-Seq and Micorarray in the

quality of the extracted active modules. The study shows how the mRNA-Seq

data greatly affects the size and the comprehensiveness of the obtained active

6 modules, hence, arguing the development of better and more efficient tools for

solving the active module discovery problem.

• The first workflow to make use of mRNA-Seq data properties and integrate gene

co-expression and PPI networks to extract more disease-related active modules.

• A novel integration approach for microRNA and mRNA data that is further

integrated with the PPI network in one frame for more accurate active module

discovery.

7 Chapter 2: Background and Related Work

There are different types of biological data, each giving a different perspective of how the cell works. The data types are generated by different technologies, each developed to measure/observe the behavior of different elements in the cell. Due to the specific properties of each data type, many algorithms were developed to analyze them separately. However, recently, more focus have been on integrating the data types together to a gain a better understanding of the cell mechanism.

In this chapter, we first explain the main elements of the cell and how the cell function. Then, we briefly discuss the new technologies and their pros and cons to further understand their impact on the active module discovery problem. Finally, we present the state of the art algorithms for the active module discovery problem and discuss their main drawbacks.

2.1 DNA and the central dogma

The Deoxyribonucleic acid (DNA) is the blueprint of the biological life. DNA is found in all of the living organisms storing complex information about the function and the behavior of each cell. In most cells, the DNA molecule consists of two biopolymer strands coiled around each other forming a double helix. The strands are composed of smaller organic molecules called nucleotides. Each nucleotide consists

8 of a base chemical, either one of guanine (G), adenine (A), thymine (T), or cytosine

(C). Therefore, the nucleotides are usually referred to as T, C, G, or A.

A gene is a small segment in the DNA that codes for proteins, which are molecules

responsible for defining how the cell works and function. Additionally, each gene codes

for more than one protein. For instance, in the human DNA, there are around 20, 000

genes coding for around 100, 000 proteins. In order to have this one-to-many relation,

each coding gene can generate different isoforms each with a different sequence. Each

isoform is then mapped to the corresponding protein. Similar to DNA, genes are

composed of nucleotides. On the other hand, the main molecules in the proteins are

called amino acids.

Genes in the DNA go into different phases in order to be finally translated into

proteins. Theses phases are called the central dogma of biology. A simplified view of

the central dogma is shown in Figure 2.1. First, Genes in the DNA are transcribed into

RNA, which is a single strand biopolymer. Then, the genes in the RNAs are shaped

into one of their isoforms, producing mRNAs. Finally, the mRNAs are translated

into the final proteins. A gene is called expressed in a cell if the cell specific-mRNA

contains a copy or multiple copies of the gene sequence. Additionally, a gene is

called differentially expressed (DE) in a diseased-cell, if it is either up-regulated, i.e., more copies of the gene sequence in the diseased-cell than in the healthy-cell, or down-regulated, i.e., less copies of the gene sequence in the diseased-cell than in the healthy-cell.

9 Transcription Translation DNA RNA Proteins

Figure 2.1: Central dogma of biology

2.1.1 Measuring gene expression levels

Measuring gene expression levels and determining which isoform is active are very

crucial in understanding which genes are differentially expressed in the disease, thus,

shedding light on the disease mechanism. There have been many techniques used to

do these measures, such as Northern Blots, expressed sequence tags (ESTs), serial

analysis of gene Expression (SAGE), and reverse transcription PCR (RT-PCR) [40,

41]. However, these techniques suffered from limitations on the number of genes that

can be analyzed in parallel [41]. Newer techniques, such as microarrays and RNA-Seq,

were further developed to better measure gene expression levels.

Microarray technology is based on combining the RNA in the cell with small

sequences of genes that might possibly exist in the cell. For instance, if the sequence

of gene g binds with the RNA in cell C, then we know that g is expressed in C.

However, microarrays require the prior knowledge about the structure of a gene.

Additionally, analyzing genes in new genomes is hard due to the unavailability of probes for this genome [40].

RNA-Seq is a more recent technique for gene expression analysis using high throughput sequencing [42]. Basically, the RNA, which is a single strand, is con- verted to a cDNA, which is double-stranded. The library of cDNA is then sequenced using the high throughput sequencing technology, thus, generating thousands or even

10 millions of short sequences, also called reads. The short sequences are then either mapped to a reference genome to generate the gene expression levels and isoforms or assembled de novo generating a complete transcriptome in case of the absence of a ref- erence genome. RNA-Seq made measuring the gene expression levels easier and more accurate. For instance, unlike microarrays, RNA-Seq can detect unidentified genes while not requiring any information about the distinct isoforms for the gene [40]. On the other hand, although its cost is continuously reducing, RNA-Seq has always been defined as a more expensive technique when compared with microarrays.

2.1.2 Other elements in the central dogma

Figure 2.1 shows the main steps for genes to be finally transformed into proteins.

However, the cell uses different post-transcription mechanisms that further modify the final protein expression level. Therefore, gene expression levels cannot always be representative for the corresponding protein expression levels.

One of the famous mechanisms the cell uses to regulate the protein expression levels is microRNAs (miRNAs). miRNAs are small non-coding RNAs used by the cell to post-transcriptionally regulate gene expression levels [43]. They inhibit protein synthesis by either stopping the protein translation or by performing mRNA degra- dation. miRNAs constitute an important inhibition technique that has been shown to be very important in different diseases, specifically, in cancer progression [44]. For instance, miRNAs were found to be differentially expressed in breast cancer in addi- tion to successfully classifying estrogen and progesterone receptors, and HER2/neu status [45].

11 Recently, high throughput sequencing was also used to measure the expression levels of miRNAs. miRNAs are processed in the same way as mRNAs. However, miRNAs sequences are much smaller that the mRNAs sequences. In addition, their number is much smaller that the number of mRNAs, e.g., approximately 1000 miR-

NAs exist in the human cells [46].

2.2 Active module discovery problem

The active module discovery problem is basically the problem of extracting genes exhibiting certain properties. Such properties could be similar interaction patterns, high correlation, or maximizing a certain function. A well assumed and studied behavior is the density of interactions between genes. Even though focusing on the network structure lacks the use of dynamic data, it formed the basis for active module discovery. Therefore, we briefly discuss first the dense module extraction problem and the key challenges.

Many algorithms have been developed to find clusters of densely interacting genes in the PPI network or the gene co-expression network, including the work of [14,4,

5,6,7,8,9, 10, 11, 12, 13]. While addressing the challenges differently, most of the algorithms focus on solving the following challenges:

• Generating overlapping clusters or, in other words, soft clusters, hence, allowing

genes to exist in more than one cluster.

• Handling the high degree problem of hub nodes. If hub nodes are not prop-

erly handled, they would lead to one or two large clusters and other singleton

clusters, hence, leading to non-informative ones.

12 • Handling noisy interactions. The interactions (edges) in the PPI network are

usually noisy. Therefore, a confidence score for each edge should be taken into

consideration to handle the noisy edges.

To handle the above challenges, Asur et al. developed an algorithm that took the topological properties of the underlying PPI network into consideration in the clus- tering phase [4]. They used two graph-based matrices, namely, clustering coefficient and betweenness centrality to find similar vertices and to group them together. The proposed algorithm explicitly handled the hub nodes and tried to cluster them into multiple groups. Additionally, the weights of the edges in the PPI network, i.e., con-

fidence scores, were taken into consideration. Shih and Parthasarathy introduced an iterative based Markov Clustering algorithm to solve the soft clustering problem [9].

Zhang and Li solved the soft clustering problem by introducing a consensus clustering based algorithm [10]. Basically, given different possible input clusters for the data, the algorithm generated more than one consensus cluster. Even though it was not designed for PPI networks specifically, it was shown to be effective in obtaining mean- ingful clusters from the PPI network as well. Inoue et al. solved the soft clustering and the hub node problem by introducing a random walk based algorithm [11]. The algorithm first generated a diffusion model of the PPI network, then random walks were applied on this model to generate the clusters. Another random walk based algorithm was introduced by Macropol et al [5]. The algorithm performed repeated random walks with restarts on the PPI network to find local clusters. The surround- ings of a seed node were examined to see how approximate they were to the seed node.

The stopping criteria was either finding a cluster of size k or the shortest distance

from the seed node to a potential protein was greater than a certain threshold. Li

13 et al. developed another algorithm that was based on dividing the PPI network into three subgraphs, one for high-degree nodes, one for low-degree nodes, and one for relation between high-degree and low-degree nodes [6]. Li et al. aimed at finding overlapping clusters, hence, they allowed high-degree nodes to exist in more than one cluster. The average size of the obtained modules was five.

Even though the mentioned algorithms have proven their efficiency in extracting functional modules, using only one type of data suffers from a lot of drawbacks [19, 24].

One important drawback is that the PPI network is actually static and cannot give actual information about the underlying dynamics of the cell [36, 47]. In more details, the interactions in the PPI networks are obtained at different conditions and from different cells. However, the edges in the PPI network does not contain the underly- ing condition information [48]. Hence, even though the static graph based algorithms assume that they are returning functionally important modules, such modules might turn out to be false positives at the end. Another drawback is the noise and the bias in the high throughput technologies used to measure the interactions [49]. Therefore, depending only on one type of data would raise doubts about the quality and repro- ducibility of the results. Thus, the integration of other forms of dynamic data, such as the gene expression, has become inevitable [35, 48, 24].

Most of the algorithms designed to solve the active module discovery problem have been mainly concerned with integrating gene expression data with the PPI net- work. Table 2.1 shows the most famous algorithms and the common features between them. The first algorithm to introduce the idea of integrating Microarray gene ex- pression data and the PPI network for active module discovery was jActiveModules developed by Ideker et al. [28]. Ideker et al. defined the highest active module as

14 the connected module that has the highest weight, where the weights on the nodes

are calculated from the genes p-values. Finding the maximum weighted module is an NP hard problem. Therefore, they introduced a simulated annealing based al- gorithm to approximate the solution. A key feature in jActiveModule is that it does not restrict its search on the weighted nodes only but also on nodes connecting other important weighted nodes. This feature is highly important for Microarray data since Microarray data contains information only for few genes. Additionally, jActiveModules processes each sample separately and then finds the most informa- tive module among all of the samples. On the other hand, jActiveModules assumes that there are control-case pairs, which is not always true. In addition, it gives more importance to genes that are differentially expressed (DE). However, there are other genes that might not be DE but their interaction patterns might be a marker for the disease [50].

Following this track, many other algorithms have been introduced either to tune jActiveModule or to introduce a new optimization function. For instance, GXNA al- gorithm adjusts jActiveModules by using another scoring function; instead of using p-values, it directly uses the gene expression values [51]. PinnacleZ is an imple- mentation of the algorithm introduced by Chuang et al. [52]. The notable change in Chuang et al. algorithm in comparison to jActiveModules is the function used to determine the relevance of a module to case samples in comparison to the con- trol samples. Specifically, for each module discovered, they calculated the mutual information between the module scores and the sample class. Heinz algorithm also works on the p-values [29]. However, Heinz combines all of the p-values together into one value. Then, if finds the active modules by transforming the problem into the

15 well-known prize-collecting Steiner tree problem (PCST). Albeit the optimality of the algorithm, lying on the assumption that a certain gene should have similar p-values across the samples is not applicable, specially for large datasets and heterogeneous diseases. Backes et al. also introduced an integer linear programming based algorithm to find the maximal weighted module [53]. However, they put three constraints to address the problem: first, the module should be the heaviest and not only maximal weight, i.e., dense module, second, the module should be reachable from a root node, and third, the module size should be at most k.

Instead of defining the problem as finding the maximal weighted module, Ulitsky et al. introduced two algorithms that pose other definitions, MATISSE [54] and DEGAS [30].

MATISSE algorithm tries to find the groups of connected genes exhibiting the same expression behavior. Such modules are discovered by projecting gene correlation values into the PPI network and finding modules of genes with similar correlation.

MATISSE is further restricted to highly DE genes by using a fold change threshold.

An obvious drawback of MATISSE is the assumption of linear correlation between the genes that is maintained across most of the samples. DEGAS algorithm is a set cover based algorithm that searches for the set of k DE genes that cover most of the samples. DEGAS uses the p-value with a cut off 0.05 to determine if a gene is DE or not. However, p-value does not always reflect the differential expression and hence the importance of the genes.

Another approach to solve the problem that recently gained popularity is the modeling of the problem as an information flow based one. The popularity of infor- mation flow based approach lies in two-fold: un-differentially expressed genes would gain differential importance based on how importance its interactions, second, it takes

16 Table 2.1: Famous active module discovery algorithms and the features common between them. DE refers to the method used to define DE genes, local net topology refers to the use of the network topology, Sample diff refers to the assumption that samples are different, Case-control pairs refers to the assumption that there is a case- control pair samples for each patient. p stands for p-value, z stands for z-score, and fc stands for fold change. tool Basic algorithm DE Local net topology Samples diff. Case-control pairs ref jActiveModules Score summation p   [28] GXNA Score summation  [51] heinz Score summation p  [29] PinnacleZ Score summation z  [52] MATISSE Correlation fc [54] DEGAS Set cover p  [30] NetWalk Random walks   [56]

the global and local network structure into consideration. An example of an algo- rithm developed based on this idea is NetWalk [55, 56]. NetWalk is a random walk with restart based algorithm. In NetWalk, the gene expression values are used to weight the nodes. Then, weighted nodes are used to calculate the transition proba- bility from one node to another one. A random walk based approach is then applied on the weighted PPI network to generate the final rank of each node. NetWalk fi- nally extracts the active modules by extracting the connected edges with the highest weights. NetWalk also has an interactive interface to compare between the active modules obtained from each sample. On the other hand, using information flow for integrating gene expression with PPI network poses the problem of how to combine and find the most informative active module across all of the samples. In general, information flow based approaches were mainly used to integrate other data types, such as disease similarity data [32] and mutated genes information [57].

17 2.3 microRNA and mRNA integration

Many works integrating miRNAs and mRNA depend on the fact that miRNAs degrade target mRNA, hence, the effect on the gene should be apparent at the gene expression level. For instance, mirConnX constructs an association network between miRNAs, mRNAs, and TFs by applying correlation measures on pairs of them [58].

Another framework, CoMeTa, is also developed to predict miRNAs targets [59]. The basic idea is to group miRNA target genes based on their co-expression. Accordingly

CoMeTa can de novo discover new miRNA-targets. Jayaswal et al. define a new mea- sure, UD, for association between miRNAs and mRNAs [60]. The main goal of using the UD measure is to calculate the association between miRNA and mRNA without requiring to have a match between miRNA and mRNA samples. The UD measure is basically calculating the average difference in expression between the control and case samples for miRNAs and mRNAs. After discretizing the average expression values, a statistical test is used to measure the independence between the change in miRNA and mRNAs expressions. mirDREM is another algorithm developed to understand the relation between miRNA and mRNAs [61]. mirDREM constructs a probabilistic model for the regulation of mRNAs expression values using miRNAs and TFs. mirDREM is not concerned with predicting new miRNA targets, rather, it is concerned with modeling the dynamic behavior of miRNAs and its effect on mRNAs in the different conditions.

Indeed, the above mentioned work is valuable in case of predicting new miRNA targets. However, in case of understanding the dynamic behavior of miRNAs and mRNAs and the relation between miRNA target genes, such methods are not sufficient since the final protein expression level can be significantly affected by miRNAs without

18 having any apparent effect on the gene expression level [62, 63]. A possible solution for overcoming the correlation constraint at the expression level between miRNA and mRNAs was introduced by Cun and Fr¨ohlich [64]. The solution is based on integrating the PPI network with miRNA target gene network and then apply random walks on the heterogeneous network to rank the genes accordingly. Indeed, such integration would work around the miRNA and mRNA integration problem. However, by focusing only in prioritizing genes through the PPI network, they cannot detect connected modules of genes with indirect dependencies, e.g., through other genes not in the PPI network or through other genes with no change in expression at mRNA level. Additionally, Cun and Fr¨ohlich do not treat each sample differently, rather, they calculate the t-score for each gene using all of the samples.

19 Chapter 3: An Evaluation of RNA-Seq Mapping Tools

Next-generation sequencing (NGS) technology has evolved rapidly in the last five

years, leading to the generation of hundreds of millions of sequences (reads) in a

single run. The number of generated reads varies between 1 million for long reads

generated by Roche/454 sequencer (≈400 base pairs (bps)) and 2.4 billion for short reads generated by Illumina/Solexa and ABI/SOLIDTM sequencers (≈75 bps). The invention of the high-throughput sequencers has led to a significant cost reduction, e.g., a Megabase of DNA sequence costs only $0.1 [65].

Nevertheless, the large amount of generated data tells us almost nothing about the DNA, as stated by Flicek and Birney [66]. This is due to the lack of proper analysis tools and algorithms. Therefore, bioinformatics researchers started to think about new ways to efficiently handle and analyze this large amount of data.

One of the areas that attracted many researchers to work on is the alignment

(mapping) of the generated sequences, i.e., the alignment of reads generated by NGS machines to a reference genome. Because, an efficient alignment of this large amount of reads with high accuracy is a crucial part in many applications’ workflow, such

20 as genome resequencing [66], DNA methylation [67], RNA-Seq [68], ChIP sequenc-

ing, SNPs detection [69], genomic structural variants detection [70], and metage-

nomics [71]. Therefore, numerous tools have been developed to undertake this chal-

lenging task including MAQ [72], RMAP [73], GSNAP [74], Bowtie [75], Bowtie2 [76],

BWA [77], SOAP2 [78], Mosaik [79], FANGS [80], SHRIMP [81], BFAST [82], MapReads

[83] , SOCS [84], PASS [85], mrFAST [70], mrsFAST [86], ZOOM [87], Slider [88],

SliderII [89], RazerS [90], RazerS3 [91], and Novoalign [92]. Moreover, GPU-based

tools have been developed to optimally map more reads such as SARUMAN [93] and

SOAP3 [94]. However, due to using different mapping techniques, each tool provides

different trade-offs between speed and quality of the mapping. For instance, the

quality is often compromised in the following ways to reduce runtime:

• Neglecting base quality score.

• Limiting the number of allowed mismatches.

• Disabling gapped alignment or limiting the gap length.

• Ignoring SNP information.

In most cases, it is unclear how such compromises affect the performance of newly developed tools in comparison to the state of the art ones. Therefore, many studies have been carried out to provide such comparisons. Some of the available studies were mainly focused on providing new tools (e.g., [74, 77]). The remaining studies tried to provide a thorough comparison while each covering a different aspect (e.g., [95, 96,

97, 98, 99]).

For instance, Li and Homer [95] classified the tools into groups according to the used indexing technique and the features the tools support such as gapped alignment,

21 long read alignment, and bisulfite-treated reads alignment. In other words, in that work, the main focus was classifying the tools into groups rather than evaluating their performance on various settings.

Similar to Li and Homer, Fronseca et al. [99] provided another classification study.

However, they included more tools in the study, around 60 mappers, while being more focused on providing a comprehensive overview of the characteristics of the tools.

Ruffalo et al. [97] presented a comparison between Bowtie, BWA, Novoalign,

SHRiMP, mrFAST, mrsFAST, and SOAP2. Unlike the above mentioned studies,

Ruffalo et al. evaluated the accuracy of the tools in different settings. They defined a read to be correctly mapped if it maps to the correct location in the genome and has a quality score higher than or equal to the threshold. Accordingly, they evaluated the behavior of the tools while varying the sequencing error rate, indel size, and indel frequency. However, they used the default options of mapping tools in most of the experiments. In addition, they considered small simulated data sets of 500,000 reads of length 50 bps while using an artificial genome of length 500Mbp and the of length 3Gbp as the reference genomes.

Another study was done by Holtgrewe et al. [96], where the focus was the sensitiv- ity of the tools. They enumerated the possible matching intervals with a max distance k for each read. Afterwards, they evaluated the sensitivity of the mappers accord- ing to the number of intervals they detected. Holtgrewe et al. used the suggested sensitivity evaluation criteria to evaluate the performance of SOAP2, Bowtie, BWA, and Shrimp2 on both simulated and real datasets. However, they used small reference genomes (the S. cerevisiae genome of length 12 Mbp and the D. melanogaster genome of length 169 Mbp). In addition, the experiments were performed on small real data

22 sets of 10,000 reads. For evaluating the performance of the tools on real data sets,

Holtgrewe et al. used RazerS to detect the possible matching intervals. RazerS is a full sensitive mapper, hence it is a very slow mapper [86]. Therefore, scaling the suggested benchmark process for realistic whole genome mapping experiments with millions of reads is not practical. Nevertheless, after the initial submission of this work, RazerS3 [91] was published, thus, making a significant improvement in the running time of the evaluation process.

Schbath et al. [98] also focused on evaluating the sensitivity of the sequencing tools. They evaluated if a tool correctly reports a read as a unique or not. In addition, for non-unique reads they evaluated if a tool detects all of the mapping locations. However, in their work, like many previous studies, the tools were used with default options, and they tested the tools with a very small read length of

40bps. Additionally, the error model they used did not include indels and allowed only 3 mismatches.

Even though many studies have been published for evaluating short sequence mapping tools, the problem is still open and further perspectives were not tackled in the current studies. For instance, the above studies did not consider the effect of changing the default options and using the same options across the tools. In addition, some of the studies used small data sets (e.g., 10,00 and 500,000 reads) while using small reference genomes (e.g., 169Mbps and 500Mbps) [97, 96]. Furthermore, they did not take the effect of input properties and algorithmic features into account.

Here, input properties refer to the type of the reference genome and the properties of the reads including their length and source. Algorithmic features, on the other hand, pertain to the features provided by the mapping tool regarding its performance

23 and utility. Therefore, there is still a need for a quantitative evaluation method to systematically compare mapping tools in multiple aspects. In this work, we address this problem and present two different sets of experiments to evaluate and understand the strengths and weaknesses of each tool. The first set includes the benchmarking suite, consisting of tests that cover a variety of input properties and algorithmic features. These tests are applied on real RNA-Seq data and genomic resequencing synthetic data to verify the effectiveness of the benchmarking tests. The real data set consists of 1 million reads while the synthetic data sets consist of 1 million reads and 16 million reads. Additionally, we have used multiple genomes with sizes varying from 0.1 Gbps to 3.1 Gbps. The second set includes a use case experiment, namely,

SNP calling, to understand the effects of mapping techniques on a real application.

Furthermore, we introduce a new, albeit simple, mathematical definition for the mapping correctness. We define a read to be correctly mapped if it is mapped while not violating the mapping criteria. This is in contrast to previous works where they define a read to be correctly mapped if it maps to its original genomic location.

Clearly, if one knows “the original genomic location”, there is no need to map the reads. Hence, even though such a definition can be considered more biologically rele- vant, unfortunately this definition is neither sufficient nor computationally achievable.

For instance, a read could be mapped to the original location with two mismatches

(i.e., substitution error or SNP) while there might exists a mapping with an exact match to another location. If a tool does not have any a-priori information for the data, it would be impossible to choose the two mismatches location over the exact matching one. One can only hope that such tool can return “the original genomic

24 location” when the user asks the tool to return all matching locations with two mis- matches or less. Indeed, as later shown in the results, our suggested definition is computationally more accurate than the na¨ıve one. In addition, it complements other definitions such as the one suggested by Holtgrewe et al. [96].

To assess our work, we apply these tests on nine well known short sequence mapping tools, namely, Bowtie, Bowtie2, BWA, SOAP2, MAQ, RMAP, Novoalign,

GSNAP, and mrFAST (mrsFAST). Unlike the other tools in this study, mrFAST (mrs-

FAST) is a full sensitive exact mapper that reports all the mapping locations. There- fore, comparing the mapping accuracy performance of mrFAST with the remaining tools is beneficial in further understanding the behavior of the different tools, even though comparing the execution time performance will not be fair. Moreover, we compare the performance of these tools with that of FANGS, a long read mapping tool, to show their effectiveness in handling long reads. The remaining tools were chosen according to the indexing techniques they use. Therefore, we can emphasize on the effect of the indexing technique on the performance. The experiments are carried out while using the same options for the tools, whenever possible.

The chapter is organized as follows: in the next section, we briefly describe the sequence mapping problem, the mapping techniques used by the tools, and various evaluation criteria used to evaluate the performance of the tools including other defi- nitions for mapping correctness. Then, we discuss how we designed the benchmarking suite and give a real application for the mapping problem. Finally, we present and explain the results for our benchmarking suite.

25 3.1 Background

Inexact matching of DNA sequences to a genome is a special case of string match- ing. It requires incorporating the known properties or features of the DNA sequences and the sequencing technologies, adding additional complexity to the mapping pro- cess. In this section, we first give a brief description of a set of features of DNA and sequencing technologies. Then, we explain how the tools used in this study work and support these features. Additionally, we describe the default options setup and show how divergent they are among the tools. Finally, we compare the evaluation criteria used in previous studies.

3.1.1 Features

• Seeding represents the first few tens of base pairs of a read. The seed part of a

read is expected to contain less erroneous characters due to the specifics of the

NGS technologies. Therefore, the seeding property is mostly used to maximize

performance and accuracy.

• Base quality scores provide a measure on correctness of each base in the read.

The base quality score is assigned by a phred-like algorithm [100, 101]. The

score Q is equal to −10 log10(e), where e is the probability that the base is wrong. Some tools use the quality scores to decide mismatch locations. Others

accept or reject the read based on the sum of the quality scores at mismatch

positions.

• Existence of indels necessitates inserting or deleting nucleotides while mapping

a sequence to a reference genome (gaps). The complexity of choosing a gap

26 location increases with the read length. Therefore, some tools do not allow any

gaps while others limit their locations and numbers.

• Paired-end reads result from sequencing both ends of a DNA molecule. Mapping

paired-end reads increases the confidence in the mapping locations due to having

an estimation of the distance between the two ends.

• Color space read is a read type generated by SOLiD sequencers. In this tech-

nology, overlapping pairs of letters are read and given a number (color) out of

four numbers [81]. The reads can be converted into bases, however, performing

the mapping in color space has advantages in terms of error detection.

• Splicing refers to the process of cutting the RNA to remove the non-coding part

(introns) and keeping only the coding part (exons) and joining them together.

Therefore, when sequencing the RNA, a read might be located across exon-exon

junctions. The process of mapping such reads back to the genome is hard due

to the variability of the intron length. For instance, the intron length ranges

between 250 and 65, 130 nt in eukaryotic model organisms [102].

• SNPs are variations of a single nucleotide between members of the same species.

SNPs are not mismatches. Therefore, their locations should be identified before

mapping reads in order to correctly identify actual mismatch positions.

• Bisulphite treatment is a method used for the study of the methylation state

of the DNA [67]. In bisulphite treated reads, each unmethylated cytosine is

converted to uracil. Therefore, they require special handling in order not to

misalign the reads.

27 3.1.2 Tools’ description

For most of the existing tools (and for all the ones we consider), the mapping process starts by building an index for the reference genome or the reads. Then, the index is used to find the corresponding genomic positions for each read. There are many techniques used to build the index [95]. The two most common techniques are the followings:

• Hash Tables:

The hash based methods are divided into two types: hashing the reads and

hashing the genome. In general, the main idea for both types is to build a

hash table for subsequences of the reads/genome. The key of each entry is a

subsequence while the value is a list of positions where the subsequence can be

found. Hashing based tools include the following tools:

GSNAP [74] is a genome indexing tool. The hash table is built by dividing

the reference genome into overlapping oligomers of length 12 sampled every 3

nucleotides. The mapping phase works by first dividing the read into smaller

substrings, finding candidate regions for each substring, and finally combining

the regions for all of the substrings to generate the final results. GSNAP was

mainly designed to detect complex variants and splicing in individual reads.

However, in this study, GSNAP is only used as a mapper to evaluate its effi-

ciency.

Novoalign [92] is a genome indexing tool. Similar to GSNAP, the hash table is

built by dividing the reads into overlapping oligomers. The mapping phase uses

28 the Needleman-Wunsch algorithm with affine gap penalties to find the global optimum alignment. mrFAST and mrsFAST [70, 86] are genome indexing tools. They build a col- lision free hash table to index k-mers of the genome. mrFAST and mrsFAST are both developed with the same method, however, the former supports gaps and mismatches while the latter supports only mismatches to run faster. Therefore, in the following, we will use mrsFAST for experiments that do not allow gaps and mrFAST for experiments that allow gaps. Unlike the other tools, mrFAST and mrsFAST report all of the available mapping locations for a read. This is important in many applications such as structural variants detection.

FANGS [80] is a genome indexing tool. In contrary to the other tools, it is designed to handle the long reads generated by the 454 sequencer.

MAQ [72] is a read indexing tool. The algorithm works by first constructing multiple hash tables for the reads. Then, the reference genome is scanned against the tables to find the mapping locations.

RMAP [73] is a read indexing tool. Similar to MAQ, RMAP pre-processes the reads to build the hash table, then the reference genome is scanned against the hash table to extract the mapping locations.

Most of the newly developed tools are based on indexing the genome. Neverthe- less, MAQ and RMAP are included in this study to investigate the effectiveness of our benchmarking tests on evaluating read indexing based tools. In addition, we investigate if there is any potential for the read indexing technique to be used in new tools.

29 • Burrows-Wheeler Transform (BWT):

BWT [103] is an efficient data indexing technique that maintains a relatively

small memory footprint when searching through a given data block. BWT

was extended by Ferragina and Manzini [104] to a newer data structure, named

FM-index, to support exact matching. By transforming the genome into an FM-

index, the lookup performance of the algorithm improves for the cases where a

single read matches multiple locations in the genome. However, the improved

performance comes with a significantly large index build up time compared to

hash tables.

BWT based tools include the following:

Bowtie [75] starts by building an FM-index for the reference genome and then

uses the modified Ferragina and Manzini [104] matching algorithm to find the

mapping location. There are two main versions of Bowtie namely Bowtie and

Bowtie 2. Bowtie 2 is mainly designed to handle reads longer than 50 bps.

Additionally, Bowtie 2 supports features not handled by Bowtie. It was noticed

that both versions had different performance in the experiments. Therefore,

both versions are included in this study.

BWA [77] is another BWT based tool. The BWA tool uses the Ferragina and

Manzini [104] matching algorithm to find exact matches, similar to Bowtie. To

find inexact matches, the authors provided a new backtracking algorithm that

searches for matches between substring of the reference genome and the query

within a certain defined distance.

30 SOAP2 [78] works differently than the other BWT based tools. It uses the

BWT and the hash table techniques to index the reference genome in order to

speed up the exact matching process. On the other hand, it applies a “split-read

strategy”, i.e., split the read into fragments based on the number of mismatches,

to find inexact matches.

In addition to providing different mapping techniques, each tool handles only a subset of the DNA sequences and the sequencing technologies features. Moreover, there are differences in the way the features are handled, which are summarized in Table 3.1. For instance, BWA, SOAP, and GSNAP accept or reject an alignment based on counting the number of mismatches between the read and the corresponding genomic position. On the other hand, Bowtie, MAQ, and Novoalign use a quality threshold (i.e., alignment score) to perform the same function. The quality threshold is different from the mapping quality. The former is the probability of the occurrence of the read sequence given an alignment location while the latter is the Bayesian posterior probability for the correctness of the alignment location calculated from all of the alignments found for the read.

In some cases, the features are partially supported. For example, SOAP2 supports gapped alignment only for paired end reads, while BWA limits the gap size. Therefore, considering only one of the above features when comparing between the tools would lead to under- or over-estimation of the tools’ performance.

3.1.3 Default options of the tested tools

In general, using a tool’s default options yields a good performance while main- taining a good output quality. Most users use the tools with the default options or

31 Table 3.1: Features supported by each tool. PE: paired-end only, mm.: mismatches, QS: base quality score, count: total count of mismatches in the read, AS: alignment score, and empty cells mean not supported. Bowtie Bowtie2 BWA SOAP2 MAQ RMAP GSNAP FANGS Novoalign mrFAST mrsFAST Seed mm. ≤ 3 Any Up to 2 Any Any

32 Non-seed mm. QS AS Count Count QS Count Count Count QS Count Count Var. seed len. > 5 Any > 28 Mapping qual. Yes Yes Yes Yes Gapped align. Yes Yes PE PE Yes Yes Yes Yes Colorspace Yes Yes Yes Yes Splicing Yes SNP tolerance Yes Bisulphite reads Yes Yes Yes Yes only tweak some of them. Therefore, it is important to understand the effect of using these options and the kind of compromises made while using them. For the nine tools considered in this work, the most crucial default options are the following:

• Maximum number of mismatches in the seed: the seed based tools use a default

value of 2.

• Maximum number of mismatches in the read: Bowtie2, BWA, and GSNAP

determine the number of mismatches based on the read length. It is 10 for

RMAP, 2 for mrsFAST, and 5 for SOAP2, FANGS, and mrFAST.

• Seed length: It is 24 for MAQ, 32 for RMAP, and 28 for Bowtie. BWA disables

seeding while SOAP2 considers the whole read as the seed.

• Quality threshold: It is equal to 70 for MAQ and Bowtie while it depends on

the read length and the genome size for Novoalign.

• Splicing: This option is enabled for GSNAP.

• Gapped alignment: It is enabled for Bowtie2, GSNAP, BWA, Novoalign and

MAQ while it is disabled for SOAP2.

• Minimum and maximum insert sizes for paired-end mapping: The insert size

represents the distance between the two ends. The values used for the minimum

and the maximum insert sizes are 0 and 250 for Bowtie and MAQ, 0 and 500

for BWA and Bowtie2, 400 and 500 for SOAP2, and 100 and 400 for RMAP.

mrFAST and mrsFAST do not have default values for max and min insert sizes.

33 Indeed, as will be shown in the results’ section, having different default values lead to different results for the same data set. Hence, using the same values when comparing between the tools is important.

3.1.4 Evaluation criteria

In general, the performance of the tools is evaluated by considering three aspects, namely, the throughput or the running time, the memory footprint, and the mapping percentage. The throughput is the number of base pairs mapped per second (bps/sec) while the memory footprint is the required memory by the tool to store/process the read/genome index. The mapping percentage is the percentage of reads each tools maps.

The mapping percentage is further divided into a correctly mapped reads part and an error (false positives) part. There have been many definitions suggested for the error in previous studies. For instance, for the simulated reads, the na¨ıve and most used definition for error is the percentage of reads mapped to the incorrect location

(i.e., a location other than the genomic location the read was originally extracted from) [77, 74]. Clearly, this definition is neither sufficient nor computationally cor- rect. Figure 3.1 gives an example explaining the drawbacks of this definition. After applying sequencing error, the read does not exactly match the original genomic lo- cation. Since the tools do not have any a-priori information for the data, it would be impossible to choose the two mismatches location as the best mapping location over the exact matching one. Therefore, the na¨ıve criteria would judge the tool as incorrectly mapping the read if the tool returned either alignment (2) or (3) while in fact it picked a more accurate matching.

34 Reference ...... C C C G C C G G A A A T T ...... Read C C GCC G G GAA

Reference C C C G C C G G A A A T T ...... C C GCC G G GAA

Alignments (1) C C GCC G G GAA MQ=40 (3) C C G C C G G GAA MQ=50 (2) C C GCC G G G A A MQ=35

Figure 3.1: An example showing how the different evaluation criteria work. In the upper part of the figure, the sequence in blue is the original genomic position where the simulated read was extracted from. After applying sequencing errors, the read does not exactly match to the original location (3 mismatches). In the lower part of the figure, three possible alignment locations for the read are shown with their mapping quality score (MQ). The na¨ıve criterion would only consider the alignment (1) as the correct alignment. For Ruffalo et al. [97] criterion, if the used threshold is 30, then (1) is correctly mapped while (2) and (3) are incorrectly mapped-strict. On the other hand, if the threshold is 40, then (3) is considered as incorrectly mapped relaxed. Holtgrewe et al. [96] criterion would detect (1) and (2) and consider them correctly mapped while (3) would be considered as incorrectly mapped.

35 The na¨ıve definition for the error was further modified by Ruffalo et al. [97] to develop a more concrete definition. The authors incorporated the mapping quality information such that a read is correctly mapped if it is mapped to the original genomic location while having a mapping quality greater than a certain threshold. They further categorized the incorrectly mapped reads into incorrectly mapped-strict and incorrectly mapped-relaxed. The incorrectly mapped-strict are the reads that were mapped with a quality higher than the threshold while not mapped to the original genomic location. On the other hand, the incorrectly mapped-relaxed are the reads that were mapped to an incorrect location with a quality higher than the threshold and there is no correct mapping for the read with a mapping quality higher than the threshold. As an example, in Figure 1, if the used threshold is 30, then the read would be considered correctly mapped if the tool returned alignment (1) while it would be considered as incorrectly mapped-strict if the tool returned either alignment (2) or (3). On the other hand, if the used threshold is 40, a read would be incorrectly mapped-relaxed if the tool returned alignment (3). Indeed, this is a valuable evaluation criterion, however, many tools, such as SOAP2, RMAP, and BWA, do not use quality scores in the mapping phase. In addition, not all of the tools report the mapping quality.

Another definition was introduced by Holtgrewe et al. [96]. Unlike the previous works, the authors tried to find a gold standard for each read, where a gold standard refers to all of the possible matching intervals for each read with a max distance k from the read. To enumerate all of the possible matching intervals, the authors used RazerS to detect the initial seed location for each interval. Afterwards, they developed a method to find the boundary of the interval centered at the seed and

36 with a max distance k from the read. They named the suggested evaluation method

Rabema. As an example, a possible interval with k = 3 would contain alignment (1)

and (2) in Figure 1. Accordingly, Holtgrewe et al. defined the false negatives as the

intervals missed by the mapper and the false positives as the intervals returned by the mapper and not included in the gold standard. However, consisting of seed detection phase and enumeration phase while depending on RazerS to return seed locations for the matching intervals makes Rabema impractical to apply on large genomes and long read lengths, e.g., RazerS took 25 hours to map 1 million reads of length 100 to the Human genome while doubling the running time when increasing the read length from 75 to 100 [86]. Therefore, Holtgrewe et al. suggested another mode, an oracle mode, which makes use of the original location of simulated reads. The oracle mode uses the original location as the seed location instead of using RazerS to detect the initial seed locations. However, this method is only suitable in case of a-priori knowl- edge that the possible mapping locations for a read are around the simulated location

(e.g., alignment (3) in Figure 1 would be missed in the oracle mode). Nevertheless, after the initial submission of this work, RazerS3 [91] was published; making a sig- nificant improvement in Rabema running time and elevating the slowness problem.

Even though the suggested definition for a gold standard quantitatively estimates the sensitivity for each mapper, it suffers from a couple of drawbacks. First, the definition does not take into consideration whether the alignments are violating the mapping criterion for the mapper or not. For instance, in Figure 1, the sensitivity of the mapper would increase if it detected alignments (1), (2), and (3). However, if the mapping criterion for the mapper is to allow a maximum of two mismatches, then alignment (1) should have not been detected by the mapper and should be considered

37 as a wrong alignment or error. Second, quality aware based tools, such as Bowtie,

MAQ, and Novoalign, would be incorrectly evaluated by Rabema since they use the quality threshold to accept or reject a read instead of calculating the edit or hamming distance. Therefore, they might map a read with more mismatches than the limit allowed by Rabema.

3.2 Methods

3.2.1 Benchmark design

In this section, we present the features covered by our benchmarking suite. In addition, we explain how they were previously addressed by the tools we mention in this work. However, two algorithmic features, namely SNPs and Splicing awareness, are not presented in the results section due to being supported only by one tool. The tests are categorized as follows:

• Mapping options

Quality threshold: MAQ, Bowtie, and Novoalign use the quality threshold

to determine the number of allowed mismatches. Therefore, setting a quality

threshold is similar to explicitly setting the number of mismatches. However,

there is no hard limit on the actual number of mismatches. The impact of vary-

ing the quality threshold while finding a mapping between the quality threshold

and the number of mismatches has not been studied before.

Number of mismatches: Changing the number of allowed mismatches affects

the percentage of mapped reads. This effect was studied in [74], however, the

mismatches were generated uniformly on the genome which does not mimic real

mismatches distribution.

38 Seed length: Seeding-based tools impose limits on the number of mismatches

in the seed part. As a result, increasing or decreasing the length of the seed

part affects the percentage of mapped reads. The effect of the seed length has

not been studied in details before.

• Input properties

Read length: The read length varies between 30bps for ABI’s SOLiD and

Illumina’s Solexa sequencers up to 500 bps for Roche’s 454. Therefore, the

impact of read length should be considered for throughput evaluation. Even

though the effect of the read length is explored in several studies, the default

options were usually used leading to incomparable trade-offs.

Paired-end reads: Mapping paired reads requires the mapping of both ends

within a maximum distance between them. Hence, it adds a constraint while

finding the corresponding genomic locations.

Genome type: The efficiency of most algorithms are tested by using the Hu-

man genome as the reference. However, each genome has its own properties

such as the percentage of repeated regions and ambiguous characters. There-

fore, using a single genome does not reveal the effect of these properties. To

the best of our knowledge, BWA [77] was the only tool to test its performance

on a large genome other than the Human.

• Algorithmic features

Gapped alignment: is important for variant discovery due to the ability

to detect indel polymorphism [95]. Bowtie2, GSNAP, Novoalign, BWA, and

mrFAST are the only tools to support it for single-end reads while the remaining

39 tools support it for paired-end only. However, from the results provided by the

previous studies, it is not obvious how gapped alignment affects the performance

of the tools in comparison to allowing only mismatches.

SNP awareness: Incorporating SNP information into mapping allows consid-

ering minor alleles as matches rather than mismatches. Currently, this feature

is provided only by GSNAP. It was shown in [74] that integrating SNP informa-

tion affected around 8% of the reads and allowed mapping 0.4% of unmapped

reads.

Splicing awareness: Reads located across exon-exon junctions would be wrongly

aligned using standard alignment algorithms. Splicing awareness is only re-

quired for certain types of data such as RNA-Seq data. The only tool that

currently supports splicing while performing the mapping phase is GSNAP. It

was shown in [74] that the alignment yield increased by 8-9% when splicing

detection based on known splice junctions was introduced. However, there was

only 0.3-0.6% increase in case of detecting novel splice junctions.

• Scalability

The scalability of the mapping tools may be different under different parallel

settings. Many tools support multithreading, which is expected to yield linearly

increasing speedup with the increase in the number of CPU cores. On the other

hand, using multiprocessing is more general and may improve the throughput

even for tools that do not support multithreading (e.g., MAQ and RMAP),

where multiprocessing refers to using more than one process in a distributed

memory fashion while communicating through a message passing interface.

40 • Accuracy evaluation

Each tool is expected to map a set of reads based on its mapping criteria.

However, a subset of the reads might not be mapped (i.e., false negatives) due

to using heuristics in the mapping algorithm or the default options limitations.

Moreover, some of the tools map a subset of these reads while violating the

mapping criteria.

• Rabema evaluation

Rabema benchmark enumerates all of the possible matching locations. Then,

it evaluates whether the tool detected the possible matching locations with the

specified error rate or not. Therefore, Rabema evaluation is a valuable one and

helps in adding another perspective when comparing between the tools.

3.2.2 Usecase: SNP Calling

SNP calling is the process of detecting genetic variations in a given genome. The genetic variations contribute to the generation of different for the same gene, leading to increasing the risk of having complex diseases. Therefore, the discov- ery of SNPs is a very important process that needs to be done accurately. Many tools have been developed to detect SNPs including ssahaSNP [105] and SNPdetector [106].

These tools were developed to analyze the DNA sequences generated using either the

Sanger or the direct PCR amplification methods. However, with the development of the next generation sequencing technology, new tools are required to analyze the new data [107]. The developed new tools work by first mapping sequences to a reference genome, then using statistical analysis methods to extract SNPs [107] after filtering

41 out low-quality mismatches. Therefore, accurately mapping the reads to the reference genome is a very crucial task in the SNP calling pipeline.

3.3 Results and discussion

In this section, we present the results from our benchmarking tests. The exper- iments were performed on a cluster of quad-core AMD Opteron CPUs at 2.4 GHz with 32 GB of RAM. We used SOAP2 v2.20, Bowtie v0.12.6 and v2.0.0-beta5, BWA v0.5.0, MAQ v0.7.0, RMAP v2.05.0, FANGS v0.2.3, GSNAP v2010-07-27, Novoalign v2.07.0, and mrFAST and mrsFAST v2.5.0.4.

Performance evaluation: The performance is evaluated by considering two fac- tors, namely, the mapping percentage and the throughput. The mapping percentage is the percentage of reads each tool maps while the throughput is the number of mapped base pairs per second (bps/sec). The throughput is calculated by dividing the number of reads mapped over the running time. For genome indexing based tools, the running time includes only the matching time while it includes the indexing and matching time for read indexing based tools. However,the running time for mrsFAST includes also the indexing time even though it is a genome indexing based tool. This is due to the dependence of the sensitivity of mrsFAST in the experiments on the window size used in the indexing phase. Therefore, the index is rebuilt in most of the experiments to maintain a full sensitivity for mrsFAST.

In addition, the mapping percentage is further divided into the following:

• Correctly mapped reads: The percentage of reads mapped within the mapping

criteria.

42 • Error: The percentage of reads mapped while violating the mapping criteria.

As shown in the background section, this definition provides another evaluation

perspective that was not covered by older definitions.

• Amb: The percentage of reads mapped to more than one location with the same

number of mismatches. Most of the tools can return more than one mapping

location for Amb reads if desired. However, RMAP only reports the number of

Amb reads while not providing any information regarding the mapping location

and the number of mismatches. Therefore, we will not be able to report the

mismatches distribution for the RMAP reported Amb reads.

Data sets: We evaluated the tools on two types of data sets, namely, synthetic data and real data. The synthetic data set mimics reads generated from genomic sequencing while the real data set is for RNA-Seq. The data sets are further generated as follows:

• Synthetic data: There is a number of tools available to extract synthetic, Fastq

format, data sets from a reference genome including wgsim [108], dwgsim [109],

Mason [110], and ART [111]. wgsim generates reads with uniform error distri-

bution while dwgsim provides a uniformly increasing/decreasing error rate. On

the other hand, Mason and ART mimic the error rates for Illumina and 454 se-

quencers. In this study, we are using wgsim and ART to generate the synthetic

data from the Human genome. wgsim helps in providing a fair comparison be-

tween the tools by using a uniform error distribution model resulting in the

same quality score for each base. Therefore, all of the tools can be allowed

exactly the same number of mismatches regardless of the technique used to set

43 the maximum number. For wgsim, the reads were generated with 0.09% SNP

mutation rate, 0.01% indel mutation rate, 2% uniform sequencing base error

rate, and with a maximum insert size of 500, which are the same parameters

used in [77]. Additionally, Dohm et al. [112] showed that the sequencing error

rate for Illumina changes between 0.3% for the beginning of the read and 3.8%

at the end of the read. Moreover, according to the error rates and indels rate

used by the Mason simulator [110], an indel rate of 0.01% is acceptable. We

determined the number of reads to generate using wgsim based on the used tool

and the experiment. On the other hand, ART does not explicitly allow the user

to choose the number of generated reads. ART generates reads that cover the

whole genome with a given coverage level. Therefore, to manage generating 1

million reads, we used ART to generate reads that cover the whole genome with

1x coverage. Then, we randomly selected 1 million reads from the output reads.

To make sure that the results are not affected by different wgsim runs, we gen-

erated 13 different wgsim data sets and ran a sample of the tools independently

on each data set. The sample included BWA, GSNAP, Bowtie, Bowtie2, and

SOAP2. We found that the maximum standard deviation from the average was

0.03 (results are not included). Since there is no significant change between the

runs, we will only carry each experiment once on a single data set.

• Real data: There are many types of real data sets such as RNA-Seq data, Chip-

Seq data, and DNA sequences that are used in different applications. It is

important in our evaluation process to choose the right data set type to better

evaluate the applicability of the tools in the different applications. Therefore,

we prefer to use RNA-Seq data sets as it is used in many applications including

44 SNP and alternative splicing detection. The used data set consists of 1 million

reads generated by Illumina sequencer after isolating mRNA from the Spretus

mouse colon tissues. The mouse genome version mm9 was used as the reference

genome. Indeed, as will be shown, the tools have similar behavior on both the

mouse and the human genomes. Therefore, there is no contradiction in using

the human genome for generating the synthetic data while the mouse genome

is used for the real ones.

First, we present the effect of the default options. The results for this experiment are given in Figure 2 and 3. Figure 2 shows the results when using wgsim to generate

the synthetic data while Figure 3.3 shows the results using ART. As stated previously,

tools try to use the options that yield a good performance while maintaining a good

output quality. For instance, as shown in Figure 2, Bowtie achieves a throughput

of around 1.6 · 105bps/s at the expense of mapping only 67.58% of the reads. On

the other hand, BWA maps 91% of the reads at the expense of having a throughput

of 0.1 · 105bps/s. Additionally, SOAP and mrsFAST (Figure 2 and 3) would look

like that they provide the smallest mapping percentage while in fact they are only

allowing 2 mismatches while other tools such as mrFAST and GSNAP are allowing

more than 5 mismatches. Therefore, using only the default options to build our

conclusions would be misleading. Indeed, further experiments show that BWA obtains

a high throughput when allowed to use the same options as Bowtie (see BWA-ND

in Figure 3.2). Moreover, BWA achieves a higher throughput than Bowtie in other

experiments. Therefore, it is important to use the same options to truly understand

how the tools behave.

45 5 x 10 3.5 Bowtie Bowtie2 BWA 3 BWA−ND SOAP GSNAP Novoalign 2.5 MAQ RMAP mrsFAST 2 mrFAST

1.5

Throughput bps/s 1

0.5

0 40 50 60 70 80 90 100 Mapped Percentage

Figure 3.2: Mapping 1 million reads of length 125 extracted from the Human genome using wgsim. Each tool was allowed to use its own default options. BWA-ND refers to BWA’s results while using Bowtie’s default options which are 2 mismatches in the seed, 3 mismatches in the whole read, and disabling gapped alignment.

46 5 x 10 3.5 Bowtie Bowtie2 BWA 3 SOAP GSNAP Novoalign MAQ 2.5 RMAP mrsFAST mrFAST 2

1.5

Throughput bps/s 1

0.5

0 0 10 20 30 40 50 60 70 80 90 100 Mapped Percentage

Figure 3.3: Mapping 1 million reads of length 100 extracted from the Human genome using ART. Each tool was allowed to use its own default options.

In the remaining experiments, unless otherwise stated, the number of mismatches in the seed and in the whole read are fixed to 2 and 5, respectively, while the quality threshold is kept at 100. The minimum and maximum insert sizes allowed are 0 and

500, respectively. In addition, the splicing, SNPs, and gapped alignment options are disabled, unless otherwise stated. For the number of reported hits, tools are only allowed to report one location except for mrsFAST that does not have this option and report all of the mapped locations. The default values are used for the remaining options.

47 3.3.1 Mapping options

Quality threshold is one of the two main metrics used for mismatch tolerance.

The other main metric is the explicit specification of the number of mismatches.

To compare fairly between the tools, a relationship between the two metrics should

be found, which is the main target of this experiment. In this experiment, wgsim

is used to generate the data set instead of using ART or a real one. The different

base quality scores in real data cause quality threshold based tools to allow more

mismatches than the other tools. For instance, when allowing a quality threshold

of 70 and 5 mismatches for the remaining tools, Bowtie and MAQ map reads with

up to 10 mismatches while the other tools are limited to 5 (results are not shown).

Therefore, MAQ and Bowtie had a mapping percentage larger than the other tools,

hence, the comparison is not fair. Nevertheless, in the following, we show how the

quality threshold can be used to mimic the behavior of the explicit specification of

the number of mismatches.

For wgsim generated synthetic data, quality thresholds of 60, 80, 100, 120, and

140 should correspond to 3, 4, 5, 6, and 7 mismatches. To assess our conclusion, we

designed an experiment where all tools were allowed a maximum of 7 mismatches

while using a quality threshold of 140. Figure 3.4 shows that the tools map the reads

with the same maximum number of mismatches while having similar mapping rates.

However, the differences in the mapping rates are due to the pruning of the search

space done by the default options for some of the tools. In addition, other tools

incorrectly mapped some of the reads causing an increase in the mapping percentage.

For instance, 0.6% of reported hits for MAQ and SOAP2 are considered as error (i.e., reads mapped while violating the mapping criteria) while Bowtie’s default options

48 100 amb error 90 7 mms 6 mms 80 5 mms 4 mms 70 3 mms 2 mms 1 mms 60 0 mms

50

40

30 Percentage mapped 20

10

0 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools

Figure 3.4: Mapping 1 million reads of length 125 extracted using wgsim on the Human genome while allowing up to 7 mismatches and a quality threshold of 140. The error is 0.6% for SOAP2 and MAQ and 0.45% for GSNAP.

limit the allowed number of backtracks to find mismatches. On the other hand,

GSNAP and mrsFAST map around 92% of the reads even though GSNAP reports error hits. This is due to being non-seed based tools, thus allowing more mismatches to be found in the first few base pairs. Additionally, mrsFAST is a full sensitive mapper, therefore, it can detect reads missed by other tools.

Number of mismatches: Not only does the number of mismatches affect the percentage of mapped reads, but also affects the throughput. In particular, the mapping percentage increases nonlinearly with the number of mismatches. Figure 3.5 shows the effect of the number of mismatches in more details using a wgsim generated data set. There is a 20% increase in the percentage of mapped reads when allowing

3 mismatches instead of 2. On the other hand, there is less than 0.7% increase when allowing 7 mismatches instead of 6. In addition, the error percentage decreases

49 for large number of mismatches. For instance, SOAP2’s error percentage is 21%

when allowing 2 mismatches while it is reduced to 1% when allowing 6 mismatches.

Additionally, mrsFAST mapped around 0.1-0.5% more reads than the remaining tools

since it is a full sensitive mapper. From the throughput point of view, the tools behave

differently. For instance, Bowtie, MAQ, RMAP, and mrsFAST are able to maintain

almost the same throughput while the throughput increases for SOAP2 and GSNAP

and decreases for BWA. The degradation in BWA performance is due to exceeding

the default number of mismatches leading to excessive backtracking to find mismatch

locations.

Additionally, we used a data set of 1 million reads of length 100 generated by ART

to evaluate the tools. The results for this experiment are shown in Figure 3.6. Similar

to the wgsim results, the increase in the percentage of mapped reads is larger when allowing 2 mismatches instead of 3 than the increase when allowing 7 mismatches in- stead of 6. Unlike wgsim results, Bowtie maintains a higher throughput than Bowtie2 for the different number of mismatches. This is due to the difference in the read length between wgism and ART data sets (100 for ART instead of 125). Moreover,

Bowtie uses the quality threshold while Bowtie2 does not.

To further understand the behavior of the tools, the same set of experiments is ap- plied on the mouse mRNA real data set. The results given in Figure 3.7 show that the error percentage for GSNAP still decreases for large number of mismatches. In addi- tion, there is a small reduction in BWA’s throughput for large number of mismatches.

Interestingly, the throughput for mrsFAST is different between the synthetic data and the real data. In the synthetic data set, mrsFAST’s throughput is higher than RMAP while maintaining the same throughput across the different number of mismatches.

50 100

90

80 amb 70 error 2 t−mms 60 3 t−mms 4 t−mms 5 t−mms

Percentage mapped 50 6 t−mms 7 t−mms 40 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools

6 10

5 10

4 10 2 t−mms 3 t−mms 3 10 4 t−mms 5 t−mms Throughput bps/sec 6 t−mms 2 7 t−mms 10 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools

Figure 3.5: Comparing the different tools while changing the total mismatches from 2 to 7. T-mms stands for the maximum allowed mismatches. A data set of 1 million reads of length 125 extracted from the Human genome using wgsim was used in this experiment.

51 100

80 Amb Error 60 2mms 3mms 4mms 40 5mms 6mms 7mms 20 Percentage mapped

0 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools

6 10

5 10 2mms 3mms 4 4mms 10 5mms 6mms

3 7mms 10 Throughput bps/s

2 10 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools

Figure 3.6: Comparing the different tools while changing the total mismatches from 2 to 7. T-mms stands for the maximum allowed mismatches. A data set of 1 million reads of length 100 extracted from the Human genome using ART was used in this experiment.

52 100 amb 98 error 2 t−mms 96 3 t−mms 4 t−mms 94 5 t−mms 92 6 t−mms 7 t−mms 90

Percentage mapped Bowtie Bowtie2 BWA SOAP GSNAPNovoalign MAQ RMAP mrsFAST Tools

6 10 2 t−mms 3 t−mms 4 t−mms 4 5 t−mms 10 6 t−mms 7 t−mms

2 10

Throughput bps/sec Bowtie Bowtie2 BWA SOAP GSNAPNovoalign MAQ RMAP mrsFAST Tools

Figure 3.7: Comparing the different tools while changing the total mismatches from 2 to 7. T-mms stands for the maximum allowed mismatches. A real mRNA data set of 1 million reads of length 51 bps extracted from the Spretus mouse strain and mapped against the mouse genome version mm9 was used in this experiment.

On the other hand, on the real data, the throughput decreases with the increase in the number of mismatches. In addition, there is a 7x reduction in the throughput between 4 t-mms and 5 t-mms. To maintain full sensitivity for a small read length and large number of mismatches, mrsFAST requires the use of a small window size when building the index (window size of 8 for 5 t-mms instead of 10 for 4 t-mms).

The smaller the window size, the longer it takes to process the index. Additionally, there is a limit on the window size (min 8 and max 14). Therefore, mrsFAST starts to lose its sensitivity for detecting mapping locations for 6 and 7 t-mms.

Seed length: Theoretically, when fixing the number of allowed mismatches in the seed and in the whole read, changing the seed length affects the mapping results.

Specifically, a shorter seed allows more mismatches in the remaining part of the read to be found. Therefore, the percentage of mapped reads would increase even though

53 the throughput would decrease. On the other hand, having a longer seed would result

in pruning some parts of the search tree as soon as possible, leading to throughput

improvement. The aim of this experiment is to study this trade off. As shown in

the results given in Figure 3.8 using a wgism data set, the tools behave as expected.

However, there are some exceptions. For instance, when increasing the seed length from 32 to 36 the percentage of mapped reads for SOAP2 and Bowtie decreases, however the throughput is not affected. In addition, there is a 0.8% increase in the percentage of mapped reads for Bowtie when increasing the seed length from 28 to 32.

This behavior is due to the backtracking property that stops once a certain limit is reached. Therefore, as a result of having less erroneous bases in the seed part, Bowtie can continue more in the depth first search without exceeding the backtracking limit.

We also carried out the same experiment on real mouse mRNA data set. The results given in Figure 3.9 show that the same behavior for Bowtie is still obtained on real data. However, Bowtie has only 0.01% increase when increasing the seed length from 28 to 32 instead of the 0.8% obtained in synthetic data.

3.3.2 Input properties

Read length: Longer reads tend to have more mismatches beside requiring more time to be fully mapped [113]. In general, for a fixed number of mismatches, increasing the read length decreases the percentage of mapped reads. Therefore, the aim of this experiment is to understand the read length effect. The results in Figure 3.10 show that the mapping percentage decreases with the increase in the read length while the error percentage increases. As an example, 95% of FANGS’ output for read length 500 is error compared to 12% of its output for read length 200. This is due to the increase

54 100

98

96

94

92

90

88

86 Percentage mapped

84 20 24 28 82 32 36 80 Bowtie BWA SOAP Tools

6 10

105

104 Throughput bps/sec 20 24 28 32 36 3 10 Bowtie BWA SOAP Tools

Figure 3.8: The effect of changing the seed length on the BWT based tools. The tools were used to map 16 million reads of length 70 bps on the Human genome. SOAP2 does not support seed length < 28.

55 100

99

98

97

96

95

94

93 Percentage mapped

92 20 24 28 91 32 36 90 Bowtie BWA SOAP Tools

7 10

106

105 Throughput bps/sec 20 24 28 32 36 4 10 Bowtie BWA SOAP Tools

Figure 3.9: The effect of changing the seed length on the BWT based tools. The tools were used to map real mRNA data set of 1 million reads of length 51 bps extracted from the Spretus mouse strain on the mouse genome version mm9. SOAP2 does not support seed length < 28.

56 of the erroneous bases with the increase of the read length. Therefore, it becomes harder to map the reads with the specified mapping criteria. In addition, Bowtie,

Bowtie2, and BWA were the only short sequence mapping tools that managed to map long reads. In particular, the max read length was 128 for MAQ, 300 for RMAP, and 200 for GSNAP, 199 for mrsFAST, while SOAP2 took more than 24 hours to map the reads with length 300 and hence not reported. On the other hand, mrsFAST’s run on read length 36 was suddenly terminated. This is probably due to the small read length and the large number of mismatches. From the throughput point of view, tools do not maintain the same behavior. For instance, the throughput of Bowtie and SOAP2 decreases for long read lengths. This is due to the backtracking property and the split strategy [78] used by Bowtie and SOAP2, respectively, to find inexact matches. Moreover, Bowtie is better than Bowtie2 for read lengths 36 and 70. On the other hand, even though the throughput of BWA and GSNAP increase with the read length, it starts to decrease for read length 500 and 200, respectively. GSNAP works by combining position lists to create candidate mapping regions. Therefore, for long reads, the throughput decreases due to the increase in the work needed to generate and combine position lists. For mrsFAST, the throughput increases with the read length since the available mapping locations for a read are less for longer reads in comparison to small ones.

Additionally, we carried out the same experiment on synthetic data sets generated by the ART tool. We did not carry out the experiment on a real data set due to the lack of proper real data sets that have different read lengths, have exactly the same coverage, generated by the same sequencer, and extracted from the same tissue. The results for this experiment are shown in Figure 3.11. Similar to wgsim results, the error

57 100 amb error 80 36 70 125 60 200 300 40 500

20 Percentage mapped

0 Bowtie Bowtie2 BWA SOAP MAQ RMAP GSNAP FANGS Novoalign mrsFAST Read length

6 10 36 70 125

5 200 10 300 500

4 10 Throughput bps/sec

3 10 Bowtie Bowtie2 BWA SOAP MAQ RMAP GSNAP FANGS Novoalign mrsFAST Read length

Figure 3.10: The effect of changing the read length from 36 to 500. The reads were extracted from the Human genome. RMAP and MAQ are slower than the other tools. Therefore, 1 million reads were used to test MAQ and RMAP while 16 million reads were used for the remaining ones.

58 100

80

60

40 amb error 20 36 70

Percentage mapped 100 0 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools

5 10

36 70

Throughput bps/sec 100 0 10 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools

Figure 3.11: The effect of mapping 1 million reads generated by ART on the mouse genome version mm9 while changing the read length from 36 to 100.

percentage increases with the increase in the read length for GSNAP and SOAP2.

Interestingly, the percentage of mapped reads for Bowtie, MAQ, and Novoalign are not significantly affected with the increase in the read length in comparison to the other tools. This is due to the fact that the longer the read is the smaller the quality score becomes for the bases at the end of the reads [114]. Therefore, Bowtie, MAQ, and Novoalign can map the reads with more mismatches while maintaining the same quality threshold.

Paired-end Mapping paired-end reads affects the performance of the tools due to the added constraint of mapping both ends within a maximum insert size. There- fore, in this experiment, we want to understand how the performance of the tools

59 100

90

80

70 amb 60 error se−ungapped 50 se−gapped Percentage mapped pe−ungapped 40 pe−gapped

Bowtie Bowtie2 Soap BWA GSNAPNovoalign RMAP MAQ mrsFAST Tools

6 10

5 10

4 10

3 10 se−ungapped se−gapped Throughput bps/sec pe−ungapped 2 pe−gapped 10 Bowtie Bowtie2 Soap BWA GSNAPNovoalign RMAP MAQ mrsFAST Tools

Figure 3.12: The effect of mapping paired-end reads of length 70 to the Human genome. 1 million reads were used to test RMAP and MAQ while 16 million reads were used to test the other tools. SE and PE refer to single end and paired end, respectively. Error is only provided for PE due to exceeding the allowed insert size.

is affected while mapping paired-end reads instead of single-end. The results in Fig- ure 3.12 (ungapped bars) show that the throughput decreases for all of the tools while mapping paired-end reads, except for BWA which was able to maintain almost the same throughput while MAQ had a small increase. Even though all of the algorithms work by finding mapping locations for each end alone and then finding the best pair,

GSNAP was the only tool to face a drop by 90% in the throughput. Additionally, the percentage of mapped reads is less while mapping paired-end read due to applying the same mapping criteria for single-end reads on paired-end reads.

60 Even though the maximum insert size was 500, tools such as BWA, SOAP, and

GSNAP mapped paired-end reads while exceeding the maximum insert size, except for Novoalign that explicitly requires the user to set the standard deviation for the insert size.

Genome type To capture the effect of the genome type, we designed an experi- ment in which the Human, Chimpanzee, Mouse, Zebrafish, Lancelet, A. mellifera, and

C. elegans genomes were used as reference genomes. The sizes of these genomes are

3.1Gbps, 3.0Gpbs, 2.5Gbps, 1.5Gbps, 0.9Gbps, 0.57Gbps, and 0.1Gbps, respectively.

Theoretically, for genome indexing based tools, the throughput is expected to slightly increase with the decrease in the genome size. However, the results in Figure 3.13 show that some tools do not act as expected. For instance, there is a difference in the throughput between the Chimpanzee and the Human genomes even though their sizes are similar. In addition, SOAP2’s and Novoalign’s throughput decreases signif- icantly for the Zebrafish genome while GSNAP did not finish its run on the same genome albeit running for two days. The reason for this behavior is the large repeti- tion rate in the Zebrafish genome. For instance, while mapping 1 million randomly generated reads from the Zebrafish genome, around 600 reads were mapped to more than 100,000 locations in comparison to the Lancelet with the maximum number of locations is around 10,000 for only 1 read. Additionally, mrsFAST detects more than

8 billions locations while mapping reads to the Zebrafish genome where it detected only 24 millions while mapping reads to the lancelet genome. Hence, for GSNAP, the large repetition rates lead to long genomic position lists; resulting in a significant slow down of GSNAP. Another interesting result is the ability of most of the tools to map more than 96% of the reads for the Zebrafish data set compared to around 91% for the

61 100

80 Amb Error Human 60 Chimp Mouse 40 Zebrafish Lancelet A.mellifera 20 C.elegans Percentage mapped

0 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools

5 10

4 10 Human

3 Chimp 10 Mouse Zebrafish 2 10 Lancelet A.mellifera C.elegans 1

Throughput bps/s 10

0 10 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools

Figure 3.13: 16 million reads of length 70bps were generated from the Human, Zebrafish, Lancelet, Chimpanzee, A. mellifera, and C. elegans genomes using wgsim for this test. 1 million reads were used for MAQ and RMAP.

Human and 89% for the Lancelet. The large mapping percentage is also due to the large repetition rate. Hence, due to synthetically generating the reads, large number of reads would be generated from the repeated regions. As a result, the probability of

finding a mapping location increases. In addition to the above results, it is also no- ticed that Bowtie scales better than Bowtie2 on different genomes. Moreover, MAQ’s throughput for the C. elegans genome is larger than Novoalign while maintaining a comparable mapping percentage. Therefore, read indexing based tools might perform better than some genome indexing based tools for small genomes albeit being very slow for large genomes.

62 To further understand the behavior of the tools, we generated a data set of 1 million reads using ART. Figure 3.14 shows the results using the ART data sets. Similar to wgsim results, SOAP2 and Novoalign still encounter a significant decrease in the throughput when mapping the Zebrafish data set. Additionally, Bowtie still scales better than Bowtie2 with the different genomes. Interestingly, GSNAP finished its run on the Zebrafish data set even though it still faces a decrease in the throughput.

On the other hand, unlike wgsim results, mrsFAST encounters a decrease in the throughput when mapping the Zebrafish data set. It is not obvious why mrsFAST encounters such a decrease even though its performance on the other genomes remains the same regardless of using wgsim or ART.

In general, the throughput for the tools increased when using ART instead of wgsim to generate the data sets. However, the relative performance between the tools and the different genomes is still the same.

3.3.3 Algorithmic features

Gapped alignment should improve the mapping percentage albeit decreasing the throughput. We designed an experiment to understand the effect of gapped alignment. Tools were used to map synthetically generated reads of length 70 to the Human genome while allowing one gap of length 3. However, mrFAST does not provide any option to limit the gap size. The results in Figure 3.15 show that the mapping percentage increases by 4% for SOAP2 and 1.5% for mrFAST in case of gapped alignment, while there is no change for BWA and GSNAP. However, there is a drop of 15% and 75% in the throughput for BWA and GSNAP, respectively.

The decrease for GSNAP is due to the overhead added to the algorithm to find

63 100

80 Amb Error Human 60 Chimp Mouse 40 Zebrafish Lancelet A.mellifera 20 C.elegans Percentage mapped

0 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools

6 10

5 Human 10 Chimp Mouse Zebrafish Lancelet 4 10 A.mellifera C.elegans Throughput bps/s

3 10 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools

Figure 3.14: 1 million reads of length 70bps were generated from the Human, Ze- brafish, Lancelet, Chimpanzee, A. mellifera, and C. elegans genomes using ART.

64 pairs of candidate regions that co-localize within a maximum allowed gap size. The algorithm tries to find a crossover between the two regions without exceeding the maximum number of mismatches leading to a significant decrease in the throughput.

Interestingly, the decrease in the throughput is less for the real data set as shown in

Figure 3.12. However, the decrease is still larger than the decrease in the throughput for the remaining tools.

For the real data set, mrsFAST (mrFAST) is not included in the results since the minimum allowed window size in the indexing phase does not guarantee a full sensitivity for mrFAST.

3.3.4 Scalability

In this experiment, we tested the multithreading behavior. In addition, pMap [115] was used to run multiple instances of each tool on a number of processors on a single node to test the multiprocessing effect. pMap is an open-source MPI-based tool that enables parallelization of existing short sequence mapping tools by partitioning the reads and distributing the work among the different processors. A single node was used in the multiprocessing experiment to understand the effect of a good implemen- tation of multithreading. The results for both experiments are given in Figure 3.16.

We can observe from the multithreading results that the tools had almost a linear speedup up to 4 threads. However, when increasing to 8 threads, Bowtie was the only tool to achieve 8x speedup. In addition, BWA had a similar speedup in both multi- threading and multiprocessing. For the multiprocessing experiment, FANGS achieved almost a 6x speedup while there was a small improvement for MAQ and RMAP. For the remaining tools, most of them were able to maintain more than a 5x speedup for

65 100

99

98

97

96

95

94

93 Percentage mapped

92

91 se−ungapped se−gapped 90 Bowtie2 BWA GSNAP Novoalign Tools

6 10

5 10 Throughput bps/sec

se−ungapped se−gapped 4 10 Bowtie2 BWA GSNAP Novoalign Tools

Figure 3.15: mRNA data set of 1 million reads extracted from the Spretus mouse strain is used in this experiment and mapped on the mouse genome version mm9. mrsFAST is used for ungapped alignment and mrFAST is used for gapped alignment.

66 8 processors, however this is less than a linear speedup. One reason for this degra- dation is the overhead of the distribution and merging steps required by distributed memory systems. As expected, we can notice that multithreading provides almost a linear speedup, however, it is limited by the number of cores.

In general, using multiprocessing provides more degrees of freedom by paralleliz- ing tools that do not support multithreading and by making use of the available computational resources.

Another important observation is the effect of the indexing method on the total speedup. Read indexing based tools did not have any significant speedup in compar- ison to the genome indexing based ones which had more than 5x speedup. Therefore, genome indexing is more efficient in case of designing a read partitioning parallelism based tool.

3.3.5 Accuracy evaluation

The aim of this experiment is to evaluate the percentage of reads each tool actually maps out of the set of the mappable reads. A read is mappable if the distance between the read and its original genomic location does not violate the mapping criteria. In this experiment, the reads were generated using ART to measure the sensitivity of the tools in case of varying the distribution of mismatches. The mapping criteria used was fixed to five mismatches for Bowtie2, SOAP, GSNAP, BWA, mrsFAST, and

RMAP. For the remaining tools, a quality threshold of 100 was used. In general, gapped alignment was disabled. The results given in Table 3.2 show that Bowtie did not map around 0.14% of the set of the mappable reads (i.e., false negatives) while

Bowtie2 did not map around 7.71%. Moreover, Bowtie mapped 93% of the reads

67 9

8

7

6

5

4 Speedup

3

2

1 2−threads 4−threads 8−threads 0 Bowtie Bowtie2 BWA SOAP GSNAP Tools

9

8

7

6

5

4 Speedup

3

2

1 2−processors 4−processors 8−processors 0 Bowtie BWA SOAP MAQ RMAP GSNAP FANGS Tools

Figure 3.16: 16 million reads of length 125 were mapped to the Human genome while using multithreading (the upper figure) or multiprocessing (the lower figure).

68 Table 3.2: Evaluating the sensitivity of the tools on a data set of 1 million reads of length 70 generated by ART. The numbers are in percentage. The Reported mapped percentage is the total percentage of reads mapped by each tool. It is equal to Actual Mapped + (Expected Unmapped- Actual Unmapped) while Reported correct is the total number of correctly mapped reads. Bowtie Bowtie2 BWA SOAP2 MAQ RMAP GSNAP Novoalign mrsFAST Mapped Expected 93.57 93.25 91.29 91.29 93.57 90.12 93.25 96.18 93.25 Actual 93.43 85.54 91.29 91.29 92.92 82.53 93.25 96.02 93.25 Error 0.73 0.03 Unmapped Expected 6.43 6.75 8.71 8.71 6.43 9.88 6.75 3.82 6.75 Actual 6.25 6.68 8.32 6.83 5.08 8.29 3.66 3.81 6.62 Error 1.73 1.25 1.5 2.97 Reported mapped 93.61 85.61 91.68 93.17 94.27 84.11 96.34 96.03 93.38 Reported correct 93.61 85.61 91.68 90.71 93.02 82.61 93.37 96.03 93.38

while Bowtie2 only mapped 85%. Nevertheless, the sensitivity of both tools can be increased by changing the default options at the expense of significantly decreasing the throughput. Interestingly, BWA, SOAP2, and mrsFAST mapped all of the mappable reads without any error.

In general, the tools were able to map a percentage of the unmappable reads, however, it was mapped with a large error percentage. For instance, even though

GSNAP mapped around 3% of the unmappable reads, only 0.03% of them were correctly mapped. Therefore, even though GSNAP maps the largest percentage of reads, other tools such as BWA and Novoalign are more accurate and precise than

GSNAP.

It is important to note that the percentage of reads mapped from the unmappable reads is similar to the percentage of incorrectly mapped reads-relaxed given in Ruffalo et al. work [97]. However, they define a read to be unmappable if it has a mapping

69 Table 3.3: Rabema evaluation results on the different tools using a data set of 1 million reads of length 100 extracted from the Human genome using ART. The maximum allowed error is 5% (i.e., 5 mismatches in this case). #Reads is the number of reads expected to be mapped with certain Error. The remaining columns for the tools show the percentage of reads detected by each tool out of the #Reads. Invalid mappings (i.e., reads mapped with errors more than the assigned error rate threshold) for Bowtie and Novoalign are 567,531 and 587,542 reads, respectively. Error #Reads Bowtie Bowtie2 BWA SOAP2 Novoalign mrsFAST 0 832 100 100 100 100 97.24 100 1 6316 96.99 100 100 100 98.29 100 2 23495 97.30 97.16 100 99.97 98.70 100 3 55941 97.00 95.92 99.85 95.78 98.84 100 4 98063 96.48 94.22 99.49 96.43 99.02 100 5 135096 95.63 91.14 98.76 97.34 99.12 100

quality less than a certain threshold while we consider it as unmappable if it violates

the mapping criteria for the tool.

3.3.6 Rabema evaluation

The aim of this experiment is to evaluate the tools based on the number of reads

with a specified error rate the tool has been able to map. Unlike the previous exper-

iment, this experiment does not take into consideration how each tool works. There-

fore, it is similar to evaluating the efficiency of each mapping algorithm (i.e., seeding

vs. non-seeding, quality scores vs. mismatches). The experiment is performed on a

synthetic data set of length 100 extracted from the Human genome using ART. The maximum allowed error rate was 5%, i.e., 5 mismatches in that case. The results for this experiment are shown in Table 3.3. Rabema takes the output SAM file from each tool as the input. However, MAQ and RMAP do not create the output in the

SAM format. Therefore, there are not included. Additionally, GSNAP results are not included since GSNAP in the SAM format messes up the quality scores.

70 As shown in the results, both Novoalign and Bowtie are evaluated as mapping invalid reads. This is because Rabema does not take the quality scores into consider- ation and just calculate the edit distance. Therefore, from the mismatches perspec- tive, the reads have more than 5 mismatches. However, from the quality threshold perspective, they have a quality threshold less than the specified one. Therefore, at the end, they are valid mappings.

In general, BWA has been able to detect almost all of the reads with the correct error rate. This suggest that most of the mismatches exist at the end of the read.

In addition, the seeding technique is a valid method specially if it can speed up the mapping process. Even though SOAP2 is a seed based tool, similar to BWA, it could not detect as much correct reads detected by BWA. Bowtie2 missed some of the reads, however, it can detect them by changing its sensitivity at the cost of increasing the running time. On the other hand, mrsFAST mapped all of the reads with the correct error rate since it is a full sensitive mapper.

3.3.7 Use case: SNP calling

The aim of this experiment is to understand how the different mapping techniques affect the quality of SNP calling. The tools were used to map an mRNA dataset of 23 million reads extracted from the Spretus mouse strain. Then Partek [116], a genomic suite developed to analyze NGS data, is used to detect SNPs. The mouse genome version mm9 was used as the reference genome in this experiment. A quality threshold of 70 was used for Bowtie and Novoalign while the remaining tools were allowed 5 mismatches. In addition, gapped alignment was enabled for Bowtie2, BWA, GSNAP, and Novoalign. Table 3.4 shows the results for this experiment. The SNP detection

71 step was done for GSNAP and SOAP2 after filtering out the erroneous reads. The log-odd ratio represents how accurate the SNP is. The small log-odds ratio for some of the SNPs is due to either the small number of reads that supports that SNP or the mixed genotype calls. We can observe that there is a large number of accurately detected SNPs. This is expected due to the high divergence of the Spretus strain from other mice strains. For the sake of completeness, we are including the whole number of detected SNPs, however, in our analysis, we focus only on the number of accurately detected SNPs shown in the last column. The results show that GSNAP detected the largest number of accurate SNPs while Novoalign detected the smallest. In addition, more than 94% of the highly accurate SNPs detected by Novoalign were also detected by the other tools (not shown). To further understand the reason for the low number of SNPs detected by Bowtie and Novoalign, we carried out the same experiment while using a quality threshold of 100. The number of highly accurate SNPs increased to

1474 and 1100 for Bowtie and Novoalign, respectively. Moreover, the reads with more than 5 mismatches did not contribute to the increase in the number of SNPs. This is due to the fact that SNPs have a high quality score. Therefore, a read with a SNP would be sequenced with a small number of errors.

3.4 Conclusion

There have been many studies carried out to analyze the performance of short sequence mapping tools and choose the best tool among them. However, the analysis of short sequence mapping tools is still an active problem with many aspects have not been addressed yet. In this work, we provided a benchmarking study for short sequence mapping tools while tackling different aspects that have not been covered

72 Table 3.4: SNP calling results when using the different tools. Each row represents a different tool while each column shows the number of SNPs detected with the log- odds ratio, a measure of the accuracy of the detected SNP, centered around the given values. The larger the log-odds ratio is, the more accurate the detected SNP becomes. Tools Log-odds ratio 5 100 200 300 400 500 600 700 800 900 1000 1000000 Bowtie 89479 24337 5082 2231 1076 648 426 281 0 0 0 1171 Bowtie2 200914 62178 10018 4200 2052 1156 767 537 0 0 0 2035 BWA 192050 52115 9028 4049 1894 1087 737 525 0 0 0 2067 SOAP2 174475 49302 8552 3824 1837 1030 704 508 0 0 0 1941 Novoalign 69798 17586 4061 1875 936 519 363 252 0 0 0 941 GSNAP 207920 69015 11416 4928 2482 1325 971 617 0 0 0 2602

by previous studies. We mainly focused on studying the effect of different input properties, algorithmic features, and changing the default options on the performance of the different tools. Additionally, we provided a set of benchmarking tests which extensively analyze the performance of the different tools. Each of the benchmarking tests stresses on a different aspect. The benchmarking tests were further applied on a variety of short sequence mapping tools, namely, Bowtie, Bowtie2, BWA, SOAP2,

MAQ, RMAP, GSNAP, Novoalign, mrsFAST (mrFAST), and FANGS.

The experiments show that some tools report an error percentage (i.e., reads mapped while violating the mapping criteria). Among these tools are GSNAP and

SOAP. GSNAP reported the highest error percentage in the experiments. Addition- ally, the error increases with the read length and it decreases with the the number of mismatches. Nevertheless, GSNAP was one of the tools which reports the largest mapping percentage in most of the experiments even after filtering out the error reads.

The main reason for mapping more reads is allowing any number of mismatches in the seed part. From a real application perspective, GSNAP’s filtered output helped in detecting the largest number of SNPs.

73 The evaluation of Bowtie, Bowtie2, BWA, mrsFAST, and Novoalign show their ability to correctly map the reads. Moreover, Novoalign mapped the largest percent- age of reads, similar to GSNAP, specially for highly repeated genomes. However, it maintained the lowest throughput among the genome indexing tools in most of the experiments.

mrsFAST’s running time is highly affected by the read length and the number of mismatches. Our experiments show that it is better to use mrsFAST for longer reads.

It can also be used for short reads but only with a small number of mismatches.

In general, genome indexing based tools performed better than read indexing tools in all of the experiments. However, MAQ was faster than Novoalign for small genomes.

Therefore there is a potential for read indexing tools to be used for small genomes.

In addition to providing the worst performance, read indexing does not provide any significant speedup in case of using read partitioning based parallelism. Therefore, the read indexing method is not preferred when designing a new read partitioning mapping tools.

Interestingly, the genome type experiment revealed many strengths and weak- nesses for the tools. For instance, the performance of SOAP, GSNAP, and Novoalign is highly dependent on the genome type; the throughput decreased significantly for the

Zebrafish genome. This is due to the large repetition rate on the Zebrafish genome. In addition, the tools behaved differently on the Human and the Chimpanzee genomes albeit having comparable genome sizes. The results of the genome type experiment suggest that the different properties of the genomes affect the performance of the tools. Therefore, further investigations are required to understand the different prop- erties of the genomes and their effect on the different mapping techniques.

74 Even though there are differences between the results for the real data sets and the synthetic ones, both experiments are important as they give us a different perspective when comparing between the tools. The control on the number of mismatches for the wgsim synthetic data allows us to know exactly what the throughput of each tool is while looking for exactly the same number of mismatches. Therefore, it becomes easier to understand why a tool is faster than another one or why a tool seems to map more reads than the other ones. At the same time, it is important to look at the behavior of the tools in case of real data and real-like synthetic data (e.g., ART) to further understand how they behave in the real world. For instance, for the number of mismatches experiment, even though Bowtie looks like it maps a percentage of reads similar to the other tools in case of 7 t-mms, it actually maps the reads with a maximum of 4 t-mms. Therefore, the output reads are more accurate than the other tools.

In general, there is no the-best tool among all of the tools; each tool was the- best in certain conditions. The short sequence mapping problem is still an active problem and new tools are needed to be developed. However, these new tools should be application-specific. By taking the target application into consideration, more accurate results can be obtained. For instance, for genome assembly, we can analyze the reference genome and estimate the number of reads that can be mapped for the different regions (e.g., repeated regions) based on the coverage information in the sequencing process. Another example for an application with very specific properties is the mapping of RNA-Seq data which contain short sequences for the exon regions rather than intron regions for the genome. Therefore, for well-studied genomes, if a small number of reads where mapped to different intron regions, we can expect them

75 to be wrongly mapped and look for other mapping locations with more number of mismatches or less mapping quality.

76 Chapter 4: Efficiency of RNA-Seq Data for Active Module Discovery in Comparison to MicroArrays

Microarrays have been very useful for biological research over the past few decades.

However, with cutting-edge high-throughput sequencing technologies, researchers now have the chance to investigate the biological systems in a more accurate way. There have been several works which compare the microarray and RNA-seq data for different problems and applications [117, 118, 40, 119, 41]. These works show that using RNA-

Seq instead of microarray data can be better because of the limitations imposed by the microarray design. RNA-Seq data has less background noise, its range for expression level quantification is higher, it can be better while distinguishing different isoforms and allelic expressions, and usually requires less cost and less genomic material [42].

Discovering active modules in protein-protein interaction (PPI) networks has been studied by several researchers leading to the development of different algorithms and tools [120, 121, 50, 122, 29, 123, 124, 30, 54, 125, 126]. It has been proved that the active modules found by these tools are useful to discover some significant genes which are overlooked by other techniques [52]. Furthermore, an active module is a better, compact, and accurate model of what is going on in the molecular level. For these reasons and more, these networks have been used as biomarkers for classification purposes in diseases such as breast cancer [52]. To the best of our knowledge, all

77 available active module discovery tools were designed and experimented by using only the microarray data. That is, there is no work which investigates using RNA-Seq data for active module discovery.

In this work, we investigate the difference between using microarray and RNA-

Seq data while discovering active modules in PPI networks. We design a set of experiments by using two different datasets, one for colorectal cancer and the other for oligodendroglioma tumors, to see the potential of RNA-Seq data. Our experiments show that RNA-Seq indeed has enormous potential while searching significant genes, since it helps the tools to detect a set of genes which are not found if the microarray data is used. Besides, for the colorectal cancer data, the results from different tools are more consistent with each other when RNA-Seq data is used. However, these state- of-the-art tools generate larger active networks with RNA-Seq data which reduces the compactness and hence the effectiveness of an active module as a biomarker.

The rest of the chapter is organized as follows: In the next section, we describe the tools used in this study for module analysis, and a brief comparison of microarrays and RNA-Seq is given. Section 4.2 describes our experimental setting and evaluates the results. Section 4.3 concludes the study.

78 4.1 Background

The problem of discovering active modulefs in PPI networks using the gene expres- sion data was introduced by Ideker et al. [28]. Afterward, many algorithms such as

Beisser et al. [120], GiGA [121], heinz [29], DEGAS [30], MATISSE [54], and CEZANNE [125] were developed to solve this problem.

4.1.1 Tools for Active Module Discovery

In general, the tools can be categorized in two classes based on how the datasets are used [127]:

1. Treatment-control based tools: These tools divide the collected samples into

two classes; treatment and control. The control samples, for example, can

represent healthy cells whereas the treatment samples can represent cancer or

healthy cells after gene knockout experiments. After this classification, a set

of connected genes with significantly different behaviors on the treatment and

control samples are extracted.

2. Co-expression-based tools: Instead of dividing the samples into more than one

class, the tools in this category analyze the co-expression patterns of gene pairs

across all samples. After that, it extracts the connected components where the

genes show the same co-expression behavior across the samples.

In this study, we use three treatment-control based tools jActiveModules [28], heinz [29], and ExprEssence [126]. Another treatment-control based tool DEGAS is designed to extract the active modules when outliers exist [30]. Since the datasets

79 used in this work are extracted from a cell line study, they do not contain outliers.

Hence, DEGAS is excluded from our experiments.

jActiveModules

The tool jActiveModules is the first one developed to extract the active modules

using gene expression data [28]. The tool returns the largest connected component

with the highest score as the active module. jActiveModules first assigns scores to each gene based on the gene’s p-values for the samples in the dataset. Thereafter, for each connected component, the algorithm aggregates the scores of the genes inside the component to generate a single score for it. The problem of finding the module with the highest score is an NP-hard problem [28]. Hence, the authors proposed a search heuristic based on simulated annealing to find such modules. In general, the algorithm suffers from having a large running time and large output modules. heinz

Even though jActiveModules is a popular tool, it suffers from many drawbacks including the non-optimality of the results and its large running time. Dittrich et al. [29] developed heinz to alleviate these problems. The tool heinz works by first assigning a score to each gene in the network using the aggregated p-value of all of the samples in which the gene exist. The aim of the scoring function is to discriminate between the correct and noise p-values by assigning a positive score to the former and a negative score to the latter. After scoring the genes in the network, heinz finds the optimal highest-scoring subgraph in this vertex-weighted graph. The problem of

finding the subgraph with the optimal score is transformed to the well-known prize- collecting Steiner tree problem (PCST) and then solved by using the optimal PCST

80 algorithm of Ljubi´cet al. [128]. The main drawback of heinz is using the aggregation

of the p-values of all samples as a single p-value. Therefore, it implicitly assumes that

all of the samples should exhibit the same behavior for the same disease.

ExprEssence

While the above tools focus on extracting connected components, ExprEssence focuses on extracting individual links where the change in the genes’ expression values indicate a regulatory change such as stimulation or inhibition. Similar to the previous tools, ExprEssence starts by calculating a score for each gene. However, instead of using p-values, it directly uses the gene expression values. Afterward, the scores for the genes sharing the same link are used to calculte the corresponding link’s score.

4.1.2 Microarray vs. RNA-Seq: History

Both microarrays and RNA-Seq revolutionized the transcriptome analysis and have been successfully used for quantifying different species or organisms’ transcrip- tomes [42]. Before the microarray era, researchers were using other techniques to analyze the transcripts of a given species such as Northern Blots, expressed sequence tags (ESTs), serial analysis of gene Expression (SAGE), and reverse transcription

PCR (RT-PCR) [40, 41]. However, these techniques suffered from limitations on the number of genes that can be analyzed in parallel [41]. Therefore, there was a need for a new technique to better analyze the different transcripts.

A whole picture for a cell’s gene expression pattern was not available before the development of gene expression microarrays [129]. They have been extensively used in many applications including gene expression detection, Single Nucleotide Polymor- phisms (SNP) analysis, and mutation detection [41]. However, microarrays require

81 the prior knowledge about the structure of a gene. Additionally, analyzing genes in new genomes is hard due to the unavailability of probes for this genome [40].

RNA-Seq is a more recent technique for transcriptome analysis using deep sequenc- ing [42]. It made measuring the gene expression levels easier and more accurate. For instance, unlike microarrays, RNA-Seq can detect unidentified genes and it does not require any information about the distinct isoforms for the gene [40]. Therefore, it is more successful while detecting novel transcript isoforms. On the other hand, al- though its cost is continuously reducing, RNA-Seq has always been defined as a more expensive technique when compared with microarrays.

Many works have been carried out to compare between microarrays and RNA-

Seq such as [117, 118, 40, 119, 41]. However, none of these works measures the true potential of RNA-Seq data for active module analysis. In this work, we compare microarrays and RNA-Seq to see if RNA-Seq is a good alternative. In our experi- ments, we try to compare the effectiveness of both techniques with respect to different structural and biological aspects of the active modules.

4.2 Experimental Evaluation

We first describe our experiments carried out using two different datasets and three active module discovery tools. We then present and evaluate their results. We used jActiveModules v2.23 and ExprEssence v1.2.1a where both tools are available as a plugin in Cytoscape v2.8.3 [130]. For heinz, we used the public version which is available in the BioNet package [131].

82 The human PPI network we used contains 11203 genes and 57235 interactions.

The network was assembled by Chuang et al. [52] from yeast two-hybrid experi- ments [132, 133]. The interactions are predicted by the ontology and co-citation [134], and curated in the literature [135, 136, 137].

We carried out experiments with two datasets: In the first set, the gene expres- sion values are calculated by both microarray and RNA-Seq techniques to analyze the RNA which is extracted from fluorouracil (5-FU)-resistant and -nonresistant hu- man colorectal cancer cell lines. The microarray data was published as a part of a expression analysis study by Griffith et al. [138] in 2008 while the RNA-Seq analysis for the same RNA was later published in 2010 [139]. To prepare the 5-FU resistant cell (MIP/5-FU) [140], Griffith et al. passed the 5-FU sensitive cell line (MIP101) [141] through an increasing concentration of 5-FU resulting in a 5-FU resistant cell. They calculated the p-values for the output gene expression values using the Wilcoxon rank-sum test and the Fisher’s exact test for the microarray and RNA-Seq data sets, respectively. The Fisher’s exact test was used for the RNA-Seq data due to the dif- ferences in counts between the two samples. Afterwards, they applied the Benjamini and Hochberg’s step-up false discovery rate controlling procedure [142] for multiple testing correction. The gene expression values in the microarray and RNA-Seq data were 85% correlated while the p-values were only 27% correlated.

To further analyze the potential of RNA-Seq on active module analysis and see how the tools behave with it, we used a second RNA-Seq dataset which contains gene expression values for six different samples. Five out of the six samples are the

83 treatment samples and they are extracted from five different versions of the oligoden-

droglioma tumor disease. The sixth sample is from tumor initiating cells and used as

the control sample1.

4.2.1 Colorectal cancer cell lines

The colorectal cancer cell line RNA-Seq data we used contains 36, 952 genes where only 11, 853 of them are expressed and have a corresponding gene id. However, only 7, 456 of these genes exist in the PPI network. On the other hand, the microarray data contains 2, 510 genes where 2, 410 of them are expressed and have a correspond- ing entrez gene id. And only 1, 656 of these genes exist in the PPI network. The

intersection between the two datasets contains 2, 200 genes where only 2, 008 genes

are expressed and have an entrez gene id. Only 1, 398 genes of the intersection set

exist in the PPI network.

We calculated the number of differentially expressed (DE) genes in each dataset.

We assume that a gene is differentially expressed if it has a fold change ≥ 2 and a

p-value < 0.05 [139]. We found that the number of DE genes is 251, 209, and 86 in

RNA-Seq data, microarray data, and their intersection, respectively. However, only

157, 128, and 53 of these DE genes, respectively, exist in the PPI network.

For the rest of the experiments, we name the output modules as follows:

1. MicroNet: represents an output module obtained by using the microarray

data.

2. RnaNet: represents an output module obtained by using the RNA-Seq data.

1http://www.alexaplatform.org/alexa_seq/Oligo/Summary.htm

84 3. MicroInterNet/RnaInterNet: represent output modules obtained by us-

ing microarray/RNA-Seq gene expression values for the set of genes existing in

both datasets only.

The combination of the three active MicroNet modules from the tools is visual- ized in Figure 4.1. As the figure shows, heinz is more focused and ExprEssence can

find the nodes from different parts of the PPI network since it returns links rather than a connected module.

Module size analysis

Table 4.1: The size of the active modules returned by the different tools. MicroNet RnaNet MicroInterNet RnaInterNet #nodes #edges #nodes #edges #nodes #edges #nodes #edges jActiveModules 219 486 2,330 11,259 233 498 1,773 6,982 heinz 36 41 198 675 18 19 44 71 ExprEssence 210 168 2,039 2,288 177 138 170 138

When the data is changed to RNA-Seq, the sizes of active RnaNet modules also change. With respect to this criteria, the tools behaved differently on both datasets as Table 4.1 shows. The sizes of the output networks obtained by each tool increase drastically when the RNA-Seq is used instead of the microarray data. The increase is around 10× for jActiveModules, 5.5× for heinz, and 10× for ExprEssence. Fur-

thermore, the network size for heinz is more than 10× smaller than that of the the

other tools. These results suggests that when RNA-Seq data is used, heinz is still

able to maintain a more specific and focused results.

85 9415

4089

5874 6878 306

7485 5465 5339 348 283

79083 6879

5140 7351 6457 55120 308 2033 10891 5873 29894

10497 8997 10788 90 7429 5468 6528 1123 81788 6558 5091 883 70 55003 10370 23510 2521 8665 60 51196 4133 22982 10043 5163 7022 10015 9355 5866 124359 4892 6712 1387 10420 75294093 23136 4155 51 7251 71 3092 5910 10097 3184 7203 10492 23162 1612 55656 5598 9644 6919 8309 1027 22808 22920 3570 5337 230 6428 8626 5058 10403 836 10461 4893 240 11099 5292 1958 8027 100 3183 5900 5913 51019 9101 5581 8936 596 1021 8741 10254 7249 1601 5925 5338 7157 83746 867 8867 517526280 9513 896 5290 4156456 2248 1026 7132 3265 5894 6774 54474 7431 274 22862 2060 5600 6714 10188 51517 5933 10152 672 2885 5335 3556 8318 4233 4914 1605 5727 54386 5578 6507 8440 6655 30011 27165 5336 2348 5921 5979 161 2254 3717 1956 857 29767 9215 1741 2900 3098 29760 6640 3175 8870 153 11184 5310 324 6773 6464 54205 8743 1080 2049 2534 2904 6885 2047 5295 112950 8792 5594 348487 2026 23586 5584 9743 8434 5770 4067 2069 5781 2250 140885 27330 2886 118788 5747 8751 2891 5159 3371 9093 8491 27092 10818 22927 3643 1398 2185 3768 55824 3575 8754 317 10912 114836 5792 2066 3680 5174 9456 3927 3655 1742 5777 2364510677 2322 7048 9228 11216 25 5329 6696 8506 1647 2275 2888 1244 960 23303 677 3690 4771 36788470 2247 64130 9575 5167 54107 2260 2208 3685 23708 83439 4301 966 3574 493 5871 1739 23095 11177 699 9368 4216 6642 8945 9229 11183 90627 1499 6722 7224 8863 23382 5793

5396 472 8745 8218 3976 91683 701 27109

2118 10423 3675 7043 1380 26061 6282

5888 6252 3696 4739 6253 9429 747 1490 2115 641 675 5243 3589 6281

23223 525 26354 10180 8664 1573 81627 2177 50649 3909 4654 7518 8724

22822 7040 3397 9094 2771 1031 27292 8761 54810 51409 4924 3172 894 1020 9075 333 9114 2937

3399 2036 3960 2203 51805 2272 4292 21

87 3394 83732 55353 64759 6359 54494 4781 20 8504 3660 7372 5824 463 6812 29927 4774

23321 1939

6143

999

Figure 4.1: The visualization of the MicroNet networks obtained by the three tools. The yellow, turquoise, and green nodes show the modules from jActiveModules, heinz, and ExprEssence. The black nodes exist in all three networks. The red nodes exist both in ExprEssence and heinz. The blue nodes exist in both ExprEssence and jActiveModules modules. The purple nodes exist in both jActiveModules and heinz.

86 When the sizes of MicroInterNet and RnaInterNet are examined, it is ob- served that the changes on network sizes are smaller compared with MicroNet and

RnaNet case. The increase ratios are 2.4× for heinz and 7.8× for jActiveModules.

However, the number of genes in RnaInterNet module for jActiveModules is larger than the number of the input genes, which, obviously, is not desired. On the other hand, a slight reduction is observed for ExprEssence which uses the expression values instead of the p-values in the scoring function. For this dataset, the correlation between the expression values and p-values in microarray and RNA-Seq data are 85% and 27%, respectively. Hence, it is expected that ExprEssence maintains the same network size.

Significant gene analysis

Table 4.2: Number of DE genes in each active module. We focus on the 53 genes which exist in both microarray and RNA-Seq data. MicroNet RnaNet MicroInterNet RnaInterNet jActiveModules 18 8.2% 37 1.6% 19 8.2% 36 2.0% heinz 6 16.6% 8 4.0% 1 5.6% 4 9.0% ExprEssence 30 14.2% 48 2.4% 29 16.4% 26 15.3%

In this experiment, we focus on the importance of the genes contained in each module. Table 4.2 shows the number of DE genes exist in the module returned by each tool. The percentages in the table are the ratio of the number of DE genes to the number of nodes in a module. As the numbers show, the number of DE genes always increases when RNA-Seq data is used instead of microarray data. On the other hand, the percentage of DE genes reduces. Since ExprEssence returns links

87 Table 4.3: Individual existence of significant genes in different modules. M and R denote the existence of the corresponding gene in the microarray and RNA-Seq data.  refers to the existence of the gene in the corresponding module. TYMS TK1 CDH1 UMPS ABCB1 GDF15 TNFRSF1B (R) (R) (M&R)(M&R)(M&R)(M&R) (M&R) MicroNet RnaNet        jActiveModules MicroInterNet  RnaInterNet        MicroNet RnaNet      heinz MicroInterNet RnaInterNet    MicroNet    RnaNet       ExprEssence MicroInterNet    RnaInterNet   

between genes rather than connected components, it returns the largest number of

DE genes and compared with other tools, its DE percentages are much better for the

intersection networks. Actually, its MicroInterNet DE percentage is the best one

for this experiment. For all tools, an increase on DE gene amount is expected since

the RNA-Seq data is larger and hence, the output network is expected to be larger.

However, this may not be desirable in practice since, as the percentages in Table 4.2

show, compared with the increase on network size (10× for jActiveModules), the

increase in DE gene amount (2× for jActiveModules) is small.

In addition to the number of DE genes in each module, we also try to evaluate

the significance of the module by examining its relation with the resistance to the

5-FU drug in colorectal cancer cells. Table 4.3 shows the results for this analysis.

The significant genes we use for this experiment are as follows:

88 1. TYMS (thymidylate synthetase) is a gene biologically shown to be involved

in the regulation of apoptotic processes [143]. 5-FU drug works by inhibit-

ing the product proteinTYMS [144]. In addition, the over-expression of this

gene is known to increase the resistance for the 5-FU drug [145, 146, 144].

The gene is differentially expressed in RNA-Seq data. As shown in Table 4.3,

jActiveModules is the only tool capable of returning this gene in RnaInter-

Net module. Note that jActiveModules explores the neighbors up to two

nodes far away from the active nodes. Therefore, it can return genes that do

not exist in the input dataset.

2. TK1 (thymidine kinase) is an enzyme proved to be related to the increase

on TYMS deficiency [144]. TK1 is not differentially expressed in the RNA-Seq

data. However, due to its 1.2 fold change on the expression levels, ExprEssence

is able to return this gene in RnaNet module.

3. CDH1 (cadherin 1) is a classical cadherin gene. It was observed that CDH1 is

down-regulated in 5-FU resistant cells [147]. As described in [147], SNAI1 is

the main reason for the suppression of the CDH1. Therefore, we checked the

existence of both genes in both microarray and RNA-Seq data. We looked for

any direct interaction between the two genes in the output modules. CDH1 is

found differentially expressed in the RNA-Seq data while SNAI1 exists only in

the RNA-Seq and is not differentially expressed. SNAI1 does not exist in any

of the output modules. A possible reason for its absence can be the absence of

direct interaction between SNAI1 and any differentially expressed gene in the

PPI network.

89 4. UMPS (uridine monophosphate synthetase) is one of the 5-FU metabolism

genes that is believed to affect the response of the tumor to 5-FU critically [148,

149]. UMPS is differentially expressed in both datasets with a fold change > 2

only in the RNA-Seq data. As shown in Table 4.3, UMPS is only found in

RnaNet and RnaInterNet by heinz and jActiveModules. On the other

hand, ExprEssence returns this gene in all its modules. Interestingly, an in-

teraction between UMPS and HNF4A gene is implied in most of the modules.

HNF4A is an important gene that is known to be involved in many important

functions such as polarity and organization of tissues, and proliferation of tumor

cell lines [139]. HNF4A is known to be dysregulated in colorectal cancer [150].

However, according to our dataset, HNF4A is down-regulated in only RNA-Seq

data.

5. ABCB1 is a multidrug resistance gene is known to be differentially expressed in

colorectal cancer cells [140]. It exists in both data we used, but it has a p-value

< 0.02 only in the RNA-Seq data. Table 4.3 shows that ABCB1 is returned

as a part of all ExprEssence modules. However, jActiveModules and heinz

returned this gene only in RnaNet and RnaInterNet.

6. GDF15 is the growth differentiation factor 15 protein. This protein has a role

in 5-FU resistance in colon cancer [146]. We found this gene to be differentially

expressed only in the RNA-Seq data. jActiveModules and heinz returned this

gene in both RnaNet and RnaInterNet modules. Interestingly, our results

imply an interaction between the GDF15 and HNF4A genes. However, we do

not know whether this implication is biologically reasonable or not.

90 7. TNFRSF1B (tumor necrosis factor receptor super-family, member 1B) gene is

known to be related to the 5-FU resistance [146]. According to our experiments,

heinz is not able to return this gene in any of the output modules. However,

it exists in the microarray data with a p-value < 0.05.

Table 4.4: Top three hub nodes in each module and their number of edges (a) jActiveModules MicroNet RnaNet MicroInterNet RnaInterNet hub #edges hub #edges hub #edges hub #edges GRB2 51 HNF4A 737 GRB2 51 HNF4A 549 SHC1 28 RPS27A 108 EGFR 31 GTF2F2 72 FYN 25 RPS3 105 SRC 27 RPS25 69

(b) heinz MicroNet RnaNet MicroInterNet RnaInterNet hub #edges hub #edges hub #edges hub #edges EGFR 10 HNF4A 71 GRB2 8 HNF4A 21 GRB2 9 RPS15A 30 CRK 5 RPL12 9 RPS3 ONECUT1 7 EEF1A1 29 INSR 4 CDH1 8 RPS5

(c) ExprEssence MicroNet RnaNet MicroInterNet RnaInterNet hub #edges hub #edges hub #edges hub #edges HNF4A 21 HNF4A 441 HNF4A 18 HNF4A 20 CDKN1A 19 CDKN1A 56 CDKN1A 17 CD44 9 CD44 9 BCL2 47 CD44 8 BCL2 7

As a summary, RNA-Seq gene expression values helped the tools to return the important genes as a part of the output modules. On the other hand, they did not return the genes in the microarray based modules even though some exist in the microarray data. Surprisingly, heinz was able to return five out of the seven

91 important genes we found in the literature related to the 5-FU-resistance behavior.

Additionally, compared to other tools, it did not encounter a significant increase in the network size when used with RNA-Seq data.

Hub-node analysis

In this experiment, we observed how the hub nodes of the active modules change for different data and tools. Table 4.4 shows the three hub nodes with the highest number of connections in each module. All the tools return HNF4A as the first hub for both RnaNet and RnaInterNet modules. We believe that this is due to the better accuracy of the RNA-Seq data which leads to better relevant node detection. As mentioned previously, HNF4A is known to be dysregulated in colorectal cancer [150] and is an important gene involved in many functions including polarity and organization of tissues [139].

4.2.2 Oligodendroglioma tumors

This dataset contains only RNA-Seq gene expression values for five treatment samples (oligodendroglioma tumor samples) and one control sample (tumor initiating cells). The dataset contains 49, 868 genes where only 10, 823 genes are expressed and have an entrez gene id. Furthermore, the number of DE genes (i.e., p-value < 0.05 and fold change > 2) is 1, 086. However, only 7, 082 of 49, 868 genes are in the PPI network where 726 of them are differentially expressed.

The results of the experiments are given in Tables 4.5 and 4.6. Interestingly, the tools show a totally different behavior in the oligodendroglioma dataset when com- pared with the colorectal cancer dataset. For this dataset, jActiveModules provides the smallest module with only 160 nodes while heinz returns the largest module

92 with 1, 593 nodes. Moreover, 30% of the genes in jActiveModules and heinz mod- ules are DE. Furthermore, it was shown in previous studies (e.g., [29, 30, 126]) that when the greedy search algorithm of jActiveModules was used with microarray data, jActiveModules returned the largest module size with the smallest percentage of DE genes in comparison to the other tools.

This behavior is due to the nature of the used datasets and how the tools handle the input p-values/expression values. The colorectal cancer samples contain replicates for the same sample whereas the oligodendroglioma samples contain five different sam- ples each represents a different version of this disease. Both heinz and ExprEssence aggregate the input p-values/expression values into a single p-value/expression value.

And they use this single value to score each node and hence the connected com- ponents and links. On the contrary, jActiveModules calculates the score for each connected component for each sample separately. Then it gives an aggregated score for each connected component by measuring how much this component represents each sample. Finally, it returns the connected component with the highest aggre- gated score. Therefore, if the variance between the input samples is high (such as the samples in the oligodendroglioma dataset), the output for heinz scoring function may be inaccurate.

Table 4.5: Sizes of the active modules found by the tools and the number of differen- tially expressed genes in them when the oligodendroglioma dataset is used. Network size #nodes #edges #DE jActiveModules 160 355 48 30.0% heinz 1, 593 6, 812 470 29.9% ExprEssence 802 2, 098 102 12.7%

93 Table 4.6: Hub-node analysis of the tools when using the oligodendroglioma dataset. jActiveModules heinz ExprEssence hub #edges hub #edges hub #edges RPS10 21 EEF1A1 93 RPS3 140 RPS15 20 RPS16 89 RPL26 121 RPS8 RPS3A 88 RPS25 115 RPS6 RPS3 RPS26 RPS27A

Table 4.6 shows the hub nodes. On the contrary to the colorectal cancer datasets, each module returned almost a different set of hub nodes since each tool handles the input data differently. We believe, this happens since the tools handle the expression and p-values differently and the correlation between the Oligodendroglioma samples is low.

4.3 Conclusion and Future Work

The discovery of active modules in the protein-protein interaction networks al- lowed biologist to have a new perspective for the gene expression data. Since the introduction of the problem in 2002 by Ideker et al. [28], many algorithms have been developed to provide better accuracy and new dimensions for the problem. However, all of these studies used microarray data to evaluate the quality of the output results and the performance of the different tools.

In this work, we investigated the efficiency of RNA-Seq data on extracting the active modules. To achieve this goal, we used both RNA-Seq and microarray gene expression values for the same RNA sample extracted from colorectal cancer cell lines to discover the active modules. To further understand the effectiveness of RNA-Seq

94 data, we used another RNA-Seq gene expression dataset extract from five different oligodendroglioma cancer samples. The results showed that RNA-Seq can be more useful than microarrays in detecting relevant and overlooked active modules. In ad- dition, RNA-Seq based modules in our experiments were containing more biologically significant genes.

In our experiments, RNA-Seq data helped for a better evaluation of the perfor- mance of the different tools. For instance, it is mentioned in many studies that jActiveModules return the largest module size in comparison to the other tools.

However, on the oligodendroglioma dataset, jActiveModules returned the smallest module. Moreover, around 30% of the returned genes were found differentially ex- pressed in the input data. This suggests that the performance of the tools highly depend on the type of the input data. This is why we believe that experiments with

RNA-Seq data is necessary to understand the algorithms better and evaluate the performance of the tools further.

Since the sizes of the output modules are getting large with the RNA-Seq data, they become less effective to be a biomarker. Therefore, new algorithms or new tools for the active discovery problem will be needed. For some applications and use cases, it is better to return more focused and compact modules. Or maybe, we need a set of plugins/additions to the tools which annotate the output of them for an easier analysis of their results.

As future work, we will investigate new scoring functions to detect more spe- cific and focused modules. We are planning to use both the relative and individual

95 properties of the different genes to determine the importance of the genes in the out- put module. Moreover, the directions of the interactions in the PPI network can be included in the scoring function for better detection of relevant genes.

96 Chapter 5: PRASE: PageRank-based Active Module Extraction

In complex diseases, genes do not act in isolation, rather, they tend to interact together in pathways and modules to perform the designated function [1]. Therefore, many researchers focused on characterizing such modules and defining their proper- ties.

One possible method to detect such modules is to detect dense clusters of genes in different networks such as protein-protein interaction (PPI) networks [9, 151]. Such algorithms are based on the idea that genes performing the same function heavily in- teract with each other in comparison to their interaction pattern with other genes [9].

However, depending only on one type of data to extract the modules would lead to suboptimal results [54], thus, making it harder to explain the underlying behavior of the disease. Therefore, the integration of different types of data is critical for understanding the disease mechanism.

According to the types of data used in the integration process, the problem can be defined as either detecting genotypic modules or phenotypic modules [152]. Genotypic

modules refer to using genotype data, such as gene mutation information, to detect

modules enriched with genes having genetic alteration related to the disease mecha-

nism [152]. In a recent work, Vandin et al. addressed this problem in their algorithm,

97 HotNet [57]. They integrated disease specific mutated genes information with the PPI

network to detect sets of mutated k genes existing in the largest number of samples.

To achieve this goal, they first constructed an influence graph containing only the mutated genes. Then, they calculated the weight on the edges on the influence graph by applying a diffusion process [153] on the PPI to calculate the influence between all pairs of mutated genes. In a further work, Vandin et al. focused on returning sets of mutated genes that are mutually exclusive, i.e., genes do not exist in the same sample [154]. They relied on the observation that different mutated genes can perturb the same pathway while not being mutated at the same time in the same sample.

Phenotypic modules, on the other hand, refer to using data, such as gene expression, to extract the group of interacting genes that best explain the underlying disease behavior. One of the pioneering works in this field was carried out by Ideker et al. [28]. They named the discovered modules as active modules, referring to how these modules unexpectedly contain many interacting differentially expressed genes.

To extract the active modules, Ideker et al. integrated gene expression data with the

PPI network. They scored each candidate module by calculating the sum of all the genes’ Z-scores in the module, where the Z-score for each gene is calculated from the corresponding p-value of that gene. The problem of detecting the highest scoring module is NP hard, therefore, they provided a simulated-annealing-based heuristic to discover the active modules.

After Ideker et al., many algorithms have been proposed for active module (phyno- typic modules) discovery. Most of these algorithms use a greedy approach while the algorithm heinz by Dittrich et al. employs integer programming to find optimal solu- tions [29]. The problem definition and hence, the functions to be optimized have been

98 slightly altered in these and subsequent studies to cover various aspects and maximize

the usefulness of the discovered modules. For instance, Ulitsky et al. aimed to find

modules (a.k.a. functional modules) whose genes show a correlated expression [54].

Later, they extended their algorithm with an edge-weighting scheme and a proba-

bilistic model for module connectivity [125]. In addition, rather than using the gene

expression values, Ulitsky et al. converted the expression values to binary where the

gene is on if it is differentially expressed, and off, otherwise [30]. With this modifi-

cation, the problem is defined as discovering the modules with a certain number of

differentially expressed genes.

One drawback of the current definitions is that they focus only on returning the

module containing highly differentially expressed genes. However, a gene might not be

highly differentially expressed when observed but exhibit a coordinated dysregulation

with surrounding genes [155]. Such genes can be easily overlooked by current tools.

The active modules can be exploited in various ways; to detect de novo pathways, new biomarkers, or for knockdown experiments. Recently, many algorithms have been proposed to extract module biomarkers from the protein-protein interaction (PPI) network [50, 52, 156, 122, 123, 155]. The main difference between the original active module and biomarker extraction problems is that the former focuses on extracting the most comprehensive connected module with the largest score. On the other hand, the latter aims to extract a number of relatively small modules which are used to differentiate between the case and control samples.

All of the above active module discovery algorithms were designed and experi- mented using microarray data. However, the more recent alternative, RNA-Seq, pro- duces a data which exhibit different properties than microarray reads. It sequences

99 whole cell mRNA instead of looking for the existence of certain genes in a micoarray.

RNA-Seq made measuring the gene expression levels easier and more accurate: unlike

microarrays, it can detect de novo genes without a-priori information regarding the

distinct isoforms for the gene [40]. Therefore, it has been more successful than mi-

croarrays while detecting novel transcript isoforms. However, the output datasets of

RNA-Seq are much larger making them harder to analyze. In a recent work, Hatem

et al. showed that RNA-Seq data can yield promising results while discovering rel-

evant active modules at the expense of generating large ones [157]. Therefore, new

algorithms are required to discover smaller, high-quality active modules.

In this work, we focus on the original active module extraction problem. Our work

tackle two main concerns: first, including important but not necessary differentially

expressed genes in the network; second, detecting smaller and more focused active

modules to facilitate any further analysis while making use of the RNA-Seq properties

to return more accurate and disease-related modules. To address these points, we

introduce a novel workflow, PRASE, which adjusts the gene expression p-values while making use of the RNA-Seq data properties to enrich the outcome of the current active module discovery tools. Our workflow starts by first constructing a gene co-expression network from the RNA-Seq data which is more accurate than the microarray data and contains a complete image of the mRNA in the cell. Therefore, by constructing a gene co-expression network, we make use of all the possible dependencies between the coding genes in the cell. Such dependencies might not exist in the PPI network due to missing information. Using the p-values for the genes in the co-expression network, PRASE employs the personalized PageRank algorithm, a variant of the famous PageRank algorithm originally used by Google to rank the web pages [38], to

100 exploit the gene-gene interactions in a more elegant way. In this way, the complete dependency information between the genes obtained from the RNA-Seq data is used to boost the p-value of the important genes. Finally, the PageRank values are adjusted to generate new p-values which are then fed with the RNA-Seq specific PPI module to the active module extraction tools.

Using PRASE, the importance of the genes which interact with many differentially expressed genes will increase. Hence, they will be contained in the output module with a larger probability. The new p-values obtained via PRASE are further used with two popular tools, jActiveModules [28] and heinz [29], to extract the final output active module.

The effectiveness of PRASE is extensively evaluated using a number of evaluation criteria, including the size of the network, the percentage of differentially expressed genes, the percentage of disease related genes, and GO and pathway enrichment analysis. In general, a technique is considered superior over the remaining ones if it maximizes most of the criterion. For instance, a technique that provides a smaller module with a large percentage of differentially expressed genes is considered better than another one with a larger module and a smaller percentage. Note here that we are focusing on the percentage rather than focusing on the absolute number of differentially expressed genes.

The rest of the chapter is organized as follows: In Section 5.1, the background material is given for active module extraction and PageRank algorithm. Section 5.2 describes the proposed workflow. A thorough experimental evaluation of the workflow is presented in Section 5.3. Section 5.4 concludes the work mentioned in this chapter.

101 5.1 Background

5.1.1 Active module extraction tools

Many tools have been developed to detect the most active module using different

metrics. In general, these tools are divided into two groups based on the type of the

input: gene expression values or p-values. Examples for expression-value-based tools

are DEGAS [30] and GXNA [51]. DEGAS uses the gene expression values to calculate the

p-value for each gene. Afterwards, it determines a gene is on/off based on its p-value

and a given threshold. Finally, it looks for the set of at least k connected on genes

covered by at least l samples. GXNA uses the gene expression values to calculate a combined score for each module. Then, it returns the module with the highest score.

The second category, i.e., p-value based tools, contains jActiveModules [28] and

heinz [29]. jActiveModules works by calculating a combined score for each module

S in each sample using the p-values. Then, it calculates a combined score for S across

all the samples. It then returns the module with the highest combined score. Since

the problem of finding the highest weighted module is NP hard [28], jActiveModules

uses a search heuristic to find these modules. Many algorithms have been developed

to provide either a better scoring function or a better search heuristic. Nevertheless,

jActiveModules is still widely used since it has a very easy and simple user interface

and very few parameters to tweak while providing good results.

The tool heinz uses an integer linear programming approach to find the optimal

module with the highest score. It first aggregates all the p-values of a given node across

all the samples into a single p-value. The aggregation function returns a negative score

for noise and a positive score for the correct p-values. heinz elegantly transforms the

102 problem into the well-known prize-collecting Steiner tree problem. And it finds an optimal solution by using the algorithm described in [128].

5.1.2 PageRank for gene ranking

The PageRank algorithm has been developed to provide an accurate ranking of web pages [38]. It has been used in many different areas including bioinformatics.

PageRank follows the random-surfer model iteratively: at each iteration, the PageR- ank score of a node i is equally distributed to i’s neighbors with probability δ. The remaining (1–δ) probability is uniformly distributed to all other nodes. That is at each iteration, the process is restarted from an arbitrary node. In PageRank, the high-ranked nodes distribute their scores to their immediate neighbors, hence, boost their ranking. As the algorithm iterates, these contributions propagate to the other

t nodes. Formally, the PageRank of node i at iteration t, denoted as ri, is equal to

N t−1 (1 − δ) X rj wij rt = + δ (5.1) i N d j=1 j where N is the number of nodes in the network, dj is the degree of node j, and wij is equal to 1 if nodes i and j are connected, and to 0, otherwise.

Morrison et al. used the personalized variant of the algorithm to rank the genes based on their expression values [158]. In this variant, the process is restarted with (1–

δ) probability at each iteration. The restart probability is not equally distributed to the PageRank scores of the genes. To de/prioritize the genes, the fold changes of the genes are used as the personalization vector. Winter et al. also used the personalized

PageRank to analyze biological networks where the personalization vector is obtained from the Pearson correlations, and a transcription-factor network is used for the gene

103 interactions [159]. Recently, Ivan et al. employed the personalized PageRank for similar purposes [160].

The PageRank algorithm and its variants have been used to rank the genes ac- cording to their interactions with known, disease-related ones. Nevertheless, all of algorithms are more concerned with prioritizing genes rather than taking PageRank one step further to prioritize and extract a module. It has been shown that cancer re- lated proteins maintain a large number of interactions when compared to non-cancer related proteins [161]. Here, by using PageRank, we incorporated the topology infor- mation and use these interactions to detect genes and gene networks which play an active role for the disease.

Similar to our work, Vandin et al. also used a random-walk based algorithm to calculate the dependency between the different mutated genes [57]. They used the dependency information to construct a graph representing the dependency between pairs of mutated genes. Finally, they returned the module of mutated genes that best represent the disease mechanism. However, unlike our work, they do not use any gene expression data. Therefore, all of the mutated genes are treated equally and they do not have any initial prioritization. In addition, the edges in the returned module does not represent physical interactions between the genes, hence, returning a set of mutated genes rather than a physical interaction network.

5.2 PRASE

The proposed approach works in multiple steps to utilize RNA-Seq data. The overall workflow is summarized in Figure 5.1.

104 Figure 5.1: The PRASE workflow: PRASE first generates the gene co-expression network from the set of genes in the RNA-Seq data. Then it generates the corre- sponding adjacency matrix. The PageRank algorithm uses the old p-values, p, and the adjacency matrix as inputs for re-ranking. The new p-values, denoted with p0, are generated by scaling the PageRank output. Then they are used with the RNA-Seq PPI network for the active module extraction process.

5.2.1 Input network and matrix construction

There are two networks required in our workflow: the input network required for the module extraction tools and the PageRank input network. For the former, the required input network is a PPI network. However, to make use of the information contained in the RNA-Seq data and reduce the false-positive rate, we are using the

PPI network containing only RNA-Seq genes. The RNA-Seq network is extracted using the extraction tool provided in the BioNet package [162].

For PageRank, a gene co-expression network is constructed and used as an input.

Indeed, other types of networks can be used such as the PPI network. However, the

PPI network is incomplete and does not contain all of the interaction information

105 between different genes. On the other hand, the gene co-expression network has the ability of capturing indirect dependencies and possible interaction patterns between these genes. In addition, having a complete set of active coding genes (obtained from the RNA-Seq data), we also have the ability to retrieve possible interactions between them. Therefore, applying PageRank on the RNA-Seq gene co-expression network can boost the rank of the most important genes even if they are not differentially expressed from the p-value perspective. The simplest method to construct the gene co-expression network is by putting an edge between a pair of genes if their Pear- son correlation is above a threshold. There are also other variations of this simple method. In this work, we used Ruan et al.’s rank-based method to construct the gene co-expression network which tries to minimize the false positives and obtain only the accurate connections [13]. Note that any other gene co-expression construction algorithm can be integrated to PRASE.

A drawback of using the gene co-expression network is the requirement of large number of samples to accurately construct edges between the genes. Hence, in case of a low number of samples, the RNA-Seq PPI network may be an alternative to obtain the PageRank values.

To generate the adjacency matrix of the gene co-expression network for PageRank, we use the ftM2adjM function in the R package. Currently, we are treating the network as unweighted (and undirected in case of the PPI network).

106 5.2.2 Re-ranking

A gene which is not differentially expressed can still be important if it connects

many important genes (e.g., hub nodes). But, considering the state-of-the-art algo-

rithms used for active module extraction, it may be ignored in the output module.

As explained above, by incorporating network structure and using PageRank, we aim

to boost the importance of such genes. This can yield a module that contains these

genes and more differentially expressed ones which were discarded in the first place.

For personalization, we modified the PageRank equation (5.1):

N t−1 X rj wij rt = (1 − δ)q + δ (5.2) i i d j=1 j

0 where, ri = qi for each gene i and,

1 − p q = i . (5.3) i N P (1 − pj) j=1

There are two important points: first, there is an inverse correlation between the p- value and PageRank score of a gene, i.e., a high PageRank score implies a significant gene, thus a small p-value. Hence, we use 1 − p and not p for initialization and

personalization. Second, we use the summation as the denominator in (5.3) to make

the output of (5.2) similar to a probability distribution rather than a gene ranking.

5.2.3 Scaling and combining

As mentioned above, the PageRank scores have an inverse correlation with the

p-values. Therefore, they cannot be used directly as p-values and they need to be

scaled such that the maximum r, i.e., r = 1, maps to the most significant p-value,

107 i.e., p0 = 0. The naive method to do that is employing a linear scaling:

r p0 = 1 − i . (5.4) i max(r)

Exponential scaling can also be used to obtain the desired mapping and the corre- sponding p0,

0 pj = exp(−s ∗ rj), (5.5) where s is chosen to minimize the difference:

N N X 0 X pi − pi (5.6) i=1 i=1

Since the scaling is non-linear and the sum of new p-values approximates to the sum of old ones, we believe this is a more viable alternative.

Even though new p-values better reflect the structure of the network, they are not designed to totally ignore the original measurements [158]. Hence, genes that were differentially expressed with the original p-value should not be ignored. To solve this issue, we merged p with the scaled p0 as follows. ( p if p < min(0.05, p0 ) p0 = j j j (5.7) j 0 pj otherwise. where 0.05 is a parameter that defines which genes are DE from the perspective of the old p-value. Indeed, a change in this value will result in a change the output.

Therefore, we are using the largest acceptable threshold value of 0.05 to make sure that we are not missing any important genes.

108 5.3 Experimental Results

We implemented PRASE in R. The necessary files of the workflow are freely

available at http://bmi.osu.edu/hpc/software/prase/index.html. We made use

of the available implementations of the module extraction, adjacency matrix con-

struction, and PageRank. We used two module discovery tools for the experiments:

jActiveModules and heinz where the former is provided as a plugin for Cytoscape [130]

and the latter is a part of the BioNet package [162]. We picked these tools since they

are widely accepted, they use the p-values as input, and they are easy to use. We used a PPI network with 11, 203 genes and 57, 235 interactions. The network was assembled by Chuang et al. [52].

We carried out the experiments with three datasets: breast invasive carcinoma

(BRCA), colorectal cancer cell line (CRC), and oligodendroglioma tumor (Oligo) datasets. We picked the datasets so as to cover different types of control/case rela- tions. For instance, the BRCA control/case samples are for the healthy and diseased tissues whereas the CRC control/case samples are for the same disease before and after introducing the 5-FU drug. On the other hand, the control/case samples for

Oligo are for different types of cancer tissues.

The BRCA dataset is for the invasive ductal carcinoma subtype. We obtained the dataset from the TCGA portal2. It contains 114 control/case samples which are extracted from healthy and tumor tissues, respectively, where each control/case sample pair was extracted from the same patient. The dataset does not contain replicates. The DESeq package was used with the unnormalized gene expression values to calculate the p-values [163].

2https://tcga-data.nci.nih.gov/tcga/

109 In the CRC dataset, RNA-Seq was used to measure the gene expression values for

fluorouracil (5-FU)-resistant and -nonresistant CRC lines. The RNA-Seq data was published as a part of an expression analysis [139]: to prepare the 5-FU resistant cell,

MIP/5-FU [140], Griffith et al. passed the 5-FU sensitive cell line, MIP101 [141], through an increasing concentration of 5-FU resulting in a 5-FU resistant cell. The p-values were calculated using Fisher’s exact test. Afterwards, the Benjamini and

Hochberg’s step-up false discovery rate controlling procedure was applied for multiple testing correction.

The last RNA-Seq dataset, the Oligo dataset, contains the gene expression values for six different samples where five of them are extracted from five different versions of the disease representing the case samples. The sixth one is extracted from tumor initiating cells representing the control sample3.

To measure the significance of the output modules, we looked for the number

(percentage) of DE genes in each module. In addition, we used gene sets from the literature which are known to be related to the diseases we use in this work. A summary of these sets are given in Table 5.1.

Table 5.1: Standard names for the (curated) gene sets from MSigDB and KEGG pathway (last row). PPI: number of genes from the gene set that exist in the PPI network we are using Name Alias Size PPI NUTT GBM VS AO GLIOMA DN GSEA1 45 38 NUTT GBM VS AO GLIOMA UP GSEA2 46 40 SCHUETZ BREAST CANCER DUCTAL INVASIVE DN GSEA3 84 58 SCHUETZ BREAST CANCER DUCTAL INVASIVE UP GSEA4 352 258 TURASHVILI BREAST DUCTAL CARCINOMA VS DUCTAL NORMAL DN GSEA5 198 120 TURASHVILI BREAST DUCTAL CARCINOMA VS DUCTAL NORMAL UP GSEA6 44 28 Oligodendroglioma pathway OligoPath 29 29

3http://www.alexaplatform.org/alexa_seq/Oligo/Summary.htm

110 We assume that a gene is differentially expressed (DE) if the change in its expres- sion value is ≥ 2 and its p-value is ≤ 0.05. Due to the randomness of the seed genes in jActiveModules, we ran each experiment three times and the averages are given.

The output modules are evaluated using the following criteria:

• Network size: It is easier to analyze the modules when they are smaller. How-

ever, this criterion cannot solely evaluate the effectiveness as it does not measure

the quality of the returned module.

• Percentage of DE genes: We used the percentage of DE genes in the network

instead of their actual number since the maximum number of DE genes can

be obtained by taking the whole PPI network which obviously not a desired

output.

• Percentage of disease-related genes: When the disease-related gene percentage

is higher, the output module is more focused on the disease.

• GO and pathway enrichment analysis: In complex diseases, the underlying bi-

ological mechanisms are still obscure and instead of analyzing the existence of

DE (or important) genes, it may be better to analyze a collective functionality.

5.3.1 Breast invasive carcinoma

The BRCA dataset contains 20, 530 genes where only 9, 463 of them are expressed genes that exist in the PPI network. Each sample has around 700 DE genes, however, there is no DE gene common among the 57 samples. Therefore, we calculated the number of DE genes while considering 20% and 30% of the samples as outliers. The number of DE genes is 29 and 111 for 20% and 30% outliers, respectively. The BRCA

111 dataset contains a large number of samples, therefore, the gene co-expression network

can be accurately constructed.

The sizes of the modules using exponential scaling are shown in Figure 5.2(a).

With PRASE, the module size for heinz decreases from 271 to 261 nodes. The same module is obtained with different δ values. On the other hand, the average size of an jActiveModules module increases from 126 to 145 nodes with δ = 0.5. However, we improved the quality of the output modules as shown in Figure 5.2(b). The GSEA3,

GSEA4, GSEA5, and GSEA6 gene sets were obtained from MSigDB (Table 5.1).

These datasets are specific to invasive ductal carcinoma. PRASE improved the quality of heinz modules w.r.t. the DE gene percentage by 0.7% and GSEA gene percentage by 0.5%-1%. Meanwhile, jActiveModules networks’ quality is improved by 1%-3% with δ = 0.5. Linear scaling was also applied on the p-value, however, it did not yield any good results (results are not included).

For a better evaluation, we performed GO enrichment and pathway analyses by using DAVID [164]. GO annotations usually suffer from repeated annotations and large overlaps. We used DAVID’s clustering to get rid of the redundant terms. A summary of the annotations is shown in Table 5.2.

In general, the modules were enriched with extracellular region, regulation of phosphorylation, and response to stimulus related annotations. The overexpression of some extracellular region related genes are known to be involved in breast cancer especially in the metastatic one (e.g., [165, 166]). Using PRASE, we detected more extracelluar region related genes while improving the p-value for the related go term.

For instance, the percentage increased from 30% for heinz to 38% while the p-value changed from 5.5 × 10−10 to 6.2 × 10−21.

112 300 400 350

250

300 200 250 150 200 150 nodes

100 Numberofedges Numberofnodes 100 edges 50 50 0 0 δ=0 δ=0.3 δ=0.5 δ=0.85 δ=0 δ=0.3 δ=0.5 δ=0.85 p p'-exp p p'-exp jActiveModules heinz (a) Module sizes with various δ values and exponential scaling (p0-exp) 18 16

14 12 GSEA3 10 GSEA4 8 GSEA5 6

genespercentage 4 GSEA6 2 DE(20%) 0 DE(30%) δ=0 δ=0.3 δ=0.5 δ=0.85 p p'-exp p p'-exp jActiveModules heinz (b) Percentages of DE and important genes in the networks. The y-axis shows the ratio of the DE and important genes to all the genes in the module. GSEA3, GSEA4, GSEA5, and GSEA6 are aliases for the gene sets in rows 3, 4, 5, and 6 of Table 5.1, respectively. DE (X%) denotes the case where X% of the samples are considered as outliers. The results for heinz do not change with δ. Therefore, a single exponential scaling (p0-exp) column is given for heinz.

Figure 5.2: Evaluation of the modules obtained for the BRCA dataset

113 Table 5.2: GO analysis summary of jActiveModules and heinz for the BRCA dataset. jActiveModules heinz p p0 p p0 δ=0.3 δ=0.5 δ=0.85 -Extracellular region       -Nuclear and cell division, mitotic and cell cycles, spindle, and or-   ganelle fission -Response to hormone and endogenous stimulus     -Response to organic substance   -Plasma membrane      -Extracellular matrix   -Regulation of phosphorylation       -Regulation of transferase and kinase activity     -Regulation and positive regulation of protein metabolic process and     map kinase activity -Cell migration, motility, and localization    -Regulation of Cell migration, and locomotion     -Platelet alpha granule and vesicle    -Response to wounding     -Wound healing and coagulation    -Cell and biological adhesion     -Blood vessel and vasculature development     -Cell junction and focal adhesion  -Response to hypoxia, oxygen levels, and progestrone stimulus  -Glycosaminoglycan, polysaccharide, and pattern binding    -Hemopoietic, and immune system development  -Cell activation   -Regulation of cell death and apoptosis   -Chemical homeostasis    -Regulation of system process  -Skeletal system development    -Neuron development  -Regulation of cell communication    -Cellular homeostasis   -Behavior and taxis   -Anchoring and cell junctions  -Muscle organ and tissue development  -Defense and inflammatory response     -Heparin binding   -Protein dimerization activity  -Integrin complex, cell-substrate, cell matrix adhesion, and Integrin-   mediated signaling -Regulation of , nuclear division, and organelle organization  -Positive regulation of cell proliferation  -Cytokine and chemokine activity and leukocyte migration  -Regulation of transmission and system, neurological system, and mul-  ticellular organismal processes -Regulation of response to external stimulus    -Regulation and positive regulation of cell adhesion   -Growth factor binding  -Response to extracellular stimulus  -Growth factor activity  -Cytokine binding  -Regulation and positive and negative regulation of -protein  ligase activity -Behavior, regulation of cellular localization, and positive regulation  of transpost and protein transport -Response to organic substance   -Homeostatic process  -Urogenital system development  -Second-messenger mediated signaling  -Regulation of mitotic cell cycle  114 The GO terms related to inflammatory responses and response to wounding were

enriched in most of the modules except for heinz with p. It is known that the existence of inflammation related genes contributes to the growth of the tumor [167].

Therefore, the introduction of this annotation in heinz with PRASE is very related to the breast cancer behavior.

A comparison of GO analyses with different δ values for jActiveModules reveals that the annotations are consistent for δ = 0.3 and δ = 0.5. However, for δ = 0.85, we

start to see slightly different annotations. This is most likely due to the increase on

the impact of node-node interactions rather than the original p-values. We observed

similar consistencies among different contributions for the GO analyses for both tools.

A summary of the pathway enrichment analysis results is shown in Table 5.3.

Using jActiveModules with PRASE results in the removal of pathways that are

not that much related to breast cancer such as the renal cell carcinoma pathway.

Meanwhile, we encounter the enrichment with other pathways such as chemokine

signaling. It has been shown that chemokines are critical for cancer progression [168].

On the other hand, some of these pathways are not related to breast cancer, such

as Arrhythmogenic right ventricular cardiomyopathy (ARVC). Moreover, important

pathways such as the ErbB signaling pathway are not significantly enriched in the

output module anymore.

The original heinz modules were not enriched with as many pathways as the pathways enriched in jActiveModules modules albeit its relatively larger module

size. However, with PRASE, heinz modules were enriched with Focal adhesion,

ECM-receptor, and cytokine-cytokine receptor interaction pathways (p-value between

2e-03 and 7e-07). Focal adhesion related genes, specifically, PTK2, are known to be

115 Table 5.3: Pathway analysis of jActiveModules and heinz for the BRCA dataset. jActiveModules heinz p p0 p p0 δ=0.3 δ=0.5 δ=0.85 -Focal adhesion     -ECM-receptor interaction      -Hematopoietic cell lineage     -Pathways in cancer       -Cytokine-cytokine receptor interaction      -Regulation of actin cytoskeleton    -Hypertrophic cardiomyopathy (HCM)    -ErbB signaling pathway    -Bladder cancer      -Dilated cardiomyopathy    -Leukocyte transendothelial migration   -Renal cell carcinoma  -Pancreatic cancer  -Chemokine signaling pathway  -Long-term depression  -Cell cycle    -Arrhythmogenic right ventricular cardiomyopathy (ARVC)   -Cell adhesion molecules (CAMs)   -Complement and coagulation cascades   -Proteasome  -Gap junction  -Progesterone-mediated oocyte maturation  -Prostate cancer  -Melanoma  -Glioma 

DE in breast cancer [169]. Moreover, the focal adhesion pathway at the end affects the signaling pathway. On the contrary, some pathways such as Gap junction and Progesterone-mediated oocyte were enriched only in the module extracted with p and were not enriched in the remaining modules. These pathways have also been mentioned as breast-cancer related [170].

5.3.2 Colorectal cancer cell line (CRC)

The CRC dataset contains 36, 952 coding and non-coding genes where only 11, 853 of them are expressed coding genes and have a corresponding entrez gene id. However, only 7, 456 of these genes exist in the PPI network. In addition, there are 251 DE genes where only 157 of them exist in the PPI network. The CRC data contains only 2

116 2000 10000

1800 9000

1600 8000 1400 7000 1200 6000 1000 5000 800 4000 nodes

600 3000 Numberofedges Numberofnodes edges 400 2000 200 1000 0 0 δ=0 δ=0.3 δ=0.5 δ=0.85 δ=0.3 δ=0.5 δ=0.85 p p'-lin p'-exp jActiveModules heinz

Figure 5.3: The number of nodes/edges in the modules with different δ values and scaling functions for the CRC dataset: p0-lin and p0-exp refer to using linear and exponential scaling, respectively.

samples, therefore, the gene co-expression network cannot be accurately constructed.

Hence, the RNA-Seq PPI network is used in this experiment.

The module sizes for jActiveModules and heinz are shown in Figure 5.3. The

figure includes both linear and exponential scaling results. In the figure, δ = 0 implies that p0 = p. A positive δ leads to the updates on the old p-values due to the connections between the genes, and a larger δ increases the impact of these updates. For this experiment, we achieved 70% reduction on the network size for jActiveModules with δ = 0.85 and exponential scaling.

To measure the significance of the output modules, we looked for the number of DE genes in each module. In addition, we looked for genes that are known to be related

117 Table 5.4: The number of DE and significant (SIG) genes in each module for jActiveModules. The numbers in parentheses are the percentages of DE genes in the module. heinz (with or without PageRank) detected 34 (16%) DE and 5 SIG genes. p p0-lin p0-exp δ=0 δ=0.3 δ=0.5 δ=0.85 δ=0.3 δ=0.5 δ=0.85 #DE 97 (5%) 84 (9%) 105 (10%) 49 (8%) 78 (8%) 90 (8%) 46 (8%) #SIG 7 5 5 4 5 5 4

to CRC and popular among the literature (OMIM4and MSigDB5). We found 7 genes related to the resistance behavior: TYMS [143], TK1 [144], CDH1 [147], UMPS [148],

ABCB1 [140], GDF15 [146], and TNFRSF1B [146]. All of these significant genes are

DE in the RNA-Seq data. As shown in Table 5.4, only 5% of the network detected by jActiveModules with the original p-value, p, were DE, among them the 7 significant genes were present. With PRASE, the ratio of DE genes increased to 8%, among them 4 of the 7 significant genes were present, namely, TYMS, CDH1, UMPS, and

ABCB1. heinz also detected exactly these 4 genes with the addition of GDF15.

A further analysis revealed that over-expression of TYMS is known to increase the resistance behavior for the 5-FU drug [146] while UMPS is believed to critically affect the response of the tumor to the drug [148]. Hence, in addition to a reduction on the module size and an increase on the DE gene percentage, PRASE helped in detecting genes that are believed to be highly relevant to the drug resistance.

4http://www.ncbi.nlm.nih.gov/omim 5http://www.broadinstitute.org/gsea/msigdb/

118 Table 5.5: Percentages of DE genes detected by jActiveModules and heinz for the Oligo dataset. The numbers are shown in percentage. p p0-lin p0-exp δ=0 δ=0.3 δ=0.5 δ=0.85 δ=0.3 δ=0.5 δ=0.85 jActiveModules 33% 29% 31% 32% 25% 23% 29% heinz 29% 29% 29% 29% 29% 28% 26%

5.3.3 Oligodendroglioma tumors

The Oligo dataset contains 49, 868 coding and non-coding genes where only 10, 823 of them are expressed protein coding genes and have an entrez gene id. 7, 082 out of the 10, 823 genes exist in the PPI network and 726 of them are DE. Similar to the

CRC dataset, the gene co-expression network cannot be accurately constructed from the 6 samples. The RNA-Seq PPI network is used in this experiment.

The sizes of the networks for jActiveModules and heinz with and without

PRASE (see δ = 0) are shown in Figure 5.4. Using jActiveModules with δ = 0.85, we obtained a 55% reduction on the network size with almost the same DE-gene percentage. However, there was no improvement for heinz within PRASE.

We further analyzed the output modules w.r.t. the hub nodes they contain. We found that all the modules include the Epidermal Growth Factor Receptor6 (EGFR)

as the hub node except for δ = 0.85 with exponential scaling which returned different

hub nodes: CD44, RPL4, and RPLP0. EGFR is a transmembrane protein that

binds to EGF. Binding to the ligand leads to receptor dimerization and tyrosine

autophosphorylation leading to cell proliferation. Glioma cells increase the expression

of this gene to boost the tumor behavior. Therefore, it is considered as a target for a

6http://www.ncbi.nlm.nih.gov/gene/1956

119 180 500 160 450

140 400

350 120 300 100 250 80 nodes 200

60 edges Numberofedges

Numberofnodes 150 40 100 20 50 0 0 δ=0 δ=0.3 δ=0.5 δ=0.85 δ=0.3 δ=0.5 δ=0.85 p p'-lin p'-exp (a) jActiveModules 2000 9000 1800 8000

1600 7000

1400 6000 1200 5000 1000 4000 nodes 800 3000 edges

600 Numberofedges Numberofnodes 400 2000 200 1000 0 0 δ=0 δ=0.3 δ=0.5 δ=0.85 δ=0.3 δ=0.5 δ=0.85 p p'-lin p'-exp (b) heinz

Figure 5.4: Modules sizes with various d values and scaling functions for the Oligo dataset: p0-lin and p0-exp refer to using linear and exponential scaling, respectively.

120 new drug development [171]. CD447 is a gene that is involved in cell-cell interactions

and cell migration. In addition, it is a receptor for the hyaluronic acid (HA) ligand.

The protein also participates in many cellular functions such as tumor metastasis.

It is also found to be overexpressed in invasive oligodendroglioma tumors [172, 173].

However, it was not contained in any of the original moduels except the ones obtained

by PRASE. On the other hand, RPL4 is one of the top 50 markers in anaplastic

oligodendroglioma according to MSigDB.

We further looked to the enrichment of known oligodendroglioma related genes/

pathways in the output modules. We obtained two gene sets from MSigDB and one

other set from oligodendroglioma related pathways found in KEGG (Table 5.1: rows

1, 2, and 7). The first gene set from MSigDB, GSEA1, contains 45 marker genes

for anaplastic oligodendroglioma where only 38 of them exist in the PPI network.

The marker genes were obtained by performing microarray expression analysis for

the 12, 000 genes in a set of 50 gliomas, 28 glioblastomas, and 22 anaplastic oligoden-

droglioma. The second dataset, GSEA2, contains 46 marker genes for glioblastoma

multiforme where only 40 of them exists in the PPI network. The genes were also

extracted from the same microarray experiment. Glioblastoma multiforme and oligo-

dendrogliomas are similar brain gliomas. However, the former is more aggressive than

the later [174]. We analyzed the percentages of these genes in the output modules.

The results of the analysis for jActiveModules are shown in Fig. 5.5. For GSEA1 and GSEA2, p0-exp with δ = 0.85 returned the largest percentage 3.6%. However, for the KEGG pathway, pr-lin with δ = 0.85 returned the largest percentage of 5.7%.

In general, PRASE with the different δ values and scaling functions enhanced the

7http://www.ncbi.nlm.nih.gov/gene/960

121 6

5

4

3 GSEA1 GSEA2 2

genes percentage OligoPath 1

0 δ=0 δ=0.3 δ=0.5 δ=0.85 δ=0.3 δ=0.5 δ=0.85 p p'-lin p'-exp

Figure 5.5: Percentages of important genes in the jActiveModules networks for the Oligo dataset. The y-axis shows the ratio of the important genes to all the genes in the module. GSEA1, GSEA2, and OligoPath are aliases for the gene sets in rows 1, 2, and 7 of Table 5.1, respectively.

percentage of important genes in the output modules. For heinz, using PRASE did not effect the percentages. All heinz modules obtained in our experiments reached approximately 1.6%, 0.5%, and 0.7% for the important genes from GSEA1, GSEA2, and OligoPath, respectively.

5.4 Conclusions

PageRank has been used to prioritize the genes using expression data and various biological networks such as transcription factor networks [159] and GO networks [158].

These studies only focus on ranking genes rather than taking PageRank one step fur- ther to extract or prioritize modules. In this work, we proposed a workflow, PRASE,

122 which uses PageRank to calibrate the p-values and detect important and overlooked patterns while using RNA-Seq data.

Our evaluation showed that PRASE can effectively improve the quality of the output modules. For instance, a 70% reduction in jActiveModules module size

while increasing the percentage of DE genes for the CRC data clearly indicates that

the workflow is promising. Nevertheless, a further evaluation may still be required

to quantitatively measure the effectiveness of using PageRank. Another potential

measure could be the betweenness centrality of the genes returned in the modules.

Therefore, in the future, we plan to apply this measure and others as well to improve

the effectiveness of the workflow.

In addition to the datasets, the effectiveness of the workflow depends on some

other parameters including the scaling method, the dumping factor δ, and the merg-

ing threshold. Among the different datasets, exponential scaling provided better

results than linear scaling. Therefore, we highly recommend exponential scaling to

be the default scaling method. For the dumping factor, we observed that δ = 0.85

generated smaller and good quality modules when using the PPI network to generate

the PageRank values. On the other hand, δ = 0.5 provided the smallest module when

using the gene co-expression network instead of the PPI. We believe such a change

in the δ value is due to the degree of the nodes in both networks; the PPI network is

sparser than the gene co-expression. However, further experiments on other datasets

might be required to prove this hypothesis. For the merging threshold, the smaller

the threshold is, the smaller the set of old p-values we merge with the new p-values.

As a result, any change in the threshold will lead to a change in the obtained active

123 Table 5.6: A summary of the improvements by PRASE in the experiments.  : cases improved with PRASE. ?: cases for which we obtained the best results with PRASE —: cases where the effect of PRASE is not significant. BRCA CRC Oligo jActiveModules   ? heinz  ——

modules. Therefore, in order to not neglect any important genes, we recommend using the largest threshold value of 0.05 to do the merging.

Even though extracting the active modules significantly help in understanding the disease mechanism, the picture is still incomplete as we do not understand how the genotypic changes are related to the phenotypic ones [152]. Recent research tried to tackle this problem and bridge the gap between the two data types [175, 176].

However, the problem is still open and further work is needed. Therefore, we plan to address this problem by extending our work and integrate genotype data in the network extraction process.

124 Chapter 6: MICA: MicroRNA Integration for Active Module Discovery

The discovery of disease-related modules have been an important problem for a long time. The focus at first was on extracting dense gene clusters from biological networks, e.g., PPI or gene co-expression networks [4,5]. However, such an approach has proven its insufficiency in extracting comprehensive modules [24]. Therefore, new algorithms and techniques have been proposed to discover more accurate disease- related modules.

One fruitful technique for extracting such modules is based on integrating gene expression values and the PPI network into one framework [24]. The integration idea was first introduced by Ideker et al. [28] and many others then followed the same approach, e.g., [29, 30]. These discovered modules are called active modules since the gene expression data, which is dynamically changing, is integrated with the static PPI network. Hence, the modules are active in certain cells or conditions. Even though the proposed algorithms have shown their efficiency in discovering disease-related active modules, they still do not make use of the fast amount of available heterogeneous data.

Therefore, the discovered modules do not give a complete picture about the disease behavior. Additionally, they focus on the genes in the PPI network; discarding other genes that we do not have yet any information regarding their interaction patterns.

125 MicroRNAs (miRNAs) are small non-coding RNAs that are used by the cell to post-transcriptionally regulate gene expression levels [43]. miRNAs inhibit protein synthesis by either stopping the protein translation or by performing mRNA degra- dation. miRNAs constitute an important inhibition technique that has been shown to be very important in different diseases, specifically, in cancer progression [44]. For instance, miRNAs were found to be differentially expressed in breast cancer in addi- tion to successfully classifying estrogen and progesterone receptors, and HER2/neu status [45]. Hence, using miRNAs for the active module discovery is a promising technique to increase the accuracy and success rate of the cancer treatments.

Most of the works that integrate miRNA and mRNA data assumes that the miRNA effect on the mRNA is distinguishable from the gene expression levels [58,

177]. However, the protein expression level can be significantly affected by the miRNA without having any apparent effect on the gene expression level [62]. [64] suggested another method to integrate miRNA and mRNA by integrating the PPI network and miRNA-target gene network into one heterogeneous network. They focused on prioritizing the genes using the suggested network. Indeed, such integration would work around the miRNA-mRNA integration problem. However, by focusing only in prioritizing genes through the PPI network, they cannot detect connected modules of genes with indirect dependencies, e.g., through other genes not in the PPI network or through other genes with no change in expression at mRNA level.

Even though the techniques using gene expression levels provide valuable infor- mation, they cannot show the whole picture. Here, we try to exploit another miRNA and mRNA interaction pattern, which is the inhibition of protein translation rather than mRNA degradation. We believe that if the gene expression levels are adjusted

126 based on the expression levels of the corresponding miRNAs, novel and interesting

gene-gene dependencies can be unraveled.

In this work, we propose a workflow Mica which employs heterogeneous data

sources and adopts independent component analysis [178] to extract active modules.

To unravel new types of gene-gene dependencies, we provide a novel data integration

technique that adjusts the expression level of the genes based on the expression level

of the corresponding miRNA. These dependencies are then mapped back to the PPI

network to extract the connected modules. Compared to existing active module dis-

covery tools, Mica is less dependent on the given biological network it uses hence

does not need to ignore the information for the entities which are not in the network.

There are three types of interactions between a group of miRNAs and a target

gene; synergetic, complementary, and additive.A synergetic effect implies that all the miRNAs affecting the gene must be expressed together in order to have mRNA degra- dation or protein inhibition [179]. Rather, miRNAs can act complementary by requir- ing only one out of the miRNA set to be expressed [179]. In an additive interaction, each miRNA alone has an effect while the overall effect is increased if multiple miRNAs are expressed [180]. Here, we will focus on the complementary and the additive effects.

The rest of this chapter is organized as follows: In Section 6.1, we provide a back- ground on the techniques we used in this work. Our methods and experimental results are presented in Section 6.2 and Section 6.3, respectively. Section 6.4 concludes the work mentioned in this chapter. 6.1 Background

Independent Component Analysis (ICA) is a famous technique used to solve the

Blind Source Separation problem. Given an input with multiple, linearly mixed

127 sources, it tries to distinguish the sources by minimizing the statistical dependen-

cies between them [178]. In the context of gene expression, ICA decomposes an input

expression into its possible expression modes [181]. For an n × m input gene expres-

sion matrix X, where rows correspond to genes and columns correspond to samples,

ICA decomposes X into:

XT = A × S (6.1)

such that S is a ` × n matrix for ` ≤ m. The rows of S are (statistically) as indepen-

dent as possible and correspond to the independent components. The columns of S

correspond to the genes and the entry Scg shows the contribution of a gene g to the component c. A is an m × ` matrix where its rows correspond to samples. The entry

Asc shows the contribution of each component c for a sample s. Many approximation algorithms have been proposed to find A and S in an efficient way, e.g., fastICA [39],

JADE [182], and InfoMax [183]. fastICA tries to identify non-Guassian components under the assumption that Gaussian components represent the noise. This algorithm can stuck in a local minima, hence multiple iterations, thus multiple estimates can be necessary [184, 185].

ICA has been used extensively to cluster different genes together or for sample classification [181, 186, 187, 188, 189, 190, 191, 192]. All of these studies have shown the efficiency of ICA in producing biologically relevant results. 6.2 Methods

Mica consists of three main parts as shown in Figure 6.1:

128 Controls Cases Controls Cases PPI Network gene 1 miRNA 1 gene 2 miRNA 2 gene 3 miRNA 3 gene 4 miRNA 4 gene 5 ...... miRNA m gene n microRNA Expression Profiles Gene Expression Profiles

Integration gene 1 gene 2 gene 3 miRNA r: z r,s > t gene 4 miRNA r': z r',s > t gene 5 . . . . Connected Module Extraction . . miRNA r'': z > t gene n r'',s module 1 module 2 Adjusted Gene Expressions

ICA

Output of ICA module 3

Figure 6.1: Mica: The workflow starts with integrating miRNA and mRNA data by adjusting the mRNA data using the miRNA data. Then, ICA is applied on the resulting new gene-expression matrix. Finally, for each independent component obtained by ICA, the largest connected module from the PPI network is extracted using the significant genes in the component.

129 6.2.1 Data integration

The miRNA and gene expression data are usually integrated using correlation-

based methods with the assumption that the miRNA effect on mRNA should be

apparent on the gene expression level. Rather than the suppression of the gene

expression, the inhibition of the protein translation can also be used. Traditional approaches cannot exploit this effect. Our novel integration scheme uses miRNA expression levels to adjust the gene expression. Hence, if a gene is affected by an miRNA at the inhibition level, the proposed integration makes the effect visible on the expression level. To do this, for each sample s, we first compute P | Zr,s | {r: r affects g, Zr,s<0} βg,s = P (6.2) Zr,s {r: r affects g, Zr,s>0}

where Zr,s is the z-score of miRNA r in sample s that is experimentally verified to

affect gene g. It is calculated by

Zr,s = (xr,s − µr)/σr (6.3)

where xr,s is the expression level of miRNA r in sample s, and µr and σr are the

mean and standard deviation of r’s expression level across all the control samples.

In (6.2), the miRNAs are divided into two groups since they affect a gene differently.

In general, when an miRNA r is down-regulated, i.e., has a negative z-score, then the

expression of g will increase. On the other hand, when r is up-regulated then the

expression of g will decrease. Accordingly, the final gene expression is calculated as

0 eg,s = βg,s × eg,s (6.4)

0 where eg,s and eg,s are the original and adjusted expression levels of gene g.

130 For data integration, (6.4) is applied to each gene-sample pair. To avoid noise,

only the miRNAs with an absolute z-score at least tR in more than 10% of the sam-

1 ples are kept. Additionally, βg,s must be > tR or < in order to modify eg,s, i.e., tR we want that either the up-regulated or the down-regulated group of miRNAs has a

significant effect on g.

As mentioned above, miRNAs can affect the genes in a synergetic, complementary,

or additive way. Our integration equation (6.4) is additive and partially complemen-

tary, i.e., the gene expression level will be affected more if several miRNAs affect

it (additive). Yet, when only a single miRNA is active in the sample, it will still

affect the expression level (complementary). At the end, our goal is to better high-

light the dependencies between the genes rather than finding exact protein expression

values; there are many unknown factors affecting the actual protein expression.

6.2.2 ICA on gene expression values

After the data integration step, the adjusted gene expression values are then fed

to the ICA for which the R version of the fastICA algorithm is used [39]. To avoid lo-

cal minimas and unreliable independent component estimates, we follow the method

in [185]: we run fastICA κ times and obtain different independent component esti-

mates at each run. Then, the Pearson correlation coefficients between the components

from different estimates are computed to distinguish the most similar ones. We con-

structed a k-partite similarity graph G = (V,E) where V = V1 ∪ · · · ∪ Vκ are the set

of all components returned by ICA and Vi is the set of components obtained in the

ith run. The edge set E contains an edge (c, c0) if the Pearson correlation coefficient

0 between c and c is at least 0.9 and they are not obtained in the same run, i.e., c ∈ Vi,

131 0 c ∈ Vj, i 6= j. To obtain the final component set, we partition G to its maximally connected subgraphs. Then for each connected subgraph C of G with at least κ ver- tices, we construct a final representative component by computing the average of the

|C| rows corresponding to the vertices in C.

An important parameter of ICA is the number of components ` to be generated; when ` is large ICA will probably return subcomponent-type structures which are not very interesting [193]. A na¨ıve method is setting ` = m, the number of samples, which is not useful in our case since we have hundreds of them. We follow another approach [191] based on an earlier method proposed by [194]. We first apply Singular

Value Decomposition (SVD) to the actual gene expression matrix to reduce the di- mensionality. We do the same for a randomly permuted version of the same matrix.

The actual variance obtained from each SVD component is used to draw a curve of the information gain. A similar curve is also generated for the randomly permuted case.

The optimal number of components would be the point of intersection of these two curves, i.e., when the information obtained from the random components is higher than the information obtained from the actual components.

The matrices S and A generated by ICA can be used to determine which genes are significant in each component and which components are significant in each sam- ple, respectively. There are different options to pick the significant components, e.g.,

[195, 185, 189]. Here, we used a variant of the correlation method suggest by [189].

Basically, instead of calculating the correlation between the component weight across the samples and the type (control/case) of the samples, the Wilcoxon signed-rank test is used to calculate a p-value for each component based on its weight distribution over the controls and cases. The Bonferroni correction method is then used to correct

132 the p-value. We further compute µ and σ for each component by using its weights in the control samples. We then compute the z-score for each component-case sample pair. Hence, a component is significant for a case, if the corresponding z-score is at least a threshold tC .

To determine the set of genes related to a component, we use the z-score threshold based method [195, 188] which was shown to be effective to return the most important genes for each component. We calculated the z-score of each gene in a component by using its weight, µ, and σ that are computed by using all the gene weights inside this component. Then for each component, the genes with a z-score at least tG is considered to be a member of the component.

6.2.3 Connected module extraction

The connected PPI modules are extracted by mapping the set of member genes in each component to the PPI network and extracting the largest connected module.

If there is no connected module or if the largest one is not large enough the threshold tG used to pick the member genes for each component is relaxed to allow more con- nectivity. However, as the results will show, each component yield a large connected module in PPI. In addition, recent studies also showed that the components generated by ICA (or similar techniques) are either highly enriched in the PPI network [177] or highly enriched with signaling pathways [188].

Each component we found after the second step is expected to generate a con- nected modules. It is crucial to define a scoring function to determine which module is the most important one, i.e., containing important member genes. Although a large module is preferable, we do not want the modules to be too large. Therefore, after

133 determining the member genes in each component c, the following scoring function is used: P Zcg scr(c) = g∈c (6.5) p|c| where |c| is the number of member genes in c. We used p|c| instead of |c| since we want to give a higher score to larger modules. A gene g will have a high Zcg value if it is significant for c. Therefore, if a connected module contains many important genes the module is considered to be important.

6.3 Results

We implemented our proposed workflow Mica in R and used the available imple- mentation of the fastICA algorithm. To demonstrate the effectiveness of the proposed workflow, that is, the added benefits of early integration of microRNA datasets, we compared the modules obtained by our workflow Mica against the ones obtained us- ing ICA and DEGAS [30], using the original gene expression values. DEGAS is a set-cover based algorithm known for its efficiency in detecting dysregulated pathways. It tries to detect a module with at least k differentially expressed (DE) genes shared between most of the samples. We tuned the DEGAS parameters to detect the best module according to a measure provided by the tool based on how far the size of the module is from a randomly generated subnetwork of k genes. We set the maximum number of modules for DEGAS to 5. Still, it returned a single module in the experiments. In the rest of the text, DEGAS output modules are referred to as degas, ICA modules as ica, and Mica modules as mica.

We carried out the experiments on two datasets for two breast-cancer subtypes: invasive lobular carcinoma (ILC) and Invasive ductal carcinoma (IDC) datasets .

134 Both datasets are from TCGA (https://tcga-data.nci.nih.gov/tcga/) and they both

contain RNA-Seq and miRNA-Seq data. High throughput sequencing data was used

in our experiments since it can provide a complete image about all the miRNAs and

mRNAs in the cell without requiring any a-priori information. The main aim of using two different subtypes of the same disease is to understand how different techniques are able to detect modules specific to each subtype.

The ILC dataset has 106 control samples and 153 case samples. All of the 259 samples have gene expression information. Out of the 153 cases, only 150 contain miRNAs expression data as well. Therefore, only the 150 cases are used in our ex- periments. The IDC dataset shares the 106 control samples with the ILC. It also has

714 case samples with gene expression information, however, only 699 case samples, which also have miRNA expression information, are used in our experiments.

The PPI network used for the module extraction was obtained from the BioGRID

(http://thebiogrid.org ) database (rel. 3.2.104). It contains 139, 539 interactions be- tween 18, 170 proteins. The experimentally validated miRNA-target interactions used in data integration are obtained from miRTarBase (rel. 4.5) [196].

The number of runs κ for ICA is set to 100 while tR threshold is set to 4 and tC and tG are set to 2. We set the threshold high since we only want to keep the values that would have a potential of being important.

The qualities of the output modules are verified using different methods, including, pathway enrichment analysis, GO enrichment analysis, disease ontology (DO) enrich- ment analysis, and finally using the evidence in the literature on the importance of the modules/genes. Enrichment analysis is performed using ReactomePA [197],

FunDo [198], and clusterProfiler [199].

135 Table 6.1: Size of the modules obtained using Mica and ICA. # is the component number, S is the number of samples a component covers, |c| is the size of the component, |c|ppi is the number of genes that are both in the component and the PPI network, N and E are the number of nodes and edges, respectively, for the largest connected module in the PPI, and scr(c) is the score of the largest connected module. The missing component is a very small one. (a) ICA (b) Mica

#S |c| |c|ppi N E scr(c) #S |c| |c|ppi N E scr(c) 1 55 754 657 221 348 39.43 1 103 501 475 164 272 55.63 3 54 279 267 103 143 25.33 2 49 284 242 21 21 12.71 4 28 703 641 274 510 50.70 3 67 1007 879 339 585 49.51 5 4 542 448 116 141 28.80 4 30 455 446 283 506 52.41 6 7 349 320 116 337 26.68 5 68 931 876 541 1535 66.91 7 2 204 176 30 29 12.81 6 9 889 752 253 354 46.04 7 3 790 738 410 1297 51.04

6.3.1 Results on ILC data

The Mica modules are meaningfully different from ICA modules. Table 6.1 shows the number of samples they cover, the size of each component, the number of member genes in the PPI network, the size of the largest connected module, and the score. In general, for each of ICA and Mica components, there is a large connected module in the PPI network. Interestingly, Mica modules have higher scores than ICA modules in addition to being more common across the samples.

We also use DEGAS on the ILC dataset for comparison purposes. The degas mod- ule consists of 347 genes with 730 interactions between them and the number of DE genes in this module is 200. The quality, i.e., the module size p-value, is 0.19 which can be considered large. We tried different options for DEGAS to get a better module, however, this is the best module we obtained.

Statistical analysis of the obtained components: An important step is to

first ensure that the obtained Mica components, hence the active modules, cannot

136 mica1 mica2 mica3 mica4 mica5 mica6 mica7 0.00 0.05 0.10 0.15 0.20 0.25 0.30 −10 −5 0 5 10 15 t−score

Figure 6.2: Random t-score distribution.

be obtained from a random matrix. Therefore, we set our null hypothesis to be that the t-score calculated for each component from its weight across the case and control samples in the A matrix can be obtained if we have a random input matrix. Accord-

ingly, we generated 1000 random matrix by randomly permuting the modified gene

expression values for each gene across the case and control samples. Afterwards, we

applied Mica on the random matrices and calculated the t-score for the randomly

generated components. For each 1000 run, we only kept the max/min t-score value.

Finally, using the t-scores from the random runs, we generated the distribution for

the random t-scores and compared our actual t-scores against. The random t-score

distribution and the components t-score values are shown in Figure 6.2. Clearly, the

components cannot randomly gain such a high t-score (i.e., p-value = 0). Therefore,

the null hypothesis is rejected.

137 AUC 0.95 0.96 0.97 0.98 0.99 1.00

MICA ICA DEGAS

Figure 6.3: AUC for Mica, ICA, and DEGAS for a 10-fold cross validation.

Classification using modified and original gene expression: It is important

to ensure that the modified gene expression data better differentiate between case and

control samples. To this end, a comparison between the predication accuracy using

Mica modules on the modified gene expression data and ICA and DEGAS modules on the original data was carried out. Basically, for Mica modules, a Support Vector Ma- chine (SVM) was trained on each module separately, with the genes in each module used as the input features. Afterwards, a voting was performed between the modules to determine the output classification. The same was applied on ICA but with the original data. For DEGAS, no voting was required since it only has one module. The results for a 10-fold cross validation is shown in Figure 6.3. In general, Mica and

ICA obtain a better classification accuracy than DEGAS, with Mica being more stable across the different runs and obtaining an AUC value of 1 in almost all of the runs.

138 Active modules analysis: The next step is to see which genes exist in each

active module, how the different active modules overlap, and the enrichment of each

module with important GO annotations. Interestingly, there was not a large over-

lap between Mica, ICA, and DEGAS; degas overlaps with 12% of mica5 while ica4

overlaps with 17% of mica6. Nevertheless, there were some similarities in the top en-

riched GO annotations (i.e., with corrected p-value < 10−15). Among the top similar

ones are: translational elongation between ica6 and mica7, and positive regulation of

biological process between ica4 and mica6, cellular macromolecule metabolic process

in mica1 and degas, and organelle organization between mica4 and degas. On the other hand, the top different ones included protein transport in ica1, cardiovascular system development and extra cellular matrix organization in ica5, response to en- doplasmic reticulum stress in mica2, RNA processing and splicing in mica3, and cell cycle and cell cycle process in mica5.

Since we are working with active modules that are going to be further used to extract important pathways, we further performed pathway enrichment analysis to better evaluate the quality of the active modules. The results are shown in Table 6.2.

Similar to GO annotations, some pathways are common between Mica, ICA, and

DEGAS. For instance, both degas and mica5 were enriched with the cell cycle pathway, however, the p-value for degas was much smaller than the p-value in mica5. Remark- ably, mica5 was enriched with more cell cycle-related pathways, such as, the cell cycle, mitotic, and check points pathways, with BRCA1 common among most of these path- ways. Mutations in BRCA1 lead to genetic instability and deficiency in the different cell cycle phases [200]. Additionally, its absence results in breast cancer formation.

139 Table 6.2: Pathway enrichment analysis for Mica, ICA, and DEGAS on the ILC data. Database Pathway MICA ICA DEGAS % pval # % pval # % pval Reactome Unfolded Protein Response 23.81 6.78 × 10−05 2 3.64 8.20 × 10−03 4 Processing of Capped Intron- 5.60 4.21 × 10−03 1 Containing Pre-mRNA mRNA Splicing 5.30 4.21 × 10−03 3 Cell Cycle, Mitotic 18.48 1.19 × 10−21 5 11.53 7.79 × 10−3 Mitotic M-M/G1 phases 13.31 3.75 × 10−18 5 Elastic fibre formation 4.74 4.05 × 10−05 6 11.21 7.30 × 10−11 5 Molecules associated with elastic fibres 3.95 2.81 × 10−04 6 3’ -UTR-mediated translational regu- 8.29 3.77 × 10−05 7 22.41 8.20 × 10−14 6 lation L13a-mediated translational silencing 8.29 3.77 × 10−05 7 22.41 8.20 × 10−14 6 of Ceruloplasmin expression Formation of a pool of free 40S sub- 7.80 3.98 × 10−05 7 19.83 5.30 × 10−12 6 units Eukaryotic Translation Initiation 8.29 3.98 × 10−05 7 18.97 3.03 × 10−11 6 Antigen Presentation: Folding, assem- 4.52 1.62 × 10−06 1 bly and peptide loading of class I MHC Interferon alpha/beta signaling 5.88 7.99 × 10−05 1 Golgi Cisternae Pericentriolar Stack 2.71 5.10 × 10−04 1 Reorganization ER-Phagosome pathway 4.98 6.98 × 10−04 1 PERK regulated gene expression 2.19 3.49 × 10−03 4 Toll Like Receptor 4 (TLR4) Cascade 5.47 4.25 × 10−03 4 Cytokine Signaling in Immune system 10.21 4.25 × 10−03 4 Antigen Presentation: Folding, assem- 2.55 6.00 × 10−03 4 bly and peptide loading of class I MHC Extracellular matrix organization 21.55 5.25 × 10−15 5 Molecules associated with elastic fibres 9.48 3.27 × 10−09 5 Integrin cell surface interactions 11.21 2.02 × 10−07 5 Degradation of collagen 8.62 5.17 × 10−06 5 Translation 24.13 8.66 × 10−14 5 Cap-dependent Translation Initiation 22.41 8.66 × 10−14 6 Eukaryotic Translation Initiation 22.41 8.66 × 10−14 6 GTP hydrolysis and joining of the 60S 21.55 2.74 × 10−13 6 ribosomal subunit Peptide chain elongation 18.10 9.89 × 10−11 6 Nonsense Mediated Decay Indepen- 18.10 1.71 × 10−10 6 dent of the Exon Junction Complex Repair synthesis for gap-filling by 1.73 7.32 × 10−3 DNA polymerase in TC-NER Removal of the Flap Intermediate from 1.72 7.32 × 10−3 the C-strand Telomere Maintenance 3.75 7.32 × 10−3 KEGG Pancreatic cancer 6.70 1.05 × 10−04 1 6.03 4.15 × 10−03 5 Pathways in cancer 15.24 1.05 × 10−04 1 14.66 2.59 × 10−03 5 Small cell lung cancer 7.31 1.05 × 10−04 1 7.75 7.07 × 10−04 5 Chronic myeloid leukemia 6.09 7.01 × 10−04 1 6.89 1.26 × 10−03 5 Colorectal cancer 5.49 8.10 × 10−04 1 5.17 9.87 × 10−03 5 Bladder cancer 4.27 2.18 × 10−03 1 Prostate cancer 6.09 2.24 × 10−03 1 Non-small cell lung cancer 74.27 8.10 × 10−03 1 Protein processing in ER 52.38 4.65 × 10−11 2 12.22 1.10 × 10−08 1 Spliceosome 6.19 1.24 × 10−03 3 Osteoclast differentiation 8.70 1.85 × 10−06 6 Complement and coagulation cascades 4.74 1.62 × 10−03 6 Ribosome 7.07 1.76 × 10−10 7 17.24 3.34 × 10−14 6 ECM-receptor interaction 11.21 3.83 × 10−07 6 Focal adhesion 16.28 3.83 × 10−07 6 TGF-beta signaling pathway 7.76 7.07 × 10−04 6

140 A B B

C

Figure 6.4: Overlap between Important pathways enriched in both Mica and ICA modules. Orange is for Mica, blue is for ICA, and green for genes in both. A) Pathways in cancer (mica1 and ica5, B) Protein processing in (mica2 and ica1, C) Ribosome (mica7 and ica6 ).

141 Pathways that are highly enriched in both Mica and ICA modules include the pathways in cancer, ribosome, and protein processing in endoplasmic reticulum path- ways. Figure 6.4 shows the overlap between Mica and ICA on those pathways.

Pathways in cancer pathway is enriched in both mica1 and ica5. Remarkably, mica1 contains key breast cancer genes including ERBB2, MYC, RB1, and NFKB1. Ad- ditionally, mica1 is more common across the samples than ica5. ERBB2 gene is a growth factor receptor that is overexpressed in breast cancer and usually related to the aggressiveness of the tumor and the resistance to the chemotherapy [201]. RB1 gene is mutated in breast cancer [202] while the NFKB1 gene has a major rule in invasive breast cancer [203]. MYC is a multifunctional protein that plays a role in cell cycle progression and cellular transformation. Amplification of MYC is found to be a fre- quent event in breast cancer that is often more associated with the metastatic version of the tumor [204]. The protein processing in endoplasmic reticulum pathway is an- other interesting pathway that is enriched in both mica2 and ica1. The endoplasmic reticulum (ER) is an essential organelle involved in many important functions such as protein folding and secretion. In cancer cells, the unfolded protein response (UPR) and ER-associated degradation (ERAD) pathways, which are parts of the protein pro- cessing in ER pathway, are both activated to help in the survival and the metastasis of the cancer cells [205]. Interestingly, EDEM1 and SEL1L genes (mica2 are important parts of the ERAD component in addition to being de-regulated in cancer cells [205].

Since mica1, mica2, ica1, and ica5 contain interesting pathways, we further per- formed disease ontology enrichment analysis on these modules using FunDO [198].

The top diseases enriched in the modules, after Bonferroni correction, are: cancer

(2.11×10−21) and breast cancer (1.11×10−4) in mica1, cancer (1.15×10−3) in mica2,

142 cancer (2.34 × 10−12) in ica5, and cancer (6.2 × 10−5) and Melanoma (1.1 × 10−4)

in ica1. Clearly, mica1 is the most enriched and related module to cancer in general

and breast cancer, in specific.

6.3.2 Results on IDC data

Invasive Ductal Carcinoma is another famous breast cancer subtype. Previous

works showed that IDC and ILC act differently and have different sets of DE genes

[206, 207]. Nevertheless, we expect to find some common pathways between them,

even though each pathway might include different sets of genes [208].

Similar to ILC, we first used the dataset with ICA and Mica to see how different the output is when the miRNA data is added. As shown in Table 6.1, there is a significant difference between ICA and Mica modules. The Mica produced more

highly scoring modules than ICA. In addition, Mica produced 66 modules while ICA

produced 35 modules. We further analyzed the highest scoring modules from the two

methods, namely, ica18, ica21, and ica30 from ICA and mica7, mica15, mica33,

mica42, and mica63 from Mica. Those modules are the highest scoring modules

with a score > 60. By comparing between the modules from ICA and Mica, we found that the most similar ones are mica42 and ica30; with 266 genes exist in both.

The remaining Mica and ICA modules did not have any significant overlap.

By further examining the genes in mica42 and ica30, we found that both contain

BRCA1, BRCA2, BRIP1, BLM, RAD51, UBE2C, and CKS2. BLM and RAD51 have a tumorigenic significance [209], UBE2C and CKS2 are among the genes that are DE in IDC [210], and BRCA1, BRIP1, and BRCA2 are known breast cancer mutated (http://cancer.sanger.ac.uk/cancergenome/projects/census/). On the other hand mica42 only contains TOP3A, HMG20B, RAD51C, CDC6, and U2AF1 genes.

143 Table 6.3: The components obtained by ICA and Mica. # is the component number, S is the number of samples a component covers, |c| is the size of the component, |c|ppi is the number of genes that are both in the component and the PPI network, N and E are the number of nodes and edges, respectively, for the largest connected module in the PPI, and scr(c) is the score of the largest connected module. The missing components are either too small, or have a very small connected module, or have a score of less than 30. (a) ICA (b) Mica

#S |c| |c|ppi N E scr(c) #S |c| |c|ppi N E scr(c) 1 418 533 477 114 140 42.29 1 324 595 538 154 182 45.82 2 130 643 556 95 105 24.5 2 76 571 526 212 329 37.71 3 201 507 441 130 182 45.78 3 523 535 473 68 78 35.5 4 199 660 488 72 92 22.36 5 319 679 604 169 249 37.61 5 15 638 542 102 124 30.08 7 296 400 374 147 234 61.78 7 28 388 341 118 179 52.08 8 174 655 592 188 266 36.24 13 184 492 419 55 69 33.4 11 414 661 583 136 176 34.89 14 693 812 659 185 248 40.82 15 336 317 267 42 47 59.76 15 64 752 622 117 131 34.5 16 255 733 670 299 458 47.19 17 246 500 450 97 108 41.98 17 216 542 425 67 86 36.61 18 87 897 849 391 775 61.95 21 436 272 258 101 208 58.79 21 123 744 669 303 522 61.43 23 208 543 473 91 113 33.63 23 136 386 343 77 109 46.12 24 262 570 512 167 275 34.7 24 201 503 447 112 137 26.47 25 309 532 483 184 244 57.42 25 253 423 376 110 153 49.62 26 328 403 377 152 243 54.86 26 173 690 601 197 316 44.53 27 278 455 389 80 88 31.39 29 6 708 612 186 234 34.55 28 262 655 579 162 214 36.99 30 513 675 649 454 1851 83.63 31 257 682 602 202 726 50.76 31 42 540 457 171 252 33.83 33 245 289 280 138 297 79.69 32 38 603 502 111 140 27.59 35 380 495 433 106 135 31.81 34 16 749 588 176 220 45.63 36 160 768 662 286 909 54.72 35 554 501 457 84 95 45.25 37 169 534 471 135 199 30.98 38 166 700 619 178 218 36.7 39 132 665 607 197 298 36.41 42 544 682 633 348 1063 66.97 45 185 634 565 156 202 32.8 61 99 535 433 78 104 40.37 63 242 243 230 101 188 66.43 64 186 565 506 163 222 49.83 65 1 494 444 159 246 58.05

HMG20B gene interacts directly with BRCA2. The inhibition, of the interaction be- tween HMG20B and BRCA2 lead to progression of tumor [211]. TOP3A and BLM genes interact with RMI1 gene forming a complex that is very important in genome stability [212]. The mutations in this complex increase the risk of breast cancer in addition to other types of cancer [213]. RAD51C gene was also found to be mutated

144 in breast cancer [214]. The de-regulation of CDC6 poses a serious risk of carcino-

genesis [215] while U2AF1 is a splicing factor protein that is mutated in cancer in

general [216].

The degas module on IDC data contains 386 genes with 1, 056 interactions and

190 DE genes. Based on the quality measure, the module has a p-value of 0, i.e.,

it cannot be randomly obtained. There are 105 genes exist in degas, ica30, and

mica42 including BRIP1, RAD51, BLM, UBE2C, and CKS2. However, degas did not contain other cancer related genes including BRCA1, BRCA2, XRCC1, XRCC2, and RRM2. Additionally, none of the genes exclusively exist in mica42 exist in degas.

In addition to examining the different obtained modules, we performed classifi- cation analysis using the different modules and datasets to ensure that the adjusted gene expression data better correlate with the disease behavior. Similar to the ILC dataset, a SVM was trained on the active modules obtained from each tool separately.

Then, a 10-fold cross validation was performed using the original data for ICA and

DEGAS and modified gene expression data for Mica. The three tools almost performed the same with Mica having the least error of 0.0013. The error for ICA and DEGAS was 0.0038 and 0.0063, respectively.

To better evaluate ICA, DEGAS, and Mica modules, we further performed path- way enrichment analysis, as shown in Table 6.4. There are a lot of pathways common between mica42, mica30, and degas such as Cell cycle, Tolemere maintenance, and

DNA strand elongation. However, mica42 alone was enriched with the p53 signaling pathway. Interestingly, there are many important pathways enriched in mica15 which were not enriched in any other tools, including the complement and coagulation cas- cades, platelet degranulation, and Hemostasis pathways. All of these pathways are

145

Figure 6.5: mica15 module. The red nodes are for the nodes in the Hemostasis pathway.

part of the hemostatic system of the cell. Hemostatic elements are considered im- portant in facilitating the metastatic potential of breast cancer [217]. Additionally,

A proteomic based study has shown the complement and coagulation pathway to be

DE in IDC( [218] . Figure 6.5 shows the genes in mica15 module. Among the nodes

in this network and also in the Hemostasis pathway is the APOA1 gene. APOA1

gene was found DE in IDC samples vs control samples in a proteomic study [219]. In

addition, mutations in this gene lead to poor outcome for post-surgery breast can-

cer patients [220]. Other interesting genes in mica15 are GADD45A, GADD45B, and

GADD45G genes. GADD45 genes are stress sensor genes that are activated in respond to cell stress and DNA damage. GADD45 genes were found down-regulated in cancer.

Additionally, they are considered as potential therapeutic targets in cancer [221].

The DO enrichment analysis using FunDO is showed in Table 6.5. In general,

Mica and MICA modules are significantly enriched with cancer and breast cancer

146 Table 6.4: Pathway enrichment analysis for ICA, DEGAS, and Mica on the IDC data. Database Pathway MICA ICA DEGAS % pval # % pval # % pval KEGG Complement and coagulation cascades 42.86 1.17 × 10−23 15 DNA replication 6.32 6.68 × 10−17 42 5.51 1.13 × 10−18 30 Mismatch repair 3.16 5.53 × 10−07 42 3.30 1.11 × 10−10 30 Homologous recombination 2.59 3.57 × 10−04 42 2.64 6.97 × 10−06 30 Base excision repair 2.59 1.65 × 10−03 42 2.64 5.46 × 10−05 30 p53 signaling pathway 3.45 7.86 × 10−03 42 Spliceosome 6.60 8.20 × 10−04 21 Oocyte meiosis 4.63 1.43 × 10−03 30 Reactome Platelet degranulation 21.43 6.66 × 10−08 15 Common Pathway 11.90 6.66 × 10−08 15 Chylomicron-mediated lipid transport 9.52 7.98 × 10−06 15 Platelet activation, signaling and ag- 23.81 9.16 × 10−06 15 gregation Intrinsic Pathway 9.52 2.81 × 10−05 15 Retinoid metabolism and transport 11.90 4.16 × 10−05 15 Hemostasis 30.95 6.80 × 10−05 15 Diseases associated with visual trans- 11.90 1.29 × 10−04 15 duction Platelet Aggregation (Plug Formation) 9.52 3.18 × 10−04 15 p130Cas linkage to MAPK signaling 7.14 3.97 × 10−04 15 for integrins GRB2:SOS provides linkage to MAPK 7.14 3.97 × 10−04 15 signaling for Intergrins mRNA Splicing 9.42 1.52 × 10−04 33 6.60 7.65 × 10−05 21 mRNA Processing 10.14 1.52 × 10−04 33 6.93 2.45 × 10−04 21 Cell Cycle, Mitotic 32.76 3.86 × 10−52 42 31.28 4.26 × 10−64 30 17.62 4.74 × 10−13 DNA strand elongation 7.18 3.92 × 10−25 42 5.95 6.16 × 10−26 30 2.59 1.45 × 10−04 Resolution of Sister Chromatid Cohe- 12.07 1.57 × 10−22 42 11.45 9.46 × 10−28 30 6.74 2.29 × 10−07 sion Leading Strand Synthesis 3.45 6.05 × 10−13 42 2.64 1.40 × 10−11 30 Polymerase switching 3.45 6.05 × 10−13 42 2.64 1.40 × 10−11 30 DNA Repair 8.62 3.75 × 10−12 42 8.15 2.16 × 10−14 30 DNA Replication Pre-Initiation 6.90 2.29 × 10−11 42 5.51 8.67 × 10−10 30 4.4 7.74 × 10−05 M/G1 Transition 6.90 2.29 × 10−11 42 5.51 8.67 × 10−10 30 4.40 7.74 × 10−05 Telomere C-strand synthesis initiation 1.72 2.31 × 10−07 42 1.32 9.12 × 10−07 30 Telomere Maintenance 5.17 2.68 × 10−07 42 4.41 4.87 × 10−07 30 3.63 1.01 × 10−03 Fanconi Anemia pathway 3.16 8.48 × 10−07 42 2.86 1.23 × 10−07 30 Removal of the Flap Intermediate 2.30 1.37 × 10−06 42 2.20 2.13 × 10−08 30 Global Genomic NER (GG-NER) 3.45 1.75 × 10−06 42 3.08 4.60 × 10−07 30 Phosphorylation of Emi1 1.44 2.04 × 10−05 42 1.10 5.57 × 10−05 30 Nucleotide Excision Repair 3.74 2.40 × 10−05 42 3.30 1.33 × 10−05 30 Transcription-coupled NER (TC- 3.45 3.60 × 10−05 42 3.08 1.46 × 10−05 30 NER) Post-transcriptional Silencing By 1.79 1.49 × 10−06 18 Small RNAs Pre-NOTCH Transcription and Trans- 2.05 1.77 × 10−05 18 lation Cohesin Loading onto Chromatin 1.53 1.41 × 10−03 18 Small Interfering RNA (siRNA) Bio- 1.28 8.16 × 10−03 18 genesis Mitotic Telophase/Cytokinesis 1.53 8.16 × 10−03 18 p53-Independent G1/S DNA damage 2.59 8.80 × 10−03 checkpoint

147 Table 6.5: DO enrichment analysis for ICA, DEGAS, and Mica. name DO Corrected p-value mica7 cancer 5.38 × 10−7 mica15 liver cancer, systematic infection, metastatic 4.67 × 10−9, 1.16 × 10−8, 6.66 × 10−8 to brain mica33 cancer 5.2 × 10−5 mica42 cancer, breast cancer 6.21 × 10−35, 5.72 × 10−7 mica63 cancer 2.30 × 10−4 ica18 breast cancer, cancer 4.59 × 10−6, 6.21 × 10−35 ica21 cancer 1.36 × 10−5 ica30 cancer, breast cancer 2.78 × 10−33,1.96 × 10−6 degas cancer, breast cancer 1.78 × 10−14, 3.14 × 10−4

genes than DEGAS, with Mica better enriched with breast cancer and cancer than

ICA. Additionally, mica15 is enriched with metastatic to brain disease genes with

APOA1 among those genes.

6.4 Conclusion

The unprecedented amount of publicly available disease-related data encourages the development of new methodologies and algorithms for a better analysis and fur- ther understanding the disease behavior. In this work, we proposed a new workflow,

Mica, that successfully integrates miRNA data, mRNA data, and PPI network in a novel way to obtain active modules which can serve as powerful biomarkers.

The experimental results show that the modules found by Mica are more disease- related while unraveling new dependencies between the genes which were hidden via previous techniques. Albeit the simplicity of the proposed workflow, Mica success- fully includes many novel ideas, including how we adjust the gene expression levels with the miRNA expression to mimic the protein expression level and how we work on the genes first to get the related ones and map them to the PPI network rather than working only on the genes existing in the PPI. To the best of our knowledge, this

148 is the first study that integrates miRNA, mRNA, and PPI network information for

active module extraction. Furthermore, Mica provides information regarding which

modules are active in which set of samples, hence, making it easier to understand the

disease behavior for different patients.

The results obtained from IDC and ILC datasets show the ability of Mica to

generate disease specific modules. Still, there are some pathways common between

IDC and ILC, such as the cell cycle pathway with BRCA1 and BRCA2 retrieved with

Mica in both datasets.

Further improvements for Mica would add more value and more understanding for the results. For instance, it would be more beneficial to extract a smaller module of 10 or 20 genes from each module that can be further used as a module biomarker.

Additionally, each module can be broken into smaller ones and each can be considered as a possible pathway. Hence, we can further understand how the different pathways interact together. Pathways extraction can also benefit from adding directionality information to the PPI network. We are planning to tackle all such improvements in our future work.

149 Chapter 7: Conclusions and Future Directions

In this dissertation, we have addressed the active module discovery problem. We

re-investigated the problem from the perspective of high throughput data and studied

their effect on the quality of the discovered active modules. Furthermore, instead of

integrating only gene expression data and the PPI network, we took the active module

discovery problem one step further and integrated microRNA data to provide a more

accurate picture for the underlying disease-cell. The contributions of our work can

be summarized as follows.

7.1 Summaries and our findings

High throughput sequencing and the quality of the output. In Chapter3, we evaluated the quality of short sequence mapping tools and observed the effect of the different inputs, settings, and algorithmic techniques on the quality and accuracy of the mapping. Moreover, we provided a set of benchmarking tests which exten- sively analyze the performance of the different tools. Each of the benchmarking tests stresses on a different aspect. Even though there are a lot of tools available for short sequencing mapping, the mapping problem is still open and further improvements are needed to improve the mapping quality. A possible improvement on the mapping is to make the tools more application-specific. By taking the target application into consideration, more accurate results can be obtained. For instance, for mRNA-Seq

150 data, which is our focus here, we can make use of the property that it should only contain reads from the exon region rather than from the intron region. Therefore, for well-studied genomes, if a small number of reads where mapped to different in- tron regions, we can expect them to be wrongly mapped and look for other mapping locations with more number of mismatches or less mapping quality.

mRNA-Seq data effectively generate better active modules than Mi- croarrays. In Chapter4, we compared the performance of active modules discovery tools when using Micorarray against using mRNA-Seq data. The active module discovery tools were evaluated with data sets for Colorectal Cancer and Oligoden- odroglimoa with both RNA-Seq and Micorarray gene expressions. The results showed that RNA-Seq can be more useful than microarrays in detecting relevant and over- looked active modules. In addition, RNA-Seq based modules in our experiments contained more biologically significant genes. On the other hand, the sizes of the obtained active modules with the mRNA-Seq data were much larger than the sizes of their counterpart obtained from the Microarray data. Therefore, they become less effective to be a biomarker. As a result, new algorithms are needed to return more focused and smaller active modules.

Network smoothing of p-values returns important genes. In Chapter5, we proposed a workflow , PRASE, that effectively made use of the mRNA-Seq properties and adjusted the p-values for the genes to better highlight the most important genes.

Such important genes might not be differentially expressed, however, their importance came from their location in the PPI network. Our evaluation showed that PRASE can effectively improved the quality of the output modules. For instance, a 70% reduction in jActiveModules module size while increasing the percentage of DE genes

151 for the Colorectal Cancer data set clearly indicated that the workflow was promising.

Nevertheless, a further evaluation may still be required to quantitatively measure the

effectiveness of PRASE. Another potential measure could be the betweenness centrality

of the genes returned in the modules.

microRNA integration with mRNA boost the quality of the discov-

ered active modules. In Chapter6, we discussed the importance of integrating

other types of data, such as the microRNA data, with the mRNA data for active

module discovery. Each type of data captures part of the image about how the cell

works. Therefore, the successful integration of the different data types would lead to a

more disease-specific results. To effectively integrate the microRNA-Seq and mRNA-

Seq data, we introduced our workflow, Mica, that integrated miRNA, mRNA, and

PPI network in a novel way to obtain active modules which could serve as power- ful biomarkers. The experimental results showed that the modules found by Mica are more disease-related while unraveling new dependencies between the genes which were hidden via previous techniques. Albeit the simplicity of the proposed workflow,

Mica successfully includes many novel ideas, including modifying the gene expres- sion levels with the miRNA expression to unravel new dependencies and working on the genes first to get the related ones and map them to the PPI network rather than working only on the genes existing in the PPI. To the best of our knowledge, this is the first study that integrates miRNA, mRNA, and PPI network information for active module extraction. Furthermore, Mica provided information regarding which modules are active in which set of samples, hence, making it easier to understand the disease behavior for different patients.

152 7.2 Future Work

In this dissertation, we introduced new techniques to solve the active module discovery problem. Our proposed techniques are based on the efficient integration and utilization of the different data types to effectively obtain more disease specific active modules. Nevertheless, the problem is still open and further improvements are needed to further shed the light and understand how the complex diseases work.

Here, we will discuss the possible improvements that can be addressed to return more disease-specific active modules.

The smaller the active module, the easier it is to analyze. Our workflows can effectively return more disease-related active modules. However, the modules returned by Mica are considered large, thus, making it harder to analyze. Therefore, we can further improve the modules by breaking each one of them into a smaller center module, that can act as a biomarker, and other small modules interacting with it. To achieve such an improvement, we can use a summation based criteria, such as returning the module with the highest score, similar to jActiveModules. Unlike jActiveModules, the size of Mica modules allows for using an exact algorithm to return the maximal scoring submodule. The score in this phase can be calculate from the weights of the genes in ICA output. Additionally, such an approach can help in extracting possible pathways as well.

Mutation data gives another perspective for the disease functionality.

With the availability of the high throughput technology, more and more data are produced; giving new perspective and understanding for the disease mechanism. Mu- tated gene information is one type of data greatly benefited from the improvement in the high throughput technology. Many works have focused on integrating mutation

153 data with the PPI network or the gene expression to extract the sets of mutated genes that cooperate together causing the disease [57, 154]. Other works also focused on understanding which mutated genes lead to the differential expression of other impor- tant genes [176, 152]. However, to the best of our knowledge, the mutation data has never been used before to study the active module discovery problem or to extract biomarkers. Therefore, a promising direction is to extract the genes that facilitate the communication between the mutated genes, hence, the active modules connecting the mutated genes.

PPI network as a hyper graph instead of a regular graph. The interactions between the proteins in the PPI network usually represented as an edge between two proteins. However, the actual interactions between the proteins are far more complex.

For instance, three or more proteins directly interact together in a complex to perform a certain function. Therefore, a more realistic representation for the PPI network is a hyper graph representation rather than a simple graph. We believe that if the PPI network is represented as a hyper graph, interesting active modules and relations between the genes can be further discovered. To generate such a representation, verified protein complexes information existing in the available databases can be used and combined together in one hyper graph. As a first step, the generated PPI hyper graph can be used to perform simple types of analysis, e.g., centrality measures, to see how the actual network structure affects the importance of the genes.

154 Bibliography

[1] D.-Y. Cho, Y.-A. Kim, and T. M. Przytycka, “Network biology approach to complex diseases,” PLoS computational biology, vol. 8, no. 12, p. e1002820, 2012.

[2] X. Chang, T. Xu, Y. Li, and K. Wang, “Dynamic modular architecture of protein-protein interaction networks beyond the dichotomy of ’date’ and ’party’ hubs,” Scientific reports, vol. 3, 2013.

[3] A.-L. Barab´asi,N. Gulbahce, and J. Loscalzo, “Network medicine: a network- based approach to human disease,” Nature Reviews Genetics, vol. 12, no. 1, pp. 56–68, 2011.

[4] S. Asur, D. Ucar, and S. Parthasarathy, “An ensemble framework for clustering protein–protein interaction networks,” Bioinformatics, vol. 23, no. 13, pp. i29– i40, 2007.

[5] K. Macropol, T. Can, and A. K. Singh, “RRW: repeated random walks on genome-scale protein networks for local cluster discovery,” BMC bioinformatics, vol. 10, no. 1, p. 283, 2009.

[6] M. Li, J. Wang, and J. Chen, “A graph-theoretic method for mining overlapping functional modules in protein interaction networks,” in Bioinformatics Research and Applications. Springer, 2008, pp. 208–219.

[7] Y. Zhang, E. Zeng, T. Li, and G. Narasimhan, “Weighted consensus clustering for identifying functional modules in protein-protein interaction networks,” in Machine Learning and Applications, 2009. ICMLA’09. International Confer- ence on. IEEE, 2009, pp. 539–544.

[8] K. Steinhaeuser and N. V. Chawla, “Identifying and evaluating community structure in complex networks,” Pattern Recognition Letters, vol. 31, no. 5, pp. 413–421, 2010.

[9] Y.-K. Shih and S. Parthasarathy, “Identifying functional modules in interac- tion networks through overlapping markov clustering,” Bioinformatics, vol. 28, no. 18, pp. i473–i479, 2012.

155 [10] Y. Zhang and T. Li, “Extending consensus clustering to explore multiple clus- tering views.” in SDM. SIAM, 2011, pp. 920–931.

[11] K. Inoue, W. Li, and H. Kurata, “Diffusion model based spectral clustering for protein-protein interaction networks,” PloS one, vol. 5, no. 9, p. e12623, 2010.

[12] D. Ucar, F. Altiparmak, H. Ferhatosmanoglu, and S. Parthasarathy, “Mutual information based extrinsic similarity for microarray analysis,” in Bioinformat- ics and Computational Biology. Springer, 2009, pp. 424–436.

[13] J. Ruan, A. K. Dean, and W. Zhang, “A general co-expression network-based approach to gene expression analysis: comparison and applications,” BMC sys- tems biology, vol. 4, no. 1, p. 8, 2010.

[14] J. Ihmels, S. Bergmann, and N. Barkai, “Defining transcription modules using large-scale gene expression data,” Bioinformatics, vol. 20, no. 13, pp. 1993– 2003, 2004.

[15] M. S. Cline, M. Smoot, E. Cerami, A. Kuchinsky, N. Landys, C. Workman, R. Christmas, I. Avila-Campilo, M. Creech, B. Gross et al., “Integration of biological networks and gene expression data using cytoscape,” Nature protocols, vol. 2, no. 10, pp. 2366–2382, 2007.

[16] M. W. Covert, E. M. Knight, J. L. Reed, M. J. Herrgard, and B. O. Pals- son, “Integrating high-throughput and computational data elucidates bacterial networks,” Nature, vol. 429, no. 6987, pp. 92–96, 2004.

[17] A. R. Joyce and B. Ø. Palsson, “The as a system: integrating ’omics’ data sets,” Nature Reviews Molecular Cell Biology, vol. 7, no. 3, pp. 198–210, 2006.

[18] D. W. Huang, B. T. Sherman, and R. A. Lempicki, “Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists,” Nucleic acids research, vol. 37, no. 1, pp. 1–13, 2009.

[19] T. Aittokallio and B. Schwikowski, “Graph-based methods for analysing net- works in cell biology,” Briefings in bioinformatics, vol. 7, no. 3, pp. 243–255, 2006.

[20] A. Tanay, R. Sharan, M. Kupiec, and R. Shamir, “Revealing modularity and organization in the yeast molecular network by integrated analysis of highly het- erogeneous genomewide data,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 9, pp. 2981–2986, 2004.

156 [21] J. Reimand, L. Tooming, H. Peterson, P. Adler, and J. Vilo, “Graphweb: min- ing heterogeneous biological networks for gene modules with functional signifi- cance,” Nucleic acids research, vol. 36, no. suppl 2, pp. W452–W459, 2008.

[22] W.-P. Lee and W.-S. Tzou, “Computational methods for discovering gene net- works from expression data,” Briefings in bioinformatics, vol. 10, no. 4, pp. 408–423, 2009.

[23] H. Ge, A. J. Walhout, and M. Vidal, “Integrating ’omic’ information: a bridge between genomics and systems biology,” TRENDS in Genetics, vol. 19, no. 10, pp. 551–560, 2003.

[24] K. Mitra, A.-R. Carvunis, S. K. Ramesh, and T. Ideker, “Integrative approaches for finding modular structure in biological networks,” Nature Reviews Genetics, vol. 14, no. 10, pp. 719–732, 2013.

[25] C. L. Myers, D. Robson, A. Wible, M. A. Hibbs, C. Chiriac, C. L. Theesfeld, K. Dolinski, and O. G. Troyanskaya, “Discovery of biological networks from diverse functional genomic data,” Genome biology, vol. 6, no. 13, p. R114, 2005.

[26] R. S. Savage, Z. Ghahramani, J. E. Griffin, J. Bernard, and D. L. Wild, “Dis- covering transcriptional modules by bayesian data integration,” Bioinformatics, vol. 26, no. 12, pp. i158–i167, 2010.

[27] N. Huang, P. K. Shah, and C. Li, “Lessons from a decade of integrating cancer copy number alterations with gene expression profiles,” Briefings in bioinfor- matics, vol. 13, no. 3, pp. 305–316, 2012.

[28] T. Ideker, O. Ozier, B. Schwikowski, and A. F. Siegel, “Discovering regulatory and signalling circuits in molecular interaction networks,” Bioinf., vol. 18, no. Suppl 1, pp. S233–S240, 2002.

[29] M. T. Dittrich, G. W. Klau, A. Rosenwald, T. Dandekar, and T. M¨uller, “Identi- fying functional modules in protein-protein interaction networks: an integrated exact approach,” Bioinf., vol. 24, no. 13, pp. i223–i231, 2008.

[30] I. Ulitsky, A. Krishnamurthy, R. M. Karp, and R. Shamir, “DEGAS: De novo discovery of dysregulated pathways in human diseases,” PLoS one, vol. 5, no. 10, p. e12267, 2010.

[31] M. Li, X. Wu, J. Wang, and Y. Pan, “Towards the identification of protein com- plexes and functional modules by integrating ppi network and gene expression data,” BMC bioinformatics, vol. 13, no. 1, p. 109, 2012.

157 [32] O. Vanunu, O. Magger, E. Ruppin, T. Shlomi, and R. Sharan, “Associating genes and protein complexes with disease via network propagation,” PLoS com- putational biology, vol. 6, no. 1, p. e1000641, 2010.

[33] Z. Guo, Y. Li, X. Gong, C. Yao, W. Ma, D. Wang, Y. Li, J. Zhu, M. Zhang, D. Yang et al., “Edge-based scoring and searching method for identifying condition-responsive protein–protein interaction sub-network,” Bioinformatics, vol. 23, no. 16, pp. 2121–2128, 2007.

[34] X.-M. Zhao, R.-S. Wang, L. Chen, and K. Aihara, “Uncovering signal trans- duction networks from high-throughput data by integer linear programming,” Nucleic acids research, vol. 36, no. 9, pp. e48–e48, 2008.

[35] Z. Wu, X. Zhao, and L. Chen, “Identifying responsive functional modules from protein-protein interaction network,” Molecules and cells, vol. 27, no. 3, pp. 271–277, 2009.

[36] S. R. Hegde, P. Manimaran, and S. C. Mande, “Dynamic changes in protein functional linkage networks revealed by integration with gene expression data,” PLoS computational biology, vol. 4, no. 11, p. e1000237, 2008.

[37] Z.-P. Liu, Y. Wang, X.-S. Zhang, and L. Chen, “Identifying dysfunctional crosstalk of pathways in various regions of alzheimer’s disease brains,” BMC systems biology, vol. 4, no. Suppl 2, p. S11, 2010.

[38] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation rank- ing: Bringing order to the web,” Stanford InfoLab, Tech. Rep. 1999-66, 1999.

[39] A. Hyv¨arinen,“Fast and robust fixed-point algorithms for independent com- ponent analysis,” Neural Networks, IEEE Transactions on, vol. 10, no. 3, pp. 626–634, 1999.

[40] J. H. Malone and B. Oliver, “Microarrays, deep sequencing and the true measure of the transcriptome,” BMC Biology, vol. 9, p. 34, 2011.

[41] F. Stahl, B. Hitzmann, K. Mutz, D. Landgrebe, M. L¨ubbecke, and et al., “Tran- scriptome analysis,” in Genomics and Systems Biology of Mammalian Cell Cul- ture, ser. Advances in Biochemical Engineering/Biotechnology, W. S. Hu and A.-P. Zeng, Eds. Springer Berlin / Heidelberg, 2012, vol. 127, pp. 1–25.

[42] Z. Wang, M. Gerstein, and M. Snyder, “RNA-Seq: a revolutionary tool for transcriptomics,” Nature Reviews Genetics, vol. 10, pp. 57–63, 2009.

[43] M. R. Fabian, N. Sonenberg, and W. Filipowicz, “Regulation of mRNA trans- lation and stability by microRNAs,” Annual review of biochemistry, vol. 79, pp. 351–379, 2010.

158 [44] M. V. Iorio and C. M. Croce, “microRNA involvement in human cancer,” Car- cinogenesis, vol. 33, no. 6, pp. 1126–1133, 2012.

[45] C. Blenkiron, L. D. Goldstein, N. P. Thorne, I. Spiteri, S.-F. Chin, M. J. Dun- ning, N. L. Barbosa-Morais, A. E. Teschendorff, A. R. Green, I. O. Ellis et al., “Microrna expression profiling of human breast cancer identifies new markers of tumor subtype,” Genome Biol, vol. 8, no. 10, p. R214, 2007.

[46] L. X. Garmire and S. Subramaniam, “Evaluation of normalization methods in mammalian microrna-seq data,” RNA, vol. 18, no. 6, pp. 1279–1288, 2012.

[47] S. Bandyopadhyay, M. Mehta, D. Kuo, M.-K. Sung, R. Chuang, E. J. Jaehnig, B. Bodenmiller, K. Licon, W. Copeland, M. Shales et al., “Rewiring of genetic networks in response to dna damage,” Science, vol. 330, no. 6009, pp. 1385– 1389, 2010.

[48] T. Ideker and N. J. Krogan, “Differential network biology,” Molecular systems biology, vol. 8, no. 1, 2012.

[49] S. Fields, “High-throughput two-hybrid analysis,” FEBS journal, vol. 272, no. 21, pp. 5391–5399, 2005.

[50] S. A. Chowdhury, R. K. Nibbe, M. R. Chance, and M. Koyut¨urk, “Subnetwork state functions define dysregulated subnetworks in cancer,” J. Comput. Bio., vol. 18, no. 3, pp. 263–281, 2011.

[51] S. Nacu, R. Critchley-Thorne, P. Lee, and S. Holmes, “Gene expression network analysis and applications to immunology,” Bioinf., vol. 23, no. 7, pp. 850–858, 2007.

[52] H.-Y. Chuang, E. Lee, Y.-T. Liu, D. Lee, and T. Ideker, “Network-based clas- sification of breast cancer metastasis,” Mol. Syst. Biol., vol. 3, p. 140, 2007.

[53] C. Backes, A. Rurainski, G. W. Klau, O. M¨uller, D. St¨ockel, A. Gerasch, J. K¨untzer, D. Maisel, N. Ludwig, M. Hein et al., “An integer linear pro- gramming approach for finding deregulated subgraphs in regulatory networks,” Nucleic acids research, vol. 40, no. 6, pp. e43–e43, 2012.

[54] I. Ulitsky and R. Shamir, “Identification of functional modules using network biology and high-throughput data,” BMC Systems Biology, vol. 1, p. 8, 2007.

[55] K. Komurov, M. A. White, and P. T. Ram, “Use of data-biased random walks on graphs for the retrieval of context-specific networks from genomic data,” PLoS computational biology, vol. 6, no. 8, p. e1000889, 2010.

159 [56] K. Komurov, S. Dursun, S. Erdin, and P. T. Ram, “Netwalker: a contextual network analysis tool for functional genomics,” BMC genomics, vol. 13, no. 1, p. 282, 2012.

[57] F. Vandin, E. Upfal, and B. J. Raphael, “Algorithms for detecting significantly mutated pathways in cancer,” Journal of Computational Biology, vol. 18, no. 3, pp. 507–522, 2011.

[58] G. T. Huang, C. Athanassiou, and P. V. Benos, “mirConnX: condition-specific mRNA-microRNA network integrator,” Nucleic acids research, vol. 39, no. suppl 2, pp. W416–W423, 2011.

[59] V. A. Gennarino, G. D’Angelo, G. Dharmalingam, S. Fernandez, G. Russolillo, R. Sanges, M. Mutarelli, V. Belcastro, A. Ballabio, P. Verde et al., “Identi- fication of microrna-regulated gene networks by expression analysis of target genes,” Genome research, vol. 22, no. 6, pp. 1163–1172, 2012.

[60] V. Jayaswal, M. Lutherborrow, and Y. H. Yang, “Measures of association for identifying microrna-mrna pairs of biological interest,” PloS one, vol. 7, no. 1, p. e29612, 2012.

[61] M. H. Schulz, K. V. Pandit, C. L. L. Cardenas, N. Ambalavanan, N. Kaminski, and Z. Bar-Joseph, “Reconstructing dynamic microrna-regulated interaction networks,” Proceedings of the National Academy of Sciences, vol. 110, no. 39, pp. 15 686–15 691, 2013.

[62] D. Baek, J. Vill´en,C. Shin, F. D. Camargo, S. P. Gygi, and D. P. Bartel, “The impact of micrornas on protein output,” Nature, vol. 455, no. 7209, pp. 64–71, 2008.

[63] E. Huntzinger and E. Izaurralde, “Gene silencing by microRNAs: contributions of translational repression and mRNA decay,” Nature Reviews Genetics, vol. 12, no. 2, pp. 99–110, 2011.

[64] Y. Cun and H. Fr¨ohlich, “Network and data integration for biomarker signature discovery via network smoothed t-statistics,” PloS one, vol. 8, no. 9, p. e73074, 2013.

[65] National human genome institute. [Online]. Available: http://www.genome.gov

[66] P. Flicek and E. Birney, “Sense from sequence reads: methods for alignment and assembly,” Nat Methods, vol. 6, no. 11s, pp. S6–S12, 2009.

[67] S. Cokus, S. Feng, X. Zhang, and et al., “Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning,” Nat, vol. 452, no. 7184, pp. 215–219, 2008.

160 [68] M. Sultan, M. Schulz, and H. e. a. Richard, “A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome,” Science, vol. 321, no. 5891, pp. 956–960, 2008.

[69] C. Van Tassel, T. Smith, and L. e. a. Matukumalli, “SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries,” Nat Methods, vol. 5, no. 3, pp. 247–252, 2008.

[70] C. Alkan, J. Kidd, and T. e. a. Margues-Bonet, “Personalized copy number and segmental duplication maps using next-generation sequencing,” Nat Genet, vol. 41, no. 10, pp. 1061–1067, 2009.

[71] J. Qin, R. Li, and J. e. a. Raes, “A human gut microbial gene catalogue estab- lished by metagenomic sequencing,” Nat, vol. 464, no. 7285, pp. 59–65, 2010.

[72] H. Li, J. Ruan, and R. Durbin, “Mapping short DNA sequencing reads and calling variants using mapping quality scores,” Gen Res, vol. 18, no. 11, pp. 1851–1858, 2008.

[73] Z. Smith, AD Xuan and M. Zhang, “Using quality scores and longer reads improves accuracy of solexa read mapping,” BMC Bioinf, vol. 9, no. 1, pp. 128+, 2008.

[74] T. Wu and S. Nacu, “Fast and SNP-tolerant detection of complex variants and splicing in short reads,” Bioinf, vol. 7, no. 26, pp. 873–881, 2010.

[75] B. Langmead, C. Trapnell, M. Pop, and S. Salzberg, “Ultrafast and memory- efficient alignment of short dna sequences to the human genome,” Gen Biol, vol. 10, no. 3, pp. R25+, 2009.

[76] B. Langmead and S. L. Salzberg, “Fast gapped-read alignment with Bowtie 2,” Nat Meth, vol. 9, pp. 357–359, 2012.

[77] H. Li and R. Durbin, “Fast and accurate short read alignment with Burrows- Wheeler transform,” Bioinf, vol. 25, no. 14, pp. 1754–1760, 2009.

[78] R. Li, C. Yu, and Y. e. a. Li, “SOAP2: an improved ultrafast tool for short read alignment,” Bioinf, vol. 25, no. 15, pp. 1966–1967, 2009.

[79] Mosaik. [Online]. Available: http://bioinformatics.bc.edu/marthlab/Mosaik

[80] S. Misra, R. Narayanan, and S. e. a. Lin, “FANGS: high speed sequence mapping for next generation sequencers,” in ACM Symposium on Applied Computing: 2010; Sierre, Switzerland, 2010, pp. 1539–1546.

161 [81] S. Rumble, P. Lacroute, A. Dalca, and et al., “SHRiMP: Accurate mapping of short color-space reads,” PLoS Comput Biol, vol. 5, no. 5, pp. e1 000 386+, 2009.

[82] N. Homer, B. Merriman, and S. Nelson, “BFAST: an alignment tool for large scale genome resequencing,” PLoS ONE, vol. 4, no. 11, p. e7767, 2009.

[83] Mapreads. [Online]. Available: http://solidsoftwaretools.com/gf/project/ mapreads/

[84] B. Ondov, A. Varadarajan, and K. e. a. Passalacqua, “Efficient mapping of Applied Biosystems SOLiD sequence data to a reference genome for functional genomic applications,” Bioinf, vol. 24, no. 23, pp. 2776–2777, 2008.

[85] D. Campagna, A. Albiero, and A. e. a. Bilardi, “PASS: a program to align short sequences,” Bioinf, vol. 25, no. 7, pp. 967–968, 2009.

[86] F. Hach, F. Hormozdiari, C. Alkan, F. Hormozdiari, I. Birol, E. E. Eichler, and S. C. Sahinalp, “mrsFast: a cache-oblivious algorithm for short-read mapping,” Nat Method, vol. 7, no. 8, pp. 576–577, 2010.

[87] H. Lin, Z. Zhang, and M. e. a. Zhang, “ZOOM! Zillions of oligos mapped,” Bioinf, vol. 24, no. 21, pp. 2431–2437, 2008.

[88] M. Malhis, Y. Butterfield, and M. e. a. Ester, “Slider - Maximum use of prob- ability information for alignment of short sequence reads and SNP detection,” Bioinf, vol. 25, no. 1, pp. 6–13, 2008.

[89] N. Malhis and S. Jones, “High quality SNP calling using Illumina data at shal- low coverage,” Bioinf, vol. 26, no. 8, pp. 1029–1035, 2010.

[90] D. Weese, A.-K. Emde, T. Rausch, A. D¨oring,and K. Reinert, “RazerS-fast read mapping with sensitivity control,” Genome Res., vol. 19, pp. 1646–1654, 2009.

[91] D. Weese, M. Holtgrewe, and K. Reinert, “RazerS 3: Faster, fully sensitive read mapping,” Bioinf., vol. 28, no. 20, pp. 2592–2599, 2012.

[92] Novoalign. [Online]. Available: http://www.novocraft.com

[93] J. Blom, T. Jakobi, D. Doppmeier, S. Jaenicke, J. Kalinowski, J. Stoye, and A. Goesmann, “Exact and complete short-read alignment to microbial genomes using graphics processing unit programming,” Bioinf, vol. 27, no. 10, pp. 1351– 1358, 2011.

162 [94] C.-M. Liu, T. Wong, E. Wu, R. Luo, S.-M. Yiu, Y. Li, B. Wang, C. Yu, X. Chu, K. Zhao, R. Li, and T.-W. Lam, “SOAP3: ultra-fast GPU-based parallel align- ment tool for short reads,” Bioinf, vol. 28, no. 6, pp. 878–879, 2012.

[95] H. Li and N. Homer, “A survey of sequence alignment algorithms for next- generation sequencing,” Brief in Bioinf, vol. 11, no. 5, pp. 473–483, 2010.

[96] M. Holtgrewe, A. Emde, D. Weese, and et al, “A novel and well-defined bench- marking method for second generation read mapping,” BMC Bioinf, vol. 12, no. 1, pp. 210+, 2011.

[97] M. Ruffalo, T. LaFramboise, and M. Koyut¨urk, “Comparative analysis of algo- rithms for next-generation sequencing read alignment,” Bioinf, vol. 27, no. 20, pp. 2790–2796, 2011.

[98] S. Schbath, V. Martin, M. Zytnicki, and et al, “Mapping reads on a genomic sequence: an algorithmic overwiew and a practical comparative analysis,” Jr Comp Biol, vol. 19, no. 6, pp. 796–813, 2012.

[99] N. A. Fonseca, J. Rung, A. Brazma, and J. C. Marioni, “Tools for mapping high-throughput sequencing data,” Bioinf, 2012.

[100] B. Ewing, L. Hillier, M. Wendl, and P. Green, “Base-calling of automated sequencer traces using phred. I. Accuracy assessment.” Genome Res, vol. 8, no. 3, pp. 175–185, 1998.

[101] B. Ewing and P. Green, “Base-calling of automated sequencer traces using phred. II. Error probabilities.” Genome Res, vol. 8, no. 3, pp. 186–194, 1998.

[102] M. Deutsch and M. Long, “Intron-exon structures of eukaryotic model organ- isms,” Nucl. Acids Res., vol. 27, no. 15, pp. 3219–3228, 1999.

[103] M. Burrows and D. Wheeler, “A block-sorting lossless data compression algo- rithm,” Digital Equipment Corporation, Palo Alto (CA), Tech. Rep. 124, 1994.

[104] P. Ferragina and G. Manzini, “Opportunistic data structures with applications,” in 41st Annual Symposium on Foundations of Computer Science, 2000.

[105] ssahaSNP. [Online]. Available: http://www.sanger.ac.uk/resources/software/ ssahasnp/

[106] D. Zhang, J Wheeler, I. Yakub, and et al, “SNPdetector: A software tool for sensitive and accurate SNP detection,” PLoS Comput Biol, vol. 1, no. 5, p. e53, 2005.

163 [107] R. Li, Y. Li, X. Fang, and et al, “SNP detection for massively parallel whole- genome resequencing,” Genome Res, vol. 18, no. 6, pp. 1124–1132, 2009.

[108] wgsim. [Online]. Available: https://github.com/lh3/wgsim

[109] dwgsim. [Online]. Available: http://sourceforge.net/projects/dnaa

[110] M. Holtgrewe, “Mason - A read simulator for second generation sequencing data,” Digital Equipment Corporation, Institut f¨urMathematik und Infor- matik, Freie Universit¨atBerlin, Berlin, Germany, Tech. Rep. TB-B-10-06, 2010.

[111] W. Huang, L. Li, J. Myers, and G. Marth, “ART: a next-generation sequencing read simulator,” Bioinf, vol. 28, no. 4, pp. 593–594, 2012.

[112] J. C. Dohm, C. Lottaz, T. Borodina, and H. Himmelbauer, “Substantial biases in ultra-short read data sets from high-throughput dna sequencing,” Nucl Acid Res., vol. 36, no. 16, p. e105, 2008.

[113] J. Schroder, H. Schroder, S. Puglisi, and et al, “SNP detection for massively parallel whole-genome resequencing,” Bioinf, vol. 25, no. 17, pp. 2157–2163, 2009.

[114] K. E. McElroy, F. Luciani, and T. Thomas, “Gemsim: general, error-model based simulator of next-generation sequencing data,” BMC Genomics, vol. 13, p. 74, 2012.

[115] pmap. [Online]. Available: http://bmi.osu.edu/hpc/software/pmap/pmap.html

[116] P. Inc., “Partek genomics suitetm v2.6.” Partek Inc., St. Louis, Tech. Rep., 2010.

[117] D. Bottomly, N. A. R. Walter, J. E. Hunter, P. Darakjian, S. Kawane, and et al., “Evaluating gene expression in c57bl/6j and dba/2j mouse striatum using RNA-Seq and microarrays,” PloSOne, vol. 6, no. 3, p. e17820, 2012.

[118] X. Fu, N. Fu, S. Guo, Z. Yan, Y. Xu, and et al., “Estimating accuracy of RNA-Seq and microarrays with proteomics,” BMC Genomics, vol. 10, p. 161, 2009.

[119] J. C. Marioni, C. E. Mason, S. M. Mane, M. Stephens, and Y. Gilad, “RNA-Seq: An assessment of technical reproducibility and comparison with gene expression arrays,” Genome Res., vol. 18, pp. 1509–1517, 2008.

[120] D. Beisser, S. Brunkhorst, T. Dandekar, G. W. Klau, and M. T. D. et al., “Ro- bustness and accuracy of functional modules in integrated network analysis,” Bioinf., 2012.

164 [121] R. Breitling, A. Amtmann, and P. Herzyk, “Graph-based iterative group anal- ysis enhances microarray interpretation,” BMC Bioinf., vol. 5, p. 100, 2004.

[122] P. Dao, K. Wang, C. Collins, M. Ester, A. Lapuk, and et al., “Optimally dis- criminative subnetwork markers predict response to chemotherapy,” Bioinf., vol. 27, no. 13, pp. i205–i213, 2011.

[123] J. Dutkowski and T. Ideker, “Protein networks as logic functions in development and cancer,” PLoS Comp. Bio., vol. 7, no. 9, p. e1002180, 2011.

[124] H. Ma, E. E. Schadt, L. M. Kaplan, and H. Zhao, “COSINE: condition-specific sub-network identification using a global optimization method,” Bioinf., vol. 27, no. 9, pp. 1290–1298, 2011.

[125] I. Ulitsky and R. Shamir, “Identifying functional modules using expression pro- files and confidence-scored protein interactions,” Bioinf., vol. 25, no. 9, pp. 1158–1164, 2009.

[126] G. Warsow, B. Greber, S. S. Falk, C. Harder, M. Siatkowski, and et al., “Ex- pressence - revealing the essence of differential experimental data in the context of an interaction/regulation network,” BMC Sys. Biol., vol. 4, p. 164, 2010.

[127] C. D. Lasher, C. L. Poirel, and M. T.M., “Cellular response networks,” in Problem solving handbook in computational biology and bioinformatics, L. Heath and N. Ramakrishnan, Eds. Springer, 2011.

[128] R. W. Ivana Ljubi´c,U. Pferschy, G. W. Klau, P. Mutzel, and et al., “An algorithmic framework for the exact solution of the prize-collecting steiner tree problem,” Math. Program. Ser. B., vol. 105, pp. 427–449, 2006.

[129] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown(3), “Quantitative mon- itoring of gene expression patterns with a complementary dna microarray,” Science, vol. 270, no. 5235, pp. 467–470, 1995.

[130] M. Smoot, K. Ono, J. Ruscheinski, P.-L. Wang, and T. Ideker, “Cytoscape 2.8: new features for data integration and network visualization,” Bioinf., vol. 27, no. 3, pp. 431–432, 2011.

[131] D. Beisser, G. W. Klau, T. Dandekar, and T. M. M. T. Dittrich, “Bionet: an r-package for the functional analysis of biological networks,” Bioinf., vol. 26, no. 8, pp. 1129–1130, 2009.

[132] J.-F. Rual, K. Venkatesan, T. Hao, T. Hirozane-Kishikawa, A. Dricot, and et al., “Towards a proteome-scale map of the human protein-protein interaction network,” Nature, vol. 437, pp. 1173–1178, 2005.

165 [133] U. Stelzl, U. Worm, M. Lalowski, C. Haenig, F. H. Brembeck, and et al., “A human protein-protein interaction network: A resource for annotating the pro- teome,” Cell, vol. 122, no. 6, pp. 957–968, 2005. [134] A. K. Ramani, R. C. Bunescu, R. J. Mooney, and E. M. Marcotte, “Consoli- dating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome,” Genome Biology, vol. 6, p. R40, 2005. [135] C. Alfarano, C. E. Andrade, K. Anthony, N. Bahroos, and M. B. et al, “The biomolecular interaction network database and related tools 2005 update,” Nucl. Acids Res., vol. 33, no. Suppl 1, pp. D418–D424, 2005. [136] G. Joshi-Tope, M. Gillespie, I. Vastrik, P. D’Eustachio, E. Schmidt, and et al., “Reactome: a knowledgebase of biological pathways,” Nucl. Acids Res., vol. 33, no. Suppl 1, pp. D428–D432, 2005. [137] S. Peri, J. D. Navarro, R. Amanchy, T. Z. Kristiansen, C. K. Jonnalagadda, and et al., “Development of human protein reference database as an initial platform for approaching systems biology in humans,” Genome Res., vol. 13, pp. 2363–2371, 2003. [138] M. Griffith, M. J. Tang, O. L. Griffith, R. D. Morin, S. Y. Chan, and et al., “Alexa: a microarray design platform for alternative expression analysis,” Nat. Methods, vol. 5, no. 2, p. 118, 2008. [139] M. Griffith, O. L. Griffith, J. Mwenifumbo, R. Goya, A. S. Morrissy, and et al., “Alternative expression analysis by RNA sequencing,” Nat. Methods, vol. 7, pp. 843–847, 2010. [140] I. T. Tai, M. Dai, D. A. Owen, and L. B. Chen, “Genome-wide expression analysis of therapy-resistant tumors reveals sparc as a novel target for cancer therapy,” J. Clin. Invest., vol. 115, no. 6, pp. 1492–1502, 2005. [141] R. M. Niles, S. A. Wilhelm, G. D. J. Steele, B. Burke, T. Christensen, and et al., “Isolation and characterization of an undifferentiated human colon carcinoma cell line (MIP-101).” Cancer Invest., vol. 5, no. 6, pp. 545–552, 1987. [142] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing.” J. R. Statist. Soc. Ser. B, vol. 57, no. 1, pp. 289–300, 1995. [143] P. Bruni, G. Minopoli, T. Brancaccio, M. Napolitano, R. Faraonio, and et al., “Fe65, a ligand of the alzheimer’s β-amyloid precursor protein, blocks cell cy- cle progression by down-regulating thymidylate synthase expression,” J. Biol. Chem., vol. 277, pp. 35 481–35 488, 2002.

166 [144] N. Zhang, Y. Yin, S.-J. Xu, and W.-S. Chen, “5-fluorouracil: Mechanisms of resistance and reversal strategies,” Molecules, vol. 13, pp. 1551–1569, 2008.

[145] W. Ichikawa, T. Takahashi, K. Suto, Y. Shirota, Z. Nihei, and et al., “Simple combinations of 5-FU pathway genes predict the outcome of metastatic gastric cancer patients treated by S-1,” Int. J. Cancer, vol. 119, pp. 1927–1933, 2006.

[146] R. Matsuyama, S. Togo, D. Shimizu, N. Momiyama, T. Ishikawa, and et al., “Predicting 5-fluorouracil chemosensitivity of liver metastases from colorectal cancer using primary tumor specimens: Three-gene expression model predicts clinical response,” In. J. Cancer, vol. 119, pp. 406–413, 2006.

[147] H. Hoshino, N. Miyoshi, K. ichi Nagai, Y. Tomimaru, H. Nagano, and et al., “Epithelial-mesenchymal transition with expression of snai1-induced chemore- sistance in colorectal cancer,” Biochem. Biophys. Res. Commun., vol. 390, no. 3, pp. 1061–1065, 2009.

[148] M. Griffith, J. C. Mwenifumbo, P. Y. Cheung, J. E. Paul, T. J. Pugh, and et al., “Novel mRNA isoforms and mutations of uridine monophosphate synthetase and 5-fluorouracil resistance in colorectal cancer,” Pharmacogenomics, vol. 17, 2012.

[149] W. Ichikawa, H. Uetake, Y. Shirota, H. Yamada, T. Takahashi, and et al., “Both gene expression for orotate phosphoribosyltransferase and its ratio to dihydropyrimidine dehydrogenase influence outcome following fluoropyrimidine- based chemotherapy for metastatic colorectal cancer,” Br. J. Cancer, vol. 89, pp. 1486–1492, 2003.

[150] K. Chellappa, L. Jankova, J. M. Schnabl, S. Pan, Y. Brelivet, and et al., “Src tyrosine kinase phosphorylation of nuclear receptor HNF4α correlates with isoform-specific loss of HNF4α in human colon cancer,” PNAS, 2012.

[151] R. W. Solava, R. P. Michaels, and T. Milenkovi´c,“Graphlet-based edge clus- tering reveals pathogen-interacting proteins,” Bioinformatics, vol. 28, no. 18, pp. i480–i486, 2012.

[152] Y.-A. Kim and T. M. Przytycka, “Bridging the gap between genotype and phenotype via network approaches,” Frontiers in genetics, vol. 3, 2012.

[153] Y. Qi, Y. Suhail, Y.-y. Lin, J. D. Boeke, and J. S. Bader, “Finding friends and enemies in an enemies-only network: a graph diffusion kernel for predict- ing novel genetic interactions and co-complex membership from yeast genetic interactions,” Genome research, vol. 18, no. 12, pp. 1991–2004, 2008.

167 [154] F. Vandin, E. Upfal, and B. J. Raphael, “De novo discovery of mutated driver pathways in cancer,” Genome research, vol. 22, no. 2, pp. 375–385, 2012.

[155] S. Erten, S. A. Chowdhury, X. Guan, R. K. Nibbe, J. S. Barnholtz-Sloan, M. R. Chance, and M. Koyut¨urk,“Identifying stage-specific protein subnetworks for colorectal cancer,” BMC Proceedings, vol. 6, no. Suppl7, p. S1, 2012.

[156] P. Dao, R. Colak, R. Salari, F. Moser, E. Davicioni, and et al., “Inferring cancer subnetwork markers using density-constrained biclustering,” Bioinf., vol. 26, no. 18, pp. i625–i631, 2010.

[157] A. Hatem, K. Kaya, and U.¨ V. Cataly¨urek,“Microarray vs. RNA-Seq: A com- parison for active subnetwork discovery,” in In Proc. of the 12th ACM Con- ference on Bioinformatics, Computational Biology and Biomedicine (BCB’12), 2012.

[158] J. L. Morrison, R. Breitling, D. J. Higham, and D. R. Gilbert, “GeneRank: Using search engine technology for the analysis of microarray experiments,” BMC Bioinf., vol. 6, p. 233, 2005.

[159] C. Winter, G. Kristiansen, S. Kersting, J. Roy, D. Aust, and et al., “Google goes cancer: Improving outcome prediction for cancer patients by network- based ranking of marker genes,” PLOS Comp. Biol., vol. 8, no. 5, p. e1002511, 2012.

[160] G. Iv´anand V. Grolmusz, “When the Web meets the cell: using personalized PageRank for analyzing protein interaction networks,” Bioinformatics, vol. 27, no. 3, pp. 405–407, 2011.

[161] T. Ideker and R. Sharan, “Protein networks in disease,” Genome Res., vol. 18, no. 4, pp. 644–652, 2008.

[162] D. Beisser, G. W. Klau, T. Dandekar, and T. M. M. T. Dittrich, “BioNet: an R-package for the functional analysis of biological networks,” Bioinf., vol. 26, no. 8, pp. 1129–1130, 2009.

[163] S. Anders and W. Huber, “Differential expression analysis for sequence count data,” Gen. Biol., vol. 11, p. R106, 2010.

[164] D. W. Huang, B. T. Sherman, and R. A. Lempicki, “Systematic and integra- tive analysis of large gene lists using DAVID bioinformatics resources,” Nature Protoc., vol. 4, no. 1, pp. 44–57, 2009.

[165] R. Nahta, D. Yu, M.-C. Hung, G. N. Hortobagyi, and F. J. Esteva, “Mechanisms of disease: understanding resistance to HER2-targeted therapy in human breast cancer,” Nat. Clinic. Prac. Oncol., vol. 3, pp. 269–280, 2006.

168 [166] R. J. Pietras, B. M. Fendly, V. R. Chazin, M. D. Pegram, S. Howell, and D. J. Slamon, “ to HER-2/neu receptor blocks DNA repair after in human breast and ovarian cancer cells.” Oncogene, vol. 9, no. 7, pp. 189–138, 1994.

[167] F. Balkwill and A. Mantovani, “Inflammation and cancer: back to Virchow?” The lancet, vol. 357, no. 9255, pp. 539–545, 2001.

[168] S. L. Hembruff and N. Cheng, “Chemokine signaling in cancer: Implications on the tumor microenvironment and therapeutic targeting,” Cancer Ther., vol. 7, no. A, pp. 254–267, 2009.

[169] V. M. Golubovskaya and W. Cance, “Focal Adhesion Kinase and p53 signal transduction pathways in cancer,” Front. in Biosci., vol. 15, pp. 901–912, 2010.

[170] L. Zuo, W. Li, and S. You, “Progesterone reverses the mesenchymal phenotypes of basal phenotype breast cancer cells via a membrane progesterone receptor mediated pathway,” Breast Can. Res., vol. 12, p. R34, 2010.

[171] C. Horbinski, J. Hobbs, K. Cieply, S. Dacic, and R. L. Hamilton, “EGFR expression stratifies oligodendroglioma behavior,” Am J Pathol., vol. 179, no. 4, pp. 1638–1644, 2011.

[172] S. Hagel, “CD44 expression in primary and recurrent oligodendrogliomas and in adjacent gliotic brain tissue,” Neuropathology and App. Neurobiology, vol. 25, no. 4, pp. 313–318, 1999.

[173] S. M. Ranuncolo, V. Ladeda, S. Specterman, M. V. MD, J. Lastiri, A. Morandi, E. Matos, E. B. D. K. Joffe, L. Puricelli, and M. G. Pallotta, “CD44 expression in human gliomas,” Journal of Surgical Oncology, vol. 79, no. 1, pp. 30–36, 2002.

[174] G. N. Fuller, C. H. Rhee, K. R. Hess, L. S. Caskey, R. Wang, J. M. Bruner, W. K. A. Yung, and W. Zhang, “Reactivation of insulin-like growth factor bind- ing protein 2 expression in glioblastoma multiforme: A revelation by parallel gene expression profiling 1,” Cancer Res., vol. 59, p. 4228, 1999.

[175] N. Atias and R. Sharan, “iPoint: an integer programming based algorithm for inferring protein subnetworks,” Molecular BioSystems, 2013.

[176] Y.-A. Kim, S. Wuchty, and T. M. Przytycka, “Identifying causal genes and dysregulated pathways in complex diseases,” PLoS computational biology, vol. 7, no. 3, p. e1001095, 2011.

169 [177] S. Zhang, C.-C. Liu, W. Li, H. Shen, P. W. Laird, and X. J. Zhou, “Discovery of multi-dimensional modules by integrative analysis of cancer genomic data,” Nucleic acids research, vol. 40, no. 19, pp. 9379–9391, 2012.

[178] A. Hyv¨arinen,“Independent component analysis: recent advances,” Philosoph- ical Transactions of the Royal Society A: Mathematical, Physical and Engineer- ing Sciences, vol. 371, no. 1984, 2013.

[179] S. Chavali, S. Bruhn, K. Tiemann, P. Sætrom, F. Barren¨as,T. Saito, K. Kan- duri, H. Wang, and M. Benson, “MicroRNAs act complementarily to regulate disease-related mRNA modules in human diseases,” RNA, vol. 19, no. 11, pp. 1552–1562, 2013.

[180] J. S. Tsang, M. S. Ebert, and A. van Oudenaarden, “Genome-wide dissection of microrna functions and cotargeting networks using gene set signatures,” Molec- ular cell, vol. 38, no. 1, pp. 140–153, 2010.

[181] W. Liebermeister, “Linear modes of gene expression determined by independent component analysis,” Bioinformatics, vol. 18, no. 1, pp. 51–60, 2002.

[182] J.-F. Cardoso and A. Souloumiac, “Blind beamforming for non-gaussian sig- nals,” in IEE Proceedings F (Radar and Signal Processing), vol. 140, no. 6. IET, 1993, pp. 362–370.

[183] A. J. Bell and T. J. Sejnowski, “An information-maximization approach to blind separation and blind deconvolution,” Neural computation, vol. 7, no. 6, pp. 1129–1159, 1995.

[184] J. Himberg, A. Hyv¨arinen, and F. Esposito, “Validating the independent com- ponents of neuroimaging time series via clustering and visualization,” Neuroim- age, vol. 22, no. 3, pp. 1214–1222, 2004.

[185] P. Chiappetta, M.-C. Roubaud, and B. Torr´esani,“Blind source separation and the analysis of microarray data,” Journal of Computational Biology, vol. 11, no. 6, pp. 1090–1109, 2004.

[186] S.-I. Lee, S. Batzoglou et al., “Application of independent component analysis to microarrays,” Genome biology, vol. 4, no. 11, pp. R76–R76, 2003.

[187] A. Frigyesi, S. Veerla, D. Lindgren, and M. H¨oglund,“Independent component analysis reveals new and biologically significant structures in micro array data,” BMC bioinformatics, vol. 7, no. 1, p. 290, 2006.

[188] A. E. Teschendorff, M. Journ´ee,P. A. Absil, R. Sepulchre, and C. Caldas, “Elu- cidating the altered transcriptional programs in breast cancer using independent component analysis,” PLoS computational biology, vol. 3, no. 8, p. e161, 2007.

170 [189] R. Schachtner, D. Lutter, P. Knollm¨uller,A. M. Tom´e,F. J. Theis, G. Schmitz, M. Stetter, P. G. Vilda, and E. W. Lang, “Knowledge-based gene expression classification via matrix factorization,” Bioinformatics, vol. 24, no. 15, pp. 1688–1697, 2008.

[190] J. M. Engreitz, B. J. Daigle, J. J. Marshall, and R. B. Altman, “Indepen- dent component analysis: Mining microarray data for fundamental human gene expression modules,” Journal of biomedical informatics, vol. 43, no. 6, pp. 932– 944, 2010.

[191] M. Rotival, T. Zeller, P. S. Wild, S. Maouche, S. Szymczak, A. Schillert, R. Castagn´e,A. Deiseroth, C. Proust, J. Brocheton et al., “Integrating genome- wide genetic variations and monocyte expression data reveals trans-regulated gene modules in humans,” PLoS genetics, vol. 7, no. 12, p. e1002367, 2011.

[192] R. A. Verdugo, T. Zeller, M. Rotival, P. S. Wild, T. M¨unzel et al., “Graphical modeling of gene expression in monocytes suggests molecular mechanisms ex- plaining increased atherosclerosis in smokers,” PloS one, vol. 8, no. 1, p. e50888, 2013.

[193] Y.-O. Li, T. Adalı, and V. D. Calhoun, “Estimating the number of independent components for functional magnetic resonance imaging data,” Human brain mapping, vol. 28, no. 11, pp. 1251–1266, 2007.

[194] J. L. Horn, “A rationale and test for the number of factors in factor analysis,” Psychometrika, vol. 30, no. 2, pp. 179–185, 1965.

[195] M. Scholz, S. Gatzek, A. Sterling, O. Fiehn, and J. Selbig, “Metabolite fin- gerprinting: detecting biological features by independent component analysis,” Bioinformatics, vol. 20, no. 15, pp. 2447–2454, 2004.

[196] S.-D. Hsu, F.-M. Lin, W.-Y. Wu, C. Liang, W.-C. Huang, W.-L. Chan, W.- T. Tsai, G.-Z. Chen, C.-J. Lee, C.-M. Chiu et al., “miRTarBase: a database curates experimentally validated microRNA–target interactions,” Nucleic acids research, vol. 39, no. suppl 1, pp. D163–D169, 2011.

[197] G. Yu, ReactomePA: Reactome Pathway Analysis, 2014, r package version 1.4.0.

[198] J. D. Osborne, J. Flatow, M. Holko, S. M. Lin, W. A. Kibbe, L. J. Zhu, M. I. Danila, G. Feng, and R. L. Chisholm, “Annotating the human genome with disease ontology,” BMC genomics, vol. 10, no. Suppl 1, p. S6, 2009.

[199] G. Yu, L. Wang, Y. Han, and Q. He., “clusterProfiler: an R package for compar- ing biological themes among gene clusters.” OMICS: A Journal of Integrative Biology, vol. 16, no. 5, pp. 284–287, 2012.

171 [200] C.-X. Deng, “BRCA1: cell cycle checkpoint, genetic instability, dna damage response and cancer evolution,” Nucleic acids research, vol. 34, no. 5, pp. 1416– 1426, 2006.

[201] F. Revillion, J. Bonneterre, and J. Peyrat, “ERBB2 oncogene in human breast cancer and its clinical significance,” European Journal of Cancer, vol. 34, no. 6, pp. 791–808, 1998.

[202] A. Hollestelle, J. H. Nagel, M. Smid, S. Lam, F. Elstrodt, M. Wasielewski, S. S. Ng, P. J. French, J. K. Peeters, M. J. Rozendaal et al., “Distinct gene mutation profiles among luminal-type and basal-type breast cancer cell lines,” Breast cancer research and treatment, vol. 121, no. 1, pp. 53–64, 2010.

[203] F. Lerebours, S. Vacher, C. Andrieu, M. Espie, M. Marty, R. Lidereau, and I. Bieche, “NF-kappa B genes have a major role in inflammatory breast cancer,” BMC cancer, vol. 8, no. 1, p. 41, 2008.

[204] A. D. Singhi, A. Cimino-Mathews, R. B. Jenkins, F. Lan, S. R. Fink, H. Nassar, R. Vang, J. H. Fetting, J. Hicks, S. Sukumar et al., “Myc gene amplification is often acquired in lethal distant breast cancer metastases of unamplified primary tumors,” Modern Pathology, vol. 25, no. 3, pp. 378–387, 2011.

[205] Y. C. Tsai and A. M. Weissman, “The unfolded protein response, degradation from the endoplasmic reticulum, and cancer,” Genes & cancer, vol. 1, no. 7, pp. 764–778, 2010.

[206] H. Zhao, A. Langerød, Y. Ji, K. W. Nowels, J. M. Nesland et al., “Different gene expression patterns in invasive lobular and ductal carcinomas of the breast,” Molecular biology of the cell, vol. 15, no. 6, pp. 2523–2536, 2004.

[207] N. Wasif, M. A. Maggard, C. Y. Ko, A. E. Giuliano et al., “Invasive lobular vs. ductal breast cancer: a stage-matched comparison of outcomes,” Annals of surgical oncology, vol. 17, no. 7, pp. 1862–1869, 2010.

[208] G. Turashvili, J. Bouchal, K. Baumforth, W. Wei, M. Dziechciarkova et al., “Novel markers for differentiation of lobular and ductal invasive breast carci- nomas by laser microdissection and microarray analysis,” BMC cancer, vol. 7, no. 1, p. 55, 2007.

[209] S.-l. Ding, J.-C. Yu, S.-T. Chen, G.-C. Hsu, S.-J. Kuo, Y. H. Lin, P.-E. Wu, and C.-Y. Shen, “Genetic variants of BLM interact with RAD51 to increase breast cancer susceptibility,” Carcinogenesis, vol. 30, no. 1, pp. 43–49, 2009.

172 [210] X.-J. Ma, R. Salunga, J. T. Tuggle, J. Gaudet, E. Enright, P. McQuary, T. Payette, M. Pistone, K. Stecker, B. M. Zhang et al., “Gene expression pro- files of human breast cancer progression,” Proceedings of the National Academy of Sciences, vol. 100, no. 10, pp. 5974–5979, 2003.

[211] M. Lee, M. Daniels, M. Garnett, and A. Venkitaraman, “A mitotic function for the high-mobility group protein HMG20b regulated by its interaction with the brc repeats of the tumor suppressor,” Oncogene, vol. 30, no. 30, pp. 3360–3369, 2011.

[212] K.-L. Chan, P. S. North, and I. D. Hickson, “BLM is required for faithful segregation and its localization defines a class of ultrafine anaphase bridges,” The EMBO journal, vol. 26, no. 14, pp. 3397–3409, 2007.

[213] K. Broberg, E. Huynh, K. S. Engstr¨om,J. Bj¨ork,M. Albin, C. Ingvar, H. Olsson, and M. H¨oglund,“Association between polymorphisms in RMI1, TOP3A, and BLM and risk of cancer, a case-control study,” BMC cancer, vol. 9, no. 1, p. 140, 2009.

[214] E. Levy-Lahad, “Fanconi anemia and breast cancer susceptibility meet again.” Nature genetics, vol. 42, no. 5, 2010.

[215] P. Li, Y. Lin, Y. Zhang, Z. Zhu, K. Huo et al., “SSX2IP promotes metasta- sis and chemotherapeutic resistance of hepatocellular carcinoma,” Journal of translational medicine, 2013.

[216] A. R. Grosso, S. Martins, and M. Carmo-Fonseca, “The emerging role of splicing factors in cancer,” EMBO reports, vol. 9, no. 11, pp. 1087–1093, 2008.

[217] I. Lal, K. Dittus, and C. E. Holmes, “Platelets, coagulation and fibrinolysis in breast cancer progression,” Breast Cancer Research, vol. 15, no. 4, pp. 1–11, 2013.

[218] M.-N. Song, P.-G. Moon, J.-E. Lee, M. Na, W. Kang, Y. S. Chae, J.-Y. Park, H. Park, and M.-C. Baek, “Proteomic analysis of breast cancer tissues to iden- tify biomarker candidates by gel-assisted digestion and label-free quantification methods using LC-MS/MS,” Archives of pharmacal research, vol. 35, no. 10, pp. 1839–1847, 2012.

[219] I. Pucci-Minafra, P. Cancemi, M. R. Marabeti, N. N. Albanese, G. Di Cara, P. Taormina, and A. Marrazzo, “Proteomic profiling of 13 paired duc- tal infiltrating breast carcinomas and non-tumoral adjacent counterparts,” PROTEOMICS-Clinical Applications, vol. 1, no. 1, pp. 118–129, 2007.

173 [220] M.-C. Hsu, K.-T. Lee, W.-C. Hsiao, C.-H. Wu, H.-Y. Sun, I.-L. Lin, and K.-C. Young, “The dyslipidemia-associated snp on the APOA1/C3/A5 gene cluster predicts post-surgery poor outcome in taiwanese breast cancer patients: a 10- year follow-up study,” BMC cancer, vol. 13, no. 1, p. 330, 2013.

[221] A. Cretu, X. Sha, J. Tront, B. Hoffman, and D. A. Liebermann, “Stress sensor Gadd45 genes as therapeutic targets in cancer,” Cancer therapy, vol. 7, no. A, p. 268, 2009.

174