Active Module Discovery: Integrated Approaches of Gene Co-Expression and PPI Networks and MicroRNA Data
Dissertation
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University
By
Ayat Hatem, M.Sc.
Graduate Program in Electrical and Computer Engineering
The Ohio State University
2014
Dissertation Committee:
Umit¨ V. C¸ataly¨urek,Advisor Yuejie Chi Kun Huang F¨usun Ozg¨uner¨ c Copyright by Ayat Hatem 2014 Abstract
Integrating protein-protein interaction (PPI) networks with gene expression data to extract active modules is shown to be promising in detecting meaningful biomark- ers for cancer and other diseases. However, current algorithms suffer from many drawbacks such as focusing only on the highly differentially expressed genes, ana- lyzing dependencies between genes in the PPI network only; totally neglecting the genes whose interactions are not known yet, and finally using mRNA gene expression data; ignoring other types of data such as gene mutation information and microRNAs expressions. In addition, lately, using the next generation sequencing technology to sequence the mRNA (RNA-Seq) has become the new standard for gene expression.
However, existing algorithms either cannot handle the RNA-Seq data, or they return large modules which are hard to analyze. Therefore, we need new approaches to ad- dress the current drawbacks while utilizing and integrating the RNA-Seq data to the module discovery process.
This work explores some of the drawbacks of current active module discovery algorithms. We first discuss the differences between RNA-Seq data and microarray data. With experimental evidence, we show that RNA-Seq is more powerful than microarray in providing better active modules at the expense of generating larger ones. Therefore, new approaches are needed to handle RNA-Seq data.
ii Afterwards, we present a new workflow, PRASE, that is specifically designed to handle and obtain better active modules while using RNA-Seq data. PRASE employs a variation of the famous PageRank algorithm to preprocess the gene expression p- values. Then, it applies a scaling function to construct new p-values for the genes.
Such new p-values redefine the importance of the genes: a gene is important not only based on its own value but also based on the values of the surrounding genes, thus, boosting the importance of genes that might not be differentially expressed from the p-value perspective. Finally, PRASE uses the new p-values with the existing active module discovery algorithms to extract the final modules. We applied our workflow on colorectal cancer, oligodendroglioma tumor, and breast cancer datasets.
Using PRASE, we obtain more specialized modules which contain information that is overlooked by existing algorithms.
Finally, we present our novel microRNA-mRNA integration technique, Mica, that efficiently integrates microRNA and mRNA expressions with the PPI network to discover more disease-specific active modules. The novelty of Mica lies in the early integration of microRNA expression with mRNA expression to better highlight the indirect dependencies between genes. We applied Mica on microRNA-Seq and
mRNA-Seq data sets of 699 invasive ductal carcinoma samples and 150 invasive lob-
ular carcinoma samples from the Cancer Genome Atlas Project (TCGA). The Mica
modules unravel new and interesting dependencies between the genes and miRNAs.
Additionally, the modules accurately differentiate between case and control samples
while being highly enriched with disease-specific pathways and genes.
iii To my parents, Karim, Omar, and Maleeka.
iv Acknowledgments
I would like to thank and express gratitude to my advisor Prof. Umit¨ V. C¸ataly¨urek, for his continuous and generous support and guidance throughout my study at OSU.
Prof. C¸ataly¨urekshowed great faith in my abilities and allowed me to work quite independently, but at the same time provided invaluable guidance at the necessary times.
I would also like to thank the dissertation examination committee members, in- cluding, Prof. F¨usun Ozg¨uner,Prof.¨ Kun Huang, Prof. Yuejie Chi, and Prof. Dawn
Chandler. The discussion and comments I received during my defense were invaluable; opening my mind to new ideas and research directions.
I would also like to thank Prof. Kamer Kaya for his support and the various discussions we had; some of which already generated ideas used in my work.
I want to thank all of my colleagues and friends at the HPC lab including Erdem
Sariyuce, Mehmet Deveci, Anas AbuDolah, ad Izzet Senturk. Also, I would like to thank the former members of the HPC lab, including, Erik Saule, Onur Kucuktunc, and Doruk Bozda˘g. It has been a privilege to know such a great group of people.
Particularly I would like to mention Doruk Bozda˘gand Erik Saule for the numerous fruitful and interesting discussions.
I would like to extend my deepest gratitude and love to my mother and my late father, who supported me during my research career and always encouraged me to
v follow my dreams. I also would like to thank my children, Omar and Maleeka, for their sense of humor and their wonderful characters, they totally changed my life. I can’t describe how grateful I am towards my husband, Karim, whose sweet presence has brought happiness into my life. He was always there for me in my tough times and always encouraging me to go forward with my PhD and never to give up.
Finally, I acknowledge the support of the Graduate School of The Ohio State
University, for the University Fellowship Award and the support from the National
Science Foundation.
vi Vita
September 15th, 1985 ...... Born - Giza, Egypt
July 2007 ...... B.S., Computer Engineering, Cairo University, Cairo, Egypt August 2009 ...... M.S., Software Engineering, Nile University, Cairo, Egypt September 2009–August 2010 ...... University Fellow, The Ohio State University, Columbus, OH, USA September 2010–Spring 2013 ...... Grad. Research Assoc., The Ohio State University, Columbus, OH, USA Spring 2013–Present ...... Grad. Teaching Assoc., The Ohio State University, Columbus, OH, USA
Publications
Research Publications
A. Hatem, K. Kaya, J. Parvin, K. Huang, U.¨ V. C¸ataly¨urek, ”MICA: MicroRNA Integration for Active Module Discovery,” In the 13th European Conference on Computational Biology (ECCB), Submitted
K. Kaya, A. Hatem, H. G. Ozer,¨ K. Huang, U.¨ V. C¸ataly¨urek, ”High-Performance Computing in High-Throughput Sequencing,” In Biological Knowledge Discovery Handbook, John Wiley & Sons, Editors M. Elloumi, A. Y. Zomaya, 2014
vii L. Wang, A. Hatem, U.¨ V. C¸ataly¨urek,M. Morrison, Z. Yu, ”Metagenomic Insights into the Carbohydrate-Active Enzymes Carried by the Microorganisms Adhering to Solid Digesta in the Rumen of Cows,” In PLoS One, vol. 8, no. 11, pg. e78507, Nov 2013
A. Hatem, D. Bozda˘g, A. E. Toland, U.¨ V. C¸ataly¨urek, ”Benchmarking Short Se- quence Mapping Tools,” In BMC Bioinformatics, vol. 14, no. 1, pg. 184, 2013
A. Hatem, K. Kaya, U.¨ V. C¸ataly¨urek,”PRASE: PageRank-based Active Subnetwork Extraction,” In Proc. of ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics (BCB), Sep 2013
A. Hatem, K. Kaya, U,¨ V.C¸ataly¨urek, ”Microarray vs. RNA-Seq: A comparison for active subnetwork discovery,” In Proc. of ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics (BCB), Oct 2012
A. Hatem,D. Bozda˘g, U.¨ V. C¸ataly¨urek, ”Benchmarking Short Sequence Mapping Tools,” In Proc. of IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Dec 2011
D. Bozda˘g,A. Hatem, U.¨ V. C¸ataly¨urek,”Exploring Parallelism in Short Sequence Mapping Using Burrows-Wheeler Transform,” In Proc. of 9th IEEE International Workshop on High Performance Computational Biology (in conjunction with IPDPS), 2010
A. Hatem, D. Bozda˘g, U.¨ V. C¸ataly¨urek,”Benchmarking Short Sequence Alignment Tools,” In Abstract, Bioinformatics, 2010 Ohio Collaborative Conference, 2010
Fields of Study
Major Field: Electrical & Computer Engineering
viii Table of Contents
Page
Abstract ...... ii
Dedication ...... iv
Acknowledgments ...... v
Vita...... vii
List of Tables ...... xii
List of Figures ...... xiv
1. Introduction ...... 1
1.1 Dissertation Outline and Summary of Contributions ...... 4
2. Background and Related Work ...... 8
2.1 DNA and the central dogma ...... 8 2.1.1 Measuring gene expression levels ...... 10 2.1.2 Other elements in the central dogma ...... 11 2.2 Active module discovery problem ...... 12 2.3 microRNA and mRNA integration ...... 18
3. An Evaluation of RNA-Seq Mapping Tools ...... 20
3.1 Background ...... 26 3.1.1 Features ...... 26 3.1.2 Tools’ description ...... 28 3.1.3 Default options of the tested tools ...... 31 3.1.4 Evaluation criteria ...... 34
ix 3.2 Methods ...... 38 3.2.1 Benchmark design ...... 38 3.2.2 Usecase: SNP Calling ...... 41 3.3 Results and discussion ...... 42 3.3.1 Mapping options ...... 48 3.3.2 Input properties ...... 54 3.3.3 Algorithmic features ...... 63 3.3.4 Scalability ...... 65 3.3.5 Accuracy evaluation ...... 67 3.3.6 Rabema evaluation ...... 70 3.3.7 Use case: SNP calling ...... 71 3.4 Conclusion ...... 72
4. Efficiency of RNA-Seq Data for Active Module Discovery in Comparison to MicroArrays ...... 77
4.1 Background ...... 79 4.1.1 Tools for Active Module Discovery ...... 79 4.1.2 Microarray vs. RNA-Seq: History ...... 81 4.2 Experimental Evaluation ...... 82 4.2.1 Colorectal cancer cell lines ...... 84 4.2.2 Oligodendroglioma tumors ...... 92 4.3 Conclusion and Future Work ...... 94
5. PRASE: PageRank-based Active Module Extraction ...... 97
5.1 Background ...... 102 5.1.1 Active module extraction tools ...... 102 5.1.2 PageRank for gene ranking ...... 103 5.2 PRASE ...... 104 5.2.1 Input network and matrix construction ...... 105 5.2.2 Re-ranking ...... 107 5.2.3 Scaling and combining ...... 107 5.3 Experimental Results ...... 109 5.3.1 Breast invasive carcinoma ...... 111 5.3.2 Colorectal cancer cell line (CRC) ...... 116 5.3.3 Oligodendroglioma tumors ...... 119 5.4 Conclusions ...... 122
6. MICA: MicroRNA Integration for Active Module Discovery ...... 125
6.1 Background ...... 127
x 6.2 Methods ...... 128 6.2.1 Data integration ...... 130 6.2.2 ICA on gene expression values ...... 131 6.2.3 Connected module extraction ...... 133 6.3 Results ...... 134 6.3.1 Results on ILC data ...... 136 6.3.2 Results on IDC data ...... 143 6.4 Conclusion ...... 148
7. Conclusions and Future Directions ...... 150
7.1 Summaries and our findings ...... 150 7.2 Future Work ...... 153
Bibliography ...... 155
xi List of Tables
Table Page
2.1 Famous active module discovery algorithms and their features . . . . 17
3.1 Features supported by the tools ...... 32
3.2 Sensitivity evaluation of the different tools ...... 69
3.3 Rabema evaluation ...... 70
3.4 SNP calling results ...... 73
4.1 Size of active modules obtained by the different tools ...... 85
4.2 Number of DE genes in each active module ...... 87
4.3 Occurrence of significant genes in the different modules ...... 88
4.4 Top three hub nodes in each module ...... 91
4.5 Size of active modules found by the tools ...... 93
4.6 Hub node analysis ...... 94
5.1 Standard names for the curated gene sets ...... 110
5.2 Go enrichment analysis for the BRCA dataset ...... 114
5.3 Pathway enrichment analysis for the BRCA data set ...... 116
5.4 Number of DE and significant genes in each module ...... 118
xii 5.5 Percentages of DE genes in each module ...... 119
5.6 Summary of improvements ...... 124
6.1 Size of the modules obtained using Mica and ICA for the ILC data set. 136
6.2 Pathway enrichment analysis for Mica, ICA, and DEGAS on the ILC data.140
6.3 The components obtained by ICA and Mica on the IDC data set. . . 144
6.4 Pathway enrichment analysis for ICA, DEGAS, and Mica on the IDC data...... 147
6.5 DO enrichment analysis for ICA, DEGAS, and Mica...... 148
xiii List of Figures
Figure Page
1.1 PRASE workflow ...... 5
1.2 MICA workflow ...... 6
2.1 Central dogma of biology ...... 10
3.1 Evaluation criteria ...... 35
3.2 Default options effect using wgsim ...... 46
3.3 Default options effect ...... 47
3.4 Quality threshold vs. number of mismatches ...... 49
3.5 Effect of changing the number of mismatches using a synthetic data set extracted using wgsim ...... 51
3.6 Effect of changing the number of mismatches using a synthetic data set extracted using ART...... 52
3.7 Effect of changing the number of mismatches using a real data set. . . 53
3.8 Effect of changing the seed length using a synthetic data set . . . . . 55
3.9 Effect of changing the seed length using a real data set ...... 56
3.10 Effect of changing the read length using a synthetic data set extracted using wgsim ...... 58
3.11 Effect of changing the read length using a ART generated data set . . 59
xiv 3.12 Effect of using paired-end data using a wgsim synthetic data set. . . . 60
3.13 Effect of changing the genome type using wgsim generated synthetic data set...... 62
3.14 Effect of changing the genome type using ART generated synthetic data set...... 64
3.15 Effect of enabling gapped alignment using a real data set...... 66
3.16 Speedup when using multithreading and multiprocessing...... 68
4.1 Visualization of the MicroNet modules ...... 86
5.1 PRASE workflow ...... 105
5.2 Evaluation of the modules obtained for the BRCA dataset ...... 113
5.3 Size of the active modules from the CRC dataset ...... 117
5.4 Size of active modules for the Oligo dataset ...... 120
5.5 Percentage of important genes in the jActiveModules module . . . . 122
6.1 Mica: The workflow ...... 129
6.2 Random t-score distribution...... 137
6.3 AUC for Mica, ICA, and DEGAS for a 10-fold cross validation. . . . . 138
6.4 Overlap between Important pathways enriched in both Mica and ICA modules...... 141
6.5 mica15 module. The red nodes are for the nodes in the Hemostasis pathway...... 146
xv Chapter 1: Introduction
In complex diseases, genes do not act in isolation, rather, they interact together in pathways and modules to perform the designated function [1]. In addition, their interaction patterns are changed based on the type of the cell and the condition [2].
A well-structured characterization and analysis of such modules have always been intriguing for the researchers, especially for extremely heterogeneous diseases. Cancer is such a disease: the derivative tissue differs for many cancer types. Besides, each cancer type can have many subtypes. Identifying a biologically correct and valid module is important for each cancer type and subtype since the treatment options and their success rates can significantly differ [3].
One way to find such modules is to look for clusters of genes with certain prop- erties, e.g., dense cluster, in different biological networks, such as the protein-protein interaction (PPI) network or the gene co-expression network [4,5,6,7,8,9, 10, 11,
12, 13, 14]. A more efficient method is the integration of different biological data to better highlight these gene modules [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27].
Following this idea, various techniques that integrate gene-expression values or p- values with biological networks to extract such gene modules have been proposed, e.g., [28, 29, 30, 31, 32, 33, 34, 19, 35, 36, 37]. Such extracted modules are called
1 active modules since the gene expression data, which is dynamically changing, is in-
tegrated with the PPI network, which is static. Hence, the word active comes from
the notion that these modules are active in certain cells or conditions. Following
this track, many algorithms have been developed to better make use of the network
structure and other types of data as well, such as genotypic data. An excellent review
and categorization of these algorithms was recently provided [24].
Although the gene expression signature-based algorithms have proven to be flexi-
ble in practice, they do not provide a be-all and end-all solution for the active modules discovery problem. Today, we have the high throughput sequencing technology with its unprecedented amount of data generated in different areas, such as mRNA-Seq and microRNA-Seq. The integration of the different types of data would indeed increase the accuracy of active modules detection in addition to providing a better picture about how the underlying cell works. However, many of the existing algorithms and workflows do not exploit such heterogeneity. Besides, these algorithms are usually restricted to the proteins/genes in the networks they use and ignore the other genes in the gene expression data that we do not yet have any information regarding their interaction patterns.
Specifically, the drawbacks of current active module discovery algorithms can be summarized as follows (See Chapter2 for more details):
• The use of p-value or fold change based criteria to define the differentially ex-
pressed genes,
• The under-estimation of genes that might not be differentially expressed from
the p-value perspective but might be important based on its interaction patterns
with surrounding important genes.
2 • Even when the above problem was solved using information flow based algo-
rithms, the problem is addressed per sample and it is not obvious how to address
the problem to find the active module across all of the samples.
• The assumption that genes should exhibit linear correlations across the samples.
• The focus on only integrating gene expression and PPI network while there
are other mechanisms the cell further use to regulate the expression levels of
the genes. The effect of such mechanisms might not be apparent at the gene
expression level, therefore, other possible active modules might not be detected.
• Most of the algorithms focus only on the genes existing in the PPI network while
ignoring other genes that are not yet discovered in the PPI network but have
important relations to other genes or would increase the rank of other genes.
• Most of the tools were designed and experimented using only Microarray data
and it is not obvious how they would further perform using the high throughput
data, such as mRNA-Seq.
An important question regarding high throughput sequencing generated data is how to map the data and finally obtain genes and microRNAs expression values.
To this end, many tools have been developed to map the short sequences into a reference genome. However, the quality of the mapping is still questionable and an effective evaluation of the mapping, and hence the final expression values used for active module discovery, are needed to effectively understand the efficiency and the importance of using high throughput sequencing datasets in the active module discovery problem.
3 1.1 Dissertation Outline and Summary of Contributions
In this dissertation, a deep evaluation, design, and implementation of different approaches to efficiently integrate different types of biological data are presented. We discuss the high throughput sequencing technology and how it affected measuring gene expression. Additionally, unlike old techniques, such as Microarrays, the quality of measuring the gene expression from mRNA-Seq data highly depends on the quality of the mapping step. To this end, a comparison between the different short sequence mapping tools and an evaluation of the effect of the different settings on the quality of the output are presented in Chapter3. In addition, we further discuss the different approaches used in short sequencing mapping and the state of the arts tools that implement those approaches.
In Chapter4, a comparison between the effect of the mRNA-Seq and Microar- ray data on the quality of the discovered active modules is addressed. It is highly important to understand the significance of mRNA-Seq datasets in discovering more disease-related active modules, otherwise, Microarray datasets, which are cheaper, would be more suitable to further advance the field. Therefore, a deep evaluation is carried out in this chapter to address this point.
In Chapter5 and6, we present the two approaches we implemented to effectively and efficiently integrate different types of high throughput data for active module discovery. Figure 1.1 and 1.2 show the workflow of the two approaches. The first workflow, PRASE, is based on intelligently making use of the mRNA-Seq proper- ties to create a gene co-expression network and further adjusting the p-values for the genes to highlight the most important ones. Basically, a variation of the PageR- ank algorithm [38] is used to populate the significance of gene to its neighbors in
4 Figure 1.1: PRASE workflow
a gene co-expression network constructed from the mRNA-Seq data. Since mRNA-
Seq data contains a complete image about which genes exist in the cell, gene co- expression network would contain relations that would otherwise be missed by other physical interactions-based networks. After using PageRank to populate gene signifi- cance, we accordingly adjust the p-values and generate new ones that are further used with current active module discovery tools. The second workflow, Mica, presented in Chapter6, provides a new solution for active module discovery by integrating microRNA-Seq, mRNA-Seq, and PPI network in one frame. Additionally, a novel microRNA-mRNA integration method is introduced to instead of depending on the common correlation integration based method. Basically, we modify the gene ex- pression values, generated from the mRNA-Seq data, with the microRNA expression values to better approximate the actual protein expression values. Afterwards, the genes are clustered to groups using independent component analysis (ICA) [39]. Fi- nally, the clusters of genes are mapped to the PPI network to extract the active modules.
5 Controls Cases Controls Cases PPI Network
gene 1 miRNA 1 gene 2 miRNA 2 gene 3 miRNA 3 gene 4 miRNA 4 gene 5 ...... miRNA m gene n microRNA Expression Profiles Gene Expression Profiles
Integration gene 1 gene 2 gene 3 miRNA r: z r,s > t gene 4 miRNA r': z r',s > t gene 5 . . . . Connected Module Extraction . . miRNA r'': z > t gene n r'',s module 1 module 2 Adjusted Gene Expressions
ICA
Output of ICA module 3
Figure 1.2: MICA workflow
The contributions of this dissertation can be summarized as follows:
• A comprehensive comparison between the different short sequence mapping
tools. The comparison is carried out to tackle different perspectives, such as,
the effect of changing the default settings, algorithm, type and size of inputs,
and reference genomes.
• A comprehensive analysis of the effect of mRNA-Seq and Micorarray in the
quality of the extracted active modules. The study shows how the mRNA-Seq
data greatly affects the size and the comprehensiveness of the obtained active
6 modules, hence, arguing the development of better and more efficient tools for
solving the active module discovery problem.
• The first workflow to make use of mRNA-Seq data properties and integrate gene
co-expression and PPI networks to extract more disease-related active modules.
• A novel integration approach for microRNA and mRNA data that is further
integrated with the PPI network in one frame for more accurate active module
discovery.
7 Chapter 2: Background and Related Work
There are different types of biological data, each giving a different perspective of how the cell works. The data types are generated by different technologies, each developed to measure/observe the behavior of different elements in the cell. Due to the specific properties of each data type, many algorithms were developed to analyze them separately. However, recently, more focus have been on integrating the data types together to a gain a better understanding of the cell mechanism.
In this chapter, we first explain the main elements of the cell and how the cell function. Then, we briefly discuss the new technologies and their pros and cons to further understand their impact on the active module discovery problem. Finally, we present the state of the art algorithms for the active module discovery problem and discuss their main drawbacks.
2.1 DNA and the central dogma
The Deoxyribonucleic acid (DNA) is the blueprint of the biological life. DNA is found in all of the living organisms storing complex information about the function and the behavior of each cell. In most cells, the DNA molecule consists of two biopolymer strands coiled around each other forming a double helix. The strands are composed of smaller organic molecules called nucleotides. Each nucleotide consists
8 of a base chemical, either one of guanine (G), adenine (A), thymine (T), or cytosine
(C). Therefore, the nucleotides are usually referred to as T, C, G, or A.
A gene is a small segment in the DNA that codes for proteins, which are molecules
responsible for defining how the cell works and function. Additionally, each gene codes
for more than one protein. For instance, in the human DNA, there are around 20, 000
genes coding for around 100, 000 proteins. In order to have this one-to-many relation,
each coding gene can generate different isoforms each with a different sequence. Each
isoform is then mapped to the corresponding protein. Similar to DNA, genes are
composed of nucleotides. On the other hand, the main molecules in the proteins are
called amino acids.
Genes in the DNA go into different phases in order to be finally translated into
proteins. Theses phases are called the central dogma of biology. A simplified view of
the central dogma is shown in Figure 2.1. First, Genes in the DNA are transcribed into
RNA, which is a single strand biopolymer. Then, the genes in the RNAs are shaped
into one of their isoforms, producing mRNAs. Finally, the mRNAs are translated
into the final proteins. A gene is called expressed in a cell if the cell specific-mRNA
contains a copy or multiple copies of the gene sequence. Additionally, a gene is
called differentially expressed (DE) in a diseased-cell, if it is either up-regulated, i.e., more copies of the gene sequence in the diseased-cell than in the healthy-cell, or down-regulated, i.e., less copies of the gene sequence in the diseased-cell than in the healthy-cell.
9 Transcription Translation DNA RNA Proteins
Figure 2.1: Central dogma of biology
2.1.1 Measuring gene expression levels
Measuring gene expression levels and determining which isoform is active are very
crucial in understanding which genes are differentially expressed in the disease, thus,
shedding light on the disease mechanism. There have been many techniques used to
do these measures, such as Northern Blots, expressed sequence tags (ESTs), serial
analysis of gene Expression (SAGE), and reverse transcription PCR (RT-PCR) [40,
41]. However, these techniques suffered from limitations on the number of genes that
can be analyzed in parallel [41]. Newer techniques, such as microarrays and RNA-Seq,
were further developed to better measure gene expression levels.
Microarray technology is based on combining the RNA in the cell with small
sequences of genes that might possibly exist in the cell. For instance, if the sequence
of gene g binds with the RNA in cell C, then we know that g is expressed in C.
However, microarrays require the prior knowledge about the structure of a gene.
Additionally, analyzing genes in new genomes is hard due to the unavailability of probes for this genome [40].
RNA-Seq is a more recent technique for gene expression analysis using high throughput sequencing [42]. Basically, the RNA, which is a single strand, is con- verted to a cDNA, which is double-stranded. The library of cDNA is then sequenced using the high throughput sequencing technology, thus, generating thousands or even
10 millions of short sequences, also called reads. The short sequences are then either mapped to a reference genome to generate the gene expression levels and isoforms or assembled de novo generating a complete transcriptome in case of the absence of a ref- erence genome. RNA-Seq made measuring the gene expression levels easier and more accurate. For instance, unlike microarrays, RNA-Seq can detect unidentified genes while not requiring any information about the distinct isoforms for the gene [40]. On the other hand, although its cost is continuously reducing, RNA-Seq has always been defined as a more expensive technique when compared with microarrays.
2.1.2 Other elements in the central dogma
Figure 2.1 shows the main steps for genes to be finally transformed into proteins.
However, the cell uses different post-transcription mechanisms that further modify the final protein expression level. Therefore, gene expression levels cannot always be representative for the corresponding protein expression levels.
One of the famous mechanisms the cell uses to regulate the protein expression levels is microRNAs (miRNAs). miRNAs are small non-coding RNAs used by the cell to post-transcriptionally regulate gene expression levels [43]. They inhibit protein synthesis by either stopping the protein translation or by performing mRNA degra- dation. miRNAs constitute an important inhibition technique that has been shown to be very important in different diseases, specifically, in cancer progression [44]. For instance, miRNAs were found to be differentially expressed in breast cancer in addi- tion to successfully classifying estrogen and progesterone receptors, and HER2/neu status [45].
11 Recently, high throughput sequencing was also used to measure the expression levels of miRNAs. miRNAs are processed in the same way as mRNAs. However, miRNAs sequences are much smaller that the mRNAs sequences. In addition, their number is much smaller that the number of mRNAs, e.g., approximately 1000 miR-
NAs exist in the human cells [46].
2.2 Active module discovery problem
The active module discovery problem is basically the problem of extracting genes exhibiting certain properties. Such properties could be similar interaction patterns, high correlation, or maximizing a certain function. A well assumed and studied behavior is the density of interactions between genes. Even though focusing on the network structure lacks the use of dynamic data, it formed the basis for active module discovery. Therefore, we briefly discuss first the dense module extraction problem and the key challenges.
Many algorithms have been developed to find clusters of densely interacting genes in the PPI network or the gene co-expression network, including the work of [14,4,
5,6,7,8,9, 10, 11, 12, 13]. While addressing the challenges differently, most of the algorithms focus on solving the following challenges:
• Generating overlapping clusters or, in other words, soft clusters, hence, allowing
genes to exist in more than one cluster.
• Handling the high degree problem of hub nodes. If hub nodes are not prop-
erly handled, they would lead to one or two large clusters and other singleton
clusters, hence, leading to non-informative ones.
12 • Handling noisy interactions. The interactions (edges) in the PPI network are
usually noisy. Therefore, a confidence score for each edge should be taken into
consideration to handle the noisy edges.
To handle the above challenges, Asur et al. developed an algorithm that took the topological properties of the underlying PPI network into consideration in the clus- tering phase [4]. They used two graph-based matrices, namely, clustering coefficient and betweenness centrality to find similar vertices and to group them together. The proposed algorithm explicitly handled the hub nodes and tried to cluster them into multiple groups. Additionally, the weights of the edges in the PPI network, i.e., con-
fidence scores, were taken into consideration. Shih and Parthasarathy introduced an iterative based Markov Clustering algorithm to solve the soft clustering problem [9].
Zhang and Li solved the soft clustering problem by introducing a consensus clustering based algorithm [10]. Basically, given different possible input clusters for the data, the algorithm generated more than one consensus cluster. Even though it was not designed for PPI networks specifically, it was shown to be effective in obtaining mean- ingful clusters from the PPI network as well. Inoue et al. solved the soft clustering and the hub node problem by introducing a random walk based algorithm [11]. The algorithm first generated a diffusion model of the PPI network, then random walks were applied on this model to generate the clusters. Another random walk based algorithm was introduced by Macropol et al [5]. The algorithm performed repeated random walks with restarts on the PPI network to find local clusters. The surround- ings of a seed node were examined to see how approximate they were to the seed node.
The stopping criteria was either finding a cluster of size k or the shortest distance
from the seed node to a potential protein was greater than a certain threshold. Li
13 et al. developed another algorithm that was based on dividing the PPI network into three subgraphs, one for high-degree nodes, one for low-degree nodes, and one for relation between high-degree and low-degree nodes [6]. Li et al. aimed at finding overlapping clusters, hence, they allowed high-degree nodes to exist in more than one cluster. The average size of the obtained modules was five.
Even though the mentioned algorithms have proven their efficiency in extracting functional modules, using only one type of data suffers from a lot of drawbacks [19, 24].
One important drawback is that the PPI network is actually static and cannot give actual information about the underlying dynamics of the cell [36, 47]. In more details, the interactions in the PPI networks are obtained at different conditions and from different cells. However, the edges in the PPI network does not contain the underly- ing condition information [48]. Hence, even though the static graph based algorithms assume that they are returning functionally important modules, such modules might turn out to be false positives at the end. Another drawback is the noise and the bias in the high throughput technologies used to measure the interactions [49]. Therefore, depending only on one type of data would raise doubts about the quality and repro- ducibility of the results. Thus, the integration of other forms of dynamic data, such as the gene expression, has become inevitable [35, 48, 24].
Most of the algorithms designed to solve the active module discovery problem have been mainly concerned with integrating gene expression data with the PPI net- work. Table 2.1 shows the most famous algorithms and the common features between them. The first algorithm to introduce the idea of integrating Microarray gene ex- pression data and the PPI network for active module discovery was jActiveModules developed by Ideker et al. [28]. Ideker et al. defined the highest active module as
14 the connected module that has the highest weight, where the weights on the nodes
are calculated from the genes p-values. Finding the maximum weighted module is an NP hard problem. Therefore, they introduced a simulated annealing based al- gorithm to approximate the solution. A key feature in jActiveModule is that it does not restrict its search on the weighted nodes only but also on nodes connecting other important weighted nodes. This feature is highly important for Microarray data since Microarray data contains information only for few genes. Additionally, jActiveModules processes each sample separately and then finds the most informa- tive module among all of the samples. On the other hand, jActiveModules assumes that there are control-case pairs, which is not always true. In addition, it gives more importance to genes that are differentially expressed (DE). However, there are other genes that might not be DE but their interaction patterns might be a marker for the disease [50].
Following this track, many other algorithms have been introduced either to tune jActiveModule or to introduce a new optimization function. For instance, GXNA al- gorithm adjusts jActiveModules by using another scoring function; instead of using p-values, it directly uses the gene expression values [51]. PinnacleZ is an imple- mentation of the algorithm introduced by Chuang et al. [52]. The notable change in Chuang et al. algorithm in comparison to jActiveModules is the function used to determine the relevance of a module to case samples in comparison to the con- trol samples. Specifically, for each module discovered, they calculated the mutual information between the module scores and the sample class. Heinz algorithm also works on the p-values [29]. However, Heinz combines all of the p-values together into one value. Then, if finds the active modules by transforming the problem into the
15 well-known prize-collecting Steiner tree problem (PCST). Albeit the optimality of the algorithm, lying on the assumption that a certain gene should have similar p-values across the samples is not applicable, specially for large datasets and heterogeneous diseases. Backes et al. also introduced an integer linear programming based algorithm to find the maximal weighted module [53]. However, they put three constraints to address the problem: first, the module should be the heaviest and not only maximal weight, i.e., dense module, second, the module should be reachable from a root node, and third, the module size should be at most k.
Instead of defining the problem as finding the maximal weighted module, Ulitsky et al. introduced two algorithms that pose other definitions, MATISSE [54] and DEGAS [30].
MATISSE algorithm tries to find the groups of connected genes exhibiting the same expression behavior. Such modules are discovered by projecting gene correlation values into the PPI network and finding modules of genes with similar correlation.
MATISSE is further restricted to highly DE genes by using a fold change threshold.
An obvious drawback of MATISSE is the assumption of linear correlation between the genes that is maintained across most of the samples. DEGAS algorithm is a set cover based algorithm that searches for the set of k DE genes that cover most of the samples. DEGAS uses the p-value with a cut off 0.05 to determine if a gene is DE or not. However, p-value does not always reflect the differential expression and hence the importance of the genes.
Another approach to solve the problem that recently gained popularity is the modeling of the problem as an information flow based one. The popularity of infor- mation flow based approach lies in two-fold: un-differentially expressed genes would gain differential importance based on how importance its interactions, second, it takes
16 Table 2.1: Famous active module discovery algorithms and the features common between them. DE refers to the method used to define DE genes, local net topology refers to the use of the network topology, Sample diff refers to the assumption that samples are different, Case-control pairs refers to the assumption that there is a case- control pair samples for each patient. p stands for p-value, z stands for z-score, and fc stands for fold change. tool Basic algorithm DE Local net topology Samples diff. Case-control pairs ref jActiveModules Score summation p [28] GXNA Score summation [51] heinz Score summation p [29] PinnacleZ Score summation z [52] MATISSE Correlation fc [54] DEGAS Set cover p [30] NetWalk Random walks [56]
the global and local network structure into consideration. An example of an algo- rithm developed based on this idea is NetWalk [55, 56]. NetWalk is a random walk with restart based algorithm. In NetWalk, the gene expression values are used to weight the nodes. Then, weighted nodes are used to calculate the transition proba- bility from one node to another one. A random walk based approach is then applied on the weighted PPI network to generate the final rank of each node. NetWalk fi- nally extracts the active modules by extracting the connected edges with the highest weights. NetWalk also has an interactive interface to compare between the active modules obtained from each sample. On the other hand, using information flow for integrating gene expression with PPI network poses the problem of how to combine and find the most informative active module across all of the samples. In general, information flow based approaches were mainly used to integrate other data types, such as disease similarity data [32] and mutated genes information [57].
17 2.3 microRNA and mRNA integration
Many works integrating miRNAs and mRNA depend on the fact that miRNAs degrade target mRNA, hence, the effect on the gene should be apparent at the gene expression level. For instance, mirConnX constructs an association network between miRNAs, mRNAs, and TFs by applying correlation measures on pairs of them [58].
Another framework, CoMeTa, is also developed to predict miRNAs targets [59]. The basic idea is to group miRNA target genes based on their co-expression. Accordingly
CoMeTa can de novo discover new miRNA-targets. Jayaswal et al. define a new mea- sure, UD, for association between miRNAs and mRNAs [60]. The main goal of using the UD measure is to calculate the association between miRNA and mRNA without requiring to have a match between miRNA and mRNA samples. The UD measure is basically calculating the average difference in expression between the control and case samples for miRNAs and mRNAs. After discretizing the average expression values, a statistical test is used to measure the independence between the change in miRNA and mRNAs expressions. mirDREM is another algorithm developed to understand the relation between miRNA and mRNAs [61]. mirDREM constructs a probabilistic model for the regulation of mRNAs expression values using miRNAs and TFs. mirDREM is not concerned with predicting new miRNA targets, rather, it is concerned with modeling the dynamic behavior of miRNAs and its effect on mRNAs in the different conditions.
Indeed, the above mentioned work is valuable in case of predicting new miRNA targets. However, in case of understanding the dynamic behavior of miRNAs and mRNAs and the relation between miRNA target genes, such methods are not sufficient since the final protein expression level can be significantly affected by miRNAs without
18 having any apparent effect on the gene expression level [62, 63]. A possible solution for overcoming the correlation constraint at the expression level between miRNA and mRNAs was introduced by Cun and Fr¨ohlich [64]. The solution is based on integrating the PPI network with miRNA target gene network and then apply random walks on the heterogeneous network to rank the genes accordingly. Indeed, such integration would work around the miRNA and mRNA integration problem. However, by focusing only in prioritizing genes through the PPI network, they cannot detect connected modules of genes with indirect dependencies, e.g., through other genes not in the PPI network or through other genes with no change in expression at mRNA level. Additionally, Cun and Fr¨ohlich do not treat each sample differently, rather, they calculate the t-score for each gene using all of the samples.
19 Chapter 3: An Evaluation of RNA-Seq Mapping Tools
Next-generation sequencing (NGS) technology has evolved rapidly in the last five
years, leading to the generation of hundreds of millions of sequences (reads) in a
single run. The number of generated reads varies between 1 million for long reads
generated by Roche/454 sequencer (≈400 base pairs (bps)) and 2.4 billion for short reads generated by Illumina/Solexa and ABI/SOLIDTM sequencers (≈75 bps). The invention of the high-throughput sequencers has led to a significant cost reduction, e.g., a Megabase of DNA sequence costs only $0.1 [65].
Nevertheless, the large amount of generated data tells us almost nothing about the DNA, as stated by Flicek and Birney [66]. This is due to the lack of proper analysis tools and algorithms. Therefore, bioinformatics researchers started to think about new ways to efficiently handle and analyze this large amount of data.
One of the areas that attracted many researchers to work on is the alignment
(mapping) of the generated sequences, i.e., the alignment of reads generated by NGS machines to a reference genome. Because, an efficient alignment of this large amount of reads with high accuracy is a crucial part in many applications’ workflow, such
20 as genome resequencing [66], DNA methylation [67], RNA-Seq [68], ChIP sequenc-
ing, SNPs detection [69], genomic structural variants detection [70], and metage-
nomics [71]. Therefore, numerous tools have been developed to undertake this chal-
lenging task including MAQ [72], RMAP [73], GSNAP [74], Bowtie [75], Bowtie2 [76],
BWA [77], SOAP2 [78], Mosaik [79], FANGS [80], SHRIMP [81], BFAST [82], MapReads
[83] , SOCS [84], PASS [85], mrFAST [70], mrsFAST [86], ZOOM [87], Slider [88],
SliderII [89], RazerS [90], RazerS3 [91], and Novoalign [92]. Moreover, GPU-based
tools have been developed to optimally map more reads such as SARUMAN [93] and
SOAP3 [94]. However, due to using different mapping techniques, each tool provides
different trade-offs between speed and quality of the mapping. For instance, the
quality is often compromised in the following ways to reduce runtime:
• Neglecting base quality score.
• Limiting the number of allowed mismatches.
• Disabling gapped alignment or limiting the gap length.
• Ignoring SNP information.
In most cases, it is unclear how such compromises affect the performance of newly developed tools in comparison to the state of the art ones. Therefore, many studies have been carried out to provide such comparisons. Some of the available studies were mainly focused on providing new tools (e.g., [74, 77]). The remaining studies tried to provide a thorough comparison while each covering a different aspect (e.g., [95, 96,
97, 98, 99]).
For instance, Li and Homer [95] classified the tools into groups according to the used indexing technique and the features the tools support such as gapped alignment,
21 long read alignment, and bisulfite-treated reads alignment. In other words, in that work, the main focus was classifying the tools into groups rather than evaluating their performance on various settings.
Similar to Li and Homer, Fronseca et al. [99] provided another classification study.
However, they included more tools in the study, around 60 mappers, while being more focused on providing a comprehensive overview of the characteristics of the tools.
Ruffalo et al. [97] presented a comparison between Bowtie, BWA, Novoalign,
SHRiMP, mrFAST, mrsFAST, and SOAP2. Unlike the above mentioned studies,
Ruffalo et al. evaluated the accuracy of the tools in different settings. They defined a read to be correctly mapped if it maps to the correct location in the genome and has a quality score higher than or equal to the threshold. Accordingly, they evaluated the behavior of the tools while varying the sequencing error rate, indel size, and indel frequency. However, they used the default options of mapping tools in most of the experiments. In addition, they considered small simulated data sets of 500,000 reads of length 50 bps while using an artificial genome of length 500Mbp and the Human genome of length 3Gbp as the reference genomes.
Another study was done by Holtgrewe et al. [96], where the focus was the sensitiv- ity of the tools. They enumerated the possible matching intervals with a max distance k for each read. Afterwards, they evaluated the sensitivity of the mappers accord- ing to the number of intervals they detected. Holtgrewe et al. used the suggested sensitivity evaluation criteria to evaluate the performance of SOAP2, Bowtie, BWA, and Shrimp2 on both simulated and real datasets. However, they used small reference genomes (the S. cerevisiae genome of length 12 Mbp and the D. melanogaster genome of length 169 Mbp). In addition, the experiments were performed on small real data
22 sets of 10,000 reads. For evaluating the performance of the tools on real data sets,
Holtgrewe et al. used RazerS to detect the possible matching intervals. RazerS is a full sensitive mapper, hence it is a very slow mapper [86]. Therefore, scaling the suggested benchmark process for realistic whole genome mapping experiments with millions of reads is not practical. Nevertheless, after the initial submission of this work, RazerS3 [91] was published, thus, making a significant improvement in the running time of the evaluation process.
Schbath et al. [98] also focused on evaluating the sensitivity of the sequencing tools. They evaluated if a tool correctly reports a read as a unique or not. In addition, for non-unique reads they evaluated if a tool detects all of the mapping locations. However, in their work, like many previous studies, the tools were used with default options, and they tested the tools with a very small read length of
40bps. Additionally, the error model they used did not include indels and allowed only 3 mismatches.
Even though many studies have been published for evaluating short sequence mapping tools, the problem is still open and further perspectives were not tackled in the current studies. For instance, the above studies did not consider the effect of changing the default options and using the same options across the tools. In addition, some of the studies used small data sets (e.g., 10,00 and 500,000 reads) while using small reference genomes (e.g., 169Mbps and 500Mbps) [97, 96]. Furthermore, they did not take the effect of input properties and algorithmic features into account.
Here, input properties refer to the type of the reference genome and the properties of the reads including their length and source. Algorithmic features, on the other hand, pertain to the features provided by the mapping tool regarding its performance
23 and utility. Therefore, there is still a need for a quantitative evaluation method to systematically compare mapping tools in multiple aspects. In this work, we address this problem and present two different sets of experiments to evaluate and understand the strengths and weaknesses of each tool. The first set includes the benchmarking suite, consisting of tests that cover a variety of input properties and algorithmic features. These tests are applied on real RNA-Seq data and genomic resequencing synthetic data to verify the effectiveness of the benchmarking tests. The real data set consists of 1 million reads while the synthetic data sets consist of 1 million reads and 16 million reads. Additionally, we have used multiple genomes with sizes varying from 0.1 Gbps to 3.1 Gbps. The second set includes a use case experiment, namely,
SNP calling, to understand the effects of mapping techniques on a real application.
Furthermore, we introduce a new, albeit simple, mathematical definition for the mapping correctness. We define a read to be correctly mapped if it is mapped while not violating the mapping criteria. This is in contrast to previous works where they define a read to be correctly mapped if it maps to its original genomic location.
Clearly, if one knows “the original genomic location”, there is no need to map the reads. Hence, even though such a definition can be considered more biologically rele- vant, unfortunately this definition is neither sufficient nor computationally achievable.
For instance, a read could be mapped to the original location with two mismatches
(i.e., substitution error or SNP) while there might exists a mapping with an exact match to another location. If a tool does not have any a-priori information for the data, it would be impossible to choose the two mismatches location over the exact matching one. One can only hope that such tool can return “the original genomic
24 location” when the user asks the tool to return all matching locations with two mis- matches or less. Indeed, as later shown in the results, our suggested definition is computationally more accurate than the na¨ıve one. In addition, it complements other definitions such as the one suggested by Holtgrewe et al. [96].
To assess our work, we apply these tests on nine well known short sequence mapping tools, namely, Bowtie, Bowtie2, BWA, SOAP2, MAQ, RMAP, Novoalign,
GSNAP, and mrFAST (mrsFAST). Unlike the other tools in this study, mrFAST (mrs-
FAST) is a full sensitive exact mapper that reports all the mapping locations. There- fore, comparing the mapping accuracy performance of mrFAST with the remaining tools is beneficial in further understanding the behavior of the different tools, even though comparing the execution time performance will not be fair. Moreover, we compare the performance of these tools with that of FANGS, a long read mapping tool, to show their effectiveness in handling long reads. The remaining tools were chosen according to the indexing techniques they use. Therefore, we can emphasize on the effect of the indexing technique on the performance. The experiments are carried out while using the same options for the tools, whenever possible.
The chapter is organized as follows: in the next section, we briefly describe the sequence mapping problem, the mapping techniques used by the tools, and various evaluation criteria used to evaluate the performance of the tools including other defi- nitions for mapping correctness. Then, we discuss how we designed the benchmarking suite and give a real application for the mapping problem. Finally, we present and explain the results for our benchmarking suite.
25 3.1 Background
Inexact matching of DNA sequences to a genome is a special case of string match- ing. It requires incorporating the known properties or features of the DNA sequences and the sequencing technologies, adding additional complexity to the mapping pro- cess. In this section, we first give a brief description of a set of features of DNA and sequencing technologies. Then, we explain how the tools used in this study work and support these features. Additionally, we describe the default options setup and show how divergent they are among the tools. Finally, we compare the evaluation criteria used in previous studies.
3.1.1 Features
• Seeding represents the first few tens of base pairs of a read. The seed part of a
read is expected to contain less erroneous characters due to the specifics of the
NGS technologies. Therefore, the seeding property is mostly used to maximize
performance and accuracy.
• Base quality scores provide a measure on correctness of each base in the read.
The base quality score is assigned by a phred-like algorithm [100, 101]. The
score Q is equal to −10 log10(e), where e is the probability that the base is wrong. Some tools use the quality scores to decide mismatch locations. Others
accept or reject the read based on the sum of the quality scores at mismatch
positions.
• Existence of indels necessitates inserting or deleting nucleotides while mapping
a sequence to a reference genome (gaps). The complexity of choosing a gap
26 location increases with the read length. Therefore, some tools do not allow any
gaps while others limit their locations and numbers.
• Paired-end reads result from sequencing both ends of a DNA molecule. Mapping
paired-end reads increases the confidence in the mapping locations due to having
an estimation of the distance between the two ends.
• Color space read is a read type generated by SOLiD sequencers. In this tech-
nology, overlapping pairs of letters are read and given a number (color) out of
four numbers [81]. The reads can be converted into bases, however, performing
the mapping in color space has advantages in terms of error detection.
• Splicing refers to the process of cutting the RNA to remove the non-coding part
(introns) and keeping only the coding part (exons) and joining them together.
Therefore, when sequencing the RNA, a read might be located across exon-exon
junctions. The process of mapping such reads back to the genome is hard due
to the variability of the intron length. For instance, the intron length ranges
between 250 and 65, 130 nt in eukaryotic model organisms [102].
• SNPs are variations of a single nucleotide between members of the same species.
SNPs are not mismatches. Therefore, their locations should be identified before
mapping reads in order to correctly identify actual mismatch positions.
• Bisulphite treatment is a method used for the study of the methylation state
of the DNA [67]. In bisulphite treated reads, each unmethylated cytosine is
converted to uracil. Therefore, they require special handling in order not to
misalign the reads.
27 3.1.2 Tools’ description
For most of the existing tools (and for all the ones we consider), the mapping process starts by building an index for the reference genome or the reads. Then, the index is used to find the corresponding genomic positions for each read. There are many techniques used to build the index [95]. The two most common techniques are the followings:
• Hash Tables:
The hash based methods are divided into two types: hashing the reads and
hashing the genome. In general, the main idea for both types is to build a
hash table for subsequences of the reads/genome. The key of each entry is a
subsequence while the value is a list of positions where the subsequence can be
found. Hashing based tools include the following tools:
GSNAP [74] is a genome indexing tool. The hash table is built by dividing
the reference genome into overlapping oligomers of length 12 sampled every 3
nucleotides. The mapping phase works by first dividing the read into smaller
substrings, finding candidate regions for each substring, and finally combining
the regions for all of the substrings to generate the final results. GSNAP was
mainly designed to detect complex variants and splicing in individual reads.
However, in this study, GSNAP is only used as a mapper to evaluate its effi-
ciency.
Novoalign [92] is a genome indexing tool. Similar to GSNAP, the hash table is
built by dividing the reads into overlapping oligomers. The mapping phase uses
28 the Needleman-Wunsch algorithm with affine gap penalties to find the global optimum alignment. mrFAST and mrsFAST [70, 86] are genome indexing tools. They build a col- lision free hash table to index k-mers of the genome. mrFAST and mrsFAST are both developed with the same method, however, the former supports gaps and mismatches while the latter supports only mismatches to run faster. Therefore, in the following, we will use mrsFAST for experiments that do not allow gaps and mrFAST for experiments that allow gaps. Unlike the other tools, mrFAST and mrsFAST report all of the available mapping locations for a read. This is important in many applications such as structural variants detection.
FANGS [80] is a genome indexing tool. In contrary to the other tools, it is designed to handle the long reads generated by the 454 sequencer.
MAQ [72] is a read indexing tool. The algorithm works by first constructing multiple hash tables for the reads. Then, the reference genome is scanned against the tables to find the mapping locations.
RMAP [73] is a read indexing tool. Similar to MAQ, RMAP pre-processes the reads to build the hash table, then the reference genome is scanned against the hash table to extract the mapping locations.
Most of the newly developed tools are based on indexing the genome. Neverthe- less, MAQ and RMAP are included in this study to investigate the effectiveness of our benchmarking tests on evaluating read indexing based tools. In addition, we investigate if there is any potential for the read indexing technique to be used in new tools.
29 • Burrows-Wheeler Transform (BWT):
BWT [103] is an efficient data indexing technique that maintains a relatively
small memory footprint when searching through a given data block. BWT
was extended by Ferragina and Manzini [104] to a newer data structure, named
FM-index, to support exact matching. By transforming the genome into an FM-
index, the lookup performance of the algorithm improves for the cases where a
single read matches multiple locations in the genome. However, the improved
performance comes with a significantly large index build up time compared to
hash tables.
BWT based tools include the following:
Bowtie [75] starts by building an FM-index for the reference genome and then
uses the modified Ferragina and Manzini [104] matching algorithm to find the
mapping location. There are two main versions of Bowtie namely Bowtie and
Bowtie 2. Bowtie 2 is mainly designed to handle reads longer than 50 bps.
Additionally, Bowtie 2 supports features not handled by Bowtie. It was noticed
that both versions had different performance in the experiments. Therefore,
both versions are included in this study.
BWA [77] is another BWT based tool. The BWA tool uses the Ferragina and
Manzini [104] matching algorithm to find exact matches, similar to Bowtie. To
find inexact matches, the authors provided a new backtracking algorithm that
searches for matches between substring of the reference genome and the query
within a certain defined distance.
30 SOAP2 [78] works differently than the other BWT based tools. It uses the
BWT and the hash table techniques to index the reference genome in order to
speed up the exact matching process. On the other hand, it applies a “split-read
strategy”, i.e., split the read into fragments based on the number of mismatches,
to find inexact matches.
In addition to providing different mapping techniques, each tool handles only a subset of the DNA sequences and the sequencing technologies features. Moreover, there are differences in the way the features are handled, which are summarized in Table 3.1. For instance, BWA, SOAP, and GSNAP accept or reject an alignment based on counting the number of mismatches between the read and the corresponding genomic position. On the other hand, Bowtie, MAQ, and Novoalign use a quality threshold (i.e., alignment score) to perform the same function. The quality threshold is different from the mapping quality. The former is the probability of the occurrence of the read sequence given an alignment location while the latter is the Bayesian posterior probability for the correctness of the alignment location calculated from all of the alignments found for the read.
In some cases, the features are partially supported. For example, SOAP2 supports gapped alignment only for paired end reads, while BWA limits the gap size. Therefore, considering only one of the above features when comparing between the tools would lead to under- or over-estimation of the tools’ performance.
3.1.3 Default options of the tested tools
In general, using a tool’s default options yields a good performance while main- taining a good output quality. Most users use the tools with the default options or
31 Table 3.1: Features supported by each tool. PE: paired-end only, mm.: mismatches, QS: base quality score, count: total count of mismatches in the read, AS: alignment score, and empty cells mean not supported. Bowtie Bowtie2 BWA SOAP2 MAQ RMAP GSNAP FANGS Novoalign mrFAST mrsFAST Seed mm. ≤ 3 Any Up to 2 Any Any
32 Non-seed mm. QS AS Count Count QS Count Count Count QS Count Count Var. seed len. > 5 Any > 28 Mapping qual. Yes Yes Yes Yes Gapped align. Yes Yes PE PE Yes Yes Yes Yes Colorspace Yes Yes Yes Yes Splicing Yes SNP tolerance Yes Bisulphite reads Yes Yes Yes Yes only tweak some of them. Therefore, it is important to understand the effect of using these options and the kind of compromises made while using them. For the nine tools considered in this work, the most crucial default options are the following:
• Maximum number of mismatches in the seed: the seed based tools use a default
value of 2.
• Maximum number of mismatches in the read: Bowtie2, BWA, and GSNAP
determine the number of mismatches based on the read length. It is 10 for
RMAP, 2 for mrsFAST, and 5 for SOAP2, FANGS, and mrFAST.
• Seed length: It is 24 for MAQ, 32 for RMAP, and 28 for Bowtie. BWA disables
seeding while SOAP2 considers the whole read as the seed.
• Quality threshold: It is equal to 70 for MAQ and Bowtie while it depends on
the read length and the genome size for Novoalign.
• Splicing: This option is enabled for GSNAP.
• Gapped alignment: It is enabled for Bowtie2, GSNAP, BWA, Novoalign and
MAQ while it is disabled for SOAP2.
• Minimum and maximum insert sizes for paired-end mapping: The insert size
represents the distance between the two ends. The values used for the minimum
and the maximum insert sizes are 0 and 250 for Bowtie and MAQ, 0 and 500
for BWA and Bowtie2, 400 and 500 for SOAP2, and 100 and 400 for RMAP.
mrFAST and mrsFAST do not have default values for max and min insert sizes.
33 Indeed, as will be shown in the results’ section, having different default values lead to different results for the same data set. Hence, using the same values when comparing between the tools is important.
3.1.4 Evaluation criteria
In general, the performance of the tools is evaluated by considering three aspects, namely, the throughput or the running time, the memory footprint, and the mapping percentage. The throughput is the number of base pairs mapped per second (bps/sec) while the memory footprint is the required memory by the tool to store/process the read/genome index. The mapping percentage is the percentage of reads each tools maps.
The mapping percentage is further divided into a correctly mapped reads part and an error (false positives) part. There have been many definitions suggested for the error in previous studies. For instance, for the simulated reads, the na¨ıve and most used definition for error is the percentage of reads mapped to the incorrect location
(i.e., a location other than the genomic location the read was originally extracted from) [77, 74]. Clearly, this definition is neither sufficient nor computationally cor- rect. Figure 3.1 gives an example explaining the drawbacks of this definition. After applying sequencing error, the read does not exactly match the original genomic lo- cation. Since the tools do not have any a-priori information for the data, it would be impossible to choose the two mismatches location as the best mapping location over the exact matching one. Therefore, the na¨ıve criteria would judge the tool as incorrectly mapping the read if the tool returned either alignment (2) or (3) while in fact it picked a more accurate matching.
34 Reference ...... C C C G C C G G A A A T T ...... Read C C GCC G G GAA
Reference C C C G C C G G A A A T T ...... C C GCC G G GAA
Alignments (1) C C GCC G G GAA MQ=40 (3) C C G C C G G GAA MQ=50 (2) C C GCC G G G A A MQ=35
Figure 3.1: An example showing how the different evaluation criteria work. In the upper part of the figure, the sequence in blue is the original genomic position where the simulated read was extracted from. After applying sequencing errors, the read does not exactly match to the original location (3 mismatches). In the lower part of the figure, three possible alignment locations for the read are shown with their mapping quality score (MQ). The na¨ıve criterion would only consider the alignment (1) as the correct alignment. For Ruffalo et al. [97] criterion, if the used threshold is 30, then (1) is correctly mapped while (2) and (3) are incorrectly mapped-strict. On the other hand, if the threshold is 40, then (3) is considered as incorrectly mapped relaxed. Holtgrewe et al. [96] criterion would detect (1) and (2) and consider them correctly mapped while (3) would be considered as incorrectly mapped.
35 The na¨ıve definition for the error was further modified by Ruffalo et al. [97] to develop a more concrete definition. The authors incorporated the mapping quality information such that a read is correctly mapped if it is mapped to the original genomic location while having a mapping quality greater than a certain threshold. They further categorized the incorrectly mapped reads into incorrectly mapped-strict and incorrectly mapped-relaxed. The incorrectly mapped-strict are the reads that were mapped with a quality higher than the threshold while not mapped to the original genomic location. On the other hand, the incorrectly mapped-relaxed are the reads that were mapped to an incorrect location with a quality higher than the threshold and there is no correct mapping for the read with a mapping quality higher than the threshold. As an example, in Figure 1, if the used threshold is 30, then the read would be considered correctly mapped if the tool returned alignment (1) while it would be considered as incorrectly mapped-strict if the tool returned either alignment (2) or (3). On the other hand, if the used threshold is 40, a read would be incorrectly mapped-relaxed if the tool returned alignment (3). Indeed, this is a valuable evaluation criterion, however, many tools, such as SOAP2, RMAP, and BWA, do not use quality scores in the mapping phase. In addition, not all of the tools report the mapping quality.
Another definition was introduced by Holtgrewe et al. [96]. Unlike the previous works, the authors tried to find a gold standard for each read, where a gold standard refers to all of the possible matching intervals for each read with a max distance k from the read. To enumerate all of the possible matching intervals, the authors used RazerS to detect the initial seed location for each interval. Afterwards, they developed a method to find the boundary of the interval centered at the seed and
36 with a max distance k from the read. They named the suggested evaluation method
Rabema. As an example, a possible interval with k = 3 would contain alignment (1)
and (2) in Figure 1. Accordingly, Holtgrewe et al. defined the false negatives as the
intervals missed by the mapper and the false positives as the intervals returned by the mapper and not included in the gold standard. However, consisting of seed detection phase and enumeration phase while depending on RazerS to return seed locations for the matching intervals makes Rabema impractical to apply on large genomes and long read lengths, e.g., RazerS took 25 hours to map 1 million reads of length 100 to the Human genome while doubling the running time when increasing the read length from 75 to 100 [86]. Therefore, Holtgrewe et al. suggested another mode, an oracle mode, which makes use of the original location of simulated reads. The oracle mode uses the original location as the seed location instead of using RazerS to detect the initial seed locations. However, this method is only suitable in case of a-priori knowl- edge that the possible mapping locations for a read are around the simulated location
(e.g., alignment (3) in Figure 1 would be missed in the oracle mode). Nevertheless, after the initial submission of this work, RazerS3 [91] was published; making a sig- nificant improvement in Rabema running time and elevating the slowness problem.
Even though the suggested definition for a gold standard quantitatively estimates the sensitivity for each mapper, it suffers from a couple of drawbacks. First, the definition does not take into consideration whether the alignments are violating the mapping criterion for the mapper or not. For instance, in Figure 1, the sensitivity of the mapper would increase if it detected alignments (1), (2), and (3). However, if the mapping criterion for the mapper is to allow a maximum of two mismatches, then alignment (1) should have not been detected by the mapper and should be considered
37 as a wrong alignment or error. Second, quality aware based tools, such as Bowtie,
MAQ, and Novoalign, would be incorrectly evaluated by Rabema since they use the quality threshold to accept or reject a read instead of calculating the edit or hamming distance. Therefore, they might map a read with more mismatches than the limit allowed by Rabema.
3.2 Methods
3.2.1 Benchmark design
In this section, we present the features covered by our benchmarking suite. In addition, we explain how they were previously addressed by the tools we mention in this work. However, two algorithmic features, namely SNPs and Splicing awareness, are not presented in the results section due to being supported only by one tool. The tests are categorized as follows:
• Mapping options
Quality threshold: MAQ, Bowtie, and Novoalign use the quality threshold
to determine the number of allowed mismatches. Therefore, setting a quality
threshold is similar to explicitly setting the number of mismatches. However,
there is no hard limit on the actual number of mismatches. The impact of vary-
ing the quality threshold while finding a mapping between the quality threshold
and the number of mismatches has not been studied before.
Number of mismatches: Changing the number of allowed mismatches affects
the percentage of mapped reads. This effect was studied in [74], however, the
mismatches were generated uniformly on the genome which does not mimic real
mismatches distribution.
38 Seed length: Seeding-based tools impose limits on the number of mismatches
in the seed part. As a result, increasing or decreasing the length of the seed
part affects the percentage of mapped reads. The effect of the seed length has
not been studied in details before.
• Input properties
Read length: The read length varies between 30bps for ABI’s SOLiD and
Illumina’s Solexa sequencers up to 500 bps for Roche’s 454. Therefore, the
impact of read length should be considered for throughput evaluation. Even
though the effect of the read length is explored in several studies, the default
options were usually used leading to incomparable trade-offs.
Paired-end reads: Mapping paired reads requires the mapping of both ends
within a maximum distance between them. Hence, it adds a constraint while
finding the corresponding genomic locations.
Genome type: The efficiency of most algorithms are tested by using the Hu-
man genome as the reference. However, each genome has its own properties
such as the percentage of repeated regions and ambiguous characters. There-
fore, using a single genome does not reveal the effect of these properties. To
the best of our knowledge, BWA [77] was the only tool to test its performance
on a large genome other than the Human.
• Algorithmic features
Gapped alignment: is important for variant discovery due to the ability
to detect indel polymorphism [95]. Bowtie2, GSNAP, Novoalign, BWA, and
mrFAST are the only tools to support it for single-end reads while the remaining
39 tools support it for paired-end only. However, from the results provided by the
previous studies, it is not obvious how gapped alignment affects the performance
of the tools in comparison to allowing only mismatches.
SNP awareness: Incorporating SNP information into mapping allows consid-
ering minor alleles as matches rather than mismatches. Currently, this feature
is provided only by GSNAP. It was shown in [74] that integrating SNP informa-
tion affected around 8% of the reads and allowed mapping 0.4% of unmapped
reads.
Splicing awareness: Reads located across exon-exon junctions would be wrongly
aligned using standard alignment algorithms. Splicing awareness is only re-
quired for certain types of data such as RNA-Seq data. The only tool that
currently supports splicing while performing the mapping phase is GSNAP. It
was shown in [74] that the alignment yield increased by 8-9% when splicing
detection based on known splice junctions was introduced. However, there was
only 0.3-0.6% increase in case of detecting novel splice junctions.
• Scalability
The scalability of the mapping tools may be different under different parallel
settings. Many tools support multithreading, which is expected to yield linearly
increasing speedup with the increase in the number of CPU cores. On the other
hand, using multiprocessing is more general and may improve the throughput
even for tools that do not support multithreading (e.g., MAQ and RMAP),
where multiprocessing refers to using more than one process in a distributed
memory fashion while communicating through a message passing interface.
40 • Accuracy evaluation
Each tool is expected to map a set of reads based on its mapping criteria.
However, a subset of the reads might not be mapped (i.e., false negatives) due
to using heuristics in the mapping algorithm or the default options limitations.
Moreover, some of the tools map a subset of these reads while violating the
mapping criteria.
• Rabema evaluation
Rabema benchmark enumerates all of the possible matching locations. Then,
it evaluates whether the tool detected the possible matching locations with the
specified error rate or not. Therefore, Rabema evaluation is a valuable one and
helps in adding another perspective when comparing between the tools.
3.2.2 Usecase: SNP Calling
SNP calling is the process of detecting genetic variations in a given genome. The genetic variations contribute to the generation of different phenotypes for the same gene, leading to increasing the risk of having complex diseases. Therefore, the discov- ery of SNPs is a very important process that needs to be done accurately. Many tools have been developed to detect SNPs including ssahaSNP [105] and SNPdetector [106].
These tools were developed to analyze the DNA sequences generated using either the
Sanger or the direct PCR amplification methods. However, with the development of the next generation sequencing technology, new tools are required to analyze the new data [107]. The developed new tools work by first mapping sequences to a reference genome, then using statistical analysis methods to extract SNPs [107] after filtering
41 out low-quality mismatches. Therefore, accurately mapping the reads to the reference genome is a very crucial task in the SNP calling pipeline.
3.3 Results and discussion
In this section, we present the results from our benchmarking tests. The exper- iments were performed on a cluster of quad-core AMD Opteron CPUs at 2.4 GHz with 32 GB of RAM. We used SOAP2 v2.20, Bowtie v0.12.6 and v2.0.0-beta5, BWA v0.5.0, MAQ v0.7.0, RMAP v2.05.0, FANGS v0.2.3, GSNAP v2010-07-27, Novoalign v2.07.0, and mrFAST and mrsFAST v2.5.0.4.
Performance evaluation: The performance is evaluated by considering two fac- tors, namely, the mapping percentage and the throughput. The mapping percentage is the percentage of reads each tool maps while the throughput is the number of mapped base pairs per second (bps/sec). The throughput is calculated by dividing the number of reads mapped over the running time. For genome indexing based tools, the running time includes only the matching time while it includes the indexing and matching time for read indexing based tools. However,the running time for mrsFAST includes also the indexing time even though it is a genome indexing based tool. This is due to the dependence of the sensitivity of mrsFAST in the experiments on the window size used in the indexing phase. Therefore, the index is rebuilt in most of the experiments to maintain a full sensitivity for mrsFAST.
In addition, the mapping percentage is further divided into the following:
• Correctly mapped reads: The percentage of reads mapped within the mapping
criteria.
42 • Error: The percentage of reads mapped while violating the mapping criteria.
As shown in the background section, this definition provides another evaluation
perspective that was not covered by older definitions.
• Amb: The percentage of reads mapped to more than one location with the same
number of mismatches. Most of the tools can return more than one mapping
location for Amb reads if desired. However, RMAP only reports the number of
Amb reads while not providing any information regarding the mapping location
and the number of mismatches. Therefore, we will not be able to report the
mismatches distribution for the RMAP reported Amb reads.
Data sets: We evaluated the tools on two types of data sets, namely, synthetic data and real data. The synthetic data set mimics reads generated from genomic sequencing while the real data set is for RNA-Seq. The data sets are further generated as follows:
• Synthetic data: There is a number of tools available to extract synthetic, Fastq
format, data sets from a reference genome including wgsim [108], dwgsim [109],
Mason [110], and ART [111]. wgsim generates reads with uniform error distri-
bution while dwgsim provides a uniformly increasing/decreasing error rate. On
the other hand, Mason and ART mimic the error rates for Illumina and 454 se-
quencers. In this study, we are using wgsim and ART to generate the synthetic
data from the Human genome. wgsim helps in providing a fair comparison be-
tween the tools by using a uniform error distribution model resulting in the
same quality score for each base. Therefore, all of the tools can be allowed
exactly the same number of mismatches regardless of the technique used to set
43 the maximum number. For wgsim, the reads were generated with 0.09% SNP
mutation rate, 0.01% indel mutation rate, 2% uniform sequencing base error
rate, and with a maximum insert size of 500, which are the same parameters
used in [77]. Additionally, Dohm et al. [112] showed that the sequencing error
rate for Illumina changes between 0.3% for the beginning of the read and 3.8%
at the end of the read. Moreover, according to the error rates and indels rate
used by the Mason simulator [110], an indel rate of 0.01% is acceptable. We
determined the number of reads to generate using wgsim based on the used tool
and the experiment. On the other hand, ART does not explicitly allow the user
to choose the number of generated reads. ART generates reads that cover the
whole genome with a given coverage level. Therefore, to manage generating 1
million reads, we used ART to generate reads that cover the whole genome with
1x coverage. Then, we randomly selected 1 million reads from the output reads.
To make sure that the results are not affected by different wgsim runs, we gen-
erated 13 different wgsim data sets and ran a sample of the tools independently
on each data set. The sample included BWA, GSNAP, Bowtie, Bowtie2, and
SOAP2. We found that the maximum standard deviation from the average was
0.03 (results are not included). Since there is no significant change between the
runs, we will only carry each experiment once on a single data set.
• Real data: There are many types of real data sets such as RNA-Seq data, Chip-
Seq data, and DNA sequences that are used in different applications. It is
important in our evaluation process to choose the right data set type to better
evaluate the applicability of the tools in the different applications. Therefore,
we prefer to use RNA-Seq data sets as it is used in many applications including
44 SNP and alternative splicing detection. The used data set consists of 1 million
reads generated by Illumina sequencer after isolating mRNA from the Spretus
mouse colon tissues. The mouse genome version mm9 was used as the reference
genome. Indeed, as will be shown, the tools have similar behavior on both the
mouse and the human genomes. Therefore, there is no contradiction in using
the human genome for generating the synthetic data while the mouse genome
is used for the real ones.
First, we present the effect of the default options. The results for this experiment are given in Figure 2 and 3. Figure 2 shows the results when using wgsim to generate
the synthetic data while Figure 3.3 shows the results using ART. As stated previously,
tools try to use the options that yield a good performance while maintaining a good
output quality. For instance, as shown in Figure 2, Bowtie achieves a throughput
of around 1.6 · 105bps/s at the expense of mapping only 67.58% of the reads. On
the other hand, BWA maps 91% of the reads at the expense of having a throughput
of 0.1 · 105bps/s. Additionally, SOAP and mrsFAST (Figure 2 and 3) would look
like that they provide the smallest mapping percentage while in fact they are only
allowing 2 mismatches while other tools such as mrFAST and GSNAP are allowing
more than 5 mismatches. Therefore, using only the default options to build our
conclusions would be misleading. Indeed, further experiments show that BWA obtains
a high throughput when allowed to use the same options as Bowtie (see BWA-ND
in Figure 3.2). Moreover, BWA achieves a higher throughput than Bowtie in other
experiments. Therefore, it is important to use the same options to truly understand
how the tools behave.
45 5 x 10 3.5 Bowtie Bowtie2 BWA 3 BWA−ND SOAP GSNAP Novoalign 2.5 MAQ RMAP mrsFAST 2 mrFAST
1.5
Throughput bps/s 1
0.5
0 40 50 60 70 80 90 100 Mapped Percentage
Figure 3.2: Mapping 1 million reads of length 125 extracted from the Human genome using wgsim. Each tool was allowed to use its own default options. BWA-ND refers to BWA’s results while using Bowtie’s default options which are 2 mismatches in the seed, 3 mismatches in the whole read, and disabling gapped alignment.
46 5 x 10 3.5 Bowtie Bowtie2 BWA 3 SOAP GSNAP Novoalign MAQ 2.5 RMAP mrsFAST mrFAST 2
1.5
Throughput bps/s 1
0.5
0 0 10 20 30 40 50 60 70 80 90 100 Mapped Percentage
Figure 3.3: Mapping 1 million reads of length 100 extracted from the Human genome using ART. Each tool was allowed to use its own default options.
In the remaining experiments, unless otherwise stated, the number of mismatches in the seed and in the whole read are fixed to 2 and 5, respectively, while the quality threshold is kept at 100. The minimum and maximum insert sizes allowed are 0 and
500, respectively. In addition, the splicing, SNPs, and gapped alignment options are disabled, unless otherwise stated. For the number of reported hits, tools are only allowed to report one location except for mrsFAST that does not have this option and report all of the mapped locations. The default values are used for the remaining options.
47 3.3.1 Mapping options
Quality threshold is one of the two main metrics used for mismatch tolerance.
The other main metric is the explicit specification of the number of mismatches.
To compare fairly between the tools, a relationship between the two metrics should
be found, which is the main target of this experiment. In this experiment, wgsim
is used to generate the data set instead of using ART or a real one. The different
base quality scores in real data cause quality threshold based tools to allow more
mismatches than the other tools. For instance, when allowing a quality threshold
of 70 and 5 mismatches for the remaining tools, Bowtie and MAQ map reads with
up to 10 mismatches while the other tools are limited to 5 (results are not shown).
Therefore, MAQ and Bowtie had a mapping percentage larger than the other tools,
hence, the comparison is not fair. Nevertheless, in the following, we show how the
quality threshold can be used to mimic the behavior of the explicit specification of
the number of mismatches.
For wgsim generated synthetic data, quality thresholds of 60, 80, 100, 120, and
140 should correspond to 3, 4, 5, 6, and 7 mismatches. To assess our conclusion, we
designed an experiment where all tools were allowed a maximum of 7 mismatches
while using a quality threshold of 140. Figure 3.4 shows that the tools map the reads
with the same maximum number of mismatches while having similar mapping rates.
However, the differences in the mapping rates are due to the pruning of the search
space done by the default options for some of the tools. In addition, other tools
incorrectly mapped some of the reads causing an increase in the mapping percentage.
For instance, 0.6% of reported hits for MAQ and SOAP2 are considered as error (i.e., reads mapped while violating the mapping criteria) while Bowtie’s default options
48 100 amb error 90 7 mms 6 mms 80 5 mms 4 mms 70 3 mms 2 mms 1 mms 60 0 mms
50
40
30 Percentage mapped 20
10
0 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools
Figure 3.4: Mapping 1 million reads of length 125 extracted using wgsim on the Human genome while allowing up to 7 mismatches and a quality threshold of 140. The error is 0.6% for SOAP2 and MAQ and 0.45% for GSNAP.
limit the allowed number of backtracks to find mismatches. On the other hand,
GSNAP and mrsFAST map around 92% of the reads even though GSNAP reports error hits. This is due to being non-seed based tools, thus allowing more mismatches to be found in the first few base pairs. Additionally, mrsFAST is a full sensitive mapper, therefore, it can detect reads missed by other tools.
Number of mismatches: Not only does the number of mismatches affect the percentage of mapped reads, but also affects the throughput. In particular, the mapping percentage increases nonlinearly with the number of mismatches. Figure 3.5 shows the effect of the number of mismatches in more details using a wgsim generated data set. There is a 20% increase in the percentage of mapped reads when allowing
3 mismatches instead of 2. On the other hand, there is less than 0.7% increase when allowing 7 mismatches instead of 6. In addition, the error percentage decreases
49 for large number of mismatches. For instance, SOAP2’s error percentage is 21%
when allowing 2 mismatches while it is reduced to 1% when allowing 6 mismatches.
Additionally, mrsFAST mapped around 0.1-0.5% more reads than the remaining tools
since it is a full sensitive mapper. From the throughput point of view, the tools behave
differently. For instance, Bowtie, MAQ, RMAP, and mrsFAST are able to maintain
almost the same throughput while the throughput increases for SOAP2 and GSNAP
and decreases for BWA. The degradation in BWA performance is due to exceeding
the default number of mismatches leading to excessive backtracking to find mismatch
locations.
Additionally, we used a data set of 1 million reads of length 100 generated by ART
to evaluate the tools. The results for this experiment are shown in Figure 3.6. Similar
to the wgsim results, the increase in the percentage of mapped reads is larger when allowing 2 mismatches instead of 3 than the increase when allowing 7 mismatches in- stead of 6. Unlike wgsim results, Bowtie maintains a higher throughput than Bowtie2 for the different number of mismatches. This is due to the difference in the read length between wgism and ART data sets (100 for ART instead of 125). Moreover,
Bowtie uses the quality threshold while Bowtie2 does not.
To further understand the behavior of the tools, the same set of experiments is ap- plied on the mouse mRNA real data set. The results given in Figure 3.7 show that the error percentage for GSNAP still decreases for large number of mismatches. In addi- tion, there is a small reduction in BWA’s throughput for large number of mismatches.
Interestingly, the throughput for mrsFAST is different between the synthetic data and the real data. In the synthetic data set, mrsFAST’s throughput is higher than RMAP while maintaining the same throughput across the different number of mismatches.
50 100
90
80 amb 70 error 2 t−mms 60 3 t−mms 4 t−mms 5 t−mms
Percentage mapped 50 6 t−mms 7 t−mms 40 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools
6 10
5 10
4 10 2 t−mms 3 t−mms 3 10 4 t−mms 5 t−mms Throughput bps/sec 6 t−mms 2 7 t−mms 10 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools
Figure 3.5: Comparing the different tools while changing the total mismatches from 2 to 7. T-mms stands for the maximum allowed mismatches. A data set of 1 million reads of length 125 extracted from the Human genome using wgsim was used in this experiment.
51 100
80 Amb Error 60 2mms 3mms 4mms 40 5mms 6mms 7mms 20 Percentage mapped
0 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools
6 10
5 10 2mms 3mms 4 4mms 10 5mms 6mms
3 7mms 10 Throughput bps/s
2 10 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools
Figure 3.6: Comparing the different tools while changing the total mismatches from 2 to 7. T-mms stands for the maximum allowed mismatches. A data set of 1 million reads of length 100 extracted from the Human genome using ART was used in this experiment.
52 100 amb 98 error 2 t−mms 96 3 t−mms 4 t−mms 94 5 t−mms 92 6 t−mms 7 t−mms 90
Percentage mapped Bowtie Bowtie2 BWA SOAP GSNAPNovoalign MAQ RMAP mrsFAST Tools
6 10 2 t−mms 3 t−mms 4 t−mms 4 5 t−mms 10 6 t−mms 7 t−mms
2 10
Throughput bps/sec Bowtie Bowtie2 BWA SOAP GSNAPNovoalign MAQ RMAP mrsFAST Tools
Figure 3.7: Comparing the different tools while changing the total mismatches from 2 to 7. T-mms stands for the maximum allowed mismatches. A real mRNA data set of 1 million reads of length 51 bps extracted from the Spretus mouse strain and mapped against the mouse genome version mm9 was used in this experiment.
On the other hand, on the real data, the throughput decreases with the increase in the number of mismatches. In addition, there is a 7x reduction in the throughput between 4 t-mms and 5 t-mms. To maintain full sensitivity for a small read length and large number of mismatches, mrsFAST requires the use of a small window size when building the index (window size of 8 for 5 t-mms instead of 10 for 4 t-mms).
The smaller the window size, the longer it takes to process the index. Additionally, there is a limit on the window size (min 8 and max 14). Therefore, mrsFAST starts to lose its sensitivity for detecting mapping locations for 6 and 7 t-mms.
Seed length: Theoretically, when fixing the number of allowed mismatches in the seed and in the whole read, changing the seed length affects the mapping results.
Specifically, a shorter seed allows more mismatches in the remaining part of the read to be found. Therefore, the percentage of mapped reads would increase even though
53 the throughput would decrease. On the other hand, having a longer seed would result
in pruning some parts of the search tree as soon as possible, leading to throughput
improvement. The aim of this experiment is to study this trade off. As shown in
the results given in Figure 3.8 using a wgism data set, the tools behave as expected.
However, there are some exceptions. For instance, when increasing the seed length from 32 to 36 the percentage of mapped reads for SOAP2 and Bowtie decreases, however the throughput is not affected. In addition, there is a 0.8% increase in the percentage of mapped reads for Bowtie when increasing the seed length from 28 to 32.
This behavior is due to the backtracking property that stops once a certain limit is reached. Therefore, as a result of having less erroneous bases in the seed part, Bowtie can continue more in the depth first search without exceeding the backtracking limit.
We also carried out the same experiment on real mouse mRNA data set. The results given in Figure 3.9 show that the same behavior for Bowtie is still obtained on real data. However, Bowtie has only 0.01% increase when increasing the seed length from 28 to 32 instead of the 0.8% obtained in synthetic data.
3.3.2 Input properties
Read length: Longer reads tend to have more mismatches beside requiring more time to be fully mapped [113]. In general, for a fixed number of mismatches, increasing the read length decreases the percentage of mapped reads. Therefore, the aim of this experiment is to understand the read length effect. The results in Figure 3.10 show that the mapping percentage decreases with the increase in the read length while the error percentage increases. As an example, 95% of FANGS’ output for read length 500 is error compared to 12% of its output for read length 200. This is due to the increase
54 100
98
96
94
92
90
88
86 Percentage mapped
84 20 24 28 82 32 36 80 Bowtie BWA SOAP Tools
6 10
105
104 Throughput bps/sec 20 24 28 32 36 3 10 Bowtie BWA SOAP Tools
Figure 3.8: The effect of changing the seed length on the BWT based tools. The tools were used to map 16 million reads of length 70 bps on the Human genome. SOAP2 does not support seed length < 28.
55 100
99
98
97
96
95
94
93 Percentage mapped
92 20 24 28 91 32 36 90 Bowtie BWA SOAP Tools
7 10
106
105 Throughput bps/sec 20 24 28 32 36 4 10 Bowtie BWA SOAP Tools
Figure 3.9: The effect of changing the seed length on the BWT based tools. The tools were used to map real mRNA data set of 1 million reads of length 51 bps extracted from the Spretus mouse strain on the mouse genome version mm9. SOAP2 does not support seed length < 28.
56 of the erroneous bases with the increase of the read length. Therefore, it becomes harder to map the reads with the specified mapping criteria. In addition, Bowtie,
Bowtie2, and BWA were the only short sequence mapping tools that managed to map long reads. In particular, the max read length was 128 for MAQ, 300 for RMAP, and 200 for GSNAP, 199 for mrsFAST, while SOAP2 took more than 24 hours to map the reads with length 300 and hence not reported. On the other hand, mrsFAST’s run on read length 36 was suddenly terminated. This is probably due to the small read length and the large number of mismatches. From the throughput point of view, tools do not maintain the same behavior. For instance, the throughput of Bowtie and SOAP2 decreases for long read lengths. This is due to the backtracking property and the split strategy [78] used by Bowtie and SOAP2, respectively, to find inexact matches. Moreover, Bowtie is better than Bowtie2 for read lengths 36 and 70. On the other hand, even though the throughput of BWA and GSNAP increase with the read length, it starts to decrease for read length 500 and 200, respectively. GSNAP works by combining position lists to create candidate mapping regions. Therefore, for long reads, the throughput decreases due to the increase in the work needed to generate and combine position lists. For mrsFAST, the throughput increases with the read length since the available mapping locations for a read are less for longer reads in comparison to small ones.
Additionally, we carried out the same experiment on synthetic data sets generated by the ART tool. We did not carry out the experiment on a real data set due to the lack of proper real data sets that have different read lengths, have exactly the same coverage, generated by the same sequencer, and extracted from the same tissue. The results for this experiment are shown in Figure 3.11. Similar to wgsim results, the error
57 100 amb error 80 36 70 125 60 200 300 40 500
20 Percentage mapped
0 Bowtie Bowtie2 BWA SOAP MAQ RMAP GSNAP FANGS Novoalign mrsFAST Read length
6 10 36 70 125
5 200 10 300 500
4 10 Throughput bps/sec
3 10 Bowtie Bowtie2 BWA SOAP MAQ RMAP GSNAP FANGS Novoalign mrsFAST Read length
Figure 3.10: The effect of changing the read length from 36 to 500. The reads were extracted from the Human genome. RMAP and MAQ are slower than the other tools. Therefore, 1 million reads were used to test MAQ and RMAP while 16 million reads were used for the remaining ones.
58 100
80
60
40 amb error 20 36 70
Percentage mapped 100 0 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools
5 10
36 70
Throughput bps/sec 100 0 10 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools
Figure 3.11: The effect of mapping 1 million reads generated by ART on the mouse genome version mm9 while changing the read length from 36 to 100.
percentage increases with the increase in the read length for GSNAP and SOAP2.
Interestingly, the percentage of mapped reads for Bowtie, MAQ, and Novoalign are not significantly affected with the increase in the read length in comparison to the other tools. This is due to the fact that the longer the read is the smaller the quality score becomes for the bases at the end of the reads [114]. Therefore, Bowtie, MAQ, and Novoalign can map the reads with more mismatches while maintaining the same quality threshold.
Paired-end Mapping paired-end reads affects the performance of the tools due to the added constraint of mapping both ends within a maximum insert size. There- fore, in this experiment, we want to understand how the performance of the tools
59 100
90
80
70 amb 60 error se−ungapped 50 se−gapped Percentage mapped pe−ungapped 40 pe−gapped
Bowtie Bowtie2 Soap BWA GSNAPNovoalign RMAP MAQ mrsFAST Tools
6 10
5 10
4 10
3 10 se−ungapped se−gapped Throughput bps/sec pe−ungapped 2 pe−gapped 10 Bowtie Bowtie2 Soap BWA GSNAPNovoalign RMAP MAQ mrsFAST Tools
Figure 3.12: The effect of mapping paired-end reads of length 70 to the Human genome. 1 million reads were used to test RMAP and MAQ while 16 million reads were used to test the other tools. SE and PE refer to single end and paired end, respectively. Error is only provided for PE due to exceeding the allowed insert size.
is affected while mapping paired-end reads instead of single-end. The results in Fig- ure 3.12 (ungapped bars) show that the throughput decreases for all of the tools while mapping paired-end reads, except for BWA which was able to maintain almost the same throughput while MAQ had a small increase. Even though all of the algorithms work by finding mapping locations for each end alone and then finding the best pair,
GSNAP was the only tool to face a drop by 90% in the throughput. Additionally, the percentage of mapped reads is less while mapping paired-end read due to applying the same mapping criteria for single-end reads on paired-end reads.
60 Even though the maximum insert size was 500, tools such as BWA, SOAP, and
GSNAP mapped paired-end reads while exceeding the maximum insert size, except for Novoalign that explicitly requires the user to set the standard deviation for the insert size.
Genome type To capture the effect of the genome type, we designed an experi- ment in which the Human, Chimpanzee, Mouse, Zebrafish, Lancelet, A. mellifera, and
C. elegans genomes were used as reference genomes. The sizes of these genomes are
3.1Gbps, 3.0Gpbs, 2.5Gbps, 1.5Gbps, 0.9Gbps, 0.57Gbps, and 0.1Gbps, respectively.
Theoretically, for genome indexing based tools, the throughput is expected to slightly increase with the decrease in the genome size. However, the results in Figure 3.13 show that some tools do not act as expected. For instance, there is a difference in the throughput between the Chimpanzee and the Human genomes even though their sizes are similar. In addition, SOAP2’s and Novoalign’s throughput decreases signif- icantly for the Zebrafish genome while GSNAP did not finish its run on the same genome albeit running for two days. The reason for this behavior is the large repeti- tion rate in the Zebrafish genome. For instance, while mapping 1 million randomly generated reads from the Zebrafish genome, around 600 reads were mapped to more than 100,000 locations in comparison to the Lancelet with the maximum number of locations is around 10,000 for only 1 read. Additionally, mrsFAST detects more than
8 billions locations while mapping reads to the Zebrafish genome where it detected only 24 millions while mapping reads to the lancelet genome. Hence, for GSNAP, the large repetition rates lead to long genomic position lists; resulting in a significant slow down of GSNAP. Another interesting result is the ability of most of the tools to map more than 96% of the reads for the Zebrafish data set compared to around 91% for the
61 100
80 Amb Error Human 60 Chimp Mouse 40 Zebrafish Lancelet A.mellifera 20 C.elegans Percentage mapped
0 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools
5 10
4 10 Human
3 Chimp 10 Mouse Zebrafish 2 10 Lancelet A.mellifera C.elegans 1
Throughput bps/s 10
0 10 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools
Figure 3.13: 16 million reads of length 70bps were generated from the Human, Zebrafish, Lancelet, Chimpanzee, A. mellifera, and C. elegans genomes using wgsim for this test. 1 million reads were used for MAQ and RMAP.
Human and 89% for the Lancelet. The large mapping percentage is also due to the large repetition rate. Hence, due to synthetically generating the reads, large number of reads would be generated from the repeated regions. As a result, the probability of
finding a mapping location increases. In addition to the above results, it is also no- ticed that Bowtie scales better than Bowtie2 on different genomes. Moreover, MAQ’s throughput for the C. elegans genome is larger than Novoalign while maintaining a comparable mapping percentage. Therefore, read indexing based tools might perform better than some genome indexing based tools for small genomes albeit being very slow for large genomes.
62 To further understand the behavior of the tools, we generated a data set of 1 million reads using ART. Figure 3.14 shows the results using the ART data sets. Similar to wgsim results, SOAP2 and Novoalign still encounter a significant decrease in the throughput when mapping the Zebrafish data set. Additionally, Bowtie still scales better than Bowtie2 with the different genomes. Interestingly, GSNAP finished its run on the Zebrafish data set even though it still faces a decrease in the throughput.
On the other hand, unlike wgsim results, mrsFAST encounters a decrease in the throughput when mapping the Zebrafish data set. It is not obvious why mrsFAST encounters such a decrease even though its performance on the other genomes remains the same regardless of using wgsim or ART.
In general, the throughput for the tools increased when using ART instead of wgsim to generate the data sets. However, the relative performance between the tools and the different genomes is still the same.
3.3.3 Algorithmic features
Gapped alignment should improve the mapping percentage albeit decreasing the throughput. We designed an experiment to understand the effect of gapped alignment. Tools were used to map synthetically generated reads of length 70 to the Human genome while allowing one gap of length 3. However, mrFAST does not provide any option to limit the gap size. The results in Figure 3.15 show that the mapping percentage increases by 4% for SOAP2 and 1.5% for mrFAST in case of gapped alignment, while there is no change for BWA and GSNAP. However, there is a drop of 15% and 75% in the throughput for BWA and GSNAP, respectively.
The decrease for GSNAP is due to the overhead added to the algorithm to find
63 100
80 Amb Error Human 60 Chimp Mouse 40 Zebrafish Lancelet A.mellifera 20 C.elegans Percentage mapped
0 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools
6 10
5 Human 10 Chimp Mouse Zebrafish Lancelet 4 10 A.mellifera C.elegans Throughput bps/s
3 10 Bowtie Bowtie2 BWA SOAP GSNAP Novoalign MAQ RMAP mrsFAST Tools
Figure 3.14: 1 million reads of length 70bps were generated from the Human, Ze- brafish, Lancelet, Chimpanzee, A. mellifera, and C. elegans genomes using ART.
64 pairs of candidate regions that co-localize within a maximum allowed gap size. The algorithm tries to find a crossover between the two regions without exceeding the maximum number of mismatches leading to a significant decrease in the throughput.
Interestingly, the decrease in the throughput is less for the real data set as shown in
Figure 3.12. However, the decrease is still larger than the decrease in the throughput for the remaining tools.
For the real data set, mrsFAST (mrFAST) is not included in the results since the minimum allowed window size in the indexing phase does not guarantee a full sensitivity for mrFAST.
3.3.4 Scalability
In this experiment, we tested the multithreading behavior. In addition, pMap [115] was used to run multiple instances of each tool on a number of processors on a single node to test the multiprocessing effect. pMap is an open-source MPI-based tool that enables parallelization of existing short sequence mapping tools by partitioning the reads and distributing the work among the different processors. A single node was used in the multiprocessing experiment to understand the effect of a good implemen- tation of multithreading. The results for both experiments are given in Figure 3.16.
We can observe from the multithreading results that the tools had almost a linear speedup up to 4 threads. However, when increasing to 8 threads, Bowtie was the only tool to achieve 8x speedup. In addition, BWA had a similar speedup in both multi- threading and multiprocessing. For the multiprocessing experiment, FANGS achieved almost a 6x speedup while there was a small improvement for MAQ and RMAP. For the remaining tools, most of them were able to maintain more than a 5x speedup for
65 100
99
98
97
96
95
94
93 Percentage mapped
92
91 se−ungapped se−gapped 90 Bowtie2 BWA GSNAP Novoalign Tools
6 10
5 10 Throughput bps/sec
se−ungapped se−gapped 4 10 Bowtie2 BWA GSNAP Novoalign Tools
Figure 3.15: mRNA data set of 1 million reads extracted from the Spretus mouse strain is used in this experiment and mapped on the mouse genome version mm9. mrsFAST is used for ungapped alignment and mrFAST is used for gapped alignment.
66 8 processors, however this is less than a linear speedup. One reason for this degra- dation is the overhead of the distribution and merging steps required by distributed memory systems. As expected, we can notice that multithreading provides almost a linear speedup, however, it is limited by the number of cores.
In general, using multiprocessing provides more degrees of freedom by paralleliz- ing tools that do not support multithreading and by making use of the available computational resources.
Another important observation is the effect of the indexing method on the total speedup. Read indexing based tools did not have any significant speedup in compar- ison to the genome indexing based ones which had more than 5x speedup. Therefore, genome indexing is more efficient in case of designing a read partitioning parallelism based tool.
3.3.5 Accuracy evaluation
The aim of this experiment is to evaluate the percentage of reads each tool actually maps out of the set of the mappable reads. A read is mappable if the distance between the read and its original genomic location does not violate the mapping criteria. In this experiment, the reads were generated using ART to measure the sensitivity of the tools in case of varying the distribution of mismatches. The mapping criteria used was fixed to five mismatches for Bowtie2, SOAP, GSNAP, BWA, mrsFAST, and
RMAP. For the remaining tools, a quality threshold of 100 was used. In general, gapped alignment was disabled. The results given in Table 3.2 show that Bowtie did not map around 0.14% of the set of the mappable reads (i.e., false negatives) while
Bowtie2 did not map around 7.71%. Moreover, Bowtie mapped 93% of the reads
67 9
8
7
6
5
4 Speedup
3
2
1 2−threads 4−threads 8−threads 0 Bowtie Bowtie2 BWA SOAP GSNAP Tools
9
8
7
6
5
4 Speedup
3
2
1 2−processors 4−processors 8−processors 0 Bowtie BWA SOAP MAQ RMAP GSNAP FANGS Tools
Figure 3.16: 16 million reads of length 125 were mapped to the Human genome while using multithreading (the upper figure) or multiprocessing (the lower figure).
68 Table 3.2: Evaluating the sensitivity of the tools on a data set of 1 million reads of length 70 generated by ART. The numbers are in percentage. The Reported mapped percentage is the total percentage of reads mapped by each tool. It is equal to Actual Mapped + (Expected Unmapped- Actual Unmapped) while Reported correct is the total number of correctly mapped reads. Bowtie Bowtie2 BWA SOAP2 MAQ RMAP GSNAP Novoalign mrsFAST Mapped Expected 93.57 93.25 91.29 91.29 93.57 90.12 93.25 96.18 93.25 Actual 93.43 85.54 91.29 91.29 92.92 82.53 93.25 96.02 93.25 Error 0.73 0.03 Unmapped Expected 6.43 6.75 8.71 8.71 6.43 9.88 6.75 3.82 6.75 Actual 6.25 6.68 8.32 6.83 5.08 8.29 3.66 3.81 6.62 Error 1.73 1.25 1.5 2.97 Reported mapped 93.61 85.61 91.68 93.17 94.27 84.11 96.34 96.03 93.38 Reported correct 93.61 85.61 91.68 90.71 93.02 82.61 93.37 96.03 93.38
while Bowtie2 only mapped 85%. Nevertheless, the sensitivity of both tools can be increased by changing the default options at the expense of significantly decreasing the throughput. Interestingly, BWA, SOAP2, and mrsFAST mapped all of the mappable reads without any error.
In general, the tools were able to map a percentage of the unmappable reads, however, it was mapped with a large error percentage. For instance, even though
GSNAP mapped around 3% of the unmappable reads, only 0.03% of them were correctly mapped. Therefore, even though GSNAP maps the largest percentage of reads, other tools such as BWA and Novoalign are more accurate and precise than
GSNAP.
It is important to note that the percentage of reads mapped from the unmappable reads is similar to the percentage of incorrectly mapped reads-relaxed given in Ruffalo et al. work [97]. However, they define a read to be unmappable if it has a mapping
69 Table 3.3: Rabema evaluation results on the different tools using a data set of 1 million reads of length 100 extracted from the Human genome using ART. The maximum allowed error is 5% (i.e., 5 mismatches in this case). #Reads is the number of reads expected to be mapped with certain Error. The remaining columns for the tools show the percentage of reads detected by each tool out of the #Reads. Invalid mappings (i.e., reads mapped with errors more than the assigned error rate threshold) for Bowtie and Novoalign are 567,531 and 587,542 reads, respectively. Error #Reads Bowtie Bowtie2 BWA SOAP2 Novoalign mrsFAST 0 832 100 100 100 100 97.24 100 1 6316 96.99 100 100 100 98.29 100 2 23495 97.30 97.16 100 99.97 98.70 100 3 55941 97.00 95.92 99.85 95.78 98.84 100 4 98063 96.48 94.22 99.49 96.43 99.02 100 5 135096 95.63 91.14 98.76 97.34 99.12 100
quality less than a certain threshold while we consider it as unmappable if it violates
the mapping criteria for the tool.
3.3.6 Rabema evaluation
The aim of this experiment is to evaluate the tools based on the number of reads
with a specified error rate the tool has been able to map. Unlike the previous exper-
iment, this experiment does not take into consideration how each tool works. There-
fore, it is similar to evaluating the efficiency of each mapping algorithm (i.e., seeding
vs. non-seeding, quality scores vs. mismatches). The experiment is performed on a
synthetic data set of length 100 extracted from the Human genome using ART. The maximum allowed error rate was 5%, i.e., 5 mismatches in that case. The results for this experiment are shown in Table 3.3. Rabema takes the output SAM file from each tool as the input. However, MAQ and RMAP do not create the output in the
SAM format. Therefore, there are not included. Additionally, GSNAP results are not included since GSNAP in the SAM format messes up the quality scores.
70 As shown in the results, both Novoalign and Bowtie are evaluated as mapping invalid reads. This is because Rabema does not take the quality scores into consider- ation and just calculate the edit distance. Therefore, from the mismatches perspec- tive, the reads have more than 5 mismatches. However, from the quality threshold perspective, they have a quality threshold less than the specified one. Therefore, at the end, they are valid mappings.
In general, BWA has been able to detect almost all of the reads with the correct error rate. This suggest that most of the mismatches exist at the end of the read.
In addition, the seeding technique is a valid method specially if it can speed up the mapping process. Even though SOAP2 is a seed based tool, similar to BWA, it could not detect as much correct reads detected by BWA. Bowtie2 missed some of the reads, however, it can detect them by changing its sensitivity at the cost of increasing the running time. On the other hand, mrsFAST mapped all of the reads with the correct error rate since it is a full sensitive mapper.
3.3.7 Use case: SNP calling
The aim of this experiment is to understand how the different mapping techniques affect the quality of SNP calling. The tools were used to map an mRNA dataset of 23 million reads extracted from the Spretus mouse strain. Then Partek [116], a genomic suite developed to analyze NGS data, is used to detect SNPs. The mouse genome version mm9 was used as the reference genome in this experiment. A quality threshold of 70 was used for Bowtie and Novoalign while the remaining tools were allowed 5 mismatches. In addition, gapped alignment was enabled for Bowtie2, BWA, GSNAP, and Novoalign. Table 3.4 shows the results for this experiment. The SNP detection
71 step was done for GSNAP and SOAP2 after filtering out the erroneous reads. The log-odd ratio represents how accurate the SNP is. The small log-odds ratio for some of the SNPs is due to either the small number of reads that supports that SNP or the mixed genotype calls. We can observe that there is a large number of accurately detected SNPs. This is expected due to the high divergence of the Spretus strain from other mice strains. For the sake of completeness, we are including the whole number of detected SNPs, however, in our analysis, we focus only on the number of accurately detected SNPs shown in the last column. The results show that GSNAP detected the largest number of accurate SNPs while Novoalign detected the smallest. In addition, more than 94% of the highly accurate SNPs detected by Novoalign were also detected by the other tools (not shown). To further understand the reason for the low number of SNPs detected by Bowtie and Novoalign, we carried out the same experiment while using a quality threshold of 100. The number of highly accurate SNPs increased to
1474 and 1100 for Bowtie and Novoalign, respectively. Moreover, the reads with more than 5 mismatches did not contribute to the increase in the number of SNPs. This is due to the fact that SNPs have a high quality score. Therefore, a read with a SNP would be sequenced with a small number of errors.
3.4 Conclusion
There have been many studies carried out to analyze the performance of short sequence mapping tools and choose the best tool among them. However, the analysis of short sequence mapping tools is still an active problem with many aspects have not been addressed yet. In this work, we provided a benchmarking study for short sequence mapping tools while tackling different aspects that have not been covered
72 Table 3.4: SNP calling results when using the different tools. Each row represents a different tool while each column shows the number of SNPs detected with the log- odds ratio, a measure of the accuracy of the detected SNP, centered around the given values. The larger the log-odds ratio is, the more accurate the detected SNP becomes. Tools Log-odds ratio 5 100 200 300 400 500 600 700 800 900 1000 1000000 Bowtie 89479 24337 5082 2231 1076 648 426 281 0 0 0 1171 Bowtie2 200914 62178 10018 4200 2052 1156 767 537 0 0 0 2035 BWA 192050 52115 9028 4049 1894 1087 737 525 0 0 0 2067 SOAP2 174475 49302 8552 3824 1837 1030 704 508 0 0 0 1941 Novoalign 69798 17586 4061 1875 936 519 363 252 0 0 0 941 GSNAP 207920 69015 11416 4928 2482 1325 971 617 0 0 0 2602
by previous studies. We mainly focused on studying the effect of different input properties, algorithmic features, and changing the default options on the performance of the different tools. Additionally, we provided a set of benchmarking tests which extensively analyze the performance of the different tools. Each of the benchmarking tests stresses on a different aspect. The benchmarking tests were further applied on a variety of short sequence mapping tools, namely, Bowtie, Bowtie2, BWA, SOAP2,
MAQ, RMAP, GSNAP, Novoalign, mrsFAST (mrFAST), and FANGS.
The experiments show that some tools report an error percentage (i.e., reads mapped while violating the mapping criteria). Among these tools are GSNAP and
SOAP. GSNAP reported the highest error percentage in the experiments. Addition- ally, the error increases with the read length and it decreases with the the number of mismatches. Nevertheless, GSNAP was one of the tools which reports the largest mapping percentage in most of the experiments even after filtering out the error reads.
The main reason for mapping more reads is allowing any number of mismatches in the seed part. From a real application perspective, GSNAP’s filtered output helped in detecting the largest number of SNPs.
73 The evaluation of Bowtie, Bowtie2, BWA, mrsFAST, and Novoalign show their ability to correctly map the reads. Moreover, Novoalign mapped the largest percent- age of reads, similar to GSNAP, specially for highly repeated genomes. However, it maintained the lowest throughput among the genome indexing tools in most of the experiments.
mrsFAST’s running time is highly affected by the read length and the number of mismatches. Our experiments show that it is better to use mrsFAST for longer reads.
It can also be used for short reads but only with a small number of mismatches.
In general, genome indexing based tools performed better than read indexing tools in all of the experiments. However, MAQ was faster than Novoalign for small genomes.
Therefore there is a potential for read indexing tools to be used for small genomes.
In addition to providing the worst performance, read indexing does not provide any significant speedup in case of using read partitioning based parallelism. Therefore, the read indexing method is not preferred when designing a new read partitioning mapping tools.
Interestingly, the genome type experiment revealed many strengths and weak- nesses for the tools. For instance, the performance of SOAP, GSNAP, and Novoalign is highly dependent on the genome type; the throughput decreased significantly for the
Zebrafish genome. This is due to the large repetition rate on the Zebrafish genome. In addition, the tools behaved differently on the Human and the Chimpanzee genomes albeit having comparable genome sizes. The results of the genome type experiment suggest that the different properties of the genomes affect the performance of the tools. Therefore, further investigations are required to understand the different prop- erties of the genomes and their effect on the different mapping techniques.
74 Even though there are differences between the results for the real data sets and the synthetic ones, both experiments are important as they give us a different perspective when comparing between the tools. The control on the number of mismatches for the wgsim synthetic data allows us to know exactly what the throughput of each tool is while looking for exactly the same number of mismatches. Therefore, it becomes easier to understand why a tool is faster than another one or why a tool seems to map more reads than the other ones. At the same time, it is important to look at the behavior of the tools in case of real data and real-like synthetic data (e.g., ART) to further understand how they behave in the real world. For instance, for the number of mismatches experiment, even though Bowtie looks like it maps a percentage of reads similar to the other tools in case of 7 t-mms, it actually maps the reads with a maximum of 4 t-mms. Therefore, the output reads are more accurate than the other tools.
In general, there is no the-best tool among all of the tools; each tool was the- best in certain conditions. The short sequence mapping problem is still an active problem and new tools are needed to be developed. However, these new tools should be application-specific. By taking the target application into consideration, more accurate results can be obtained. For instance, for genome assembly, we can analyze the reference genome and estimate the number of reads that can be mapped for the different regions (e.g., repeated regions) based on the coverage information in the sequencing process. Another example for an application with very specific properties is the mapping of RNA-Seq data which contain short sequences for the exon regions rather than intron regions for the genome. Therefore, for well-studied genomes, if a small number of reads where mapped to different intron regions, we can expect them
75 to be wrongly mapped and look for other mapping locations with more number of mismatches or less mapping quality.
76 Chapter 4: Efficiency of RNA-Seq Data for Active Module Discovery in Comparison to MicroArrays
Microarrays have been very useful for biological research over the past few decades.
However, with cutting-edge high-throughput sequencing technologies, researchers now have the chance to investigate the biological systems in a more accurate way. There have been several works which compare the microarray and RNA-seq data for different problems and applications [117, 118, 40, 119, 41]. These works show that using RNA-
Seq instead of microarray data can be better because of the limitations imposed by the microarray design. RNA-Seq data has less background noise, its range for expression level quantification is higher, it can be better while distinguishing different isoforms and allelic expressions, and usually requires less cost and less genomic material [42].
Discovering active modules in protein-protein interaction (PPI) networks has been studied by several researchers leading to the development of different algorithms and tools [120, 121, 50, 122, 29, 123, 124, 30, 54, 125, 126]. It has been proved that the active modules found by these tools are useful to discover some significant genes which are overlooked by other techniques [52]. Furthermore, an active module is a better, compact, and accurate model of what is going on in the molecular level. For these reasons and more, these networks have been used as biomarkers for classification purposes in diseases such as breast cancer [52]. To the best of our knowledge, all
77 available active module discovery tools were designed and experimented by using only the microarray data. That is, there is no work which investigates using RNA-Seq data for active module discovery.
In this work, we investigate the difference between using microarray and RNA-
Seq data while discovering active modules in PPI networks. We design a set of experiments by using two different datasets, one for colorectal cancer and the other for oligodendroglioma tumors, to see the potential of RNA-Seq data. Our experiments show that RNA-Seq indeed has enormous potential while searching significant genes, since it helps the tools to detect a set of genes which are not found if the microarray data is used. Besides, for the colorectal cancer data, the results from different tools are more consistent with each other when RNA-Seq data is used. However, these state- of-the-art tools generate larger active networks with RNA-Seq data which reduces the compactness and hence the effectiveness of an active module as a biomarker.
The rest of the chapter is organized as follows: In the next section, we describe the tools used in this study for module analysis, and a brief comparison of microarrays and RNA-Seq is given. Section 4.2 describes our experimental setting and evaluates the results. Section 4.3 concludes the study.
78 4.1 Background
The problem of discovering active modulefs in PPI networks using the gene expres- sion data was introduced by Ideker et al. [28]. Afterward, many algorithms such as
Beisser et al. [120], GiGA [121], heinz [29], DEGAS [30], MATISSE [54], and CEZANNE [125] were developed to solve this problem.
4.1.1 Tools for Active Module Discovery
In general, the tools can be categorized in two classes based on how the datasets are used [127]:
1. Treatment-control based tools: These tools divide the collected samples into
two classes; treatment and control. The control samples, for example, can
represent healthy cells whereas the treatment samples can represent cancer or
healthy cells after gene knockout experiments. After this classification, a set
of connected genes with significantly different behaviors on the treatment and
control samples are extracted.
2. Co-expression-based tools: Instead of dividing the samples into more than one
class, the tools in this category analyze the co-expression patterns of gene pairs
across all samples. After that, it extracts the connected components where the
genes show the same co-expression behavior across the samples.
In this study, we use three treatment-control based tools jActiveModules [28], heinz [29], and ExprEssence [126]. Another treatment-control based tool DEGAS is designed to extract the active modules when outliers exist [30]. Since the datasets
79 used in this work are extracted from a cell line study, they do not contain outliers.
Hence, DEGAS is excluded from our experiments.
jActiveModules
The tool jActiveModules is the first one developed to extract the active modules
using gene expression data [28]. The tool returns the largest connected component
with the highest score as the active module. jActiveModules first assigns scores to each gene based on the gene’s p-values for the samples in the dataset. Thereafter, for each connected component, the algorithm aggregates the scores of the genes inside the component to generate a single score for it. The problem of finding the module with the highest score is an NP-hard problem [28]. Hence, the authors proposed a search heuristic based on simulated annealing to find such modules. In general, the algorithm suffers from having a large running time and large output modules. heinz
Even though jActiveModules is a popular tool, it suffers from many drawbacks including the non-optimality of the results and its large running time. Dittrich et al. [29] developed heinz to alleviate these problems. The tool heinz works by first assigning a score to each gene in the network using the aggregated p-value of all of the samples in which the gene exist. The aim of the scoring function is to discriminate between the correct and noise p-values by assigning a positive score to the former and a negative score to the latter. After scoring the genes in the network, heinz finds the optimal highest-scoring subgraph in this vertex-weighted graph. The problem of
finding the subgraph with the optimal score is transformed to the well-known prize- collecting Steiner tree problem (PCST) and then solved by using the optimal PCST
80 algorithm of Ljubi´cet al. [128]. The main drawback of heinz is using the aggregation
of the p-values of all samples as a single p-value. Therefore, it implicitly assumes that
all of the samples should exhibit the same behavior for the same disease.
ExprEssence
While the above tools focus on extracting connected components, ExprEssence focuses on extracting individual links where the change in the genes’ expression values indicate a regulatory change such as stimulation or inhibition. Similar to the previous tools, ExprEssence starts by calculating a score for each gene. However, instead of using p-values, it directly uses the gene expression values. Afterward, the scores for the genes sharing the same link are used to calculte the corresponding link’s score.
4.1.2 Microarray vs. RNA-Seq: History
Both microarrays and RNA-Seq revolutionized the transcriptome analysis and have been successfully used for quantifying different species or organisms’ transcrip- tomes [42]. Before the microarray era, researchers were using other techniques to analyze the transcripts of a given species such as Northern Blots, expressed sequence tags (ESTs), serial analysis of gene Expression (SAGE), and reverse transcription
PCR (RT-PCR) [40, 41]. However, these techniques suffered from limitations on the number of genes that can be analyzed in parallel [41]. Therefore, there was a need for a new technique to better analyze the different transcripts.
A whole picture for a cell’s gene expression pattern was not available before the development of gene expression microarrays [129]. They have been extensively used in many applications including gene expression detection, Single Nucleotide Polymor- phisms (SNP) analysis, and mutation detection [41]. However, microarrays require
81 the prior knowledge about the structure of a gene. Additionally, analyzing genes in new genomes is hard due to the unavailability of probes for this genome [40].
RNA-Seq is a more recent technique for transcriptome analysis using deep sequenc- ing [42]. It made measuring the gene expression levels easier and more accurate. For instance, unlike microarrays, RNA-Seq can detect unidentified genes and it does not require any information about the distinct isoforms for the gene [40]. Therefore, it is more successful while detecting novel transcript isoforms. On the other hand, al- though its cost is continuously reducing, RNA-Seq has always been defined as a more expensive technique when compared with microarrays.
Many works have been carried out to compare between microarrays and RNA-
Seq such as [117, 118, 40, 119, 41]. However, none of these works measures the true potential of RNA-Seq data for active module analysis. In this work, we compare microarrays and RNA-Seq to see if RNA-Seq is a good alternative. In our experi- ments, we try to compare the effectiveness of both techniques with respect to different structural and biological aspects of the active modules.
4.2 Experimental Evaluation
We first describe our experiments carried out using two different datasets and three active module discovery tools. We then present and evaluate their results. We used jActiveModules v2.23 and ExprEssence v1.2.1a where both tools are available as a plugin in Cytoscape v2.8.3 [130]. For heinz, we used the public version which is available in the BioNet package [131].
82 The human PPI network we used contains 11203 genes and 57235 interactions.
The network was assembled by Chuang et al. [52] from yeast two-hybrid experi- ments [132, 133]. The interactions are predicted by the ontology and co-citation [134], and curated in the literature [135, 136, 137].
We carried out experiments with two datasets: In the first set, the gene expres- sion values are calculated by both microarray and RNA-Seq techniques to analyze the RNA which is extracted from fluorouracil (5-FU)-resistant and -nonresistant hu- man colorectal cancer cell lines. The microarray data was published as a part of a expression analysis study by Griffith et al. [138] in 2008 while the RNA-Seq analysis for the same RNA was later published in 2010 [139]. To prepare the 5-FU resistant cell (MIP/5-FU) [140], Griffith et al. passed the 5-FU sensitive cell line (MIP101) [141] through an increasing concentration of 5-FU resulting in a 5-FU resistant cell. They calculated the p-values for the output gene expression values using the Wilcoxon rank-sum test and the Fisher’s exact test for the microarray and RNA-Seq data sets, respectively. The Fisher’s exact test was used for the RNA-Seq data due to the dif- ferences in counts between the two samples. Afterwards, they applied the Benjamini and Hochberg’s step-up false discovery rate controlling procedure [142] for multiple testing correction. The gene expression values in the microarray and RNA-Seq data were 85% correlated while the p-values were only 27% correlated.
To further analyze the potential of RNA-Seq on active module analysis and see how the tools behave with it, we used a second RNA-Seq dataset which contains gene expression values for six different samples. Five out of the six samples are the
83 treatment samples and they are extracted from five different versions of the oligoden-
droglioma tumor disease. The sixth sample is from tumor initiating cells and used as
the control sample1.
4.2.1 Colorectal cancer cell lines
The colorectal cancer cell line RNA-Seq data we used contains 36, 952 genes where only 11, 853 of them are expressed and have a corresponding entrez gene id. However, only 7, 456 of these genes exist in the PPI network. On the other hand, the microarray data contains 2, 510 genes where 2, 410 of them are expressed and have a correspond- ing entrez gene id. And only 1, 656 of these genes exist in the PPI network. The
intersection between the two datasets contains 2, 200 genes where only 2, 008 genes
are expressed and have an entrez gene id. Only 1, 398 genes of the intersection set
exist in the PPI network.
We calculated the number of differentially expressed (DE) genes in each dataset.
We assume that a gene is differentially expressed if it has a fold change ≥ 2 and a
p-value < 0.05 [139]. We found that the number of DE genes is 251, 209, and 86 in
RNA-Seq data, microarray data, and their intersection, respectively. However, only
157, 128, and 53 of these DE genes, respectively, exist in the PPI network.
For the rest of the experiments, we name the output modules as follows:
1. MicroNet: represents an output module obtained by using the microarray
data.
2. RnaNet: represents an output module obtained by using the RNA-Seq data.
1http://www.alexaplatform.org/alexa_seq/Oligo/Summary.htm
84 3. MicroInterNet/RnaInterNet: represent output modules obtained by us-
ing microarray/RNA-Seq gene expression values for the set of genes existing in
both datasets only.
The combination of the three active MicroNet modules from the tools is visual- ized in Figure 4.1. As the figure shows, heinz is more focused and ExprEssence can
find the nodes from different parts of the PPI network since it returns links rather than a connected module.
Module size analysis
Table 4.1: The size of the active modules returned by the different tools. MicroNet RnaNet MicroInterNet RnaInterNet #nodes #edges #nodes #edges #nodes #edges #nodes #edges jActiveModules 219 486 2,330 11,259 233 498 1,773 6,982 heinz 36 41 198 675 18 19 44 71 ExprEssence 210 168 2,039 2,288 177 138 170 138
When the data is changed to RNA-Seq, the sizes of active RnaNet modules also change. With respect to this criteria, the tools behaved differently on both datasets as Table 4.1 shows. The sizes of the output networks obtained by each tool increase drastically when the RNA-Seq is used instead of the microarray data. The increase is around 10× for jActiveModules, 5.5× for heinz, and 10× for ExprEssence. Fur-
thermore, the network size for heinz is more than 10× smaller than that of the the
other tools. These results suggests that when RNA-Seq data is used, heinz is still
able to maintain a more specific and focused results.
85 9415
4089
5874 6878 306
7485 5465 5339 348 283
79083 6879
5140 7351 6457 55120 308 2033 10891 5873 29894
10497 8997 10788 90 7429 5468 6528 1123 81788 6558 5091 883 70 55003 10370 23510 2521 8665 60 51196 4133 22982 10043 5163 7022 10015 9355 5866 124359 4892 6712 1387 10420 75294093 23136 4155 51 7251 71 3092 5910 10097 3184 7203 10492 23162 1612 55656 5598 9644 6919 8309 1027 22808 22920 3570 5337 230 6428 8626 5058 10403 836 10461 4893 240 11099 5292 1958 8027 100 3183 5900 5913 51019 9101 5581 8936 596 1021 8741 10254 7249 1601 5925 5338 7157 83746 867 8867 517526280 9513 896 5290 4156456 2248 1026 7132 3265 5894 6774 54474 7431 274 22862 2060 5600 6714 10188 51517 5933 10152 672 2885 5335 3556 8318 4233 4914 1605 5727 54386 5578 6507 8440 6655 30011 27165 5336 2348 5921 5979 161 2254 3717 1956 857 29767 9215 1741 2900 3098 29760 6640 3175 8870 153 11184 5310 324 6773 6464 54205 8743 1080 2049 2534 2904 6885 2047 5295 112950 8792 5594 348487 2026 23586 5584 9743 8434 5770 4067 2069 5781 2250 140885 27330 2886 118788 5747 8751 2891 5159 3371 9093 8491 27092 10818 22927 3643 1398 2185 3768 55824 3575 8754 317 10912 114836 5792 2066 3680 5174 9456 3927 3655 1742 5777 2364510677 2322 7048 9228 11216 25 5329 6696 8506 1647 2275 2888 1244 960 23303 677 3690 4771 36788470 2247 64130 9575 5167 54107 2260 2208 3685 23708 83439 4301 966 3574 493 5871 1739 23095 11177 699 9368 4216 6642 8945 9229 11183 90627 1499 6722 7224 8863 23382 5793
5396 472 8745 8218 3976 91683 701 27109
2118 10423 3675 7043 1380 26061 6282
5888 6252 3696 4739 6253 9429 747 1490 2115 641 675 5243 3589 6281
23223 525 26354 10180 8664 1573 81627 2177 50649 3909 4654 7518 8724
22822 7040 3397 9094 2771 1031 27292 8761 54810 51409 4924 3172 894 1020 9075 333 9114 2937
3399 2036 3960 2203 51805 2272 4292 21
87 3394 83732 55353 64759 6359 54494 4781 20 8504 3660 7372 5824 463 6812 29927 4774
23321 1939
6143
999
Figure 4.1: The visualization of the MicroNet networks obtained by the three tools. The yellow, turquoise, and green nodes show the modules from jActiveModules, heinz, and ExprEssence. The black nodes exist in all three networks. The red nodes exist both in ExprEssence and heinz. The blue nodes exist in both ExprEssence and jActiveModules modules. The purple nodes exist in both jActiveModules and heinz.
86 When the sizes of MicroInterNet and RnaInterNet are examined, it is ob- served that the changes on network sizes are smaller compared with MicroNet and
RnaNet case. The increase ratios are 2.4× for heinz and 7.8× for jActiveModules.
However, the number of genes in RnaInterNet module for jActiveModules is larger than the number of the input genes, which, obviously, is not desired. On the other hand, a slight reduction is observed for ExprEssence which uses the expression values instead of the p-values in the scoring function. For this dataset, the correlation between the expression values and p-values in microarray and RNA-Seq data are 85% and 27%, respectively. Hence, it is expected that ExprEssence maintains the same network size.
Significant gene analysis
Table 4.2: Number of DE genes in each active module. We focus on the 53 genes which exist in both microarray and RNA-Seq data. MicroNet RnaNet MicroInterNet RnaInterNet jActiveModules 18 8.2% 37 1.6% 19 8.2% 36 2.0% heinz 6 16.6% 8 4.0% 1 5.6% 4 9.0% ExprEssence 30 14.2% 48 2.4% 29 16.4% 26 15.3%
In this experiment, we focus on the importance of the genes contained in each module. Table 4.2 shows the number of DE genes exist in the module returned by each tool. The percentages in the table are the ratio of the number of DE genes to the number of nodes in a module. As the numbers show, the number of DE genes always increases when RNA-Seq data is used instead of microarray data. On the other hand, the percentage of DE genes reduces. Since ExprEssence returns links
87 Table 4.3: Individual existence of significant genes in different modules. M and R denote the existence of the corresponding gene in the microarray and RNA-Seq data. refers to the existence of the gene in the corresponding module. TYMS TK1 CDH1 UMPS ABCB1 GDF15 TNFRSF1B (R) (R) (M&R)(M&R)(M&R)(M&R) (M&R) MicroNet RnaNet jActiveModules MicroInterNet RnaInterNet MicroNet RnaNet heinz MicroInterNet RnaInterNet MicroNet RnaNet ExprEssence MicroInterNet RnaInterNet
between genes rather than connected components, it returns the largest number of
DE genes and compared with other tools, its DE percentages are much better for the
intersection networks. Actually, its MicroInterNet DE percentage is the best one
for this experiment. For all tools, an increase on DE gene amount is expected since
the RNA-Seq data is larger and hence, the output network is expected to be larger.
However, this may not be desirable in practice since, as the percentages in Table 4.2
show, compared with the increase on network size (10× for jActiveModules), the
increase in DE gene amount (2× for jActiveModules) is small.
In addition to the number of DE genes in each module, we also try to evaluate
the significance of the module by examining its relation with the resistance to the
5-FU drug in colorectal cancer cells. Table 4.3 shows the results for this analysis.
The significant genes we use for this experiment are as follows:
88 1. TYMS (thymidylate synthetase) is a gene biologically shown to be involved
in the regulation of apoptotic processes [143]. 5-FU drug works by inhibit-
ing the product proteinTYMS [144]. In addition, the over-expression of this
gene is known to increase the resistance for the 5-FU drug [145, 146, 144].
The gene is differentially expressed in RNA-Seq data. As shown in Table 4.3,
jActiveModules is the only tool capable of returning this gene in RnaInter-
Net module. Note that jActiveModules explores the neighbors up to two
nodes far away from the active nodes. Therefore, it can return genes that do
not exist in the input dataset.
2. TK1 (thymidine kinase) is an enzyme proved to be related to the increase
on TYMS deficiency [144]. TK1 is not differentially expressed in the RNA-Seq
data. However, due to its 1.2 fold change on the expression levels, ExprEssence
is able to return this gene in RnaNet module.
3. CDH1 (cadherin 1) is a classical cadherin gene. It was observed that CDH1 is
down-regulated in 5-FU resistant cells [147]. As described in [147], SNAI1 is
the main reason for the suppression of the CDH1. Therefore, we checked the
existence of both genes in both microarray and RNA-Seq data. We looked for
any direct interaction between the two genes in the output modules. CDH1 is
found differentially expressed in the RNA-Seq data while SNAI1 exists only in
the RNA-Seq and is not differentially expressed. SNAI1 does not exist in any
of the output modules. A possible reason for its absence can be the absence of
direct interaction between SNAI1 and any differentially expressed gene in the
PPI network.
89 4. UMPS (uridine monophosphate synthetase) is one of the 5-FU metabolism
genes that is believed to affect the response of the tumor to 5-FU critically [148,
149]. UMPS is differentially expressed in both datasets with a fold change > 2
only in the RNA-Seq data. As shown in Table 4.3, UMPS is only found in
RnaNet and RnaInterNet by heinz and jActiveModules. On the other
hand, ExprEssence returns this gene in all its modules. Interestingly, an in-
teraction between UMPS and HNF4A gene is implied in most of the modules.
HNF4A is an important gene that is known to be involved in many important
functions such as polarity and organization of tissues, and proliferation of tumor
cell lines [139]. HNF4A is known to be dysregulated in colorectal cancer [150].
However, according to our dataset, HNF4A is down-regulated in only RNA-Seq
data.
5. ABCB1 is a multidrug resistance gene is known to be differentially expressed in
colorectal cancer cells [140]. It exists in both data we used, but it has a p-value
< 0.02 only in the RNA-Seq data. Table 4.3 shows that ABCB1 is returned
as a part of all ExprEssence modules. However, jActiveModules and heinz
returned this gene only in RnaNet and RnaInterNet.
6. GDF15 is the growth differentiation factor 15 protein. This protein has a role
in 5-FU resistance in colon cancer [146]. We found this gene to be differentially
expressed only in the RNA-Seq data. jActiveModules and heinz returned this
gene in both RnaNet and RnaInterNet modules. Interestingly, our results
imply an interaction between the GDF15 and HNF4A genes. However, we do
not know whether this implication is biologically reasonable or not.
90 7. TNFRSF1B (tumor necrosis factor receptor super-family, member 1B) gene is
known to be related to the 5-FU resistance [146]. According to our experiments,
heinz is not able to return this gene in any of the output modules. However,
it exists in the microarray data with a p-value < 0.05.
Table 4.4: Top three hub nodes in each module and their number of edges (a) jActiveModules MicroNet RnaNet MicroInterNet RnaInterNet hub #edges hub #edges hub #edges hub #edges GRB2 51 HNF4A 737 GRB2 51 HNF4A 549 SHC1 28 RPS27A 108 EGFR 31 GTF2F2 72 FYN 25 RPS3 105 SRC 27 RPS25 69
(b) heinz MicroNet RnaNet MicroInterNet RnaInterNet hub #edges hub #edges hub #edges hub #edges EGFR 10 HNF4A 71 GRB2 8 HNF4A 21 GRB2 9 RPS15A 30 CRK 5 RPL12 9 RPS3 ONECUT1 7 EEF1A1 29 INSR 4 CDH1 8 RPS5
(c) ExprEssence MicroNet RnaNet MicroInterNet RnaInterNet hub #edges hub #edges hub #edges hub #edges HNF4A 21 HNF4A 441 HNF4A 18 HNF4A 20 CDKN1A 19 CDKN1A 56 CDKN1A 17 CD44 9 CD44 9 BCL2 47 CD44 8 BCL2 7
As a summary, RNA-Seq gene expression values helped the tools to return the important genes as a part of the output modules. On the other hand, they did not return the genes in the microarray based modules even though some exist in the microarray data. Surprisingly, heinz was able to return five out of the seven
91 important genes we found in the literature related to the 5-FU-resistance behavior.
Additionally, compared to other tools, it did not encounter a significant increase in the network size when used with RNA-Seq data.
Hub-node analysis
In this experiment, we observed how the hub nodes of the active modules change for different data and tools. Table 4.4 shows the three hub nodes with the highest number of connections in each module. All the tools return HNF4A as the first hub for both RnaNet and RnaInterNet modules. We believe that this is due to the better accuracy of the RNA-Seq data which leads to better relevant node detection. As mentioned previously, HNF4A is known to be dysregulated in colorectal cancer [150] and is an important gene involved in many functions including polarity and organization of tissues [139].
4.2.2 Oligodendroglioma tumors
This dataset contains only RNA-Seq gene expression values for five treatment samples (oligodendroglioma tumor samples) and one control sample (tumor initiating cells). The dataset contains 49, 868 genes where only 10, 823 genes are expressed and have an entrez gene id. Furthermore, the number of DE genes (i.e., p-value < 0.05 and fold change > 2) is 1, 086. However, only 7, 082 of 49, 868 genes are in the PPI network where 726 of them are differentially expressed.
The results of the experiments are given in Tables 4.5 and 4.6. Interestingly, the tools show a totally different behavior in the oligodendroglioma dataset when com- pared with the colorectal cancer dataset. For this dataset, jActiveModules provides the smallest module with only 160 nodes while heinz returns the largest module
92 with 1, 593 nodes. Moreover, 30% of the genes in jActiveModules and heinz mod- ules are DE. Furthermore, it was shown in previous studies (e.g., [29, 30, 126]) that when the greedy search algorithm of jActiveModules was used with microarray data, jActiveModules returned the largest module size with the smallest percentage of DE genes in comparison to the other tools.
This behavior is due to the nature of the used datasets and how the tools handle the input p-values/expression values. The colorectal cancer samples contain replicates for the same sample whereas the oligodendroglioma samples contain five different sam- ples each represents a different version of this disease. Both heinz and ExprEssence aggregate the input p-values/expression values into a single p-value/expression value.
And they use this single value to score each node and hence the connected com- ponents and links. On the contrary, jActiveModules calculates the score for each connected component for each sample separately. Then it gives an aggregated score for each connected component by measuring how much this component represents each sample. Finally, it returns the connected component with the highest aggre- gated score. Therefore, if the variance between the input samples is high (such as the samples in the oligodendroglioma dataset), the output for heinz scoring function may be inaccurate.
Table 4.5: Sizes of the active modules found by the tools and the number of differen- tially expressed genes in them when the oligodendroglioma dataset is used. Network size #nodes #edges #DE jActiveModules 160 355 48 30.0% heinz 1, 593 6, 812 470 29.9% ExprEssence 802 2, 098 102 12.7%
93 Table 4.6: Hub-node analysis of the tools when using the oligodendroglioma dataset. jActiveModules heinz ExprEssence hub #edges hub #edges hub #edges RPS10 21 EEF1A1 93 RPS3 140 RPS15 20 RPS16 89 RPL26 121 RPS8 RPS3A 88 RPS25 115 RPS6 RPS3 RPS26 RPS27A
Table 4.6 shows the hub nodes. On the contrary to the colorectal cancer datasets, each module returned almost a different set of hub nodes since each tool handles the input data differently. We believe, this happens since the tools handle the expression and p-values differently and the correlation between the Oligodendroglioma samples is low.
4.3 Conclusion and Future Work
The discovery of active modules in the protein-protein interaction networks al- lowed biologist to have a new perspective for the gene expression data. Since the introduction of the problem in 2002 by Ideker et al. [28], many algorithms have been developed to provide better accuracy and new dimensions for the problem. However, all of these studies used microarray data to evaluate the quality of the output results and the performance of the different tools.
In this work, we investigated the efficiency of RNA-Seq data on extracting the active modules. To achieve this goal, we used both RNA-Seq and microarray gene expression values for the same RNA sample extracted from colorectal cancer cell lines to discover the active modules. To further understand the effectiveness of RNA-Seq
94 data, we used another RNA-Seq gene expression dataset extract from five different oligodendroglioma cancer samples. The results showed that RNA-Seq can be more useful than microarrays in detecting relevant and overlooked active modules. In ad- dition, RNA-Seq based modules in our experiments were containing more biologically significant genes.
In our experiments, RNA-Seq data helped for a better evaluation of the perfor- mance of the different tools. For instance, it is mentioned in many studies that jActiveModules return the largest module size in comparison to the other tools.
However, on the oligodendroglioma dataset, jActiveModules returned the smallest module. Moreover, around 30% of the returned genes were found differentially ex- pressed in the input data. This suggests that the performance of the tools highly depend on the type of the input data. This is why we believe that experiments with
RNA-Seq data is necessary to understand the algorithms better and evaluate the performance of the tools further.
Since the sizes of the output modules are getting large with the RNA-Seq data, they become less effective to be a biomarker. Therefore, new algorithms or new tools for the active discovery problem will be needed. For some applications and use cases, it is better to return more focused and compact modules. Or maybe, we need a set of plugins/additions to the tools which annotate the output of them for an easier analysis of their results.
As future work, we will investigate new scoring functions to detect more spe- cific and focused modules. We are planning to use both the relative and individual
95 properties of the different genes to determine the importance of the genes in the out- put module. Moreover, the directions of the interactions in the PPI network can be included in the scoring function for better detection of relevant genes.
96 Chapter 5: PRASE: PageRank-based Active Module Extraction
In complex diseases, genes do not act in isolation, rather, they tend to interact together in pathways and modules to perform the designated function [1]. Therefore, many researchers focused on characterizing such modules and defining their proper- ties.
One possible method to detect such modules is to detect dense clusters of genes in different networks such as protein-protein interaction (PPI) networks [9, 151]. Such algorithms are based on the idea that genes performing the same function heavily in- teract with each other in comparison to their interaction pattern with other genes [9].
However, depending only on one type of data to extract the modules would lead to suboptimal results [54], thus, making it harder to explain the underlying behavior of the disease. Therefore, the integration of different types of data is critical for understanding the disease mechanism.
According to the types of data used in the integration process, the problem can be defined as either detecting genotypic modules or phenotypic modules [152]. Genotypic
modules refer to using genotype data, such as gene mutation information, to detect
modules enriched with genes having genetic alteration related to the disease mecha-
nism [152]. In a recent work, Vandin et al. addressed this problem in their algorithm,
97 HotNet [57]. They integrated disease specific mutated genes information with the PPI
network to detect sets of mutated k genes existing in the largest number of samples.
To achieve this goal, they first constructed an influence graph containing only the mutated genes. Then, they calculated the weight on the edges on the influence graph by applying a diffusion process [153] on the PPI to calculate the influence between all pairs of mutated genes. In a further work, Vandin et al. focused on returning sets of mutated genes that are mutually exclusive, i.e., genes do not exist in the same sample [154]. They relied on the observation that different mutated genes can perturb the same pathway while not being mutated at the same time in the same sample.
Phenotypic modules, on the other hand, refer to using phenotype data, such as gene expression, to extract the group of interacting genes that best explain the underlying disease behavior. One of the pioneering works in this field was carried out by Ideker et al. [28]. They named the discovered modules as active modules, referring to how these modules unexpectedly contain many interacting differentially expressed genes.
To extract the active modules, Ideker et al. integrated gene expression data with the
PPI network. They scored each candidate module by calculating the sum of all the genes’ Z-scores in the module, where the Z-score for each gene is calculated from the corresponding p-value of that gene. The problem of detecting the highest scoring module is NP hard, therefore, they provided a simulated-annealing-based heuristic to discover the active modules.
After Ideker et al., many algorithms have been proposed for active module (phyno- typic modules) discovery. Most of these algorithms use a greedy approach while the algorithm heinz by Dittrich et al. employs integer programming to find optimal solu- tions [29]. The problem definition and hence, the functions to be optimized have been
98 slightly altered in these and subsequent studies to cover various aspects and maximize
the usefulness of the discovered modules. For instance, Ulitsky et al. aimed to find
modules (a.k.a. functional modules) whose genes show a correlated expression [54].
Later, they extended their algorithm with an edge-weighting scheme and a proba-
bilistic model for module connectivity [125]. In addition, rather than using the gene
expression values, Ulitsky et al. converted the expression values to binary where the
gene is on if it is differentially expressed, and off, otherwise [30]. With this modifi-
cation, the problem is defined as discovering the modules with a certain number of
differentially expressed genes.
One drawback of the current definitions is that they focus only on returning the
module containing highly differentially expressed genes. However, a gene might not be
highly differentially expressed when observed but exhibit a coordinated dysregulation
with surrounding genes [155]. Such genes can be easily overlooked by current tools.
The active modules can be exploited in various ways; to detect de novo pathways, new biomarkers, or for knockdown experiments. Recently, many algorithms have been proposed to extract module biomarkers from the protein-protein interaction (PPI) network [50, 52, 156, 122, 123, 155]. The main difference between the original active module and biomarker extraction problems is that the former focuses on extracting the most comprehensive connected module with the largest score. On the other hand, the latter aims to extract a number of relatively small modules which are used to differentiate between the case and control samples.
All of the above active module discovery algorithms were designed and experi- mented using microarray data. However, the more recent alternative, RNA-Seq, pro- duces a data which exhibit different properties than microarray reads. It sequences
99 whole cell mRNA instead of looking for the existence of certain genes in a micoarray.
RNA-Seq made measuring the gene expression levels easier and more accurate: unlike
microarrays, it can detect de novo genes without a-priori information regarding the
distinct isoforms for the gene [40]. Therefore, it has been more successful than mi-
croarrays while detecting novel transcript isoforms. However, the output datasets of
RNA-Seq are much larger making them harder to analyze. In a recent work, Hatem
et al. showed that RNA-Seq data can yield promising results while discovering rel-
evant active modules at the expense of generating large ones [157]. Therefore, new
algorithms are required to discover smaller, high-quality active modules.
In this work, we focus on the original active module extraction problem. Our work
tackle two main concerns: first, including important but not necessary differentially
expressed genes in the network; second, detecting smaller and more focused active
modules to facilitate any further analysis while making use of the RNA-Seq properties
to return more accurate and disease-related modules. To address these points, we
introduce a novel workflow, PRASE, which adjusts the gene expression p-values while making use of the RNA-Seq data properties to enrich the outcome of the current active module discovery tools. Our workflow starts by first constructing a gene co-expression network from the RNA-Seq data which is more accurate than the microarray data and contains a complete image of the mRNA in the cell. Therefore, by constructing a gene co-expression network, we make use of all the possible dependencies between the coding genes in the cell. Such dependencies might not exist in the PPI network due to missing information. Using the p-values for the genes in the co-expression network, PRASE employs the personalized PageRank algorithm, a variant of the famous PageRank algorithm originally used by Google to rank the web pages [38], to
100 exploit the gene-gene interactions in a more elegant way. In this way, the complete dependency information between the genes obtained from the RNA-Seq data is used to boost the p-value of the important genes. Finally, the PageRank values are adjusted to generate new p-values which are then fed with the RNA-Seq specific PPI module to the active module extraction tools.
Using PRASE, the importance of the genes which interact with many differentially expressed genes will increase. Hence, they will be contained in the output module with a larger probability. The new p-values obtained via PRASE are further used with two popular tools, jActiveModules [28] and heinz [29], to extract the final output active module.
The effectiveness of PRASE is extensively evaluated using a number of evaluation criteria, including the size of the network, the percentage of differentially expressed genes, the percentage of disease related genes, and GO and pathway enrichment analysis. In general, a technique is considered superior over the remaining ones if it maximizes most of the criterion. For instance, a technique that provides a smaller module with a large percentage of differentially expressed genes is considered better than another one with a larger module and a smaller percentage. Note here that we are focusing on the percentage rather than focusing on the absolute number of differentially expressed genes.
The rest of the chapter is organized as follows: In Section 5.1, the background material is given for active module extraction and PageRank algorithm. Section 5.2 describes the proposed workflow. A thorough experimental evaluation of the workflow is presented in Section 5.3. Section 5.4 concludes the work mentioned in this chapter.
101 5.1 Background
5.1.1 Active module extraction tools
Many tools have been developed to detect the most active module using different
metrics. In general, these tools are divided into two groups based on the type of the
input: gene expression values or p-values. Examples for expression-value-based tools
are DEGAS [30] and GXNA [51]. DEGAS uses the gene expression values to calculate the
p-value for each gene. Afterwards, it determines a gene is on/off based on its p-value
and a given threshold. Finally, it looks for the set of at least k connected on genes
covered by at least l samples. GXNA uses the gene expression values to calculate a combined score for each module. Then, it returns the module with the highest score.
The second category, i.e., p-value based tools, contains jActiveModules [28] and
heinz [29]. jActiveModules works by calculating a combined score for each module
S in each sample using the p-values. Then, it calculates a combined score for S across
all the samples. It then returns the module with the highest combined score. Since
the problem of finding the highest weighted module is NP hard [28], jActiveModules
uses a search heuristic to find these modules. Many algorithms have been developed
to provide either a better scoring function or a better search heuristic. Nevertheless,
jActiveModules is still widely used since it has a very easy and simple user interface
and very few parameters to tweak while providing good results.
The tool heinz uses an integer linear programming approach to find the optimal
module with the highest score. It first aggregates all the p-values of a given node across
all the samples into a single p-value. The aggregation function returns a negative score
for noise and a positive score for the correct p-values. heinz elegantly transforms the
102 problem into the well-known prize-collecting Steiner tree problem. And it finds an optimal solution by using the algorithm described in [128].
5.1.2 PageRank for gene ranking
The PageRank algorithm has been developed to provide an accurate ranking of web pages [38]. It has been used in many different areas including bioinformatics.
PageRank follows the random-surfer model iteratively: at each iteration, the PageR- ank score of a node i is equally distributed to i’s neighbors with probability δ. The remaining (1–δ) probability is uniformly distributed to all other nodes. That is at each iteration, the process is restarted from an arbitrary node. In PageRank, the high-ranked nodes distribute their scores to their immediate neighbors, hence, boost their ranking. As the algorithm iterates, these contributions propagate to the other
t nodes. Formally, the PageRank of node i at iteration t, denoted as ri, is equal to
N t−1 (1 − δ) X rj wij rt = + δ (5.1) i N d j=1 j where N is the number of nodes in the network, dj is the degree of node j, and wij is equal to 1 if nodes i and j are connected, and to 0, otherwise.
Morrison et al. used the personalized variant of the algorithm to rank the genes based on their expression values [158]. In this variant, the process is restarted with (1–
δ) probability at each iteration. The restart probability is not equally distributed to the PageRank scores of the genes. To de/prioritize the genes, the fold changes of the genes are used as the personalization vector. Winter et al. also used the personalized
PageRank to analyze biological networks where the personalization vector is obtained from the Pearson correlations, and a transcription-factor network is used for the gene
103 interactions [159]. Recently, Ivan et al. employed the personalized PageRank for similar purposes [160].
The PageRank algorithm and its variants have been used to rank the genes ac- cording to their interactions with known, disease-related ones. Nevertheless, all of algorithms are more concerned with prioritizing genes rather than taking PageRank one step further to prioritize and extract a module. It has been shown that cancer re- lated proteins maintain a large number of interactions when compared to non-cancer related proteins [161]. Here, by using PageRank, we incorporated the topology infor- mation and use these interactions to detect genes and gene networks which play an active role for the disease.
Similar to our work, Vandin et al. also used a random-walk based algorithm to calculate the dependency between the different mutated genes [57]. They used the dependency information to construct a graph representing the dependency between pairs of mutated genes. Finally, they returned the module of mutated genes that best represent the disease mechanism. However, unlike our work, they do not use any gene expression data. Therefore, all of the mutated genes are treated equally and they do not have any initial prioritization. In addition, the edges in the returned module does not represent physical interactions between the genes, hence, returning a set of mutated genes rather than a physical interaction network.
5.2 PRASE
The proposed approach works in multiple steps to utilize RNA-Seq data. The overall workflow is summarized in Figure 5.1.
104 Figure 5.1: The PRASE workflow: PRASE first generates the gene co-expression network from the set of genes in the RNA-Seq data. Then it generates the corre- sponding adjacency matrix. The PageRank algorithm uses the old p-values, p, and the adjacency matrix as inputs for re-ranking. The new p-values, denoted with p0, are generated by scaling the PageRank output. Then they are used with the RNA-Seq PPI network for the active module extraction process.
5.2.1 Input network and matrix construction
There are two networks required in our workflow: the input network required for the module extraction tools and the PageRank input network. For the former, the required input network is a PPI network. However, to make use of the information contained in the RNA-Seq data and reduce the false-positive rate, we are using the
PPI network containing only RNA-Seq genes. The RNA-Seq network is extracted using the extraction tool provided in the BioNet package [162].
For PageRank, a gene co-expression network is constructed and used as an input.
Indeed, other types of networks can be used such as the PPI network. However, the
PPI network is incomplete and does not contain all of the interaction information
105 between different genes. On the other hand, the gene co-expression network has the ability of capturing indirect dependencies and possible interaction patterns between these genes. In addition, having a complete set of active coding genes (obtained from the RNA-Seq data), we also have the ability to retrieve possible interactions between them. Therefore, applying PageRank on the RNA-Seq gene co-expression network can boost the rank of the most important genes even if they are not differentially expressed from the p-value perspective. The simplest method to construct the gene co-expression network is by putting an edge between a pair of genes if their Pear- son correlation is above a threshold. There are also other variations of this simple method. In this work, we used Ruan et al.’s rank-based method to construct the gene co-expression network which tries to minimize the false positives and obtain only the accurate connections [13]. Note that any other gene co-expression construction algorithm can be integrated to PRASE.
A drawback of using the gene co-expression network is the requirement of large number of samples to accurately construct edges between the genes. Hence, in case of a low number of samples, the RNA-Seq PPI network may be an alternative to obtain the PageRank values.
To generate the adjacency matrix of the gene co-expression network for PageRank, we use the ftM2adjM function in the R package. Currently, we are treating the network as unweighted (and undirected in case of the PPI network).
106 5.2.2 Re-ranking
A gene which is not differentially expressed can still be important if it connects
many important genes (e.g., hub nodes). But, considering the state-of-the-art algo-
rithms used for active module extraction, it may be ignored in the output module.
As explained above, by incorporating network structure and using PageRank, we aim
to boost the importance of such genes. This can yield a module that contains these
genes and more differentially expressed ones which were discarded in the first place.
For personalization, we modified the PageRank equation (5.1):
N t−1 X rj wij rt = (1 − δ)q + δ (5.2) i i d j=1 j
0 where, ri = qi for each gene i and,
1 − p q = i . (5.3) i N P (1 − pj) j=1
There are two important points: first, there is an inverse correlation between the p- value and PageRank score of a gene, i.e., a high PageRank score implies a significant gene, thus a small p-value. Hence, we use 1 − p and not p for initialization and
personalization. Second, we use the summation as the denominator in (5.3) to make
the output of (5.2) similar to a probability distribution rather than a gene ranking.
5.2.3 Scaling and combining
As mentioned above, the PageRank scores have an inverse correlation with the
p-values. Therefore, they cannot be used directly as p-values and they need to be
scaled such that the maximum r, i.e., r = 1, maps to the most significant p-value,
107 i.e., p0 = 0. The naive method to do that is employing a linear scaling:
r p0 = 1 − i . (5.4) i max(r)
Exponential scaling can also be used to obtain the desired mapping and the corre- sponding p0,
0 pj = exp(−s ∗ rj), (5.5) where s is chosen to minimize the difference:
N N X 0 X pi − pi (5.6) i=1 i=1
Since the scaling is non-linear and the sum of new p-values approximates to the sum of old ones, we believe this is a more viable alternative.
Even though new p-values better reflect the structure of the network, they are not designed to totally ignore the original measurements [158]. Hence, genes that were differentially expressed with the original p-value should not be ignored. To solve this issue, we merged p with the scaled p0 as follows. ( p if p < min(0.05, p0 ) p0 = j j j (5.7) j 0 pj otherwise. where 0.05 is a parameter that defines which genes are DE from the perspective of the old p-value. Indeed, a change in this value will result in a change the output.
Therefore, we are using the largest acceptable threshold value of 0.05 to make sure that we are not missing any important genes.
108 5.3 Experimental Results
We implemented PRASE in R. The necessary files of the workflow are freely
available at http://bmi.osu.edu/hpc/software/prase/index.html. We made use
of the available implementations of the module extraction, adjacency matrix con-
struction, and PageRank. We used two module discovery tools for the experiments:
jActiveModules and heinz where the former is provided as a plugin for Cytoscape [130]
and the latter is a part of the BioNet package [162]. We picked these tools since they
are widely accepted, they use the p-values as input, and they are easy to use. We used a PPI network with 11, 203 genes and 57, 235 interactions. The network was assembled by Chuang et al. [52].
We carried out the experiments with three datasets: breast invasive carcinoma
(BRCA), colorectal cancer cell line (CRC), and oligodendroglioma tumor (Oligo) datasets. We picked the datasets so as to cover different types of control/case rela- tions. For instance, the BRCA control/case samples are for the healthy and diseased tissues whereas the CRC control/case samples are for the same disease before and after introducing the 5-FU drug. On the other hand, the control/case samples for
Oligo are for different types of cancer tissues.
The BRCA dataset is for the invasive ductal carcinoma subtype. We obtained the dataset from the TCGA portal2. It contains 114 control/case samples which are extracted from healthy and tumor tissues, respectively, where each control/case sample pair was extracted from the same patient. The dataset does not contain replicates. The DESeq package was used with the unnormalized gene expression values to calculate the p-values [163].
2https://tcga-data.nci.nih.gov/tcga/
109 In the CRC dataset, RNA-Seq was used to measure the gene expression values for
fluorouracil (5-FU)-resistant and -nonresistant CRC lines. The RNA-Seq data was published as a part of an expression analysis [139]: to prepare the 5-FU resistant cell,
MIP/5-FU [140], Griffith et al. passed the 5-FU sensitive cell line, MIP101 [141], through an increasing concentration of 5-FU resulting in a 5-FU resistant cell. The p-values were calculated using Fisher’s exact test. Afterwards, the Benjamini and
Hochberg’s step-up false discovery rate controlling procedure was applied for multiple testing correction.
The last RNA-Seq dataset, the Oligo dataset, contains the gene expression values for six different samples where five of them are extracted from five different versions of the disease representing the case samples. The sixth one is extracted from tumor initiating cells representing the control sample3.
To measure the significance of the output modules, we looked for the number
(percentage) of DE genes in each module. In addition, we used gene sets from the literature which are known to be related to the diseases we use in this work. A summary of these sets are given in Table 5.1.
Table 5.1: Standard names for the (curated) gene sets from MSigDB and KEGG pathway (last row). PPI: number of genes from the gene set that exist in the PPI network we are using Name Alias Size PPI NUTT GBM VS AO GLIOMA DN GSEA1 45 38 NUTT GBM VS AO GLIOMA UP GSEA2 46 40 SCHUETZ BREAST CANCER DUCTAL INVASIVE DN GSEA3 84 58 SCHUETZ BREAST CANCER DUCTAL INVASIVE UP GSEA4 352 258 TURASHVILI BREAST DUCTAL CARCINOMA VS DUCTAL NORMAL DN GSEA5 198 120 TURASHVILI BREAST DUCTAL CARCINOMA VS DUCTAL NORMAL UP GSEA6 44 28 Oligodendroglioma pathway OligoPath 29 29
3http://www.alexaplatform.org/alexa_seq/Oligo/Summary.htm
110 We assume that a gene is differentially expressed (DE) if the change in its expres- sion value is ≥ 2 and its p-value is ≤ 0.05. Due to the randomness of the seed genes in jActiveModules, we ran each experiment three times and the averages are given.
The output modules are evaluated using the following criteria:
• Network size: It is easier to analyze the modules when they are smaller. How-
ever, this criterion cannot solely evaluate the effectiveness as it does not measure
the quality of the returned module.
• Percentage of DE genes: We used the percentage of DE genes in the network
instead of their actual number since the maximum number of DE genes can
be obtained by taking the whole PPI network which obviously not a desired
output.
• Percentage of disease-related genes: When the disease-related gene percentage
is higher, the output module is more focused on the disease.
• GO and pathway enrichment analysis: In complex diseases, the underlying bi-
ological mechanisms are still obscure and instead of analyzing the existence of
DE (or important) genes, it may be better to analyze a collective functionality.
5.3.1 Breast invasive carcinoma
The BRCA dataset contains 20, 530 genes where only 9, 463 of them are expressed genes that exist in the PPI network. Each sample has around 700 DE genes, however, there is no DE gene common among the 57 samples. Therefore, we calculated the number of DE genes while considering 20% and 30% of the samples as outliers. The number of DE genes is 29 and 111 for 20% and 30% outliers, respectively. The BRCA
111 dataset contains a large number of samples, therefore, the gene co-expression network
can be accurately constructed.
The sizes of the modules using exponential scaling are shown in Figure 5.2(a).
With PRASE, the module size for heinz decreases from 271 to 261 nodes. The same module is obtained with different δ values. On the other hand, the average size of an jActiveModules module increases from 126 to 145 nodes with δ = 0.5. However, we improved the quality of the output modules as shown in Figure 5.2(b). The GSEA3,
GSEA4, GSEA5, and GSEA6 gene sets were obtained from MSigDB (Table 5.1).
These datasets are specific to invasive ductal carcinoma. PRASE improved the quality of heinz modules w.r.t. the DE gene percentage by 0.7% and GSEA gene percentage by 0.5%-1%. Meanwhile, jActiveModules networks’ quality is improved by 1%-3% with δ = 0.5. Linear scaling was also applied on the p-value, however, it did not yield any good results (results are not included).
For a better evaluation, we performed GO enrichment and pathway analyses by using DAVID [164]. GO annotations usually suffer from repeated annotations and large overlaps. We used DAVID’s clustering to get rid of the redundant terms. A summary of the annotations is shown in Table 5.2.
In general, the modules were enriched with extracellular region, regulation of phosphorylation, and response to stimulus related annotations. The overexpression of some extracellular region related genes are known to be involved in breast cancer especially in the metastatic one (e.g., [165, 166]). Using PRASE, we detected more extracelluar region related genes while improving the p-value for the related go term.
For instance, the percentage increased from 30% for heinz to 38% while the p-value changed from 5.5 × 10−10 to 6.2 × 10−21.
112 300 400 350
250
300 200 250 150 200 150 nodes
100 Numberofedges Numberofnodes 100 edges 50 50 0 0 δ=0 δ=0.3 δ=0.5 δ=0.85 δ=0 δ=0.3 δ=0.5 δ=0.85 p p'-exp p p'-exp jActiveModules heinz (a) Module sizes with various δ values and exponential scaling (p0-exp) 18 16
14 12 GSEA3 10 GSEA4 8 GSEA5 6
genespercentage 4 GSEA6 2 DE(20%) 0 DE(30%) δ=0 δ=0.3 δ=0.5 δ=0.85 p p'-exp p p'-exp jActiveModules heinz (b) Percentages of DE and important genes in the networks. The y-axis shows the ratio of the DE and important genes to all the genes in the module. GSEA3, GSEA4, GSEA5, and GSEA6 are aliases for the gene sets in rows 3, 4, 5, and 6 of Table 5.1, respectively. DE (X%) denotes the case where X% of the samples are considered as outliers. The results for heinz do not change with δ. Therefore, a single exponential scaling (p0-exp) column is given for heinz.
Figure 5.2: Evaluation of the modules obtained for the BRCA dataset
113 Table 5.2: GO analysis summary of jActiveModules and heinz for the BRCA dataset. jActiveModules heinz p p0 p p0 δ=0.3 δ=0.5 δ=0.85 -Extracellular region -Nuclear and cell division, mitotic and cell cycles, spindle, and or- ganelle fission -Response to hormone and endogenous stimulus -Response to organic substance -Plasma membrane -Extracellular matrix -Regulation of phosphorylation -Regulation of transferase and kinase activity -Regulation and positive regulation of protein metabolic process and map kinase activity -Cell migration, motility, and localization -Regulation of Cell migration, and locomotion -Platelet alpha granule and vesicle -Response to wounding -Wound healing and coagulation -Cell and biological adhesion -Blood vessel and vasculature development -Cell junction and focal adhesion -Response to hypoxia, oxygen levels, and progestrone stimulus -Glycosaminoglycan, polysaccharide, and pattern binding -Hemopoietic, and immune system development -Cell activation -Regulation of cell death and apoptosis -Chemical homeostasis -Regulation of system process -Skeletal system development -Neuron development -Regulation of cell communication -Cellular homeostasis -Behavior and taxis -Anchoring and cell junctions -Muscle organ and tissue development -Defense and inflammatory response -Heparin binding -Protein dimerization activity -Integrin complex, cell-substrate, cell matrix adhesion, and Integrin- mediated signaling -Regulation of mitosis, nuclear division, and organelle organization -Positive regulation of cell proliferation -Cytokine and chemokine activity and leukocyte migration -Regulation of transmission and system, neurological system, and mul- ticellular organismal processes -Regulation of response to external stimulus -Regulation and positive regulation of cell adhesion -Growth factor binding -Response to extracellular stimulus -Growth factor activity -Cytokine binding -Regulation and positive and negative regulation of ubiquitin-protein ligase activity -Behavior, regulation of cellular localization, and positive regulation of transpost and protein transport -Response to organic substance -Homeostatic process -Urogenital system development -Second-messenger mediated signaling -Regulation of mitotic cell cycle 114 The GO terms related to inflammatory responses and response to wounding were
enriched in most of the modules except for heinz with p. It is known that the existence of inflammation related genes contributes to the growth of the tumor [167].
Therefore, the introduction of this annotation in heinz with PRASE is very related to the breast cancer behavior.
A comparison of GO analyses with different δ values for jActiveModules reveals that the annotations are consistent for δ = 0.3 and δ = 0.5. However, for δ = 0.85, we
start to see slightly different annotations. This is most likely due to the increase on
the impact of node-node interactions rather than the original p-values. We observed
similar consistencies among different contributions for the GO analyses for both tools.
A summary of the pathway enrichment analysis results is shown in Table 5.3.
Using jActiveModules with PRASE results in the removal of pathways that are
not that much related to breast cancer such as the renal cell carcinoma pathway.
Meanwhile, we encounter the enrichment with other pathways such as chemokine
signaling. It has been shown that chemokines are critical for cancer progression [168].
On the other hand, some of these pathways are not related to breast cancer, such
as Arrhythmogenic right ventricular cardiomyopathy (ARVC). Moreover, important
pathways such as the ErbB signaling pathway are not significantly enriched in the
output module anymore.
The original heinz modules were not enriched with as many pathways as the pathways enriched in jActiveModules modules albeit its relatively larger module
size. However, with PRASE, heinz modules were enriched with Focal adhesion,
ECM-receptor, and cytokine-cytokine receptor interaction pathways (p-value between
2e-03 and 7e-07). Focal adhesion related genes, specifically, PTK2, are known to be
115 Table 5.3: Pathway analysis of jActiveModules and heinz for the BRCA dataset. jActiveModules heinz p p0 p p0 δ=0.3 δ=0.5 δ=0.85 -Focal adhesion -ECM-receptor interaction -Hematopoietic cell lineage -Pathways in cancer -Cytokine-cytokine receptor interaction -Regulation of actin cytoskeleton -Hypertrophic cardiomyopathy (HCM) -ErbB signaling pathway -Bladder cancer -Dilated cardiomyopathy -Leukocyte transendothelial migration -Renal cell carcinoma -Pancreatic cancer -Chemokine signaling pathway -Long-term depression -Cell cycle -Arrhythmogenic right ventricular cardiomyopathy (ARVC) -Cell adhesion molecules (CAMs) -Complement and coagulation cascades -Proteasome -Gap junction -Progesterone-mediated oocyte maturation -Prostate cancer -Melanoma -Glioma
DE in breast cancer [169]. Moreover, the focal adhesion pathway at the end affects the P53 signaling pathway. On the contrary, some pathways such as Gap junction and Progesterone-mediated oocyte were enriched only in the module extracted with p and were not enriched in the remaining modules. These pathways have also been mentioned as breast-cancer related [170].
5.3.2 Colorectal cancer cell line (CRC)
The CRC dataset contains 36, 952 coding and non-coding genes where only 11, 853 of them are expressed coding genes and have a corresponding entrez gene id. However, only 7, 456 of these genes exist in the PPI network. In addition, there are 251 DE genes where only 157 of them exist in the PPI network. The CRC data contains only 2
116 2000 10000
1800 9000
1600 8000 1400 7000 1200 6000 1000 5000 800 4000 nodes
600 3000 Numberofedges Numberofnodes edges 400 2000 200 1000 0 0 δ=0 δ=0.3 δ=0.5 δ=0.85 δ=0.3 δ=0.5 δ=0.85 p p'-lin p'-exp jActiveModules heinz
Figure 5.3: The number of nodes/edges in the modules with different δ values and scaling functions for the CRC dataset: p0-lin and p0-exp refer to using linear and exponential scaling, respectively.
samples, therefore, the gene co-expression network cannot be accurately constructed.
Hence, the RNA-Seq PPI network is used in this experiment.
The module sizes for jActiveModules and heinz are shown in Figure 5.3. The
figure includes both linear and exponential scaling results. In the figure, δ = 0 implies that p0 = p. A positive δ leads to the updates on the old p-values due to the connections between the genes, and a larger δ increases the impact of these updates. For this experiment, we achieved 70% reduction on the network size for jActiveModules with δ = 0.85 and exponential scaling.
To measure the significance of the output modules, we looked for the number of DE genes in each module. In addition, we looked for genes that are known to be related
117 Table 5.4: The number of DE and significant (SIG) genes in each module for jActiveModules. The numbers in parentheses are the percentages of DE genes in the module. heinz (with or without PageRank) detected 34 (16%) DE and 5 SIG genes. p p0-lin p0-exp δ=0 δ=0.3 δ=0.5 δ=0.85 δ=0.3 δ=0.5 δ=0.85 #DE 97 (5%) 84 (9%) 105 (10%) 49 (8%) 78 (8%) 90 (8%) 46 (8%) #SIG 7 5 5 4 5 5 4
to CRC and popular among the literature (OMIM4and MSigDB5). We found 7 genes related to the resistance behavior: TYMS [143], TK1 [144], CDH1 [147], UMPS [148],
ABCB1 [140], GDF15 [146], and TNFRSF1B [146]. All of these significant genes are
DE in the RNA-Seq data. As shown in Table 5.4, only 5% of the network detected by jActiveModules with the original p-value, p, were DE, among them the 7 significant genes were present. With PRASE, the ratio of DE genes increased to 8%, among them 4 of the 7 significant genes were present, namely, TYMS, CDH1, UMPS, and
ABCB1. heinz also detected exactly these 4 genes with the addition of GDF15.
A further analysis revealed that over-expression of TYMS is known to increase the resistance behavior for the 5-FU drug [146] while UMPS is believed to critically affect the response of the tumor to the drug [148]. Hence, in addition to a reduction on the module size and an increase on the DE gene percentage, PRASE helped in detecting genes that are believed to be highly relevant to the drug resistance.
4http://www.ncbi.nlm.nih.gov/omim 5http://www.broadinstitute.org/gsea/msigdb/
118 Table 5.5: Percentages of DE genes detected by jActiveModules and heinz for the Oligo dataset. The numbers are shown in percentage. p p0-lin p0-exp δ=0 δ=0.3 δ=0.5 δ=0.85 δ=0.3 δ=0.5 δ=0.85 jActiveModules 33% 29% 31% 32% 25% 23% 29% heinz 29% 29% 29% 29% 29% 28% 26%
5.3.3 Oligodendroglioma tumors
The Oligo dataset contains 49, 868 coding and non-coding genes where only 10, 823 of them are expressed protein coding genes and have an entrez gene id. 7, 082 out of the 10, 823 genes exist in the PPI network and 726 of them are DE. Similar to the
CRC dataset, the gene co-expression network cannot be accurately constructed from the 6 samples. The RNA-Seq PPI network is used in this experiment.
The sizes of the networks for jActiveModules and heinz with and without
PRASE (see δ = 0) are shown in Figure 5.4. Using jActiveModules with δ = 0.85, we obtained a 55% reduction on the network size with almost the same DE-gene percentage. However, there was no improvement for heinz within PRASE.
We further analyzed the output modules w.r.t. the hub nodes they contain. We found that all the modules include the Epidermal Growth Factor Receptor6 (EGFR)
as the hub node except for δ = 0.85 with exponential scaling which returned different
hub nodes: CD44, RPL4, and RPLP0. EGFR is a transmembrane protein that
binds to EGF. Binding to the ligand leads to receptor dimerization and tyrosine
autophosphorylation leading to cell proliferation. Glioma cells increase the expression
of this gene to boost the tumor behavior. Therefore, it is considered as a target for a
6http://www.ncbi.nlm.nih.gov/gene/1956
119 180 500 160 450
140 400
350 120 300 100 250 80 nodes 200
60 edges Numberofedges
Numberofnodes 150 40 100 20 50 0 0 δ=0 δ=0.3 δ=0.5 δ=0.85 δ=0.3 δ=0.5 δ=0.85 p p'-lin p'-exp (a) jActiveModules 2000 9000 1800 8000
1600 7000
1400 6000 1200 5000 1000 4000 nodes 800 3000 edges
600 Numberofedges Numberofnodes 400 2000 200 1000 0 0 δ=0 δ=0.3 δ=0.5 δ=0.85 δ=0.3 δ=0.5 δ=0.85 p p'-lin p'-exp (b) heinz
Figure 5.4: Modules sizes with various d values and scaling functions for the Oligo dataset: p0-lin and p0-exp refer to using linear and exponential scaling, respectively.
120 new drug development [171]. CD447 is a gene that is involved in cell-cell interactions
and cell migration. In addition, it is a receptor for the hyaluronic acid (HA) ligand.
The protein also participates in many cellular functions such as tumor metastasis.
It is also found to be overexpressed in invasive oligodendroglioma tumors [172, 173].
However, it was not contained in any of the original moduels except the ones obtained
by PRASE. On the other hand, RPL4 is one of the top 50 markers in anaplastic
oligodendroglioma according to MSigDB.
We further looked to the enrichment of known oligodendroglioma related genes/
pathways in the output modules. We obtained two gene sets from MSigDB and one
other set from oligodendroglioma related pathways found in KEGG (Table 5.1: rows
1, 2, and 7). The first gene set from MSigDB, GSEA1, contains 45 marker genes
for anaplastic oligodendroglioma where only 38 of them exist in the PPI network.
The marker genes were obtained by performing microarray expression analysis for
the 12, 000 genes in a set of 50 gliomas, 28 glioblastomas, and 22 anaplastic oligoden-
droglioma. The second dataset, GSEA2, contains 46 marker genes for glioblastoma
multiforme where only 40 of them exists in the PPI network. The genes were also
extracted from the same microarray experiment. Glioblastoma multiforme and oligo-
dendrogliomas are similar brain gliomas. However, the former is more aggressive than
the later [174]. We analyzed the percentages of these genes in the output modules.
The results of the analysis for jActiveModules are shown in Fig. 5.5. For GSEA1 and GSEA2, p0-exp with δ = 0.85 returned the largest percentage 3.6%. However, for the KEGG pathway, pr-lin with δ = 0.85 returned the largest percentage of 5.7%.
In general, PRASE with the different δ values and scaling functions enhanced the
7http://www.ncbi.nlm.nih.gov/gene/960
121 6
5
4
3 GSEA1 GSEA2 2
genes percentage OligoPath 1
0 δ=0 δ=0.3 δ=0.5 δ=0.85 δ=0.3 δ=0.5 δ=0.85 p p'-lin p'-exp
Figure 5.5: Percentages of important genes in the jActiveModules networks for the Oligo dataset. The y-axis shows the ratio of the important genes to all the genes in the module. GSEA1, GSEA2, and OligoPath are aliases for the gene sets in rows 1, 2, and 7 of Table 5.1, respectively.
percentage of important genes in the output modules. For heinz, using PRASE did not effect the percentages. All heinz modules obtained in our experiments reached approximately 1.6%, 0.5%, and 0.7% for the important genes from GSEA1, GSEA2, and OligoPath, respectively.
5.4 Conclusions
PageRank has been used to prioritize the genes using expression data and various biological networks such as transcription factor networks [159] and GO networks [158].
These studies only focus on ranking genes rather than taking PageRank one step fur- ther to extract or prioritize modules. In this work, we proposed a workflow, PRASE,
122 which uses PageRank to calibrate the p-values and detect important and overlooked patterns while using RNA-Seq data.
Our evaluation showed that PRASE can effectively improve the quality of the output modules. For instance, a 70% reduction in jActiveModules module size
while increasing the percentage of DE genes for the CRC data clearly indicates that
the workflow is promising. Nevertheless, a further evaluation may still be required
to quantitatively measure the effectiveness of using PageRank. Another potential
measure could be the betweenness centrality of the genes returned in the modules.
Therefore, in the future, we plan to apply this measure and others as well to improve
the effectiveness of the workflow.
In addition to the datasets, the effectiveness of the workflow depends on some
other parameters including the scaling method, the dumping factor δ, and the merg-
ing threshold. Among the different datasets, exponential scaling provided better
results than linear scaling. Therefore, we highly recommend exponential scaling to
be the default scaling method. For the dumping factor, we observed that δ = 0.85
generated smaller and good quality modules when using the PPI network to generate
the PageRank values. On the other hand, δ = 0.5 provided the smallest module when
using the gene co-expression network instead of the PPI. We believe such a change
in the δ value is due to the degree of the nodes in both networks; the PPI network is
sparser than the gene co-expression. However, further experiments on other datasets
might be required to prove this hypothesis. For the merging threshold, the smaller
the threshold is, the smaller the set of old p-values we merge with the new p-values.
As a result, any change in the threshold will lead to a change in the obtained active
123 Table 5.6: A summary of the improvements by PRASE in the experiments. : cases improved with PRASE. ?: cases for which we obtained the best results with PRASE —: cases where the effect of PRASE is not significant. BRCA CRC Oligo jActiveModules ? heinz ——
modules. Therefore, in order to not neglect any important genes, we recommend using the largest threshold value of 0.05 to do the merging.
Even though extracting the active modules significantly help in understanding the disease mechanism, the picture is still incomplete as we do not understand how the genotypic changes are related to the phenotypic ones [152]. Recent research tried to tackle this problem and bridge the gap between the two data types [175, 176].
However, the problem is still open and further work is needed. Therefore, we plan to address this problem by extending our work and integrate genotype data in the network extraction process.
124 Chapter 6: MICA: MicroRNA Integration for Active Module Discovery
The discovery of disease-related modules have been an important problem for a long time. The focus at first was on extracting dense gene clusters from biological networks, e.g., PPI or gene co-expression networks [4,5]. However, such an approach has proven its insufficiency in extracting comprehensive modules [24]. Therefore, new algorithms and techniques have been proposed to discover more accurate disease- related modules.
One fruitful technique for extracting such modules is based on integrating gene expression values and the PPI network into one framework [24]. The integration idea was first introduced by Ideker et al. [28] and many others then followed the same approach, e.g., [29, 30]. These discovered modules are called active modules since the gene expression data, which is dynamically changing, is integrated with the static PPI network. Hence, the modules are active in certain cells or conditions. Even though the proposed algorithms have shown their efficiency in discovering disease-related active modules, they still do not make use of the fast amount of available heterogeneous data.
Therefore, the discovered modules do not give a complete picture about the disease behavior. Additionally, they focus on the genes in the PPI network; discarding other genes that we do not have yet any information regarding their interaction patterns.
125 MicroRNAs (miRNAs) are small non-coding RNAs that are used by the cell to post-transcriptionally regulate gene expression levels [43]. miRNAs inhibit protein synthesis by either stopping the protein translation or by performing mRNA degra- dation. miRNAs constitute an important inhibition technique that has been shown to be very important in different diseases, specifically, in cancer progression [44]. For instance, miRNAs were found to be differentially expressed in breast cancer in addi- tion to successfully classifying estrogen and progesterone receptors, and HER2/neu status [45]. Hence, using miRNAs for the active module discovery is a promising technique to increase the accuracy and success rate of the cancer treatments.
Most of the works that integrate miRNA and mRNA data assumes that the miRNA effect on the mRNA is distinguishable from the gene expression levels [58,
177]. However, the protein expression level can be significantly affected by the miRNA without having any apparent effect on the gene expression level [62]. [64] suggested another method to integrate miRNA and mRNA by integrating the PPI network and miRNA-target gene network into one heterogeneous network. They focused on prioritizing the genes using the suggested network. Indeed, such integration would work around the miRNA-mRNA integration problem. However, by focusing only in prioritizing genes through the PPI network, they cannot detect connected modules of genes with indirect dependencies, e.g., through other genes not in the PPI network or through other genes with no change in expression at mRNA level.
Even though the techniques using gene expression levels provide valuable infor- mation, they cannot show the whole picture. Here, we try to exploit another miRNA and mRNA interaction pattern, which is the inhibition of protein translation rather than mRNA degradation. We believe that if the gene expression levels are adjusted
126 based on the expression levels of the corresponding miRNAs, novel and interesting
gene-gene dependencies can be unraveled.
In this work, we propose a workflow Mica which employs heterogeneous data
sources and adopts independent component analysis [178] to extract active modules.
To unravel new types of gene-gene dependencies, we provide a novel data integration
technique that adjusts the expression level of the genes based on the expression level
of the corresponding miRNA. These dependencies are then mapped back to the PPI
network to extract the connected modules. Compared to existing active module dis-
covery tools, Mica is less dependent on the given biological network it uses hence
does not need to ignore the information for the entities which are not in the network.
There are three types of interactions between a group of miRNAs and a target
gene; synergetic, complementary, and additive.A synergetic effect implies that all the miRNAs affecting the gene must be expressed together in order to have mRNA degra- dation or protein inhibition [179]. Rather, miRNAs can act complementary by requir- ing only one out of the miRNA set to be expressed [179]. In an additive interaction, each miRNA alone has an effect while the overall effect is increased if multiple miRNAs are expressed [180]. Here, we will focus on the complementary and the additive effects.
The rest of this chapter is organized as follows: In Section 6.1, we provide a back- ground on the techniques we used in this work. Our methods and experimental results are presented in Section 6.2 and Section 6.3, respectively. Section 6.4 concludes the work mentioned in this chapter. 6.1 Background
Independent Component Analysis (ICA) is a famous technique used to solve the
Blind Source Separation problem. Given an input with multiple, linearly mixed
127 sources, it tries to distinguish the sources by minimizing the statistical dependen-
cies between them [178]. In the context of gene expression, ICA decomposes an input
expression into its possible expression modes [181]. For an n × m input gene expres-
sion matrix X, where rows correspond to genes and columns correspond to samples,
ICA decomposes X into:
XT = A × S (6.1)
such that S is a ` × n matrix for ` ≤ m. The rows of S are (statistically) as indepen-
dent as possible and correspond to the independent components. The columns of S
correspond to the genes and the entry Scg shows the contribution of a gene g to the component c. A is an m × ` matrix where its rows correspond to samples. The entry
Asc shows the contribution of each component c for a sample s. Many approximation algorithms have been proposed to find A and S in an efficient way, e.g., fastICA [39],
JADE [182], and InfoMax [183]. fastICA tries to identify non-Guassian components under the assumption that Gaussian components represent the noise. This algorithm can stuck in a local minima, hence multiple iterations, thus multiple estimates can be necessary [184, 185].
ICA has been used extensively to cluster different genes together or for sample classification [181, 186, 187, 188, 189, 190, 191, 192]. All of these studies have shown the efficiency of ICA in producing biologically relevant results. 6.2 Methods
Mica consists of three main parts as shown in Figure 6.1:
128 Controls Cases Controls Cases PPI Network gene 1 miRNA 1 gene 2 miRNA 2 gene 3 miRNA 3 gene 4 miRNA 4 gene 5 ...... miRNA m gene n microRNA Expression Profiles Gene Expression Profiles
Integration gene 1 gene 2 gene 3 miRNA r: z r,s > t gene 4 miRNA r': z r',s > t gene 5 . . . . Connected Module Extraction . . miRNA r'': z > t gene n r'',s module 1 module 2 Adjusted Gene Expressions
ICA
Output of ICA module 3
Figure 6.1: Mica: The workflow starts with integrating miRNA and mRNA data by adjusting the mRNA data using the miRNA data. Then, ICA is applied on the resulting new gene-expression matrix. Finally, for each independent component obtained by ICA, the largest connected module from the PPI network is extracted using the significant genes in the component.
129 6.2.1 Data integration
The miRNA and gene expression data are usually integrated using correlation-
based methods with the assumption that the miRNA effect on mRNA should be
apparent on the gene expression level. Rather than the suppression of the gene
expression, the inhibition of the protein translation can also be used. Traditional approaches cannot exploit this effect. Our novel integration scheme uses miRNA expression levels to adjust the gene expression. Hence, if a gene is affected by an miRNA at the inhibition level, the proposed integration makes the effect visible on the expression level. To do this, for each sample s, we first compute P | Zr,s | {r: r affects g, Zr,s<0} βg,s = P (6.2) Zr,s {r: r affects g, Zr,s>0}
where Zr,s is the z-score of miRNA r in sample s that is experimentally verified to
affect gene g. It is calculated by
Zr,s = (xr,s − µr)/σr (6.3)
where xr,s is the expression level of miRNA r in sample s, and µr and σr are the
mean and standard deviation of r’s expression level across all the control samples.
In (6.2), the miRNAs are divided into two groups since they affect a gene differently.
In general, when an miRNA r is down-regulated, i.e., has a negative z-score, then the
expression of g will increase. On the other hand, when r is up-regulated then the
expression of g will decrease. Accordingly, the final gene expression is calculated as
0 eg,s = βg,s × eg,s (6.4)
0 where eg,s and eg,s are the original and adjusted expression levels of gene g.
130 For data integration, (6.4) is applied to each gene-sample pair. To avoid noise,
only the miRNAs with an absolute z-score at least tR in more than 10% of the sam-
1 ples are kept. Additionally, βg,s must be > tR or < in order to modify eg,s, i.e., tR we want that either the up-regulated or the down-regulated group of miRNAs has a
significant effect on g.
As mentioned above, miRNAs can affect the genes in a synergetic, complementary,
or additive way. Our integration equation (6.4) is additive and partially complemen-
tary, i.e., the gene expression level will be affected more if several miRNAs affect
it (additive). Yet, when only a single miRNA is active in the sample, it will still
affect the expression level (complementary). At the end, our goal is to better high-
light the dependencies between the genes rather than finding exact protein expression
values; there are many unknown factors affecting the actual protein expression.
6.2.2 ICA on gene expression values
After the data integration step, the adjusted gene expression values are then fed
to the ICA for which the R version of the fastICA algorithm is used [39]. To avoid lo-
cal minimas and unreliable independent component estimates, we follow the method
in [185]: we run fastICA κ times and obtain different independent component esti-
mates at each run. Then, the Pearson correlation coefficients between the components
from different estimates are computed to distinguish the most similar ones. We con-
structed a k-partite similarity graph G = (V,E) where V = V1 ∪ · · · ∪ Vκ are the set
of all components returned by ICA and Vi is the set of components obtained in the
ith run. The edge set E contains an edge (c, c0) if the Pearson correlation coefficient
0 between c and c is at least 0.9 and they are not obtained in the same run, i.e., c ∈ Vi,
131 0 c ∈ Vj, i 6= j. To obtain the final component set, we partition G to its maximally connected subgraphs. Then for each connected subgraph C of G with at least κ ver- tices, we construct a final representative component by computing the average of the
|C| rows corresponding to the vertices in C.
An important parameter of ICA is the number of components ` to be generated; when ` is large ICA will probably return subcomponent-type structures which are not very interesting [193]. A na¨ıve method is setting ` = m, the number of samples, which is not useful in our case since we have hundreds of them. We follow another approach [191] based on an earlier method proposed by [194]. We first apply Singular
Value Decomposition (SVD) to the actual gene expression matrix to reduce the di- mensionality. We do the same for a randomly permuted version of the same matrix.
The actual variance obtained from each SVD component is used to draw a curve of the information gain. A similar curve is also generated for the randomly permuted case.
The optimal number of components would be the point of intersection of these two curves, i.e., when the information obtained from the random components is higher than the information obtained from the actual components.
The matrices S and A generated by ICA can be used to determine which genes are significant in each component and which components are significant in each sam- ple, respectively. There are different options to pick the significant components, e.g.,
[195, 185, 189]. Here, we used a variant of the correlation method suggest by [189].
Basically, instead of calculating the correlation between the component weight across the samples and the type (control/case) of the samples, the Wilcoxon signed-rank test is used to calculate a p-value for each component based on its weight distribution over the controls and cases. The Bonferroni correction method is then used to correct
132 the p-value. We further compute µ and σ for each component by using its weights in the control samples. We then compute the z-score for each component-case sample pair. Hence, a component is significant for a case, if the corresponding z-score is at least a threshold tC .
To determine the set of genes related to a component, we use the z-score threshold based method [195, 188] which was shown to be effective to return the most important genes for each component. We calculated the z-score of each gene in a component by using its weight, µ, and σ that are computed by using all the gene weights inside this component. Then for each component, the genes with a z-score at least tG is considered to be a member of the component.
6.2.3 Connected module extraction
The connected PPI modules are extracted by mapping the set of member genes in each component to the PPI network and extracting the largest connected module.
If there is no connected module or if the largest one is not large enough the threshold tG used to pick the member genes for each component is relaxed to allow more con- nectivity. However, as the results will show, each component yield a large connected module in PPI. In addition, recent studies also showed that the components generated by ICA (or similar techniques) are either highly enriched in the PPI network [177] or highly enriched with signaling pathways [188].
Each component we found after the second step is expected to generate a con- nected modules. It is crucial to define a scoring function to determine which module is the most important one, i.e., containing important member genes. Although a large module is preferable, we do not want the modules to be too large. Therefore, after
133 determining the member genes in each component c, the following scoring function is used: P Zcg scr(c) = g∈c (6.5) p|c| where |c| is the number of member genes in c. We used p|c| instead of |c| since we want to give a higher score to larger modules. A gene g will have a high Zcg value if it is significant for c. Therefore, if a connected module contains many important genes the module is considered to be important.
6.3 Results
We implemented our proposed workflow Mica in R and used the available imple- mentation of the fastICA algorithm. To demonstrate the effectiveness of the proposed workflow, that is, the added benefits of early integration of microRNA datasets, we compared the modules obtained by our workflow Mica against the ones obtained us- ing ICA and DEGAS [30], using the original gene expression values. DEGAS is a set-cover based algorithm known for its efficiency in detecting dysregulated pathways. It tries to detect a module with at least k differentially expressed (DE) genes shared between most of the samples. We tuned the DEGAS parameters to detect the best module according to a measure provided by the tool based on how far the size of the module is from a randomly generated subnetwork of k genes. We set the maximum number of modules for DEGAS to 5. Still, it returned a single module in the experiments. In the rest of the text, DEGAS output modules are referred to as degas, ICA modules as ica, and Mica modules as mica.
We carried out the experiments on two datasets for two breast-cancer subtypes: invasive lobular carcinoma (ILC) and Invasive ductal carcinoma (IDC) datasets .
134 Both datasets are from TCGA (https://tcga-data.nci.nih.gov/tcga/) and they both
contain RNA-Seq and miRNA-Seq data. High throughput sequencing data was used
in our experiments since it can provide a complete image about all the miRNAs and
mRNAs in the cell without requiring any a-priori information. The main aim of using two different subtypes of the same disease is to understand how different techniques are able to detect modules specific to each subtype.
The ILC dataset has 106 control samples and 153 case samples. All of the 259 samples have gene expression information. Out of the 153 cases, only 150 contain miRNAs expression data as well. Therefore, only the 150 cases are used in our ex- periments. The IDC dataset shares the 106 control samples with the ILC. It also has
714 case samples with gene expression information, however, only 699 case samples, which also have miRNA expression information, are used in our experiments.
The PPI network used for the module extraction was obtained from the BioGRID
(http://thebiogrid.org ) database (rel. 3.2.104). It contains 139, 539 interactions be- tween 18, 170 proteins. The experimentally validated miRNA-target interactions used in data integration are obtained from miRTarBase (rel. 4.5) [196].
The number of runs κ for ICA is set to 100 while tR threshold is set to 4 and tC and tG are set to 2. We set the threshold high since we only want to keep the values that would have a potential of being important.
The qualities of the output modules are verified using different methods, including, pathway enrichment analysis, GO enrichment analysis, disease ontology (DO) enrich- ment analysis, and finally using the evidence in the literature on the importance of the modules/genes. Enrichment analysis is performed using ReactomePA [197],
FunDo [198], and clusterProfiler [199].
135 Table 6.1: Size of the modules obtained using Mica and ICA. # is the component number, S is the number of samples a component covers, |c| is the size of the component, |c|ppi is the number of genes that are both in the component and the PPI network, N and E are the number of nodes and edges, respectively, for the largest connected module in the PPI, and scr(c) is the score of the largest connected module. The missing component is a very small one. (a) ICA (b) Mica
#S |c| |c|ppi N E scr(c) #S |c| |c|ppi N E scr(c) 1 55 754 657 221 348 39.43 1 103 501 475 164 272 55.63 3 54 279 267 103 143 25.33 2 49 284 242 21 21 12.71 4 28 703 641 274 510 50.70 3 67 1007 879 339 585 49.51 5 4 542 448 116 141 28.80 4 30 455 446 283 506 52.41 6 7 349 320 116 337 26.68 5 68 931 876 541 1535 66.91 7 2 204 176 30 29 12.81 6 9 889 752 253 354 46.04 7 3 790 738 410 1297 51.04
6.3.1 Results on ILC data
The Mica modules are meaningfully different from ICA modules. Table 6.1 shows the number of samples they cover, the size of each component, the number of member genes in the PPI network, the size of the largest connected module, and the score. In general, for each of ICA and Mica components, there is a large connected module in the PPI network. Interestingly, Mica modules have higher scores than ICA modules in addition to being more common across the samples.
We also use DEGAS on the ILC dataset for comparison purposes. The degas mod- ule consists of 347 genes with 730 interactions between them and the number of DE genes in this module is 200. The quality, i.e., the module size p-value, is 0.19 which can be considered large. We tried different options for DEGAS to get a better module, however, this is the best module we obtained.
Statistical analysis of the obtained components: An important step is to
first ensure that the obtained Mica components, hence the active modules, cannot
136 mica1 mica2 mica3 mica4 mica5 mica6 mica7 0.00 0.05 0.10 0.15 0.20 0.25 0.30 −10 −5 0 5 10 15 t−score
Figure 6.2: Random t-score distribution.
be obtained from a random matrix. Therefore, we set our null hypothesis to be that the t-score calculated for each component from its weight across the case and control samples in the A matrix can be obtained if we have a random input matrix. Accord-
ingly, we generated 1000 random matrix by randomly permuting the modified gene
expression values for each gene across the case and control samples. Afterwards, we
applied Mica on the random matrices and calculated the t-score for the randomly
generated components. For each 1000 run, we only kept the max/min t-score value.
Finally, using the t-scores from the random runs, we generated the distribution for
the random t-scores and compared our actual t-scores against. The random t-score
distribution and the components t-score values are shown in Figure 6.2. Clearly, the
components cannot randomly gain such a high t-score (i.e., p-value = 0). Therefore,
the null hypothesis is rejected.
137 AUC 0.95 0.96 0.97 0.98 0.99 1.00
MICA ICA DEGAS
Figure 6.3: AUC for Mica, ICA, and DEGAS for a 10-fold cross validation.
Classification using modified and original gene expression: It is important
to ensure that the modified gene expression data better differentiate between case and
control samples. To this end, a comparison between the predication accuracy using
Mica modules on the modified gene expression data and ICA and DEGAS modules on the original data was carried out. Basically, for Mica modules, a Support Vector Ma- chine (SVM) was trained on each module separately, with the genes in each module used as the input features. Afterwards, a voting was performed between the modules to determine the output classification. The same was applied on ICA but with the original data. For DEGAS, no voting was required since it only has one module. The results for a 10-fold cross validation is shown in Figure 6.3. In general, Mica and
ICA obtain a better classification accuracy than DEGAS, with Mica being more stable across the different runs and obtaining an AUC value of 1 in almost all of the runs.
138 Active modules analysis: The next step is to see which genes exist in each
active module, how the different active modules overlap, and the enrichment of each
module with important GO annotations. Interestingly, there was not a large over-
lap between Mica, ICA, and DEGAS; degas overlaps with 12% of mica5 while ica4
overlaps with 17% of mica6. Nevertheless, there were some similarities in the top en-
riched GO annotations (i.e., with corrected p-value < 10−15). Among the top similar
ones are: translational elongation between ica6 and mica7, and positive regulation of
biological process between ica4 and mica6, cellular macromolecule metabolic process
in mica1 and degas, and organelle organization between mica4 and degas. On the other hand, the top different ones included protein transport in ica1, cardiovascular system development and extra cellular matrix organization in ica5, response to en- doplasmic reticulum stress in mica2, RNA processing and splicing in mica3, and cell cycle and cell cycle process in mica5.
Since we are working with active modules that are going to be further used to extract important pathways, we further performed pathway enrichment analysis to better evaluate the quality of the active modules. The results are shown in Table 6.2.
Similar to GO annotations, some pathways are common between Mica, ICA, and
DEGAS. For instance, both degas and mica5 were enriched with the cell cycle pathway, however, the p-value for degas was much smaller than the p-value in mica5. Remark- ably, mica5 was enriched with more cell cycle-related pathways, such as, the cell cycle, mitotic, and check points pathways, with BRCA1 common among most of these path- ways. Mutations in BRCA1 lead to genetic instability and deficiency in the different cell cycle phases [200]. Additionally, its absence results in breast cancer formation.
139 Table 6.2: Pathway enrichment analysis for Mica, ICA, and DEGAS on the ILC data. Database Pathway MICA ICA DEGAS % pval # % pval # % pval Reactome Unfolded Protein Response 23.81 6.78 × 10−05 2 3.64 8.20 × 10−03 4 Processing of Capped Intron- 5.60 4.21 × 10−03 1 Containing Pre-mRNA mRNA Splicing 5.30 4.21 × 10−03 3 Cell Cycle, Mitotic 18.48 1.19 × 10−21 5 11.53 7.79 × 10−3 Mitotic M-M/G1 phases 13.31 3.75 × 10−18 5 Elastic fibre formation 4.74 4.05 × 10−05 6 11.21 7.30 × 10−11 5 Molecules associated with elastic fibres 3.95 2.81 × 10−04 6 3’ -UTR-mediated translational regu- 8.29 3.77 × 10−05 7 22.41 8.20 × 10−14 6 lation L13a-mediated translational silencing 8.29 3.77 × 10−05 7 22.41 8.20 × 10−14 6 of Ceruloplasmin expression Formation of a pool of free 40S sub- 7.80 3.98 × 10−05 7 19.83 5.30 × 10−12 6 units Eukaryotic Translation Initiation 8.29 3.98 × 10−05 7 18.97 3.03 × 10−11 6 Antigen Presentation: Folding, assem- 4.52 1.62 × 10−06 1 bly and peptide loading of class I MHC Interferon alpha/beta signaling 5.88 7.99 × 10−05 1 Golgi Cisternae Pericentriolar Stack 2.71 5.10 × 10−04 1 Reorganization ER-Phagosome pathway 4.98 6.98 × 10−04 1 PERK regulated gene expression 2.19 3.49 × 10−03 4 Toll Like Receptor 4 (TLR4) Cascade 5.47 4.25 × 10−03 4 Cytokine Signaling in Immune system 10.21 4.25 × 10−03 4 Antigen Presentation: Folding, assem- 2.55 6.00 × 10−03 4 bly and peptide loading of class I MHC Extracellular matrix organization 21.55 5.25 × 10−15 5 Molecules associated with elastic fibres 9.48 3.27 × 10−09 5 Integrin cell surface interactions 11.21 2.02 × 10−07 5 Degradation of collagen 8.62 5.17 × 10−06 5 Translation 24.13 8.66 × 10−14 5 Cap-dependent Translation Initiation 22.41 8.66 × 10−14 6 Eukaryotic Translation Initiation 22.41 8.66 × 10−14 6 GTP hydrolysis and joining of the 60S 21.55 2.74 × 10−13 6 ribosomal subunit Peptide chain elongation 18.10 9.89 × 10−11 6 Nonsense Mediated Decay Indepen- 18.10 1.71 × 10−10 6 dent of the Exon Junction Complex Repair synthesis for gap-filling by 1.73 7.32 × 10−3 DNA polymerase in TC-NER Removal of the Flap Intermediate from 1.72 7.32 × 10−3 the C-strand Telomere Maintenance 3.75 7.32 × 10−3 KEGG Pancreatic cancer 6.70 1.05 × 10−04 1 6.03 4.15 × 10−03 5 Pathways in cancer 15.24 1.05 × 10−04 1 14.66 2.59 × 10−03 5 Small cell lung cancer 7.31 1.05 × 10−04 1 7.75 7.07 × 10−04 5 Chronic myeloid leukemia 6.09 7.01 × 10−04 1 6.89 1.26 × 10−03 5 Colorectal cancer 5.49 8.10 × 10−04 1 5.17 9.87 × 10−03 5 Bladder cancer 4.27 2.18 × 10−03 1 Prostate cancer 6.09 2.24 × 10−03 1 Non-small cell lung cancer 74.27 8.10 × 10−03 1 Protein processing in ER 52.38 4.65 × 10−11 2 12.22 1.10 × 10−08 1 Spliceosome 6.19 1.24 × 10−03 3 Osteoclast differentiation 8.70 1.85 × 10−06 6 Complement and coagulation cascades 4.74 1.62 × 10−03 6 Ribosome 7.07 1.76 × 10−10 7 17.24 3.34 × 10−14 6 ECM-receptor interaction 11.21 3.83 × 10−07 6 Focal adhesion 16.28 3.83 × 10−07 6 TGF-beta signaling pathway 7.76 7.07 × 10−04 6
140 A B B