A Probabilistic Approach for Automated Discovery of

Biomarkers using Expression Data from microarray or RNA-

Seq datasets

A dissertation submitted to the

Graduate School

of the University of Cincinnati

in partial fulfillment of the

requirements for the degree of

DOCTOR OF PHILOSOPHY

In the Department of Molecular & Cellular Physiology

of the College of Medicine

By

Gopinath Sundaramurthy

MS, University of Cincinnati

2015

Thesis Advisor and Committee Chair: Dr. Hamid Eghbalnia Abstract

The response to perturbations in cellular systems is governed by a large number of molecular circuits that coalesce into a complex network. In complex diseases, the breakdown of cellular components is brought about by multiple molecular and environmental perturbations. While individual signatures of cellular components might vary significantly among clinical patients, commonality in signs and symptoms of disease progression is a compelling indicator that key cellular sub-processes follow similar trajectories?

-. Our approach aims for an enhanced understanding of the effect of disease perturbations on the cell by developing an automated platform that assigns more significance to changes that occur at the sub-network level – focusing on that are “wired” together and change together. The platform that we have developed is motivated by the study of concomitant expression changes in sub-networks. The analysis by our platform produces a small subset of signaling and regulatory genes that are wired together and change together beyond random chance. In order to evaluate the effectiveness of our platform in producing subsets that can distinguish diseases and disease-subtypes, we used publicly available RNA-Seq and microarray breast cancer expression datasets. Each dataset was analyzed independently using our platform and the disease related sub-network perturbations among breast cancer subtypes were identified. The resulting subset was subjected to standard multi-way classification and predictions based on our approach were compared with PAM50 predictions. Biomarkers identified from the microarray and RNA-Seq dataset reproduced the PAM50 classification with 100% and 80% agreement respectively despite having only 10% of genes common with the PAM50. This proof-of-concept analysis using breast cancer datasets is indicative of the platform’s stable cross-validation results. This platform can potentially be used for automated and unbiased computational discovery of disease related genes. Our results suggest that probabilistic and automated approaches may offer a powerful complement to existing approaches by providing an unbiased initial screen.

2

3

This research is dedicated to my father, Mr. Sundaramurthy Arunachalam.

4

Acknowledgement

This dissertation would not have been possible without the help of my family, friends and mentors.

I would first like to express my sincere appreciation to my committee chair, professor and mentor, Dr.

Hamid Eghbalnia for his support, guidance, and encouragement throughout my graduate study and research.

Without his persistence, advice and support, this dissertation would not have been possible. I would also like to express my heartfelt appreciation to Dr. Steven Kleene for constant support and encouragement.

I would also like to thank the other members of my dissertation committee, Dr. Yana Zavros, Dr.

Jarek Meller, Dr. Judith Heiny and Dr. Anil Jegga for their valuable feedback, insights and advice throughout my research. I would also like to thank Dr. Nelson Horseman and Dr. Jay Hove, who had previously served on my committee, for their support and guidance. I would also like to thank Jeannie

Cummins and Betty Young for making Cincinnati my second home.

I would like to thank my mom, Ms. Shanthi Sundaramurthy, and my dad, Mr. Sundaramurthy

Arunachalam, for providing me an excellent home where I could learn, grow and develop. My research would not have been possible without their support, determination, inspiration and encouragement. I would like to thank my wife Dhivya Jeganathan without whose unwavering support, I would not have been able to complete my dissertation. I would also like to take this opportunity to thank my brothers and sisters-in- law, Swaminathan Sundaramurthy and Nirupama Srinivasan, and, Palani Sundaramurthy and Sampoorni

Deivasigamani, for always being there and supporting me through my challenging times.

Finally, I would also like to thank my friends Shruti, Shatrunjai and Preeti for making my PhD. experience a memorable one. I would especially like to thank Dr. Sun Wook Kim, Dr. Shreya Ghosh and

Dr. Kirthi Radhakrishnan for all the support through my research and dissertation.

5

TABLE OF CONTENTS

Abstract ...... 2

Acknowledgement ...... 5

List of Figures ...... 10

List of Tables ...... 13

Chapter I: Introduction and Historical Perspective ...... 15

I - 1 Biological Complexity and Emergence ...... 18

I - 2 Complex Diseases ...... 19

I - 3 Organizing Principles of Biological Systems ...... 21

I - 4 Network Biology ...... 23

I - 5 Network Theory and Properties of Biological Networks ...... 24

I - 6 Dynamics of Biological Networks ...... 32

I - 6.1 Differential Expression Analysis ...... 33

I - 7 Pathway Analysis ...... 35

I - 7.1 Classification of Pathway Analysis ...... 35

I - 7.2 Limitations of Pathway Analysis...... 42

Chapter II: Aim of the Thesis ...... 45

Chapter III: Methods ...... 48

III – 1 Genomic and Network Data ...... 50

III – 1.1 Genomic Data ...... 50

III – 1.2 Network Data ...... 50

III - 2 Probability of Change (Nodes and Edges) ...... 51

6

III - 3 Hub Interaction Score ...... 54

III - 4 Path Analysis ...... 56

III - 5 Path and Feature Genes Selection ...... 57

III - 6 Biomarker Selection, Validations, and Functional Analysis ...... 58

III - 7 Pathway Analysis Software Architecture ...... 60

III - 7.1 GSE Analyzer ...... 61

III - 7.2 GSE Project Compiler ...... 65

III - 7.3 GSE Run Scheduler ...... 67

III - 7.4 GSE EMD Calculation ...... 70

III - 7.5 GSE Path Analysis Calculations ...... 72

III - 7.6 GSE Feature Selection ...... 74

III - 7.7 GSE Validation, Results and Reports ...... 76

Chapter IV: Expression Analysis of Breast Cancer Subtypes ...... 79

IV - 1 Molecular Subtypes of Breast Cancer ...... 82

IV - 2 Dataset Information ...... 86

IV - 3 Study Design ...... 87

IV - 4 Results ...... 91

IV - 4.1 Probability of change (POC) ...... 93

IV - 4.2 Path Analysis ...... 100

IV - 4.3 Feature genes and Classification Performance ...... 111

IV - 5 Discussion ...... 113

IV - 6 Follow-up Analysis using TCGA datasets ...... 117

Chapter V: Expression Analysis of Sepsis, SIRS And Septic Shock ...... 120

V - 1 Abstract ...... 120

7

V - 1.1 Background ...... 120

V - 1.2 Method ...... 120

V - 1.3 Results and Conclusion ...... 120

V - 2 Background ...... 121

V - 2.1 Molecular Mechanism of Sepsis, SIRS and Septic Shock ...... 122

V - 2.2 Clinical Diagnosis ...... 123

V - 2.3 Limitations of diagnostic methods ...... 125

V - 3 Dataset Information ...... 126

V - 4 Network Medicine and Human Inflammatory Disease Physiology ...... 127

V - 4.1 Differential Expression patterns for studying multigenic diseases: ...... 128

V - 4.2 Network Data ...... 128

V - 4.3 Identifying “Creative elements” ...... 129

V - 5 Methods...... 129

V - 5.1 Probability of change (POC) for nodes and edges: ...... 129

V- 5.2 Temporal Causality for Nodes and Edge Scores ...... 130

V - 5.3 Hub Interaction Score ...... 131

V - 5.4 Study Design and Path Analysis ...... 132

V - 6 Results ...... 135

V - 6.1 Performance of POC in time-series analysis ...... 136

V - 6.2 Path Analysis and Functional Genomics ...... 144

V - 6.3 Biomarkers and Classification Performance ...... 152

Chapter VI: Summary and Future Directions ...... 157

VI - 1 Software Architecture: ...... 160

VI - 1.1 Cloud Service: ...... 160

8

VI - 1.2 Smarter multi-processing ...... 161

VI - 1.3 User Interface: ...... 161

VI - 1.4 Automation: ...... 162

VI - 1.5 Improve run-times: ...... 162

VI - 1.6 Improve monitoring and reporting: ...... 162

VI - 1.7 Improved Data Management Schemes: ...... 163

VI - 1.8 Integration with Public API’s: ...... 163

VI - 2. Algorithm ...... 164

VI - 2.1 Network Analysis: ...... 164

VI - 2.2 Multiple dataset integration ...... 164

VI - 2.3 miRNA-mRNA Analysis ...... 165

VI - 2.4 Identifying analogous paths across species ...... 169

VI - 2.5 Improve precision and recall for RNA-Seq dataset classification ...... 170

VI - 2.6 New bioinformatics uses for the statistical metric: prospecting site-directed

mutagenesis ...... 171

Chapter VIII: References ...... 184

9

List of Figures

Figure 1: The Difficulties in studying Complex Diseases ...... 20

Figure 2: Overall Aim of Most Differential Expression Analysis is extracting useful genes from the thousands analyzed by high-throughput analysis ...... 33

Figure 3: Comparing our approach with the frequentist approach ...... 47

Figure 4: Detailed flow chart of steps within the path analysis algorithm ...... 49

Figure 5: Probability of Change Calculation for Nodes...... 51

Figure 6 Calculation for the Probability of Change for the Edge ...... 52

Figure 7: Calculation of Hub Scores using EMD Distance Calculation ...... 54

Figure 8: steps involved in path analysis and the software architecture ...... 60

Figure 9: General Structure of Genomic Expression data obtained from GEO Public Database ...... 61

Figure 10: Extracting Genomic data and Other Relevant Information for The Project ...... 62

Figure 11: Building Projects based on user defined configuration files ...... 65

Figure 12 : Run Scheduler which assigns and monitors all the processors dedicated for the analysis...... 68

Figure 13: Modules required for Calculation of Hub Interaction Scores ...... 70

Figure 14: Modules of the Path Analysis calculations ...... 72

Figure 15: Biomarker Identification using Feature Selection ...... 74

Figure 16: Biomarker Validation using Classification...... 76

Figure 17: Modules required for Report Generation and creating summary ...... 77

Figure 18: Methodology for the molecular classification of breast cancer...... 81

Figure 19: molecular classification of breast cancer...... 82

Figure 20: Prevalence of breast cancer subtypes in normal women population ...... 83

Figure 21: Overall Design of the Study ...... 87

Figure 22: Combination of runs required to identify disease related genes and biomarkers using the microarray and RNA-Seq dataset...... 87

10

Figure 23: Study Process for Individual Combination ...... 88

Figure 24: Plots show the distribution of POC scores of nodes and edges for various subtype comparsions

...... 89

Figure 25: Distribution of Node and Edge POC scores for different subtype combinations ...... 90

Figure 26: Degree Distribution of nodes in the Undirected and Directed graph ...... 90

Figure 27: Probability of Change (POC) as a metric for differential expression ...... 92

Figure 28 : Comparing POC with frequentist metrics like absolute fold change and p-value ...... 93

Figure 29: POC vs Absolute fold change for all genes in every subtype comparsions ...... 97

Figure 30: Plots of POC vs p-value for various subtypes comparisons ...... 98

Figure 31: POC vs p-value for all genes in every subtype comparsions...... 99

Figure 32: Venn Diagrams of Path Analysis Results from individual pairwise analysis ...... 100

Figure 33: Venn Diagrams of Path Analysis Results from individual pairwise analysis-II...... 101

Figure 34: Heatmap of biomarkers identified in the microarray analysis ...... 105

Figure 35: Heatmap of Biomarkers Identified in the RNA-Seq Analysis...... 108

Figure 36: Network Map of Identified Biomarkers ...... 112

Figure 37: Clinical Diagnosis for Sepsis, SIRS, and Septic Shock ...... 122

Figure 38: Algorithm and Study Design for the time series analysis of sepsis, SIRS and septic shock ... 132

Figure 39: Study design for sepsis, SIRS and septic shock analysis ...... 134

Figure 40: Disease profile of sepsis ...... 136

Figure 41: Disease profile of Septic Shock ...... 137

Figure 42: Disease profile of SIRS ...... 138

Figure 43: Plots details the POC vs p-value scores for the time-series sepsis analysis ...... 139

Figure 44: Plots details the POC vs p-value scores for the time-series septic shock analysis ...... 140

Figure 45: Plots details the POC vs p-value scores for the time-series septic shock analysis ...... 141

Figure 46: Disease-related genes from analysis ...... 142

Figure 47: plots shows the distrubution between EMD scores, POC and node degree respectively ...... 143

11

Figure 48: Feature Selection of inflammatory Biomarkers ...... 147

Figure 49: sub-figure a show how the feature Selection was used to identify biomarkers...... 155

Figure 50: Prospective methodology to study micro-RNA m-RNA expression ...... 165

Figure 51: Motivation behind the Functional Enrichment Analysis using miRNA-mRNA interactions . 166

Figure 52: Prospective methodology to study pathways across species ...... 170

Figure 53: Illustrates the importance of correlating conformational structure to structure and function of ...... 173

Figure 54: Illustrates steps involved for the different types of analysis that can be done...... 178

Figure 55: Overall cost of amino-acid replacement ...... 179

Figure 56: Heatmap of the Neighborhood effect during site-directed mutagenesis...... 180

Figure 57: Understanding EMD costs using Ramachandran plots ...... 182

Figure 58: Example of tripeptide analysis using our methodology ...... 183

12

List of Tables

Table 1: Counts of Overlapping genes between RNA-Seq and Microarray analysis along with the intersecting genes from both these analysis ...... 102

Table 2: List of biomarkers identified from the Microarray Analysis ...... 102

Table 3: List of Biomarkers Identified from the Breast Cancer RNA-Seq Analysis ...... 106

Table 4: Classification Report for Microarray Training Set Analysis ...... 109

Table 5: Classification Report for Microarray Test dataset ...... 109

Table 6: Classification Report for RNA-Seq training dataset analysis ...... 110

Table 7 : Classification Report for RNA-Seq test dataset analysis ...... 110

Table 8: Classification Report for biomarkers identified from intrinsic lists using the training dataset

...... 110

Table 9: Comparison of the correlation between RNA-Seq and microarray POC scores between two idependent runs. ‘previous run’ column reports the published analysis while the ‘current run’ reports the results from analysis of the new TCGA dataset...... 117

Table 10: comparison of correlation in each platform as measured by mean raw expression values and

POC scores...... 119

Table 11: Selective pathways that showed significant changes during sepsis temporal analysys ...... 144

Table 12: Selective pathways that showed significant changes during septic shock temporal analysys .. 145

Table 13: Selective pathways that showed significant changes during septic shock temporal analysys .. 146

Table 14: Biomarkers identified for sepsis, septic shock and SIRS ...... 152

Table 15: Multi-class classification result for the test and training dataset ...... 156

Table 16: Preliminary results from our miRNA-mRNA Analysis for sepsis ...... 167

Table 17: Shows variation in EMD distances based on changes ...... 180

13

14

Chapter I: Introduction and Historical Perspective

The current disease diagnostic methodologies have been derived from 19th century approaches wherein diseases were largely classified based on Oslerian clinicopathological correlations that entailed simple correlations between the clinical symptoms and disease pathological analysis (Beresford, 2010;

Loscalzo & Barabasi, 2011). Diseases were defined based on their principal organ system which displayed the signs and symptoms. The gross anatomic pathology and histopathology were correlated (Loscalzo &

Barabasi, 2011). This approach continued to dominate much of medicine for another century during which techniques to identify pathological markers continued to refine. These new techniques included biochemical measurements, immunohistochemistry, flow cytometry and many more (Sorger et al., 2011). Although this approach served the medical community, it had limited applicability in the era of genomic science due to its strong reliance on reductionist principles (Auffray & Nottale, 2008; Mazzocchi, 2008; Wülfingen, 2009).

Reductionist principles in biomedicine have dominated western scientific theory for centuries which defines simple mechanistic associations to control disease pathobiology (Beresford, 2010; Chan &

Loscalzo, 2012a; M H V Van Regenmortel, 2004) . This approach entails dissecting biological systems to their consistent parts and thereby explaining the chemical and biological basis for various life processes

(Mazzocchi, 2008; M H V Van Regenmortel, 2004; Marc H V Van Regenmortel, 2004). Research into understanding the working of normal and diseased tissues by Dr. Johannes Muller and Dr. Theodor

Schwann laid the foundation for cell-theory, tumor theory, and cellular pathology. This paradigm was helpful in clinical science of the 19th century where clinicians could study the physiology of diseases tissues by describing them with respect to their normal counterpart. Since the discovery of DNA as a transmitter of heritable information, the fundamental aim of genetics has been connecting the phenotypes to genotypes

(Auffray & Nottale, 2008). The reductionist approach was very successful in the study of disease traits that followed Mendelian genetics (Botstein & Risch, 2003b). These diseases are generally brought about by the modification of one or more genes and the disease phenotype is usually the outcome of the modified

15 gene/protein. The role of genetic mutation on the final protein product provided the basis for researchers to understand the probable roles these genes played in the normal and disease physiology (Lo et al., 2012;

Schadt, 2009). Techniques such as ‘positional cloning’ lead to the identification of thousands of genes.

Some of the diseases that were identified using this method include chronic granulomatous disease, X-lined muscular dystrophies, cystic fibrosis, and Fanconi anemia. Predisposition to cancers is caused by variations in retinoblastoma, and breast cancer genes like BRCA (Botstein & Risch, 2003a). Much of the knowledge about these diseases was obtained by studying the families segregating high risk alleles. However, families carrying these alleles were rare in the population (C. S. Carlson, Eberle, Kruglyak, & Nickerson, 2004).

Linkage mapping was successfully used to map the gene implicated in cystic fibrosis (Botstein & Risch,

2003a). However, linkage analysis may also result in false linkages and missed diagnosis in diseases having a high degree of heterogeneity and complex inheritance patterns. For instance, the gene BRCA1 has been shown to account for only a fraction of all breast cancers (Botstein & Risch, 2003a). To expedite the process of identifying and studying thousands of human disease genes at a fraction of the cost, the

Project (HCP) was created (de la Iglesia et al., 2013). Within a couple of years since its inception, the HCP created the most accurate genomic map available and the necessary infrastructure to make it accessible to researchers around the world.

With the new wealth of knowledge now available, it has become increasingly clear that Mendelian inheritance alone does not entirely explain the genotypic and mutational variability, clinical phenotype variability, and highly variable associated risks observed in various diseases (Botstein & Risch, 2003a).

The molecular basis of complex disease such as cancers, allergy, and obesity indicate that current and future diagnostic techniques need to move away from simple Mendelian genetics (Botstein & Risch, 2003b).

Studies show that these disease phenotypes were brought about by complex interaction of multiple genes and environmental factors and that it was difficult to isolate the effect of an individual factor in the overall disease outcome (Pawson & Linding, 2008). Many of these genes/proteins are also involved in many different processes thereby making it difficult to investigate the role of gene in the disease physiology.

16

Scientists also began to realize that the reductionist approach began to reach its limit (Mazzocchi,

2008). The new wealth of knowledge revealed that biological systems are extremely complex and have emergent properties which cannot be fully explained or predicted by the individual components (Bizzarri,

Cucina, Conti, & D’Anselmi, 2008; J. M. Carlson & Doyle, 2001). The reductionist approach, which was successful earlier, underestimated the complexity by generalizing pathophenotypes and failing to consider individualized patient-specific nuances. The reductionist approach can therefore result in incomplete or faulty diagnostic techniques that can be detrimental to the early identification and treatment of complex diseases. Some of the limitations of traditional disease definitions include (Barabási, Gulbahce, & Loscalzo,

2011a; Loscalzo & Barabasi, 2011)

 Disease definition generally focuses on late-appearing manifestations in a dysfunctional

organ system. The disease definitions also generally neglect preclinical pathophenotypes

or environmental or genetic susceptibility factors. Therefore, the therapeutic strategies are

not generally focused and are designed to target intermediate pathophenotypes.

 Conventional disease paradigm neglects disease physiology that extends beyond the

disease-defining organ system. They also do not consider environmental (stochastic) and

molecular (deterministic) factors that may play a role in the disease progression.

 Conventional classification of diseases is excessively inclusive for a range of

pathophenotypes which were largely classified in the pre-molecular era. These

classifications obscure subtle yet important demarcations among diseases, thereby losing

vital information which can be important to providing the right therapeutics to the patient.

 Another drawback is the usage of reductionist approach in identifying the disease

mechanism which attributes the disease pathology as a consequence of a direct abnormality

in one or more molecular effectors.

17

I - 1 Biological Complexity and Emergence

Several studies have suggested moving away from the reductionist approach and towards more systems-based approaches in genetics (Kay, 1995; Keller & Harel, 2007; Keller, 2005). The strongest proponent for this movement was provided by the Human Genome Project which identified different levels of organization within genomes. Systems-based approaches involve investigating the interactions between the constituent parts of the biological system and the environment that give rise to novel or ‘emergent’ features which are ignored when the constituent parts are considered under isolation. A key component of the systems-based approach is the identification of ‘emergent properties’, which are properties of the whole biological system or network and not characteristic of an individual component alone. Thus the concept of

‘emergence’ complements ‘reduction’ when reductionist principles provide limited explanation of complex phenomena and diseases (Chan & Loscalzo, 2012b; You, 2004). Although biology has always been a science of complex systems, the study of emergence and complexity in biological system has been quite recent and is being advanced with the advent of powerful computational systems. Emergent properties also tend to resist any attempt at being predicted or deduced by explicit calculations (Helikar, Konvalina, Heidel,

& Rogers, 2008; M H V Van Regenmortel, 2004) because these systems interact in many ways including feed-forward, and/or feed-back control whose dynamic outcome cannot always be predicted satisfactorily using linear mathematical models that disregard cooperative and non-additive effects. A new method of modeling/mathematics is required for capturing the outcomes of these systems (Aderem & Smith, 2004) .

A property is considered emergent if it satisfies the following criteria.

1. It is irreducible to properties of the parts

2. It could not be predicted before its instantiation.

One of the best examples that illustrates the drawback of genetic reductionism comes from knockout studies performed on model organism to study the hypothetical function of genes established based on in-vitro experiments. In knockout experiments, a gene which is considered to be important is

18 removed or inactivated through various processes and the effect on the organism is studied. It was found in some cases that there is negligible to no effect on the organism despite the gene having been shown to be important in in-vitro studies while in other cases, the outcome was completely different from what was predicted (Edwards & Palsson, 2000; He & Zhang, 2006). The reason for such large variation in results between the in-vitro and in-vivo results can be attributed to fact that some of these genes may have simply been a part of a larger pathway or network and the final phenotypic outcome of the experiment is the result of the modification to network property and not the protein property. In a similar manner, diseases such as cancer can be viewed emergent phenotype caused by the altered arising at a tissue level organization.

I - 2 Complex Diseases

Complex diseases, such as cancers, sepsis, Parkinson disease, and obesity are caused by a combination of molecular and environmental factors which dysregulate similar components of the cellular system as shown in Figure 1. Various factors, including gene pleiotropy (Barrenäs et al., 2012; del Sol,

Balling, Hood, & Galas, 2010; Gelfond, Ibrahim, Gupta, Chen, & Cody, 2013; Kitano, 2004; Schadt, 2009), gene interactions (Kitano, 2004; Schadt, 2009), and environmental factors (Milne, Carneiro, O’Morain, &

Offerhaus, 2009; Schadt, 2009) explain the lack of correlation between the phenotypes and genotypes in complex diseases. Genes and their respective proteins interact with each other’s function synergistically or through modification of other genes and proteins. Genes are also known to exhibit pleiotropy, wherein a single gene produces multiple phenotypes. Thus the mutated gene may continue to possess some, all, or none of its initial traits, thus further obfuscating the identification of a genotype for given disease phenotype.

Environmental and molecular factors that differ from patient to patient can also limit the use of simple Mendelian techniques. In a ‘simple’ Mendelian disease, mutations in one gene is a major cause of the overall disease phenotype and variation in the levels of these factors causing the genetic mutation can account for the severity of the disease (Botstein & Risch, 2003b). Complex disease such as cancers, autism,

19 sepsis, and obesity are caused by a number of complex factors which vary between patients (Y. Liu &

Chance, 2013). In many cases, similar components such as cell cycle, and pathways that are affected during the disease than individual genes. This is because the perturbations in each individual factor alone cannot account for the disease, however when these factors interact with each other, they give rise to an emergent disease phenotype (Bodenmiller et al., 2010; Kim, Wuchty, & Przytycka, 2011). Figure 1 shows the factors that influence on few complex metabolic diseases.

FIGURE 1: THE DIFFICULTIES IN STUDYING COMPLEX DISEASES Figure 1 Details: Complex diseases such as cancers are brought about by multiple genes acting in combinations with environmental and life-style factors. Understanding complex diseases is difficult due the heterogeneity in the disease, patients, knowledge and samples.

20

Complex diseases are also known to be very heterogeneous, whereby the same disease can display large number of phenotypes (Altman, 2012). The origin of several complexes diseases has been linked to breakdown of cellular bio molecular networks. Cells have a strong self-preservation strategy. This makes them very adaptable to constantly changing environment, especially during the course of infectious diseases

(Chavali et al., 2008). To achieve adaptability, cells constantly monitor the internal and external environment using the signaling pathways (Antunes et al., 2009; Schadt, 2009; Supper et al., 2009). They maintain the homeostatic conditions within the cell by using the regulatory pathways to modify the levels of various genes. When there is an internal or an external breakdown of cellular components, the cell modifies the levels of certain genes so that the system is returned to homeostasis, failing which it triggers apoptosis. Thus cellular systems are designed to either weather the changing environment or trigger apoptosis when the cell is too badly damaged. Cancer cells are known to highjack this mechanism for their advantage during their growth phase. As seen in many cancers, when presented with anti-growth signals, these cancer cells choose alternatives paths to acquire key cancer hallmark when the direct route is blocked.

This is also the main reason why levels of cancer genes can markedly vary within the same type of cancer.

In the example of another complex disease like autism, heterogeneity and patient-specific phenotypes are observed (Lee, Chuang, Kim, Ideker, & Lee, 2008). Autism is considered the most complex heritable disorder and the emergence of the disease has been linked to a rare genetic variation (Ptak & Petronis,

2008). It is evident from these studies on complex diseases that it is necessary to investigate the interactions between genes and their products in response to environmental perturbations.

I - 3 Organizing Principles of Biological Systems

Cellular components exert their functions by interacting with other components in the cellular system (Barabási et al., 2011a). These functions can be within the same cell, across cells, or even across

21 organs. At the system-level, these molecular interactions coalesce into a highly complex interconnected network called the human ‘interactome’ consisting of various elements including DNA, RNA, protein- coding genes, proteins, and metabolites (Przytycka, Singh, & Slonim, 2010; Vidal, Cusick, & Barabási,

2011). The human ‘interactome’ consists of approximately 25,000 protein coding genes, approximately

1000 metabolites, and an unidentified number of distinct proteins, and functional RNA molecules. The

‘interactome’ network achieves relative functional autonomy through variable feedback cross-linkages, redundancy, and modularity. Under normal conditions, cells are designed to adapt to perturbations in the environment by orchestrating intracellular and intercellular molecular interactions between various genomic and proteomic elements (Dunham et al., 2012; Gitter, Carmi, Barkai, & Bar-Joseph, 2013; Mei,

Zhao, & Fu, 2012). However, in complex disease conditions, such as cancers, key cellular networks breakdown in response to multiple molecular and environmental perturbations (Barabási et al., 2011a; del

Sol et al., 2010; Kitano, 2004). These emergent disease phenotypes arise due to a combination of various molecular perturbations that can vary between patients while dysregulating similar components of the cellular system (Kim et al., 2011; K.-Q. Liu, Liu, Hao, Chen, & Zhao, 2012; Y. Liu & Chance, 2013).

In order to unravel the behavior of cells under disease conditions, it is necessary to understand cellular robustness, which is a key property of the cellular network (Beisser et al., 2012; Kitano, 2004).

Cellular robustness is a fundamental and ubiquitous system-level phenomenon which can be defined as the ability of the cell to maintain its function despite external and internal perturbations (Kitano, 2004).

Understanding the architecture and tradeoffs of cellular robustness in evolving cellular systems is important in understanding complex diseases and developing effective therapies (Kitano, 2004; Whitacre, 2012). This robustness is enabled by feedbacks and redundancies in the complex network, which allow changes in the structure and components of the system while still maintaining the essential functionalities. Due to the presence of multiple paths in achieving a cellular functionality, when a failure in one path occurs, the system is able to bypass the failure to maintain the functionality. This fail-safe mechanism within cells is achieved through redundancy and diversity within the cell (Kitano, 2004; Whitacre, 2012). Redundancy refers to the

22 presence of several identical components/paths to attain the same functionality. Duplicated genes are an example of cellular redundancy (Whitacre, 2012). Diversity refers to the ability of the cell to use different components from the original to achieve a given functionality (Whitacre, 2012). Cellular diversity is also known as phenotypic plasticity. For instance, the metabolic pathways in yeast can change to either metabolize glucose or lactic acid, depending on the environmental conditions.

Another important property of a complex system is modularity. Modularity allows a complex system to localize the response to environmental stimuli to minimize the effect on the overall network

(Whitacre, 2012). A cell within a multi-cellular system is an example of modular design. Modularity decouples the low level environmental perturbations from the higher level functionality, thereby preventing catastrophic failures in the complex system.

I - 4 Network Biology

Network biology mostly relies on network theory which indicated that cells, both in disease and normal conditions, can be organized around the core organizing principles (Barabási et al., 2011a). A contemporary approach to human diseases requires a systems perspective of these interactions.

Understanding the complexity of cells and network principles can address the role of various cellular constituents during the course of the disease. To better understand the roles these molecular and environmental perturbations play in complex disease, we not only need to define the topology of the disease network but also need to define the dynamics of the network during the course of the disease (Loscalzo &

Barabasi, 2011). These perspective are essential in understanding the determinants of the disease expression. The topology of the molecular network defines the interplay between the cellular constituents.

Network analysis allows the development of a framework to understand the collective behavior and large- scale response of biological systems to stimuli and perturbations.

23

Network analysis allows for individual components of a biological system to be integrated into single system and examined through interactions of these components. Components, such as genes, proteins, and metabolites can be best characterized as networks. One way to represent these systems graphically is to use vertices (or nodes), and edges between these nodes. The nodes of these network represent the various molecular components such as proteins, genes and metabolites while the edges of the network depicting the type of interactions between the nodes. There can be different types of molecular networks to represent different types of interactions. Networks with protein-protein interactions depict physical binding interactions between different proteins. In metabolic networks, the nodes represent metabolites or proteins which participate in various biochemical reactions, and the directed edges depict the direction and type of biochemical reaction. Regulatory networks depict directed regulatory information such as transcription factors and their gene products. They also represent post-translation modifications such as those seen with kinases. RNA networks depicts the effects micro-RNA, short-interfering RNA and

DNA have on gene expression.

I - 5 Network Theory and Properties of Biological Networks

The Human Genome project (HGP) has played a very important role in the advancements of network medicine. It was the first large scale study developed to map the entire human genome with a goal to better understand human genome and diseases. Other high-throughput technologies such as microarray and RNA-Seq leveraged the data obtained from the HGP and sequence of all the genes to advance the analysis of the human system. Currently, the focus of life-science research has largely moved on from studying genes in isolation to studying the roles of molecular components and their interactions in global view of cellular functions. Studies have indicated that the interactions between various molecular components are not random but instead governed by set of organizing principles. The role these molecular components might play in the disease can be identified either using systems biology or experimental

24 methodologies such as high-throughput screenings. Systems biology methodology involves integrating currently available knowledge about the molecular network along with properties of the network to identify the regions of the network that are most likely involved in the disease. High-throughput screening methodologies such as microarray expression, methylations studies, genome wide association studies and

RNA-Seq analysis have provided an unprecedented amount of data about the state of cells/tissues/organisms under various conditions. These methodologies can provide a snapshot of the cell at given instant time revealing to researcher the state of molecular components during the course of the disease. The results from these high-throughput methodologies, coupled with network theory framework can provide a powerful tool to researchers to better understand the complex nature of disease.

Investigating the structural properties of molecular network can reveal useful information regarding the evolution of key biological properties. These key biological properties can also reveal how disease conditions might use these properties to their advantage and the possible mechanism to restore homeostasis back to the system. Given the nature of biological systems, molecular components and their interactions needs to be properly classified in order to extract meaningful information from the analysis. For this reason, the biological networks are classified by the types of interactions they represent and are listed below.

 Protein-Protein Interaction Networks: These molecular networks depict how different

proteins coordinate with each other to enable all the biological processes (Böde et al., 2007).

Although the sequences of most genes and proteins are available, there are large number of

proteins whose function and interactions are not completely understood. Therefore, the

interactions for these proteins are usually obtained from large-scale high-throughput

investigations in which methodologies such as yeast two-hybrid maps and tandem affinity

purifications are used to develop binary interactions for nearly all human proteins. The

information of these interactions are held in many publically available databases such as

Molecular Interactions (MINT) (Stelzl et al., 2005), the IntAct database, Biomolecular

Interaction Database (BIND) and Human Protein Reference Database (HPRD) (Han, 2008).

25

 Signaling and Regulatory Networks: Cells have evolved to constantly monitor the status of

their internal and external environments. Cells transmit this information to their nuclei, which

adapt to their environments by modifying the levels of proteins and genes (Eungdamrong &

Iyengar, 2004). The regulatory network contains information about transcription factors, post-

translational modifications, and associations with other biomolecules. The signaling and

regulatory network within the cells helps accomplish this task by transmitting information in a

directed fashion. These molecular networks illustrate how signal transmission occurs within

these networks and the many ways they may affect the expression of genes and proteins within

the cell. Databases such as JASPAR, TRANSFAC contain regulatory network information

whereas signaling network information is present in databases such as TRANSPATH and

MIST.

 Metabolic and Biochemical Network: This type of network is well described and is

considered to be the most comprehensive biological network. It is used to study and model

metabolic reactions in various organisms. These metabolic maps detail the dynamics of a series

of biochemical reactions and enzymes that catalyze these reactions. It also includes a

mathematical description of the biochemical reactions in the system. This network model can

be either a static network map or a dynamic network model depending on the depth of

knowledge known in the area. This network is represented using an XML-like language called

Systems Biology Markup Language (SBML) which provides the framework necessary for the

representation and simulation of these models (Hucka et al., 2010). SBML has also been

adapted to represent other networks including signaling and regulatory networks. Databases,

such as KEGG (Kyoto Encyclopedia of Gene and Genomes) (Ogata et al., 1999), BioCyc (L.

Chen, Huang, Shi, Cai, & Chou, 2010), EcoCyc (L. Chen et al., 2010), and Reactome (Joshi-

Tope et al., 2005), provide a comprehensive network map of all the metabolic reactions known.

As recently as 2013, the human metabolism map for the liver described the various biochemical

reactions occurring in the liver in mathematical notations (Jerby, Shlomi, & Ruppin, 2010).

26

 RNA Networks: The interactions between RNA-RNA and RNA-DNA network are

represented in RNA networks(Enerly et al., 2011; Witkos, Koscianska, & Krzyzosiak, 2011).

These include the roles that micro-RNA and other small-stranded RNA play in regulating the

expression of genes in the cell. Databases such as TargetScan, miRBase, miRDB, and PicTar

contain information about micro-RNA and their predicted gene targets (Witkos et al., 2011).

Similar to networks, individual genes can be classified into different groups based on important characteristics, which sheds more light on evolutionary roles of these genes. Some of these classifications are listed below.

 Essential Genes: Essential genes comprise the minimal set of genes in an organism that are

considered vital for the survival of the organism (Zhang & Lin, 2009). Although human embryos

contain tens or even hundreds of mutations to genes, only a few mutations to key genes may impact

the maintenance and growth of cells. These genes are known as essential genes and are usually

highly conserved. They evolve at a much slower rate than other genes in the genome. The knock-

down or knockout of these genes in an organism is usually lethal, as they affect the early

development of the organism. Understanding the roles these genes play in the molecular network

will help researchers to identify potential drug targets for complex diseases.

 Disease-related Genes: Disease-related or disease causing genes play a significant role in the

progression of the disease (Lee et al., 2008). Unlike essential genes whose mutations are usually

lethal to the organism, disease causing genes usually don’t play a significant role in the

development of the organism. They usually come into play once the organism is born or sometimes

their effects can be seen only when the organism reaches reproductive age. It is likely that most

disease causing genes might not be essential genes but they do have a strong tendency to interact

with essential genes.

The primary motivation behind using network theory is to study and understand the role and function of molecular components in the overall disease and normal physiology (Zhu, Gerstein, & Snyder,

27

2007). In this study, network theory is employed to understand the role of various molecular components in normal and disease physiology (Lee et al., 2008; Schadt, 2009). This analysis can help in diagnosing diseases based on disease physiology, determining functions of genes and proteins, and identifying viable drug targets to treat. These analyses can help in identifying viable drug targets, determining functions of genes and proteins, diagnosing and treating diseases based on disease physiology and many more.

1.1. Undirected Graphs: It is essential to understand the mathematical description of graph to have a

better understanding of graph theory (Mason & Verwoerd, 2007; Pavlopoulos et al., 2011). A

graph can be defined as

퐺 = (푉, 퐸) where

 V defines the set of vertices representing the various molecular components in the

cell such as proteins, DNA, and RNA

 E is a set of edges which represent the interactions between the various molecular

components. Interactions such as binding are best represented by these edges as they

are directionless. In the case of protein-protein interaction networks, these edges

would be indicating direct binding between the different two proteins. 퐸 =

{(푖, 푗)|푖, 푗 ∈ 푉} , defines an edge E, which connects two neighboring molecular

components i and j.

Two neighboring elements can either be connected using a single edge or sometimes by more than

one edge like in the case of protein-protein interaction network. When a network contains more

than one edge connecting two molecular components, such edges are called multi-edge

connections.

1.2. Directed Graphs: A directed graph represents interactions that usually occur in sequential manner

(Mason & Verwoerd, 2007; Pavlopoulos et al., 2011). A directed graph can be defined as

퐺 = (푉, 퐸, 푓), where

28

 V defines the set of vertices representing the various molecular components in the cell

such as proteins, DNA, and RNA.

 E is a set of edges which represent the interactions between an ordered pair of molecular

components. 퐸 = {(푖, 푗)|푖, 푗 ∈ 푉} , defines an edge E, which connects two neighboring

molecular components i and j and is considered to have direction going from i to j.

 푓 is a function that maps every interaction E to an ordered pair of molecular components

in V.

Directed graphs are suitable to represent biological pathways or sequential interactions or time

dependent flow of information in networks. These are best suited to represent metabolic,

biochemical, signal transduction or regulatory networks.

1. Weighted Graphs: Individual weights can be associated with both directed and undirected graphs

and these graphs are called weighted graphs (Pavlopoulos et al., 2011). Beyond the topology, in

most molecular networks, not all the interactions are truly equal. Interactions between certain

molecular components are stronger than other because of various factors such as binding energies,

locations of molecular components, and the state of molecular components. Weighted graphs help

define these ‘strengths’ of various interactions in the network in a quantifiable manner in a network

setting. Hence it has become the natural way to incorporate experimental data over topological

network information. Weighted graphs are defined as

퐺 = (푉, 퐸) Where

 V defines the set of vertices representing the various molecular components in the cell

such as proteins, DNA, and RNA.

 E is a set of edges which represent the interactions between an ordered pair of molecular

components. 퐸 = {(푢, 푣)|푢, 푣 ∈ 푉} , defines an edge E, which is associated with a weight

function푤: 퐸 → 푅, where R denotes a set of real numbers. The weight wij represents an

29

edge which connects two neighboring molecular components i and j. The weights on the

edges are used to represent the reliability of the edge in the network and widely used

throughout the field of bioinformatics and computer science. They are frequently used in

the analysis of disease networks because they can be made to represent the importance of

interactions based on experimental high-throughput analysis.

1.3. Bipartite Graph: Bipartite graphs are a special case of undirected graphs (Pavlopoulos et al.,

2011). Bipartite graphs are undirected graphs defined by 퐺 = (푉, 퐸) such that

 V defines the set of vertices representing the various molecular components in the cell such as

proteins, DNA, and RNA. Molecular components of bitrate graphs are generally portioned

into two sets V1 and V2.

 E is a set of edges which represent the interactions between an ordered pair of molecular

components. 퐸 = {(푢, 푣)|푢, 푣 ∈ 푉} such that 푢 ∈ 푉1 and 푣 ∈ 푉2 or 푢 ∈ 푉2 and 푣 ∈ 푉1.

1.4. Subgraph: If 퐺 = (푉, 퐸) is a undirected/directed graph, then a subgraph 퐺1 = (푉1, 퐸1). 푉1 ⊆ 푉

and 퐸1 ⊆ 퐸 where the edges E1 represents edges between vertices V1.

1.5. Degree of Network: Network theory has many important properties which reveals useful

information about the entire network. One such property is the degree of a node, which is the

number the interactions that a node or molecular component in a biological network. Definition of

the degree of a node varies depending on the type of network.

1.5.1. Undirected Network: For undirected networks, degree of a node is the number of connections

or interactions a node shares in that network. It is defined by deg(푖) = 푘(푖) = |푁(푖)| where

N(i) is the number of neighbors of node i.

1.5.2. Directed Network: For directed networks, the degree for each node depending on the direction

of the interactions. These are subdivided into in-degree and out-degree which represents the

number of incoming and outgoing edges to node i respectively.

30

Robustness in biological systems arises from the integration of multiple signaling domains, molecular circuits, and cross regulatory interconnections. Scale free networks (SFN) have played a very important role in explaining the robustness seen in biological systems (Beisser et al., 2012; J. M. Carlson

& Doyle, 2001; Kitano, 2004). An important property of SFNs is the presence of highly connected nodes called Hubs. The functional integrity of SFN is robust to the loss of a random node with low connectivity whereas SFNs are fragile when a random node with high connectivity is lost in the network. This helps explain the random-yet fragile nature of biological systems. One of the important characteristics of SFN, are hubs genes, which have large number of interacting partners (greater than average number of neighbors in the scale free network). Hubs are also called “problem distributors” because they tend to distribute perturbations across the network due to their high connectivity. Most researchers often refer to the top 20% of the nodes in the network as ‘Hubs’ (J. Hu, Song, & Chen, n.d.; Vallabhajosyula, Chakravarti, Lutfeali,

Ray, & Raval, 2009). These hub genes are very important in maintaining the global network structure of biological networks because they are well connected modulations of these nodes leading to cascading down-stream effects. Disruption of these hub genes severely affects information transfer in real world networks. Disruption of random nodes in the network usually has a very low impact while disruption of these hub genes generally tend of have large scale widespread effect, as seen in many knockout studies.

Since hub genes play such a vital role in maintaining robustness and global structure of the network, they are essential and well-conserved relative to non-hub proteins.

Hubs in scale-free networks are generally classified into inter-modular hubs and intra-modular hubs, based on their network topology and function. Inter-modular hubs have low correlation with interacting partners and tend to play an important role in network connectivity. Intra-modular hubs have a high correlation with interacting partners. Hub bridges or ‘Creative elements’ are specific interactions that link hubs in scale free network (Csermely, 2008). These overlapping elements are nodes that are present in two or more network modules and play a substantial role in each module. They are often found in regions of overlap between functional modules and their neighborhoods occupy a comparably large space in

31 multiple modules. Hub bridges usually undergo transformation after repeated stress by rewiring the neighborhood topology. Therefore, they redistribute the stress across other hubs or bridges. Overlap and bridges are also prominent sites of modulation during cellular adaptation and are prevalent in human signaling processes. These dynamic elements continuously change the structure of their linkages. Such re- wiring might preferentially connect two modules with strong links to problem-solving specialized elements.

Since these bridges undergo active remodeling during the course of the disease, they have been proven to be useful drug targets (Rozenblatt-Rosen et al., 2012; Shou et al., 2011) .

I - 6 Dynamics of Biological Networks

Recent advances in genomic technologies have helped pave the way in developing more detailed reconstruction of cell’s interacting network such as the signaling, regulatory and metabolic network. Recent research has focused on understanding the role of a complex network of interacting genes and proteins in cell physiology (Poirel, Rodrigues, Chen, Tyson, & Murali, 2013a). These advances have recently helped researchers construct detail, predictive and mechanistic models of fundamental physiological processes such as the yeast cell cycle (Papin, Hunter, Palsson, & Subramaniam, 2005; Poirel, Rodrigues, Chen, Tyson,

& Murali, 2013b). Pathways in cells are defined as a series of actions among molecules that bring about a common function, product, or/change in state (National Human Genome Research Institute). Although these pathways tend to be defined based on certain changes they bring about, more than often, biological pathways tend to be highly cross-linked with varying degrees of feedback. Pathways also exhibit non-linear behavior from various feedback mechanisms that maintain homeostasis during external perturbations.

These evolving interactions along with feedback mechanisms give rise to a dynamic complex system that exhibits a non-trivial emergent and self-organizing behavior. These behaviors result in permanent or transient functional cellular phenotypes such as organization of cytoskeleton or calcium waves in cells.

32

I - 6.1 Differential Expression Analysis

FIGURE 2: OVERALL AIM OF MOST DIFFERENTIAL EXPRESSION ANALYSIS IS EXTRACTING USEFUL GENES FROM THE THOUSANDS ANALYZED BY HIGH-THROUGHPUT ANALYSIS

A cell’s gene expression profiles reflect the state of the signaling, metabolic, RNA, and regulatory network within the system. The complex interactions and feedbacks systems within these networks drives the evolution of the overall cellular systems towards an equilibrium state (Whitacre, 2012). The characteristic phenotypes exhibited by distinct cell types within an organism are dependent on different equilibrium states of a system, which in turn is dependent on the various regulatory interactions even though the cell has identical genomic data. Therefore, it can be said that the functional phenotypic state of a given cell corresponds to trajectories in the gene expression state space.

Even with prior knowledge, it is usually difficult to determine how the changes in information flow along the gene regulatory and signaling networks plays in a role in the disease physiology. Over the last two decades, there have been numerous high-throughput methodologies developed to identify differentially expressed disease genes. Studying these differentially expressed provides genes gives insight into how the cells may have changed/adapt to the changing internal and external environments. Experimental expression analysis such as microarray and RNA-Seq can play a role in elucidating some of these underlying molecular

33 mechanism (Huber, 2003; Slonim & Yanai, 2009). Some of the biggest challenges in modeling these experimental analyses include large number of genes in the high-throughput study, low sample numbers in the study due to prohibitive cost of high-throughput analysis, and heterogeneity among samples (Dalman,

Deeter, Nimishakavi, & Duan, 2012). Another big drawback is that most high-throughput methods such as microarray and RNA-Seq measure the levels and states of thousands of genes from a given sample but due to high costs associated with these studies, the number of samples traditionally used in these analysis are quite low as shown in Figure 2. Some of these experiments measures tens of thousands of genes but the number of samples are usually less than 10. Even the larger and well-funded studies contain only few dozen samples in their analysis for every conditions. This gives rise to what is defined as “thin and tall” problem, wherein there are large number of genes are measured at any given time but there are too few samples present in the analysis to identify whether if the changes of the genes are significant or not.

Some of the drawbacks with respect to the network data mainly lies in terms of complex interaction types between the molecular components and the lack of detailed kinetic and chemical parameters defining the interactions between different cellular components (Chan & Loscalzo, 2012a; Han, 2008; Kim, Wuchty,

& Przytycka, 2010; Kim et al., 2011). Without the complete knowledge of mechanistic information on how the changing levels affect the interactions between the various molecular components, it is very difficult to extract meaningful information on the list of differentially expressed genes. Due to some of these limitations, most analysis methodologies only trace selected pathways in the network versus dynamic modeling the network as a whole. The most common method of network module reconstruction is using

List-based network seeding where genes whose expression levels have varied significantly serve the basis on which the modules/sub networks are reconstructed.

34

I - 7 Pathway Analysis

I - 7.1 Classification of Pathway Analysis

Path analysis has been used as an umbrella term in current scientific literature. This term encompasses many different types of analysis used to study how changes in molecular entities affect the phenotype via various interactions. Over the past decade there has been an explosion in the number of methodologies employing pathway analysis in biomedical literature. Pathway analysis has especially become very prevalent for studying high-throughput analysis to study differential set of genes or proteins.

According to Khatri et al. the functional pathway analysis performed over the last decade can be broadly classified as follows.

I - 7.1.1 Over-representation analysis

With the explosion of genomic data from genomic analysis such as microarrays, functional analysis of genes provided the best insight to researchers. These methodologies provided statistical criteria to evaluate whether pathways have been modulated during the course of the disease. These included providing researcher with a framework to identify pathways where genes have been over- and under-expressed during the course of the disease. Subsequent statistical methods leveraged false discovery rate and thresholds to identify the pathways that might have changed. The pathway data contained list of genes and were usually curated from existing literature. The pathways were then tested for over-representation or under- representation in the differential expression of genes using the Binominal distribution or Chi square distribution as a model. The majority of the tools which used over-representation differed on the basis of the statistics or the input pathway and gene databases. One of the biggest drawback of this methodology was that it is was very gene centric. Genes were considered as independent components and not part of an interacting system and not all probe expression values were considered in the overall analysis. Similarly, pathways were also treated in an independent fashion and usually lacked cross-linking to other pathways which is usually how they occur in a network setting. Another major drawback is the statistics that was

35 used in the analysis. Only the most significant genes based on their statistics were considered in the pathways analysis while eliminating genes which were marginally less than criteria set such as < 2-fold change and p-value <0.05. Some of the commonly used tools which employ over-representation analysis in their algorithm include:

o GoMiner

o GenMAPP

o Onto-Express

o GOToolBox

I - 7.1.2 Functional Class Scoring

The next iteration of the methodologies address some of the major concerns that stem from over- representation analysis such using only statistically significant genes which pass certain criteria and the lack of interaction data among the genes. Now pathways which exhibited weaker but coordinated set of genes were ranked much higher because of the changes employed in these methodologies. These methods employ statistical approaches such as ANOVA, t-test and z-score to measure and rank gene-level changes from the genomic analysis but they have minimal effect in the overall analysis. In the next step, individual gene level statistics are aggregated into pathway-level which usually employs multivariate statistics such as Kolmogorov-Smirnov statistics. This is followed by assessing the statistical significance of the pathway- level statistics. One of the biggest drawbacks in this methods is that, like over-representation analysis, the pathways are considered entities with no interactions or overlap among each other. Some of the commonly used software that employ functional class scoring include:

o GSEA

o sigPathway

o SAM-GS

36

I - 7.1.3 Pathway Topology

Recent advances in genomic and proteomic high-throughput technologies have helped produce a large number of publically available pathways knowledgebase. These databases not only provided information about interacting molecular entities, but also provided greater details such as the type of interactions, cellular localization, cell-type information. Some of the publically available knowledgebase include:

o KEGG

o Reactome

o Biocarta

o NCI-Nature

o Panther Database

These pathway databases can be completely reorganized, and based on the new topology, the pathway level statistics can be recalculated. The statistical methods used in these methodologies are similar to the ones used in previous analysis. One of the biggest shortcomings is that the method is very much in its infancy, with many of knowledgebase containing fragmented information about the interactions. Also, some interactions that are very cell-type dependent, or disease dependent, are not accurately captured in these databases.

I - 7.1.4 Pathway Analysis for Complex Diseases

Complex diseases such as obesity, cancer and Alzheimer’s disease have been extensively studied using classical biomedicine based methodologies such as molecular biology, genetics and other experimental biology. With the advent of high-throughput molecular measures, pathway analysis has provided with the statistical framework required reduce the dimensionality of datasets and provide a focused group of targets, which can then be biologically validated. As discussed in the previous sections, pathway analysis methodologies have undergone an evolution and pathway topology based modeling has

37 been found to be best suited to study complex diseases because of the ability to integrate pathway knowledgebase. The methodology discussed in this thesis also falls under the pathway topology category.

Pathway topology (PT) methodologies integrate knowledgebase such KEGG, MetaCyc, Reactome, and Pathway Commons as part of their analysis. Unlike over-representation analysis (ORA) and functional class scoring (FCS), PT goes beyond just pathways by employing detailed interaction information such as type, directionality and the connectivity between molecular entities in their overall analysis. This is very important because the knowledgebase is constantly being updated based on the newly available information. For PT based models, the mathematical model describes the process involved in scoring the pathways from the graph and high-throughput (HT) experimental measures. This score is usually a form of statistical significance which results in a ranked list of genes/paths/pathways. From a statistical perspective,

PT based modeling differentiates itself from ORA and FCS by computing statistics at the individual gene level. To calculate the score at a pathway level, PT based methods extend the scores from gene-level statistics to a pathway level using the relationships defined in the knowledgebase.

NODE SCORING METHODS

To understand and reverse engineer the disease physiology, most PT methodologies employ a bottom-top approach. They begin by modeling the node changes and study how they affect paths, modules and functional pathways of the organism. PT methodologies differs from ORA and FCS in the way the node-level statistics are modeled. The most commonly used models in statistical analysis of pathway topology are frequentist and probabilistic models. Among the various methods, Bayesian statistics is the oldest and was developed in the 18th century and by the 20th century it was being was rapidly replaced by frequentist theory because it was easier to compute. Based on the statistics used, interactions in these networks can be classified into three classes: correlation, mutual information, and Bayesian. Many of the pathway analysis tools use multi-level hierarchical scoring (Mitrea et al., 2013). On the first level, scoring is done at the node level or a pair of nodes (edge/interaction). The scores in the second level are an aggregate of scores for a functional pathway, and the last level measures the statistical significance of the entire

38 pathway. We will discuss some of the commonly used node level scoring models below. Most of these models incorporate pathway topology information in the node scores. In following section, we discuss a few alternate scoring measures used in the topological analysis of networks to identify complex diseases that are similar to our approach

Graph Measure

Network theory has provided vital measures which can be employed PT analysis. These measures help understand the importance of nodes with respect to other nodes in the network. Centrality is one such measure which describes the relative importance of a given node with context to the overall network. There are several variants of the centrality measure that can be used to study genes and their interactions including degree centrality, closeness centrality, betweenness centrality and eigenvector centrality. Degree centrality, which is commonly used in information theory based models, scores nodes based on the number of directed edges entering and leaving a node. Closeness, on the other hand, measures the sum of shortest distance between each node and every other node. Many tools including Pathway-guide, and Pathway Express use perturbation factor, which uses gene expression values to unravel how the perturbations travels downstream

(Voichita, Donato, & Draghici, 2012). Most of these tools vary in the way experimental high-throughput data is integrated and in how the impact factor of the pathway is calculated (Aittokallio & Schwikowski,

2006).

Similarity measure

Similarity measures estimate the co-expression, behavioral similarity and co-regulations of two or more molecular components in the high-throughput measure. Commonly studied similarity measures include correlation coefficients and covariance. POC metric, which was developed in our pathway analysis, differs from similarity measures in that the POC measures the consistency of change between two conditions, even when the interactions are non-linear.

Correlation

39

Correlation is a classical statistical approach that measures the dependence between two random variables X and Y. The correlation coefficient r, commonly referred to as the Pearson correlation coefficient, quantifies this dependence as show in the equation below.

XY  nXY r    X 2  nX 2 Y 2  nY 2

When gene changes between conditions are linear, either up or down regulated, then the correlation coefficients are high. Correlation based methods are generally employed in many unsupervised studies to identify new biological relationships like the in the case of genetic networks from high-throughput analysis

(M??nsson et al., 2004). One of the well-known methodologies that employs correlation as part of their analysis is ScorePage (Data, 2004). The main reason why we didn’t choose correlation measure was because it is very susceptible to outlier values in high-throughput measures, thereby drastically decreasing the accuracy especially for the non-linear interactions. High-throughput molecular measurements are inherently very noisy resulting in a lot of outlier values which cannot easily be eliminated.

Mutual Information:

Mutual information is the measure of dependency between variables and this measure has roots in information theory. It is based on the concept of entropy as defined by Shannon et al (Shannon, 1948).

Mutual information I of two variables measures the amount of information that one variable contains about another and is defined by the equation below:

퐼(푋, 푌) = 퐻(푋) − 퐻(푋|푌)

In biological networks, a large mutual information between two molecular components can be interpreted as strong interaction. One of the most popular methods to use this approach is ARACNE, which uses mutual information to prune the network of indirect interactions based on the mutual information of

40 the edges (Margolin et al., 2006). Like correlation, mutual information does not perform well when there is a non-linearity relationship between the two variables.

Probabilistic model

The Bayesian inference model is very useful in incorporating prior knowledge into network models.

The Bayes’ rule for two variable X and Y, where X is value of the parameters and Y is the high-throughput data, is defined as:

p(Y|X)∗p(X) 푝(푋|푌) = p(Y)

where 푝(푌) and 푝(푋|푌) are prior and posterior probabilities respectively. One of the biggest drawback of this methodology, which is similar to POC, is the computationally expensive calculation required for large networks. Friedman et al. group had used Bayesian interference model to represent genetic networks S. cerevisiae (Friedman, 2004). POC, on the other hand, uses a Bradley Terry model which uses logistic regression model to estimate the probabilities of nodes and edges between two disease conditions.

The biggest drawback of Naïve Bayes model is that it assumes all the features are conditionally independent. So, if some of the features are dependent on each other, as is the case of a large feature space, the prediction might be poor. POC, which is based on logistic regression, splits feature space linearly and works with reasonable accuracy even if some of the variables are correlated. One aspect where the Naïve

Bayes works better than POC is with less training data, since the estimates are based on the joint density function. Logistic regression model’s estimates may over fit the data (Ballester & Mitchell, 2010).

Pathway Scoring Methods

One of the crucial steps in bridging the gap between genotype and phenotype is by studying changes in molecular components that affect pathways and modules. These associations provide valuable insights disease phenotype and help us identify biomarkers from these deregulated paths. Traditional disease

41 classification which is based on correlation between genes and phenotypes haven’t been useful in studying complex diseases (Z.-P. Liu, Wang, Zhang, & Chen, 2012). Chuang et. al. employed a protein network based approach to identify biomarkers not as genes but entire sub-networks(Chuang, Lee, Liu, Lee, &

Ideker, 2007). Similar to our study, they demonstrated that sub-network markers were more reliable and classified breast cancer with higher accuracy than traditional individual marker genes. Another study by

Taylor et. al. (Taylor et al., 2009), using protein-interactions networks, showed the importance of hubs in complex diseases. They identified intra and inter modular hubs that co-expressed with the hub proteins and showed that the altered modularity of these hubs could be used for prognosis of breast cancer. Monika et. al. (Ray, Ruan, & Zhang, 2008) used co-expression networks to analyze cis-regulatory networks across complex diseases like Alzheimer’s disease and cardiovascular disease. Their analysis identified 6 co- expressed gene modules, many of which were hub-genes.

Two other methods which employed Bayesian probabilistic models to study changes in the network were BPA(Isci, Ozturk, Jones, & Otu, 2011) and BAPA-IGGFD (Y. Zhao et al., 2012). In these methods, for a given set of phenotypes, a discretized fold change profile is calculated for each gene. This data is used as input to the Bayesian network model and the output of the method was a list of pathways which was scored by SKL divergence.

I - 7.2 Limitations of Pathway Analysis

Pathway analysis was developed with an aim to understand which molecular pathways are affected during the course of the disease. The topological information stored in most of these network contains information about gene/protein/RNA and their interacting partners. All these molecular entities interacting together gives rise to complex interactome network. Most pathway analysis methodologies discussed above are generally geared towards studying how the changes in gene/protein expression levels might modulates the pathway on the whole. The modulation is usually identified from the overall gene expression level and not of the probes or alternative splice variants. These path analysis methodologies also don’t take into

42 consideration the mutation status of the genes and how it might affect the different interactions which is usually the case in many cancers.

The biological significance of pathways is dependent on the static network maps present in many pathway knowledge bases. The interactions in many of these publically available knowledge bases are derived from high-throughput screening methodologies. There are many genes in these knowledge bases that are predicted, hypothetical or pseudo-genes whose interactions may be derived from various high- throughput screening methodologies. With the data explosion, manual curation of the genes and their interactions is becoming increasing difficult for all the interactions. Several curated knowledge bases are present such as NCI-Nature Cancer signaling pathways which contains pathways that are manually curated.

Manually curating all the interactions using current available technologies and methodologies could take over a decade.

We know from various biological experiments that the interactions between molecular entities such as genes, and proteins are very much dependent on the cell state and cell type. The strength or the importance of these interactions is dependent on the cell type and cell state and hence the dynamics of the interactions vary depending on study. Most studies use a static network map which doesn’t account for the cell type and cell state and hence all the interactions in the network map are treated equally making it difficult for cell-type or state specific disease pathways from being detected when other genes are strongly modulated.

Metabolic pathway details the mechanism and the dynamics of different biochemical events occurring within the cells. The interactions in these biochemical maps are usually represented as ordinary differential equations. The parameters and values of various molecular components for these network are usually experimentally derived from binding studies or through metabolomics studies. Simulation of these metabolic network against time represents how the variation of the enzymes levels affect the metabolic products vary and how these variations affect the cell’s overall metabolic phenotype. One of the best example these network is the human metabolome project which contains the entire human metabolic

43 network represented as ordinary differential expression. Although detailed information for most of the metabolic biochemical events are available, the same is not true for signaling and regulatory network.

Where detailed dynamics are rarely available. The dynamics of these interactions gives details about how the changes in level of proteins affects the kinetics of the overall interactions. The kinetics might also provide greater insight of how mutations and alternative transcripts of these protein might affect the interaction with its neighbors and hence the overall network during the course of the disease.

44

Chapter II: Aim of the Thesis

The main of the thesis is to provide a framework for coherent network-based analysis of complex diseases that includes all interacting genes in the network irrespective of their fold change levels would appear to be more appropriate. We propose a novel method, using a probabilistic approach, in order to produce a “feature gene” set that is more representative of network properties.

Our method analyzes expression change in sub-network of genes, which we postulate to be representative of emergent disease phenotype. In order to measure sub-network changes, we assign a probability of change to each gene, to each pair of “connected” genes, to a “neighborhood”, and to regulatory paths connecting neighborhoods of genes. Genes are “connected” if there is prior evidence for their interaction from existing databases (for example Reactome, NCI-nature signaling database). For “neighborhoods”, we draw on concepts from scale-free biological networks (SFN) (Albert, 2005; Barabási, 2009)– we use the concept of ‘hub’ proteins, which are essential and well-conserved proteins (Cukuroglu, Gursoy, & Keskin, 2010; Liang et al., 2012).

Hubs are also shown to play an important role as “problem distributors” and are thought to provide robustness to cells by distributing the perturbations across network (del Sol et al., 2010; Kitano,

2004). Moreover, disease-related genes have a strong tendency to lie in the vicinity of these hubs

(Csermely, 2008). ‘Party hubs’ are known to interact simultaneously with large number of partners with similar biological function (Agarwal, Deane, Porter, & Jones, 2010; Mirzarezaee, Araabi, &

Sadeghi, 2010). These sites help in identifying prominent components of disease dysregulation and disease genes that might be responsible for overall fragility (Barabási, Gulbahce, & Loscalzo,

2011b; del Sol et al., 2010). Our final list of disease-related genes is identified by considering

45 connected and directed path where genes interact with highly active hubs and whose activity have a high probability of having been perturbed during the course of the disease.

In order to evaluate our approach, we investigate the significantly changing genes in breast cancer subtypes using our platform. Breast cancer is well-studied complex disease with a wealth of publicly available data. Breast cancer is one of the first complex diseases where genome analysis played a large role in the diagnosis, classification and treatment of the disease (Cancer & Atlas,

2012a). The sub-networks based biomarkers discovery approach proposed here demonstrates that biomarkers that correspond well to genes previously identified as biologically relevant disease- related genes, can be identified in a robust and automated manner. Moreover, using standard approaches of classification using machine learning, our biomarkers are able to classify the disease subtypes with promising accuracy.

46

FIGURE 3: COMPARING OUR APPROACH WITH THE FREQUENTIST APPROACH

Figure 3 Details: In the standard approach (top portion of figure) genes are analyzed for fold- change and p-value (statistical analysis) in order to identify differentially expressed genes. The bottom portion of the figure illustrates our POC based approach.

47

Chapter III: Methods

The platform being discussed in this section was designed for the analysis of genomic expression data for biomarker discovery and functional pathway enrichment using a probability-based scoring approach.

The platform is designed to accept disease specific input a variety of different platform including

 Agilent Expression Microarray

 Affymetrix Expression Microarray

 Level III RNA-Seq Datasets

The output from the platform was a set of feature genes, which were involved in regulation, were connected by being in the same directed path, and were showing the most significant relative changes within a sub-network. The output from our platform was used for subtype classification using a standard Python library “sklearn” that utilizes linear SVM. For comparison with ‘frequentist’ approach, fold changes and p- values were also obtained, using the GEO2R script based on PAM50 classifications for individual probes in the microarray and from the DESeq package for probes in the RNA-Seq dataset. For each subtype, the sample set was randomly split into a training dataset consisting of 80% of the samples in the original dataset and the remainder as a test dataset. Datasets having less than 10 samples were only used as training datasets.

The platform is developed using python 2.7, R statistical packages, and C. Tk was used in certain instances for developing the graphical user interfaces. The development of the platform, unit testing, and integration of software modules utilized a completely independent dataset for a different complex disease

(sepsis, SIRS and septic shock: GSE13904) in order to minimize the risk of customizing the platform for the cancer subtyping data set. For validation purposes, well characterized breast cancer subtyping datasets were used. The workflow for identifying disease related genes and biomarkers in order to differentiate subtypes involves a series of steps (Figure 3 and Figure 4). The integration of different sources of data, and processes that transform the data, in order to identify disease-related genes and biomarkers is also necessary

48 as part of our approach (Figure 3 and Figure 4). There is also a discrete workflow for path analysis; a process for identifying disease-related genes that change coherently in the network (Figure 3).

FIGURE 4: DETAILED FLOW CHART OF STEPS WITHIN THE PATH ANALYSIS ALGORITHM Figure 4 Details: POC scores for genes and their interactions within hub-hub sub-networks and randomly generated network topology were used to identify probable disease-related genes that change between two subtype conditions.

49

III – 1 Genomic and Network Data

III – 1.1 Genomic Data

With the advent of high-throughput analysis, there have been an explosion amount of publically available genomic data on the internet. One of the biggest challenges in the earliest days of genomic analysis was the lack of standards in representing the expression analysis results. This made it increasing difficult for programmers who designed software packages to work with genomic data. There were also multiple manufactures of these high-throughput gene chips such as Affymetrix and Agilent. While the different manufactures were all designing these chips to capture expression level analysis, the process flow, data and analysis methodologies were very different. Many of these genomic-cores adopted their individual standards for publishing which made it very difficult to write analysis software that could work on the results. The high cost associated with high-throughput screening made it difficult for most researcher to repeat these experiments to obtain results in the format they required. Most of these research was also publically funded and the data they produced needed to publically accessible. These factors along with many other factors formed the basis for adopting a common format as seen by the dataset formats of Gene

Expression Omnibus (GEO) which is supported and hosted by the NCBI and ArrayExpress which is hosted and supported by EBI. The GSE Analyzer used in our framework was designed to accommodate multiple gene-expression format files that are generally publically available. Some of commonly used public formats include in most GEO, ArrayExpress and Level III RNA-Seq formats available TCGA database.

III – 1.2 Network Data

In order to create the signaling and regulatory maps, the interactions between genes in the genomic data were identified and downloaded from Pathway Commons (www.pathwaycommons.org) in Simple interaction format (SIF). SIF represents gene/protein interactions as simple pairwise interactions, which can be classified as directed or undirected. Interactions involving molecular complexes, multiple products, and substrates are expanded to pairwise gene interactions by using the appropriate combinatorics. Interaction

50 information from various databases including STRING, Reactome, and the National Cancer Institute

Pathway Interaction Database (NCI-PID) is used to construct network. A directed interaction network is constructed based on the presence of string classifiers such as ‘Component_of’, ‘Metabolic_catalysis’,

’Sequential_catalysis’, and ‘State_change’ which indicates directed interactions. An undirected interaction network is constructed based on the presence of string classifiers such as ‘Co_control’,

‘Interacts_with’,’In_same_component’, and ‘reacts_with’ which indicated undirected interactions.

Directed and undirected networks are created for all the genes available in the genomic data. The networks consisted of “node” genes represented by their gene ids. The gene names and entrez ids were stored as the node attributes in the networks. Connection types and sources of the interaction were stored as the edge attributes in the networks. The network was further pruned to remove nodes with zero degree and all self-loops (isolated nodes). Multiple edges between a pair of nodes were replaced by a single edge.

III - 2 Probability of Change (Nodes and Edges)

FIGURE 5: PROBABILITY OF CHANGE CALCULATION FOR NODES

51

FIGURE 6 CALCULATION FOR THE PROBABILITY OF CHANGE FOR THE EDGE Each “node” in the network represents a gene. The probability of change (POC) for a node denotes the probability by which the gene expression has changed between two conditions (disease vs. control). The model for calculating POC is analogous to the game-theoretic Bradley Terry Model (Huang, Weng, & Lin,

2006). The idea is to estimate the parameters of a log-likelihood model using available data, and subsequently use the model to calculate POC. The conceptual model for estimating log-likelihood can be stated mathematically as:

mm l( Q ) w ln q  w ln q  q ij j ij i j  ij11

where Q  q1,, qm  푤푖푗 is the number of times “i” beats “j” in head-to-head games, m is the

m number of comparisons, and we require . In our implementation, the interpretation of w for  qi 1 ij i expression values is the number of times the expression value in a disease condition “beats” that in the control condition as shown in Figure 5. The conceptual description provides the clarity of the POC concept,

52 however, since the original Bradley Terry model is for paired data, an extension is needed in order to build

wij in our case.

For pairwise comparison of breast cancer subtypes, the POC score for each gene or microarray probe was calculated using the Bradley Terry Model. Bradley–Terry–Luce (BTL) model for paired comparison, the probability that a gene expression in subtype ‘a’ is higher than that in subtype ‘b’ is given by assumes that for any element a휖 퐴 there exists a real number a  such that, for all a, b A ,

a p a, b  ab  

Where p(a, b) is the probability that ‘a’ is chosen when (a, b) is presented.

The probability of perturbation of a particular gene using the BTL model, the abilities of individual expression value from the subtypes in the pairwise comparisons were computed using

pS  POC   1 p S12 p S 

A probability of perturbation score of zero represents very low likelihood of alteration in gene’s expression levels, whereas a score close to one represents high likelihood of alteration in gene expression levels between the two subtypes. In microarray genomic data, where certain genes are represented by more than one probe, the probe with the highest POC value for the given pair of subtypes is used.

In the analysis for breast cancer subtyping, POC of a given gene was defined as the probability by which the gene’s expression values changed between one cancer subtype and another cancer subtype. The probability of two interacting genes changing simultaneously between two subtype conditions was represented by the value attached to an edge in the network. The presence of an edge was determined through the use of databases identified earlier. Where an edge existed, an edge data set was constructed using all pairwise evaluation of the two genes expression values. For each pair of subtypes, a square matrix was created whose rows and columns represented the samples for a set of interacting genes as shown in

Figure 6. The value of each cell of the matrix was the product of gene expression values of the corresponding rows and column. The values in the matrix were subjected to the same analysis as that of the

53 node values and a POC for the edge was computed. In samples having more than one probe for a gene, the featured microarray probe from the node analysis was used for calculation. Edge POC score close to zero represented a weak edge with low likelihood of the genes interacting together, whereas scores close to one represented a high likelihood of interaction.

III - 3 Hub Interaction Score

FIGURE 7: CALCULATION OF HUB SCORES USING EMD DISTANCE CALCULATION A direct approach to evaluate the set of most active “paths” in the network would have required the evaluation of activity between all pairs of genes for which a directed path exists. As part of this evaluation a subsequent comparison of the activity level to a “background” would have been required. The estimate for the “background” activity involved the simulation of “random” sub networks under random assignment of probabilities. Therefore, a direct approach was computationally prohibitive for networks beyond static

54 models. A biologically inspired approach was used to quantitatively identify the genes that are most likely to propagate the perturbation through the network. These genes are the called hub genes or hub nodes. Hub nodes interact with many other nodes (high degree) and are nodes within the top 10% of degree distribution curve. Hubs are known to play a key role in maintaining the homeostasis of the system. Investigating the dynamics of hubs and their immediate neighborhoods identifies the most promising nodes along the path that has been activated by the disease. We employed the Wasserstein metric, to quantify the changes in a hub and its neighborhood. For each gene expression value identified by the index by j, there were n samples.

The “background” was denoted by the set of values indexed by i and j by the random variable 푄푖푗, with the corresponding distribution uj. We assumed that each sample of our gene expression data was drawn from the same parameterized family of distributions U, but the distribution had been changed (perturbed), due to disease conditions, by a perturbation function that belonged to the family of functions F. 푄푖푗denotes the sample data from the perturbed distribution. Each j represents the perturbation of the j-th sample. We measured the overall perturbation by finding an estimator T that minimizes the “energy” needed for perturbing the data 푃푖푗 to 푄푖푗 . This natural measure was the Wasserstein distance, which measured the minimum cost of “perturbing” the experimental values 푃푖푗to 푄푖푗.

The specific implementation follows the following approach:

● Pi , Qi Bipartite Network representative, Wi is the weight of the cluster

● P (Subtype Signatures)=()()PWi p11 P m W p m 

● Q (Control Signatures)=(QiWW p11 ) (Q n p n )

● EMD can be defined as the cost of “moving supplies" from P to Q

mn dfij ij ij11 EMD(,) P Q  mn [3]  fij ij11

55

Where fij the set of is flows and dij is the distance between two states.

The hubs and their interacting partners were represented as a bipartite network, where all the interacting partners were represented as undirected interactions. The value of the nodes and the edges in the bipartite network were the POC values for the given disease conditions. The signature for each hub gene within a subtype consisted of a unique edge POC score (distance) and the POC score (weight) of all the hubs interacting with the hub gene under consideration. POC values for the bipartite network representing

“background” values were defined for the same hub-neighborhood topology but with POC values replaced by the average change of nodes and edges for the given condition. The discrete version of the Wasserstein distance (a modified Earth mover’s distance (EMD)) was used to calculate the minimum cost to transform the distribution of a disease subtype to the distribution of a “background” distribution as shown Figure 7.

EMD was used as a distance metric based on bipartite network matching and was defined as the minimum cost of matching the bins of two (Applegate, Dasu, Krishnan, & Urbanek, 2011; Demirci, Shokoufandeh,

Keselman, Bretzner, & Dickinson, 2006). The final EMD score was a measure of the change in the neighborhood of a hub. Hubs having higher the EMD scores had a greater likelihood of changes in their neighborhood. Although all the hubs could have been ranked and analyzed further, only the top half active hubs were analyzed in order to save time and computational resources since earlier experiments had suggested that the bottom half does not contribute. The top half of the active hubs was representative because they accounted for the average span in biological networks (average span is three).

III - 4 Path Analysis

Upon identification of the key hubs that underwent changes, the subsequent analysis was performed to identify the directed paths connecting the hubs that displayed significant changes between disease subtypes. In the first step, sub-networks were created from the disease network to include all simple chains

(a sequence of connected nodes) between the key hubs. The values of the nodes and edges in the sub- network were retained from the original disease subtype.

56

In order to identify the paths and the key genes in the sub-network that had changed during the course of the disease, we rely on probability propagations. We treated gene expression probabilities (node values) as probability of change, p, and propagate them along edges (messages) to downstream nodes where they are accumulated. The intensity of the propagation is the edge values we computed earlier – represented by the matrix M. The measure of message intensity can be computed directly through a multi-step and multiplication where the number of steps is at most four - all hubs are separated by at most four intervening genes. The approach is similar to computing Markov chain transition probabilities. Next we identified the nodes having a transition score that is greater one obtained by “chance” – a value that could be obtained by random chance. The “random chance” level was assessed by creating numerous sample networks, with the same topology, but with different edge and node probabilities. The edge and node probabilities were randomly sampled from the empirical probability distribution of the edges and nodes in the original disease network. All genes in the sub-network having transition scores larger than a three the standard deviation of the control sub-network were selected for further analysis. The 3*sigma level indicated that there is a less than 1 in 370 chance that the detected change in the entire path (not just one gene) could be by chance.

III - 5 Path and Feature Genes Selection

The final necessary step in biomarker discovery involves the reduction of the set of “wired genes” obtained through network analysis to a small set of reporter genes. Existing classification tools can be used for this step, yet classification algorithms are known to suffer from severe performance degradations when the number of input genes (data dimension) is high. These algorithms often include a dimension reduction step. However, the dimension reduction step prior to feature selection is often the weak link when the dimension is high. Our reduced set of genes obtained based on concomitant change of “wired” components in the network is expected to result in improved performance of the classification step.

The top genes selected in the previous step (hub-to-hub sub-networks) were considered to be putative reporter genes. By construction, the algorithm had selected these genes because they were deemed

57

“best” in distinguishing the state of hub-hub sub-networks in disease (relative to random fluctuations). The top genes from each of the sub-networks were pooled to create the genes that effectively represent the changes in expression levels between the two different conditions. In the case of breast cancer subtyping, the genes that exhibit change between all pairwise comparisons of subtypes was identified and selected.

The genes pooled and further analyzed to select feature genes that were capable of effectively representing the multi-class classification required for subtyping. These genes represent a very small fraction of the genes original data set – approximately 1%.

The sklearn “Feature ranking with recursive feature elimination and cross-validation” is used to perform the classification task (Pedregosa, Weiss, & Brucher, 2011). The algorithms implemented in sklearn represent well-studied classification approaches. The input for the feature ranking algorithm was the matrix of reporter genes and their expression values used in the calculation of the POC. The input matrix is then normalized using sklearn preprocessing tool MinMaxScaler to values between the limits 1 and -1

(Pedregosa et al., 2011). Pathway enrichment analysis was a standard step in current approaches

III - 6 Biomarker Selection, Validations, and Functional Analysis

To test the precision and recall of our classifier in predicting the breast cancer subtype, we created a training set and a blind set for testing (test set). We used a Multi-class one Vs rest Classifier based on linear support vector machine (SVM). The efficiency of the multi-class classification was measured using sklearn metric classification report, which measures the precision and recall of prediction (Pedregosa et al.,

2011).

In order to select key biomarker selection (features), we iterated over the features using the feature selection algorithm. In each step of the iteration additional genes with low contribution to classification was removed. The iterations continued as long as the performance of the algorithm remained “constant”. A degraded performance after gene removal indicated that the minimum number of genes that could effectively differentiate the subtypes for a blind dataset had been identified. In practice, because the

58 performance measure was noisy, the removal of additional genes was continued until three or more degraded measures of performance was observed.

Molecular functional enrichment and pathway enrichment were studied for all these genes using

DAVID. This standard enrichment approach was implemented using a python web-api to send various GO terms and pathways (KEGG, reactome and Biocarta) to DAVID’s web-services (Dennis et al., 2003; Jiao et al., 2012). The results reflect the various pathways and molecular function that were selectively enriched during each disease condition and analyzed further to uncover the role of these pathways in disease physiology.

The features genes from the RNA-Seq and microarray results were analyzed using cBioportal for cancer genomic online tool (Gao et al., 2013; Schroeder, Gonzalez-Perez, & Lopez-Bigas, 2013). The feature genes were cross-referenced with curated breast invasive carcinoma studies to identify common cancer genes, genomic alterations, co-expression, and frequently altered neighbor genes.

59

III - 7 Pathway Analysis Software Architecture

FIGURE 8: STEPS INVOLVED IN PATH ANALYSIS AND THE SOFTWARE ARCHITECTURE The software system architected for pathway analysis is composed of seven functional modules.

The decomposition of the software into modules and clearly defining the interface of each module facilitates easy modification of the overall process. New modules can be inserted or the intermediate files altered as necessary by ensuring that the interface requirements of the modules are satisfied.

Figure 8 details the overall of the system we’ve designed for biomarker identification and validation, and the programming modules that accomplish these tasks. These modules are explained in detail in subsequent sections in this chapter.

60

III - 7.1 GSE Analyzer

FIGURE 9: GENERAL STRUCTURE OF GENOMIC EXPRESSION DATA OBTAINED FROM GEO PUBLIC DATABASE

61

FIGURE 10: EXTRACTING GENOMIC DATA AND OTHER RELEVANT INFORMATION FOR THE PROJECT The GEO database, hosted by NCBI, is the most popular public database used for high-

throughput analysis. The data deposited into GEO has to strictly adhere to the format defined by NCBI for

both microarray and RNA-Seq expression level analysis. Expression datasets present in GEO are usually

of the following types as shown in Figure 9:

 GSM Files: These files contain the results obtained from a single chip. The data of importance to

us are:

a) The name of the probe (columns 1)

b) The value measured for the probe (column 2)

c) p-value which measures the significance of the reading (column 3, optional)

62

 GPL Files: These file contain annotations describing the type of microarray/RNA-Seq platform. It

provides detailed information about the probe used and their corresponding genes.

 GSE Files: These files usually contain multiple GSM file data detailing the overall experiment

conducted.

 GDS Files: These files contain curated information that summarizes the GSE and GSM files. The

expression value of the genes is normalized to each gene for each sample.

Most of the datasets used in this thesis were obtained from the GEO database. Figure 5 describes the general structure of the GSE file usually contains a GPL file, and multiple GSM files. These files are generally stored in tab-separated format called soft. The GPL section is composed of two more sections, one of which the platform information section and the probe information section are of importance to us.

The series section of the GSE files usually contains detailed information about the experiments conducted including the protocol, patient information if any, and other relevant information regarding the studies.

Finally, the GSM or the sample information section contains the normalized expression level of the probe/genes.

The GSE Analyzer module extracts the information relevant to the study and classifies the data into the various subtypes (Figure 10). The classification can be either be manually performed by the user, or automatically evaluated based on the series information present in the soft file. The module additionally classifies the samples based on information present in the header of the individual sample files. Upon reading the GSE Soft file, the module prepares two human editable configurations. Since every source of data can have its own set of attributes, data sources are not consistent. In order to accommodate for these differences in input format, the GSE Analyzer module was architected with the “smartness” to automatically detect and consolidate data based on fuzzy logic. The Analyzer can also be fed with user supplied classification mapping information to further improve its cross-referencing capability.

In addition, the GSE Analyzer has been designed to be capable of parsing additional file types including the file formats used by ArrayExpress and TCGA Level III data.

63

Currently, the framework has only been designed to analyze one pair of conditions at a time. For studies where there are more than two conditions, the study is split into multiple pairwise run combinations, and each of these splits are run independently. In order to accommodate this, the GSE analyzer restructures the input genomic data into two folder sections, namely:

 SOFT Folder: The soft folder contains all the original information extracted from the GSE

soft file. The data is stored in human-readable format. The soft folder remains unchanged across

run combinations and is used to compile individual projects. This file also contains the

configuration necessary to create the run combinations.

 Project Folder: The project folder contains the files necessary for completing path analysis

for a given combination. It contains multiple human-editable configurations files that contains

parameters like the computation cluster to be picked for the run, the number of processors

designated for the run, debugging status, and so on. The folder also contains local copies of the

generated values to support the massively distributed computational model. If the dataset

contains p-value associated with the expression levels, p-value cutoffs can be set in the

confirmation folder to enable to system the automatically determine whether to use the gene.

The GSE-Analyzer module also extracts the relevant information from the GSE files, classifies the

data and then prepares the files necessary to compile a given project. These files are then consumed

by the GSE Project Compiler.

64

III - 7.2 GSE Project Compiler

FIGURE 11: BUILDING PROJECTS BASED ON USER DEFINED CONFIGURATION FILES

The GSE Project Compiler module prepares, compiles and if necessary modifies the data for the pairwise condition runs (Figure 11). In cases where there are more than two conditions, project runs help manage individual pairwise runs which can be integrated at a later stage. In order to support distributed writes, the Project Analyzer creates copies of all the data that it needs, including network graphs, the test and training expression data for the designated run, etc. Creating copies of the data for each run prevents potential corruption of data that could occur due to multi-processes writes to the same data files.

Another important role this module is allocate computational resources for the project for the computationally heavy analysis performed by the subsequent steps. This is necessary since executing the

65 entire analysis on a single-core machine could take more than 20 days complete, depending on the size of the dataset. The system was designed as a distributed system capable of running on 100s of machines. For our experiments, we ran our simulations on a cluster of 5 hosts, each containing 12 to 16 processors and 32

GB to 64 GB of RAM. The output generated by the distributed system is integrated at a later stage.

The inputs to this module are:

 SOFT and Project Folder Information: User defined input and configuration generated by the

GSE Analyzer module. This contains the data to designate the directory structure, hostnames, and

number of processors available on the host.

 Run Configuration Files: The path analysis algorithm device is very processor intensive and

hence is designed to be executed in a distributed fashion. The user decides the cluster to run on,

and the number of CPUs on the host to designate for the run. The user can also define the run type

and conditions (e.g. Time-series/simple run)

 Biological Network Data: This data identifies all the viable genes that can be used in the analysis

and creates a topological network representing them.

The GSE Project Compiler produced as output run files that are necessary for the edge and node calculations. The output contains the following data:

 Test and Training Datasets: The Project Compiler uses the sample and subtype information stored

in the SOFT folder and configurations set in the project folder to create the test and training datasets

that would be used throughout the project. The sample values are combined and then split into the

test and training sets in the ratio of 80:20 respectively. If fewer than 20 samples are encountered in

the dataset, then the entire sample are used for training. If the training data contains more than 50

samples, Bradley Terry computation becomes very expensive. To overcome this, the training

dataset size is limited to 50 samples, with the rest used as test dataset.

66

 Network Graphs: Based on the platform information present in the ‘SOFT folder’, the module

identifies all the unique genes ids present in analysis. If genes have associated p-values, then for

every condition, only genes which have over 20 significant sample values are selected. The Entrez

gene ids extracted from the GSE SOFT folder is cross-referenced with the pathway common

database of edges and a directed and an undirected network are constructed for genes present in the

analysis. Genes that do not contain any edges will not play a role in the path analysis of the project.

These network graphs contain the necessary topological information to identify these genes and not

consider them for the project being considered. A new topological network is generated for every

project.

 Run Files: Run files create the queue systems for all the processes that a given machine should

execute. Because the analysis can run for a prolonged period of time, separate queue systems are

created for both the machines and its individual processors. So even if a processor fails during

execution due to bad values, only that process can be brought up again and won’t affect the other

parallel runs. The first run list is created for the nodes. For genes which usually have multiple

probes representing them on the microarray, this help identify which probe would be required for

during the calculation of the edge scores. This is mainly done to reduce the number of edges that

are being calculated as some of the genes might have more than one probe representing the gene.

The probe which score prominently high in most conditions are generally selected to represent the

node and the new edge run list is created for the edges in the network.

III - 7.3 GSE Run Scheduler

67

FIGURE 12 : RUN SCHEDULER WHICH ASSIGNS AND MONITORS ALL THE PROCESSORS DEDICATED FOR THE ANALYSIS. The GSE Run Manager module is one of the most complex and vital modules in the system. Its functions include:

1. Optimal run job scheduling

2. Coordinating run jobs across the processors of a job

3. Graceful error handling of any failed jobs – re-run the job and pick from where the failed job

stopped

4. Storing the updated weighted network graph for subsequent processing

The components of the Run Manager include:

68

1. Run Scheduler

2. Graph Compiler

3. Worker Processes

The Run Schedule is the core piece of the module and contains the logic to coordinate and manage workers as shown in Figure 12. The Graph compiler identifies all genes in the high-throughput analysis and creates a graph based on the identified genes. The worker processes managed by the Run Manager perform weight calculations for the nodes and edges.

The GSE Run compiler uses the run list for nodes and sends them to their respective machines for execution. It spawns separate processes for every processor to ensure optimal utilization of computing resources. If one of the processors completely earlier than other, it is capable or restructure the runs so that the slowest processors shares some of the runs with the faster one.

Once the POC scores for the nodes are calculated, the probe selection component within the GSE run manager identifies the microarray id which representative of the Entrez gene id and writes it into the project folder. It also uses this knowledge to create the edge run list for all edges in the network based on the topological network created by the GSE project compiler. The new edge run list is again sent to the

GSE run scheduler module for execution.

Once the runs are complete, the graph compiler component uses the newly calculated POC scores for the edges and nodes to create a weight directed and undirected graph. These graphs are having identical have identical topology as the input network but the weights are equal to the POC of change scores calculated using the Bradley Terry Model. The compiled edge and node POC scores are written into the project folder which is used by other modules at a later stage. The weighted networks are also saved into the project folder for the next stage of analysis.

69

III - 7.4 GSE EMD Calculation

FIGURE 13: MODULES REQUIRED FOR CALCULATION OF HUB INTERACTION SCORES The main purpose of this module is help identify the vital disease hubs in the network. To accomplish this, the program first tries to identify and rank all the hubs in the network based on their neighborhood scores as shown in Figure 13. EMD distance was used to score the hubs and the neighborhood is done using EMD distance. From this list the top scoring hubs are identified and their sub-networks are stored in the project folder.

For undirected network, the degree of the node is given by the total number of interacting partners.

For directed network, the degree is divided into two further parts called in-degree and out-degree based on the direction of the interaction. Hubs are vital genes in scale free network which have greater than the

70 average degree. It is this feature that helps hubs maintain robustness in cells during times of perturbations.

To calculate the hub score, weighted undirected networks were exclusively used.

The GSE hub detection component orders all the nodes in the undirected network by their degree and nodes representing the top 20% of the degree were designated as hubs. For every gene designated as a hub, a bipartite network was created from the weighted undirected network comprising of the genes and its interacting partners. The EMD Scoring module contains the EMD distance scoring module which was originally implemented in C. The code has been adapted to calculate the distance between two bipartite networks and write the results of the calculation into a file in the project folder. The compiled C code is managed within the python framework using the multi-processing module. The input for the C program comprises of a ‘disease weighted’ bipartite network and a ‘background weighted’ bipartite network. The

‘background weighted’ bipartite network is created using the topology of the ‘disease bipartite’ network but the weights are used in the background bipartite network are different. The sum of all the weights in the entire network is calculated for both the edges and the nodes and these weights are used for the all the node and edge weights in the bipartite network. This helps create a background that is unique for that disease condition while still making sure they aren’t over biasing the scoring system.

The output of the EMD scoring is the hub and its final score for that given study conditions. The reason why EMD distance is calculated is to reduce the number of hub-hub sub-networks that are being analyzed. The path analysis for these sub networks are extremely expensive and as the number of paths being analyzed goes up, greater is the computation times. The top 25% of the hubs, which is usually about

140 hubs, with the highest scores are then chosen for the next stage of analysis. Hub-hub sub-network creation component identifies and creates sub-networks that connect all selected hubs with each other. The algorithm identifies all the possible paths that connect these hubs and are separated by a maximum of three hops apart. All these sub-networks, which usually around 20000 unique sub-networks. Some of these sub- networks are very dense with plenty of connecting paths between the two hubs while others are sparse with few or no directed connecting paths between them.

71

III - 7.5 GSE Path Analysis Calculations

In biological systems, disease related genes are likely to be in paths in these sub-network that undergo large

change/rewiring between the two conditions. To identify these genes, it is necessary to evaluate which of

FIGURE 14: MODULES OF THE PATH ANALYSIS CALCULATIONS these paths connecting these sub-networks have undergone coherent change during the disease condition

and is described in Figure 14. In the previous process, the EMD Calculation module identified and created

thousands of the top changing hub-hub sub-networks. Path Analysis module accomplishes this is by trying

to study how signal maybe propagated in these directed sub-network. From the hub score calculation and

selection, hubs with the largest neighborhood change was identified. Based on the changes in their

interactions it can be assumed that these hubs are activated and information is most likely to flow through

these networks to the adjacent network during times of stress. By tracking which of the genes and paths

undergo most changes in during the transmission of these signals, we can identify which genes have

72 significantly affected the topology of the overall network. Markov-chain transitions were used to study these weight sub networks and how they change during the Markov transitions. The average width of most bio-molecular network is about 3, so the three step transition probabilities were calculated for all the edges and nodes in the disease sub-network. To identify which of the genes have changed significantly, we created

100 control random network which had similar topology of the disease sub network but the weights were randomly sourced from various nodes and edges in the disease network. The 3 step transitions for all the

100 control sub networks were calculated and the genes whose probabilities had changed over a three standard deviations were identified as disease related genes. We note that a larger number of control sub networks is likely to yield more accurate estimates – our choice was shaped by resources available to meet the computational demands.

The GSE path analysis module of multi-process enabled and it usually handles one job at a time.

Once the job is complete, the job manager allocates another job from the pool. In this case each disease sub-network is considered a job. For each disease sub network, the module, 100 random sub-network are created which contain identical topology as the disease network but the edge and node weights are sourced from the overall list of edge and node POC scores. The disease sub network and the random control sub network undergo edge correction, where the edges are adjusted to better suit the sub-network topology and node weights using maximum likelihood estimation. Once the disease and the control sub-networks are edge corrected, Markov chain transition for both the disease and control networks are calculated for up to three transition states and the final node scores are calculated. If the gene in the disease sub-network has a value greater or less than three standard deviation of the control gene transition values, then the gene is deemed significant. The genes that didn’t make it are than ranked based on their variation with the cutoff gene and are used to complete the path between the hubs. This entire process is accomplished by the Markov chain calculation and analysis module. The significant genes and paths are stored in the project folder for later analysis.

73

III - 7.6 GSE Feature Selection

FIGURE 15: BIOMARKER IDENTIFICATION USING FEATURE SELECTION Disease-related genes may vary from a few dozen genes to few hundred genes depending on how many pathways were modulated by the disease pathology. These genes provide an insight into the state of bio molecular network and the pathways that are affected which might give rise to the disease phenotype.

Although these are smaller set than the original genome, it is still difficult for these genes to be used as biomarkers. The cost of analyzing large number of genes along with noise that are inherent with larger dataset make the less attractive for clinical use. Clinical biomarkers are generally smaller subset of genes which usually vary between 10-100 genes. The GSE Feature selection module helps reduce the disease related genes identified from multiple condition into a smaller as shown in Figure 15. It also ensures that the biomarkers that are being identified are not clustered around a small sub network. The clustered biomarkers are more prone to errors because the perturbation giving rise to the disease can be very different and hence not activate a given sub-network of genes. Complex diseases are also usually multi-gene and

74 multi-pathway disease. Many cancers with similar phenotype have completely different downstream pathways activated. This makes choosing biomarkers which are part of few clustered downstream pathways prone to errors. To account for this problem, the module identifies unique features from every individual sub-network and then integrates them making sure that there is useful representation from different parts of the network.

The overall module is mainly composed three individual components. The first component, the hub sub network analysis component, identifies all the genes that were marked significant in a given sub- network. If the number of genes in the given sub network is over 10, it uses feature selection with elimination to identify the top changing genes in that sub-network and these genes are now called as reported genes. The hub sub network analysis component identifies these reporter genes from all the sub network from the given disease conditions and then pools them together and writes it into the project folder and then moves on to the next disease condition. The feature selection with elimination module obtains the reporter genes from every disease conditions and pools them together. These reporter genes are representative of the overall disease network. Feature selection with elimination protocol uses the sklearn feature selection with linear SVM algorithm to identify the biomarkers. Expression values from the project folder are used as data points for the feature selection algorithm. In the case where there is more than one disease subtype present in the overall disease condition, multi-class support vector machine is used. The feature elimination algorithm is run until the prediction scores are about 90%. Feature selection algorithm usually have better prediction performance when the number of features are very high as they would be able to use more data points to differentiate the conditions. This would sometimes mean taking nearly all the features which defeats the purpose of dimensionality reduction. For this project it was found that 90% predictability of the maximum had good performance with most datasets.

75

III - 7.7 GSE Validation, Results and Reports

FIGURE 16: BIOMARKER VALIDATION USING CLASSIFICATION

76

FIGURE 17: MODULES REQUIRED FOR REPORT GENERATION AND CREATING SUMMARY The final module of the software architecture helps validate the biomarkers that were identified by the GSE Feature selection module with an un-seen test dataset and generate the necessary reports as shown in Figure 16 and 17. The aim of the validation is check the predictability of the machine learning algorithm to rightly classify the samples from an unseen dataset after being trained. The accuracy by which they can predicted is usually represented as a receiver operator curve in the case of diseases that just have two outcomes. For diseases that have multiple outcomes or phenotypes like breast cancer, a classification report is used which measures the recall and accuracy by a multi-class classifier to able to predict the condition from an unknown dataset. The single and multi-class SVM used in this analysis was obtained from python sci-kit package. The training dataset for the SVM were the original expression values used in creating the

POC scores for the edges and nodes. The predictive capability of the biomarkers was tested against the unseen test datasets which retains the original sample labeling. Depending on the type of disease, an ROC or Classification report was created for detailing the accuracy of the results.

77

Along with validating the biomarkers with the test datasets, results comparing the individual genes and hubs scores were also plotted. Fold change and their associated p-values between the individual conditions were calculated for the individual runs and were plotted against the node POC scores for comparisons. They helped study the variation between the POC scores, fold change and p-value. They also provided a better understanding of the disease profiles associated with the different conditions.

The disease related genes in each conditions were sent to DAVID via a web-api to identify the pathways that have probably enriched and how they might play a role with the overall disease physiology.

They also provided a better insight on how different GO ontologies were modified during the disease. A functional report was prepared with all this pathway level information and was deposited into the project folder.

78

Chapter IV: Expression Analysis of Breast Cancer Subtypes

Breast cancer is a heterogeneous complex of diseases targeting the same anatomical site (Weigelt,

Baehner, & Reis-filho, 2010). Breast cancer remains the most diagnosed from of cancer in women, affecting over 1.2 million women worldwide and accounting for 29% of cancers in women (Polyak, Shipitsin,

Campbell-Marrotta, Bloushtain-Qimron, & Park, 2009). In terms of mortality, breast cancer is only second to lung cancer in the United States responsible for above 14% of all cancer-related deaths (Sandhu, Parker,

Jones, Livasy, & Coleman, 2010).

Breast cancer was historically perceived as a single disease which showed large number of histopathological features and diverse treatment outcomes. Treatment decision for breast cancer was solely based on clinical pathological variables such as tumor Size, presence of Lymph-nodes, presence of metastasis and histological grade of the tumor. It has long been recognized that breast cancer is heterogeneous at a clinical, histological, and molecular level (Norum, Andersen, & Sørlie, 2014). Since the

1960s, scientists and researchers knew the heterogeneous and complex nature of breast cancers and made numerous attempts were made to classify breast cancers based on various characteristics and create a standard taxonomy (Al-Lazikani, Banerji, & Workman, 2012). The development of classification system was not well received by the scientific community. The main idea behind building a standard classification system was to be able to identify subtypes based on various morphological and biological characteristics of the disease and then use the subsequent information to target and treat the disease. Currently, the WHO has breast cancer into 17 subtype based on their specific morphological features (Sinn & Kreipe, 2013).

Invasive ductal carcinoma is the most common morphological subtype of breast cancer accounting for over

80% of the invasive breast cancers followed by invasive lobular carcinoma which accounts for about 10%.

The other less common morphological subtypes of breast cancers include mucinous, cribriform, micropapillary, papillary, tubular, medullary, metaplastic and inflammatory carcinomas (Sinn & Kreipe,

2013).

79

By 1970, researchers had already identified that breast cancer displayed distinct clinical characteristics based on the expression status of estrogen receptor (ER) (Reis-Filho & Pusztai, 2011). Based on the ER status, breast cancer was divided into two subtypes (Reis-Filho & Pusztai, 2011). Three predictive markers were then added to the clinicopathological variables to help doctors and hence guided the endocrine therapy (Norum et al., 2014). These were measuring the expression levels of ER, Progesterone receptor

(PR) and HER2 (ERRB2). The identification of excessive HER2 expression led the development of the monoclonal antibody trastuzumab (Weigelt et al., 2010). Personalized treatment built on the data from expression levels of breast cancer receptors led to the development of online (internet based) applications/ programs/software which in turn improved the decision making ability of both patients and doctors and this led to a steady decline in the breast cancer related mortality in the US (Parker et al., 2009). Applications such as Adjuvant! Online and Nottingham prognostic index which helped form guidelines for breast cancer treatment including the US National Cancer Institute. Although the histopathological approaches were reducing the mortality rate of breast cancer, they were not sufficient to provide personalized treatment regimen for patients. Only a small percentage of the overall patients ultimately deriving any benefit while almost everyone facing the risk of severe side-effects.

It wasn’t until pivotal study by Perou’s (Sorlie et al) group using cDNA microarray that the heterogeneity of breast cancer was rediscovered (Sorlie et al., 2003). For the first time, the disease was subtyped based on the transcriptomic profile of the protein coding region of the genome. In their study, they identified about 500 genes which showed significant expression level variations between the different subtypes and was thought to play a role in the disease physiology. They defined these genes as an ‘intrinsic genes’ list and when the subtypes were clustered using these genes. Four distinct subtypes (of breast cancer) namely Luminal, Basal, HER2 and normal-like breast were identified based on ‘intrinsic gene’ list as shown in Figure 18 (Sorlie et al., 2003). In a follow up study by the same group, they were further able to reclassify

Luminal subtype into Luminal A and Luminal B subtypes based on their clinical outcomes. These molecular

80 subtypes that were identified were confirmed and expanded upon in follow-up studies done by other independent groups (Guiu et al., 2012).

FIGURE 18: METHODOLOGY FOR THE MOLECULAR CLASSIFICATION OF BREAST CANCER.

81

IV - 1 Molecular Subtypes of Breast Cancer

FIGURE 19: MOLECULAR CLASSIFICATION OF BREAST CANCER.

Figure 19 details the properties of different breast cancer subtypes. Luminal A breast cancers have shown to have high expression of ER and very active ER pathway and didn’t show large changes in proliferation related genes. They also show up regulation of gene such as GATA, FOXA1 and

LIV1(Damrauer et al., 2014; Eroles, Bosch, Pérez-Fidalgo, & Lluch, 2012; X. Hu et al., 2009; Perou &

Børresen-Dale, 2011). They also do now show gene amplification of HER2 and its related pathway (Perou

& Børresen-Dale, 2011; Taherian-Fard, Srihari, & Ragan, 2014). Luminal A breast cancers are usually well differentiated and show low histological grade(Desmedt et al., 2008). They have a very low proliferation index and hence usually have excelled prognosis. These subtype tumors usually associated with better recurrence free survival and overall survivals than any other breast cancer subtype (Bandyopadhyay & Ali-

Fehmi, 2013).

82

FIGURE 20: PREVALENCE OF BREAST CANCER SUBTYPES IN NORMAL WOMEN POPULATION Luminal B breast cancer subtype, like Luminal A shows an active ER pathway but their activity is lower than Luminal A subtype. They show increased expression of proliferation related genes, luminal cytokeratins and increased expression of ERRB2 or HER2 related pathways(Bandyopadhyay & Ali-Fehmi,

2013). Unlike Luminal A subtype, Luminal B subtype breast cancer show higher histological grade and proliferation rates which usually translate to worse clinical outcomes. Because of these variations between

Luminal A and Luminal B, Luminal B breast cancers generally have a poorer prognosis than Luminal A with greater chances of relapse. One of the most common gene mutation seen in many of the luminal cancers is PIK3CA.

83

HER2 breast cancer subtype accounts for about 15% of all breast carcinomas.

Immunohistochemistry of HER2 subtype is identified by HER2 positive, and ER and PR negative. They also commonly exhibit positive breast lymph nodes and greater nodal-volume when compared to Luminal

A subtype (West et al., 2001). In around 40% of breast cancers cases, HER2 subtype shows mutations in p53 (Figure 20) (West et al., 2001). HER2 subtypes also show significant over-expression of HER2 pathway related genes such as GRB7 along with lower expression of ER and their related genes (X. Hu et al., 2009). Because of the over-expression of proliferation related genes and HER2, these tumors are generally behaving in an aggressive fashion. With respect to treatment, with recent advancements, these tumors respond well to anti-HER2 treatments and usually show good overall survival rates and low relapse rates (Bandyopadhyay & Ali-Fehmi, 2013).

The basal-like breast cancer subtype lack the expression of ER receptor, show an increased expression of EGFR pathways genes and ERRB2/HER2 receptor. Their name is derived by the over expression of cytokeratins to the basal/myoepithelial cells. Immunohistochemistry testing of this subtype has identified negative for all three receptors (ER, HER2 and PR) as negative. About 75% percentage of these tumors also show mutations in p53 gene. Basal-like breast cancer subtype are usually of higher histological grade and show greater chromosomal instabilities. They show a higher mitotic rate than

Luminal tumors. They also seem to be more prevalent with younger women of African and Hispanic descent and morphologically similar to tumors arising from BRCA1 germ-line mutations. These tumors have poorer prognosis than luminal tumors with greater chance of relapse in less than 3 years even if the tumor is cured.

The metastasizing tumor has a higher chance of affecting the central nervous system (Bandyopadhyay &

Ali-Fehmi, 2013). Recently this subtype has been further divided to better aid the treatment and diagnosis

(Bandyopadhyay & Ali-Fehmi, 2013). These subtypes include

o Claudin-low Subtype: Is characterized by low expression of ESR1, ERRB2 and cell-cell

adhesion genes such as claudin and E-Cadherin (Prat & Perou, 2011). In addition, it displays

84

low proliferation and increased expression of immune system, cell-cell communication, cell-

migration related genes (Bandyopadhyay & Ali-Fehmi, 2013).

o Apocrine Subtype: Exhibit increased expression of ERRB2 and Androgen receptor (AR)

signaling genes and the variations in the levels of genes related to lipid metabolism

(Bandyopadhyay & Ali-Fehmi, 2013).

o Interferon-related Subtype: Is characterized by increased expression of interferon related

genes including STAT1 and are usually related to poor prognosis (Bandyopadhyay & Ali-

Fehmi, 2013).

Normal-like breast cancer is sometimes classified along with ER negative branch even though they are poorly characterized as stated in the literature that these cancers maybe an artifact of sample representation (Sweeney et al., 2014). ER negative breast cancer subtype show an aggressive clinical behavior. Even though breasts cancer can be subtyped based on their transcriptomic profile, they were not large-scale clinical setting due to the limitations of the microarray methods(Kao, Chang, Hsu, & Huang,

2011). During that time, subtyping of breast cancers were done through immunohistochemistry of the receptor expression and proliferation gene markers. FDA has also approved a number of prognostic and diagnostic gene signature assays for commercial use all of which show varying level of efficacy. Some of the commonly used assays include MamaPrint, Oncotype DX, Theros, and MapQuant Dx (Kao et al., 2011).

The PAM50 assay proved to be a more reliable measurement for the molecular subtypes and provided better results than the immunohistochemistry and other assays available at that time. Many studies continued to show usefulness of this assay in both their prognostic and predictive the course of breast cancer treatment(Arpino et al., 2013; Sweeney et al., 2014).

Microarray and other genomic analysis have revolutionized the way we look at breast cancer. The discovery of molecular subtypes, followed by their validation in independent analysis has offered an understanding of the different clinical heterogeneity as seen in breast cancer. They have also provided clinicians a powerful tool for the prognosis, diagnosis and treatment of breast cancer subtypes. They have

85 also paved the way for more individualized and effective treatment of breast cancer. It has also helped patients effectively manage the disease and plan their treatment accordingly in consultation with their physician (Kao et al., 2011).

IV - 2 Dataset Information

Dataset was obtained from a study used to determine the effectiveness of 76 genes as a prognostic signature for all types of invasive breast carcinoma. The 76 gene signature was initially developed using a supervised analysis of microarray data and then trained with data obtained from 115 breast cancer samples.

The set of 76 gene set was derived from the following two independent analyses:

o Dataset consisted of samples from patients who were ER positive. A set of 60 genes identified

from this analysis were able to predict distant metastasis within 5 years in patients.

o Datasets consisted of samples from patients who were ER negative. The 16 genes identified

from this analysis were able to predict distant metastasis within 5 years in patients.

The genes identified from this analysis outperformed the guidelines set by the US NCI especially for people who had a good prognosis of the disease and wouldn’t need the therapy. The biggest limitation of this set of genes were that it largely based on the expression level of proliferation related genes to make its decision. It also required fresh or frozen samples obtained from patients in a time dependent manner and the 16 genes identified for ER negative breast cancers didn’t perform well for triple negative breast cancers

(Reis-Filho & Pusztai, 2011).

86

IV - 3 Study Design

FIGURE 21: OVERALL DESIGN OF THE STUDY

FIGURE 22: COMBINATION OF RUNS REQUIRED TO IDENTIFY DISEASE RELATED GENES AND BIOMARKERS USING THE MICROARRAY AND RNA-SEQ DATASET.

87

This analysis aims to compare disease-related genes and biomarkers identified between breast cancer subtypes using two independent breast cancers datasets derived from two different platforms (Figure 21 & Figure 22). The performance of these biomarkers will be evaluated based on their ability to correctly classify an unseen dataset from their respective platform. The biggest advantage of using microarray platforms is that it is a better established methodology for expression level analysis.

RNA-Seq analysis on the other hand is a newer but more accurate in measuring expression level analysis than microarray. The biggest drawback of RNA-Seq analysis is that the standardization and normalization protocols for these methods are not very robust as microarray analysis (S. Zhao, Fung-Leung, Bittner, Ngo, & Liu,

2014).

The microarray gene expression data used in this analysis is currently available at Gene Expression Omnibus accession number

GSE7390. Samples for the microarray data were obtained from 198 frozen biopsy samples using Affymetrix Genechips U133A

GeneChip. Next generation RNA-Seq gene expression values were obtained from the TCGA data portal. In this study, the Level 3 gene Figure 23: Study Process for Individual expression levels measured in read counts per million base pairs Combination (RPKMs) from the RNA-Seq dataset were analyzed. PAM50 molecular signature was used to classify the genomic data (both microarray and RNA-Seq datasets) into five breast cancer subtypes namely Luminal A (LumA), Luminal B (LumB), Basal (Basal), HER2 (HER2) and Normal-Like (Normal). The overall analysis is illustrated in Figure 21 and 22, which details the

88

common steps used in both the analysis. In the case of RNA-Seq analysis, RPKM values are used to analyze

differential expression while log2 normalized intensities are used in the case of microarray data.

The methodology only allows pairwise comparisons of disease conditions. Since the breast cancer

dataset was classified into five mutually exclusive subtypes, it is necessary to identify the disease-related

genes between every pair of subtypes before the biomarkers for the overall disease is identified. Details

about the multiple combinations of runs are shown in the figure 22. Disease related genes between every

pair of disease subtype are calculated independently using the process flow detailed in Figure 23. Feature

selection with elimination is then performed on the all the disease-related genes combined to identify the

biomarkers that can best differentiate the breast cancer subtypes. Heat maps of the biomarkers from RNA-

Seq and microarray analysis are generated and the performance of the biomarkers in classifying these

disorders are evaluated for both the RNA-Seq and microarray derived biomarkers.

FIGURE 24 : PLOTS SHOW THE DISTRIBUTION OF POC SCORES OF NODES AND EDGES FOR VARIOUS SUBTYPE COMPARSIONS

89

FIGURE 25: DISTRIBUTION OF NODE AND EDGE POC SCORES FOR DIFFERENT SUBTYPE COMBINATIONS Figure 25 Details: Subfigures a and b plots the degree distribution of nodes in the undirected and directed networks respectively. The graphs used in our analysis were downloaded from curated databases from Pathway Commons 2 and compiled according to procedure described in the methods section. The y-axis for the histogram is presented on a log scale while the x-axis describes the number of neighboring interacting partners for each gene. The properties of these graphs are similar to those of scale-free networks in which a large number of genes have small node degrees while few genes have higher than average node degrees (power law-distribution).

FIGURE 26: DEGREE DISTRIBUTION OF NODES IN THE UNDIRECTED AND DIRECTED GRAPH

90

IV - 4 Results

An automated approach to identify subtypes has been developed. In the section titled

“Probability of Change”, the result demonstrates that POC is a robust, but not identical, indicator of change when in comparison with fold-change and p-value. In the “Path Analysis” section it is shown that the algorithm, produces primarily the same paths for both RNA-Seq and microarray datasets. Further, the selected feature genes have been shown to classify the breast cancer subtypes in an automated way without manual intervention. The classification outcomes using these selected genes or biomarkers were compared with the classification outcomes using the PAM50 gene set. We were able to achieve similar performance levels by directly running the sklearn algorithm on well-studied breast cancer intrinsic subtyping genes. For comparison purposes, absolute fold changes, p-values and POC scores for probes/genes were calculated between pairs combination of subtypes. Overall, a network of 14000 genes, and 140 hub-to-hub sub- networks were analyzed.

The goal of this study was to develop and evaluate an automated framework that uses the graded function of POC for the discovery of disease-related genes and biomarkers. In the discussion of these results, the breast cancer data set is used to evaluate the capacity of the platform to differentiate breast cancer subtypes. In addition, the overall efficacy of POC in generating gene candidates with discriminatory power is evaluated using standard methods of machine learning.

91

FIGURE 27: PROBABILITY OF CHANGE (POC) AS A METRIC FOR DIFFERENTIAL EXPRESSION

Figure 27 Details: Sub-figures a-c show the scatter plot profiles of POC vs. fold change for breast cancer subtypes. Sub-figures e-f show the variations in POC with adjusted p-value of the t-test for genes in cancer subtypes. The relationship between POC and fold change follows a linear relationship at low fold change values but has a higher scatter at higher fold change values suggesting that there are some differences at higher fold changes. On the log-log scale, POC and p-value have an inverse relationship. In physiologically similar cases, for example luminal A vs. luminal B, POC remains small, as would be expected, indicating changes are not reliably identifiable. In luminal A vs. Normal- like, and luminal A and basal, changes are reliably identifiable. Fold change and p-value alterations are largely reflected in POC values.

92

FIGURE 28 : COMPARING POC WITH FREQUENTIST METRICS LIKE ABSOLUTE FOLD CHANGE AND P- VALUE Figure 28 Details: We use the common p-value cut-off of 0.05 to examine the relationship between POC and AFC (absolute fold change) for all probes in the microarray data set. Here we show pairwise comparisons between Luminal A vs. all other subtypes, as a demonstration. Figure (a) represents probes with p-value < 0.05, and (b) identifies probes with non-significant p-values. The red line identifies the 2-fold cut-off; commonly used in practice. The red rectangle represents probes whose fold change was low (< 2 fold) but retained > 0.5 POC value. For lower AFC value, POC and AFC are correlated irrespective of p-value. As the AFC increases, the POC values and AFC diverge slightly (wider dispersion) for p-value < 0.05, while for p-value > 0.05 they remain reasonably correlated. In a few exceptional cases, the behavior of p-value and AFC differs from POC. Several specific examples are examined in figures c-h. These cases are examined in detail in the result section. IV - 4.1 Probability of change (POC)

An all vs. all pairwise comparison of breast cancer subtypes was performed, using GEOquery and

Limma R packages from Bioconductor project. The results from the package included absolute fold change

(AFC) and adjusted p-value for all pairwise subtype comparisons. Corresponding POC values for the genes

were calculated for all pairwise subtype comparisons. Computations with POC in our framework do not

use any threshold values, but for comparison purposes, a POC of larger than 0.5 was used – 0.5 indicating

a 50-50 change of detecting change in expression (see Figure 24 for the distribution of POC scores). Results

based on POC values were predominantly concordant with results obtained by significant p-value (p-value

< 0.05) and AFC threshold (AFC > 2x) tests. Over 90% of the genes with POC value over 0.5 were found

93 to have significant p-value (< 0.05), with AFC > 2. For genes with POC < 0.5, one or both of the following conditions existed: the gene had an AFC < 2X (Figure 27a), or had a non-significant p-value (Figure 27b).

. When the p-value for differential expression was < 0.05, the POC value tracked more closely with the ability to separate gene expression values rather than fold change. In some genes with p-value < 0.05 and higher than 2x fold change, the POC remained lower than 0.5. These cases were subjected to a more detailed analysis. The analysis revealed that the POC measure favored separability of expression values more favorably than AFC. Genes with small fold changes were found to have low POC values unless the changes in gene expression were highly coherent. The same analysis also revealed that a number of genes with distinctive differential profiles would have been eliminated from further analysis due to a two-fold limit used by many frequentist methods. Higher POC values were indicative of statistical significance as well as coherent change.

Dispersion plots of POC vs. fold change and POC vs. p-value illustrate the corresponding relationship (Figures 28a-f and Figure 29-31). The association between the POC and p-value follows a reciprocal relation: log(푝 − 푣푎푙푢푒) ∙ log(푃푂퐶) = 푐표푛푠푡푎푛푡, indicating a strong inverse correlation. The relationship between POC and AFC is strongly linear when the POC is low, but dispersed at higher AFC values. For sub-types Luminal A vs. Luminal B (Fig. 28a) – ER+ breast cancer subtypes that are physiologically similar – the range of AFC varied from no change to approximately 4x, while POC remained below 0.5. For the same subtypes, the POC vs. p-value plot shows that very few genes (< 10 with

AFC > 2) attained a significant p-value (Figure 28d). The plots exhibit that only a few genes were identified as having significant changes (p-value < 0.5) and an AFC > 2. High POC values, obtained for the aforementioned genes, were approximately near 0.45 – indicating the absence of definitive high-probability changes. Figures 28b and 28c illustrate the variations in POC with fold change and p-values, for Luminal

A and other physiologically dissimilar subtypes such as Basal-like and Normal-like, exhibited favorable trends. Generally, we found that higher POC values are observed when the corresponding genes have significant p-value and AFC > 1.5. For well distinguishable subtypes, there is a compact and well-defined

94 correlation between genes with POC < 0.5 and genes that are considered not to be significantly changing according to standard analysis. Comparison of any subtype with normal-like subtype (Fig 28b, 28e) indicated several genes with higher POC values. Some of these genes had p-value < 0.05 and AFC > 1.3 – although, not all genes with p-value < 0.05 and AFC > 2 had > 0.5 POC. Luminal A vs. Basal subtype (Fig

28c, 28f) had a large number of genes showing difference of two-fold or more and p-value < 0.05 – that is highly correlated with high POC genes.

We examined the relationship between Absolute Fold Change (AFC) and corresponding POC scores (Figure 27). For clarity, we limited the comparisons to microarray probes involving Luminal A vs. all the other subtypes in this figure. The microarray probes were divided into two scatter plots based on p- value significance, with statistically significant probes (p-value<0.05) plotted in Figure 27a and the remainder (p-value > 0.05) in Figure 27b. More than 90% of the genes with POC > 0.5 were found to have a significant p-value with AFC > 2. Similarly, most genes with POC < 0.5 exhibited non-significant adjusted p-values, or AFC < 2, or both. While POC broadly captured the p-value and fold change, we noted some cases in which genes with statistically significant changes and high fold values did not have a high POC. A detailed analysis in these cases revealed that the expression values for these genes had a high scatter and variance – indicating that POC favors consistency over average fold change.

Some representative discordant instances were characterized using five examples shown in Figures

4c-4e, corresponding to p-value < 0.05, and Figures 27 f-g corresponding to p-value > 0.05. Figure 27c identifies a gene ‘CRABP1’ (that participates in retinoic acid-mediated differentiation and proliferation processes (Honecker et al., 2014)) with significant p-value and greater than two-fold AFC, but a low POC value. POC scores favor consistent change (higher interquartile difference and less variance, for example), over high AFC. Figure 4e is a box plot of gene CA12 (related to zinc metalloenzymes, that participates in many biological processes including respiration, acid-base balance and calcification (Damaghi,

Wojtkowiak, & Gillies, 2013)) with a high POC value but AFC < 2. The interquartile difference between the two subtypes was reflected in the POC measure indicating a case that should be considered in more detail, but the gene would have been eliminated from consideration under standard measures. Intermediate

95 between these two cases (Fig 27 d) is a gene with an AFC slightly below 2-fold. This gene would have also been eliminated from further consideration according to standard criteria, yet its interquartile difference appears to be better discriminated than FAXDC2 (part of the fatty-acid biosynthesis and oxidation process

(Olsen et al., 2006)). Overall, for genes with p-value < 0.05, the POC provides an effective measure of detecting change and is not impacted adversely by low AFC.

Next, we considered genes with p-value > 0.05 and high POC in detail. Our comparative statistical analysis illustrated that these genes were eliminated as a result of low sample count (normal-like sample counts are low). This property of POC may has no impact on false discovery rates. While these genes are not eliminated in this first stage, the inclusion of their graded values in the context of network neighborhoods in the subsequent analysis yields combined scores that exclude them from further consideration as disease-related genes. Finally, the case of POLR2L (part of machinery that catalyzes transcription of DNA into RNA (Acker, Murroni, Mattei, Kedinger, & Vigneron, 1996)) illustrates the concordance of p-value < 0.05 and high AFC with high POC value; as should be expected (Fig. 27h).

96

FIGURE 29: POC VS ABSOLUTE FOLD CHANGE FOR ALL GENES IN EVERY SUBTYPE COMPARSIONS

97

FIGURE 30: PLOTS OF POC VS P-VALUE FOR VARIOUS SUBTYPES COMPARISONS

98

FIGURE 31: POC VS P-VALUE FOR ALL GENES IN EVERY SUBTYPE COMPARSIONS

Figure 31 e-f illustrate the variations in POC values of genes with the corresponding p-value of the t-test across two breast cancer subtypes. Notably, genes with low p-value (highly significant) had higher POC values – more specifically, high POC values were good indicators of statistical significance. Overall, the

POC metric was efficacious in capturing both the fold change and significance of the genes, particularly when fold change was consistent.

99

FIGURE 32: VENN DIAGRAMS OF PATH ANALYSIS RESULTS FROM INDIVIDUAL PAIRWISE ANALYSIS

IV - 4.2 Path Analysis

The POC values of genes were used in path analysis. Supplementary Table S1 from our manuscript lists the total number of disease-related genes that were identified as exhibiting significant change between two disease subtypes based on network analysis performed on the microarray dataset and RNA-Seq datasets.

The complete set of significant genes identified from the individual pairwise analysis from the microarray and RNA-Seq dataset are listed in Supplementary File S1 and File S2 respectively (digital files not included in the print copy of the thesis. Can be found in our manuscript). A Venn diagram comparing genes selected from the RNA-Seq data set and the microarray data set is shown in Figure 32 and 33. A larger subset of significant genes was identified from the RNA-Seq dataset compared to the microarray dataset. Despite differences in the techniques and the reported values between RNA-Seq and microarray, 70%-80% of the disease-related genes identified using the path analysis were common to both datasets for most pairwise subtype comparisons. However, when comparisons were made between a disease subtype and the Normal- like subtype, the percentage of intersecting genes found common to the microarray dataset and the RNA-

100

Seq dataset dropped to less than 50%, suggesting that the “Normal-like” subtype may not have been well- defined across platforms.

FIGURE 33: VENN DIAGRAMS OF PATH ANALYSIS RESULTS FROM INDIVIDUAL PAIRWISE ANALYSIS-II The individual pairwise analysis of the subtypes resulted in the identification of 503 unique genes for the microarray dataset and 688 unique genes for RNA-Seq analysis in total. These genes listed in the

Supplementary File S3 and File S4 (digital files not included in the print copy of the thesis), along with genes which were identified in both datasets can potentially be used to key roles to differentiate the subtypes

(these genes are listed). The feature genes identified using this method overlapped with four out of the 50

PAM50 genes in the microarray analysis and five out of the 50 PAM50 genes in the RNA-Seq analysis.

When genes in the immediate neighborhood of these feature genes were included in the list the overlap was found to be about 80%. The number of paths selected in each pairwise comparison depended on the reported physiological similarities between the subtypes (Figure 32 and 33). For example, Luminal A and Luminal

B which have similar disease physiologies, a smaller subset of feature genes were identified whereas comparing Luminal A and HER2, which are physiologically dissimilar, yielded a larger set. This result is

101 consistent with the expectation that physiologically similar conditions exhibit different regulatory behavior on a small number of wired paths.

TABLE 1: COUNTS OF OVERLAPPING GENES BETWEEN RNA-SEQ AND MICROARRAY ANALYSIS ALONG WITH THE INTERSECTING GENES FROM BOTH THESE ANALYSIS

Microarray RNA-Seq Intersecting Conditions Analysis Analysis Genes No. of genes No. of genes No. of genes

identified identified intersecting Luminal A vs Luminal B 87 375 56 Luminal A vs Basal 86 308 67 Luminal A vs HER2 174 388 133 Luminal A vs Normal-like 100 147 45 Luminal B vs Basal 277 342 195 Luminal B vs HER2 115 399 94 Luminal B vs Normal-like 97 176 48 Basal vs HER2 172 388 125 Basal vs Normal-like 127 230 67 HER2 vs Normal-like 115 270 66

TABLE 2: LIST OF BIOMARKERS IDENTIFIED FROM THE MICROARRAY ANALYSIS Gene Name EntrezGene Description BUB1 699 BUB1 budding uninhibited by benzimidazoles 1 homolog (yeast) ARHGAP6 395 Rho GTPase activating protein 6 CFTR 1080 Cystic fibrosis transmembrane conductance regulator (ATP- binding cassette sub-family C, member 7) CDC2 983 Cell division cycle 2, G1 to S and G2 to M GART 2618 Phosphoribosylglycinamide formyltransferase, phosphoribosylglycinamide synthetase, phosphoribosylaminoimidazole synthetase RPL5 6125 Ribosomal protein L5 MED1 5469 Mediator complex subunit 1

102

BCL2 596 B-cell CLL/lymphoma 2 AURKB 9212 Aurora kinase B ARHGEF12 23365 Rho guanine nucleotide exchange factor (GEF) 12 RPL31 6160 Ribosomal protein L31 ESR1 2099 Estrogen receptor 1 RPL38 6169 Ribosomal protein L38 POLR2A 5430 Polymerase (RNA) II (DNA directed) polypeptide A, 220kDa CYP1B1 1545 Cytochrome P450, family 1, subfamily B, polypeptide 1 TIAM1 7074 T-cell lymphoma invasion and metastasis 1 RPL7 6129 Ribosomal protein L7 PLCE1 51196 Phospholipase C, epsilon 1 DAPK1 1612 Death-associated protein kinase 1 PTPN11 5781 Protein tyrosine phosphatase, non-receptor type 11 (Noonan syndrome 1) RASGRF1 5923 Ras protein-specific guanine nucleotide-releasing factor 1 TTK 7272 TTK protein kinase CDC25B 994 Cell division cycle 25 homolog B (S. pombe) CDC20 991 Cell division cycle 20 homolog (S. cerevisiae) CDC25A 993 Cell division cycle 25 homolog A (S. pombe) KIT 3815 V-kit Hardy-Zuckerman 4 feline sarcoma viral oncogene homolog NUDC 10726 Nuclear distribution gene C homolog (A. nidulans) MDM2 4193 Mdm2, transformed 3T3 cell double minute 2, p53 binding protein (mouse) CDC42 998 Cell division cycle 42 (GTP binding protein, 25kDa) PCM1 5108 Pericentriolar material 1 CYP21A2 1589 Cytochrome P450, family 21, subfamily A, polypeptide 2 CYP24A1 1591 Cytochrome P450, family 24, subfamily A, polypeptide 1 PTPN2 5771 Protein tyrosine phosphatase, non-receptor type 2 PCK1 5105 Phosphoenolpyruvate carboxykinase 1 (soluble) NMT2 9397 N-myristoyltransferase 2 CBLB 868 Cas-Br-M (murine) ecotropic retroviral transforming sequence b MAPK14 1432 Mitogen-activated protein kinase 14 ARHGAP1 392 Rho GTPase activating protein 1 DUSP1 1843 Dual specificity phosphatase 1 RPL13 6137 Ribosomal protein L13 MAPT 4137 Microtubule-associated protein tau CENPA 1058 Centromere protein A RET 5979 Ret proto-oncogene CYP7A1 1581 Cytochrome P450, family 7, subfamily A, polypeptide 1 ERBB2 2064 V-erb-b2 erythroblastic leukemia viral oncogene homolog 2, neuro/glioblastoma derived oncogene homolog (avian) F2R 2149 Coagulation factor II (thrombin) receptor

BUB1 699 BUB1 budding uninhibited by benzimidazoles 1 homolog (yeast) ARHGAP6 395 Rho GTPase activating protein 6

103

CFTR 1080 Cystic fibrosis transmembrane conductance regulator (ATP- binding cassette sub-family C, member 7) CDC2 983 Cell division cycle 2, G1 to S and G2 to M GART 2618 Phosphoribosylglycinamide formyltransferase, phosphoribosylglycinamide synthetase, phosphoribosylaminoimidazole synthetase RPL5 6125 Ribosomal protein L5 MED1 5469 Mediator complex subunit 1 BCL2 596 B-cell CLL/lymphoma 2 AURKB 9212 Aurora kinase B ARHGEF12 23365 Rho guanine nucleotide exchange factor (GEF) 12 RPL31 6160 Ribosomal protein L31 ESR1 2099 Estrogen receptor 1

104

FIGURE 34: HEATMAP OF BIOMARKERS IDENTIFIED IN THE MICROARRAY ANALYSIS

Figure 34 Details: Figures details the heatmap of biomarkers genes identified in the microarray analysis. Visualization of the performance of these biomarkers on the training and test set is shown as a heat map. The color bar identifies the normalized and rescaled expression values for genes in the heat map. Each column represents one biomarker gene while each row identifies sample values. The orange/teal bar on the left is used to denote training (orange), and test samples (teal). Breast cancer subtypes samples are color-coded in the adjacent bar. The explanation for the subtype color codes is provided in the space to the right of the heat map.

105

TABLE 3: LIST OF BIOMARKERS IDENTIFIED FROM THE BREAST CANCER RNA-SEQ ANALYSIS Gene Name EntrezGene Description IL2 3558 Interleukin 2 AURKA 6790 Aurora kinase A BCL2 596 B-cell CLL/lymphoma 2 ENPP7 339221 Ectonucleotide pyrophosphatase/phosphodiesterase 7 SLC5A11 115584 Solute carrier family 5 (sodium/glucose cotransporter), member 11 FGFR1 2260 Fibroblast growth factor receptor 1 (fms-related tyrosine kinase 2, Pfeiffer syndrome) BLM 641 Bloom syndrome MLSTD1 55711 Male sterility domain containing 1 APOC3 345 Apolipoprotein C-III INS 3630 Insulin MYT1 4661 Myelin transcription factor 1 CBR1 873 Carbonyl reductase 1 KIF20A 10112 Kinesin family member 20A MAPT 4137 Microtubule-associated protein tau SLC22A4 6583 Solute carrier family 22 (organic cation transporter), member 4 MAFA 389692 V-maf musculoaponeurotic fibrosarcoma oncogene homolog A (avian) PNMT 5409 Phenylethanolamine N-methyltransferase MYOD1 4654 Myogenic differentiation 1 SLC6A2 6530 Solute carrier family 6 (neurotransmitter transporter, noradrenalin), member 2 NRG1 3084 Neuregulin 1 AMPH 273 Amphiphysin HIST1H1A 3024 cluster 1, H1a FBP2 8789 Fructose-1,6-bisphosphatase 2 ESR1 2099 Estrogen receptor 1 RHAG 6005 Rh-associated glycoprotein MYB 4602 V-myb myeloblastosis viral oncogene homolog (avian) ATP7B 540 ATPase, Cu++ transporting, beta polypeptide CDC20 991 Cell division cycle 20 homolog (S. cerevisiae) AR 367 Androgen receptor (dihydrotestosterone receptor; testicular feminization; spinal and bulbar muscular atrophy; Kennedy disease) SLC28A3 64078 Solute carrier family 28 (sodium-coupled nucleoside transporter), member 3 ACSL6 23305 Acyl-CoA synthetase long-chain family member 6 SLC38A5 92745 Solute carrier family 38, member 5 ZBTB16 7704 Zinc finger and BTB domain containing 16 SLC5A1 6523 Solute carrier family 5 (sodium/glucose cotransporter), member 1

106

SLC10A1 6554 Solute carrier family 10 (sodium/bile acid cotransporter family), member 1 CCNE1 898 Cyclin E1 SGOL1 151648 Shugoshin-like 1 (S. pombe) CDC2 983 Cell division cycle 2, G1 to S and G2 to M RHCG 51458 Rh family, C glycoprotein NGFR 4804 Nerve growth factor receptor (TNFR superfamily, member 16) CENTA1 11033 Centaurin, alpha 1 ASAH3 125981 N-acylsphingosine amidohydrolase (alkaline ceramidase) 3 KIT 3815 V-kit Hardy-Zuckerman 4 feline sarcoma viral oncogene homolog ELOVL2 54898 Elongation of very long chain fatty acids (FEN1/Elo2, SUR4/Elo3, yeast)-like 2 SLC17A3 10786 Solute carrier family 17 (sodium phosphate), member 3 TTK 7272 TTK protein kinase GFAP 2670 Glial fibrillary acidic protein TOP2A 7153 Topoisomerase (DNA) II alpha 170kDa HPSE2 60495 Heparanase 2 CYP27B1 1594 Cytochrome P450, family 27, subfamily B, polypeptide 1 CYP24A1 1591 Cytochrome P450, family 24, subfamily A, polypeptide 1 GSTP1 2950 Glutathione S-transferase pi CDK6 1021 Cyclin-dependent kinase 6 G6PC2 57818 Glucose-6-phosphatase, catalytic, 2 AURKB 9212 Aurora kinase B ELOVL4 6785 Elongation of very long chain fatty acids (FEN1/Elo2, SUR4/Elo3, yeast)-like 4 FURIN 5045 Furin (paired basic amino acid cleaving enzyme)

107

FIGURE 35: HEATMAP OF BIOMARKERS IDENTIFIED IN THE RNA-SEQ ANALYSIS

Figure 35 Details: Figures details the heatmap of biomarkers identified in the RNASeq analysis. Visualization of the performance of these biomarkers on the training and test set is shown as a heat map. The color bar identifies the normalized and rescaled expression values for genes in the heat map. Each column represents one biomarker gene while each row identifies sample values. The orange/teal bar on the left is used to denote training (orange), and test samples (teal). Breast cancer subtypes samples are color-coded in the adjacent bar. The explanation for the subtype color codes is provided in the space to the right of the heat map.

108

Table 4: Classification Report for Microarray Training Set Analysis

Subtypes Microarray Training Dataset

Precision Recall F1-score support

Luminal B 1.0 1.0 1.0 36 Luminal A 1.0 1.0 1.0 70 HER2 1.0 1.0 1.0 23 Basal 1.0 1.0 1.0 35

Avg. total 1.0 1.0 1.0 164

TABLE 5: CLASSIFICATION REPORT FOR MICROARRAY TEST DATASET

Subtypes Microarray Test Dataset Precision Recall F1-score support Luminal B 1.0 1.0 1.0 9 Luminal A 1.0 1.0 1.0 8 HER2 1.0 1.0 1.0 3 Basal 1.0 1.0 1.0 9

Avg. total 1.0 1.0 1.0 29

109

Table 6: Classification Report for RNA-Seq training dataset analysis

Subtypes RNA-Seq Training Dataset

Precision Recall F1-score support Luminal B 1.0 0.98 1.0 50 Luminal A 0.98 1.0 1.0 50 HER2 1.0 1.0 1.0 44 Basal 1.0 1.0 1.0 50

Avg. total 1.0 1.0 1.0 194 TABLE 7 : CLASSIFICATION REPORT FOR RNA-SEQ TEST DATASET ANALYSIS

Subtypes RNA-Seq Test Dataset

Precision Recall F1-score support

Luminal B 0.59 0.63 0.61 67

Luminal A 0.93 0.71 0.81 164

HER2 0.25 0.82 0.38 11 Basal 1.0 1.0 1.0 40

Avg. total 0.81 0.73 0.76 282 TABLE 8: CLASSIFICATION REPORT FOR BIOMARKERS IDENTIFIED FROM INTRINSIC GENE LISTS USING THE TRAINING DATASET

Subtypes RNA-Seq Training Dataset Precision Recall F1-score support Luminal B 0.57 0.64 0.61 67

Luminal A 0.88 0.74 0.81 164

110

HER2 0.42 0.91 0.57 11 Basal 0.95 1.0 0.98 40

Avg. total 0.80 0.76 0.77 282

IV - 4.3 Feature genes and Classification Performance

After identifying the genes that changed between subtypes, the next step in our analysis was to determine the subset of reporter genes that can be used as biomarkers by using a feature selection algorithm.

Subsequently, all the reporter genes were combined to create a much larger list of genes. The feature selection algorithm was rerun iteratively to identify the minimum number of genes that could effectively differentiate the subtypes for our dataset. The analysis identified 46 significant feature genes or biomarkers in the case of microarray analysis and 59 genes in the case of the RNA-Seq analysis. Table 4-8 lists the reporter and feature genes or biomarkers that were identified from the microarray and RNA-Seq analysis respectively. Figure 34 and 35 shows a heat map of the expression values of these features genes from both the datasets for various breast cancer subtypes. Figure 34 and 35 shows the heat map of the gene expression levels for 46 and 59 feature genes in the microarray and RNA-Seq dataset respectively. This heat map provides a visualization of the distinct patterns in the expression levels of the feature genes which can be used to differentiate various disease subtypes. The expression levels in this heat map were rescaled between

-1 and 1. The samples from the training dataset and test dataset are demarcated by the blue and white bars respectively on the left hand side of the heat map.

In order to test the ability of the feature genes to classify samples into breast cancer disease subtypes a support vector based on multi-class classifier was trained using the training dataset. The feature genes were used to classify samples in both the training dataset as well as the test datasets. Table 4-5 indicates that the feature genes obtained from the microarray analysis could classify all the samples with 100% precision and 100% recall suggesting that the feature genes could effectively differentiate the disease. Table

111

6-7 indicates that the feature genes obtained from the RNA-Seq analysis could classify all the samples with

81% precision and 78% recall suggesting that the feature genes could effectively differentiate the disease

in most conditions.

FIGURE 36: NETWORK MAP OF IDENTIFIED BIOMARKERS

112

IV - 5 Discussion

Our overall goal was to provide an automated framework that removes the need for statistical thresholds, and uses concomitant gene expression changes along directed regulatory paths as an indicator of disease rewiring. Disease conditions modify signaling and regulatory pathways and these perturbations are subsequently dissipated across the disease network through hubs. By studying how the hubs dissipate these signals and by tracing the changes occurring in the hub-hub sub-network, we identified key paths and disease-related genes that are affected during the course of the disease.

POC provides an effective measure for identifying changing expression levels. It is effective in the network setting and succinctly captures the essential features of ‘frequentist’ methods (p-value and fold change) into a single score. ‘Frequentist’ statistical methods such as t-test, ANOVA, and Fisher exact test, have been developed to assess the statistical significance of the gene’s differential expression. The Student’s t-test is the most commonly used statistics that is used to measures the significance of differential expression of the gene as a p-value (Dalman et al., 2012). The fold change metric, on the other hand, provides insight into levels by which the gene has changed between two disease conditions. These measures, taken together or individually, are commonly used to reduce the dimensionality of gene expression data by providing a selection criterion. POC aims at providing a probabilistic measure for the gene changes between conditions.

A POC score is not used as a threshold, and a gene and its POC value is retained throughout the computational process. POC scores were independent of the directionality of gene change (up- or down- regulation) and were not used as thresholds for feature selection. We used pairwise pathway analysis to identify an initial candidate list for subtyping (Figure 34 and 35). In most cases, pairwise comparisons of disease phenotypes revealed a higher number of disease-related genes in the RNA-Seq data set as compared to the microarray data set. This result may have been due to higher overall POC values in RNA-Seq data.

The higher POC values could be traced to higher accuracy of RNA-Seq detection methods (Z. Wang,

Gerstein, & Snyder, 2009).

113

Based on functional analysis performed using functional analysis tool DAVID (Dennis et al., 2003), the set of breast cancer disease genes from microarray data set, and in RNA-Seq data set, were assessed to be a significant OMIM disease. Functional pathway analysis revealed the enrichment of several pathways that were previously implicated in breast cancer. These include ErbB signaling pathway (Guedj et al., 2012;

Sorlie et al., 2003; S. Wang & Biology, 2011), Insulin signaling pathway (Jin & Esteva, 2008), cell cycle pathways (Gasco, Shami, & Crook, 2002), p53 signaling pathway (Cancer & Atlas, 2012b; Gasco et al.,

2002; Miller et al., 2005), and TGFβ signaling pathway (Bierie & Moses, 2006; Buck & Knabbe, 2006).

Regulation of cell cycle and programmed cell death were among the GO biological process that had significantly changed.

Our algorithm’s ability to automatically rediscover genes that previous studies have suggested to be strongly linked to this cancer is extremely promising. Nine biomarker genes were found to be common between RNA-Seq feature genes and the microarray feature genes. These genes are CYP24A1, MAPT,

ESR1, KIT, TTK, CDK1, AURKB, CDC20 and BCL2. Between the RNA-Seq analysis and PAM50 set, genes including BCL2, CCNE1, CDC20, ESR1, and MAPT were identified as common. Genes common between Microarray feature set and PAM50 feature set include ERBB2, ESR1, CDC20, MAPT, MDM2 and BCL2.

Many of these genes identified as part of our analysis are part of vital cellular programs that indicate the hallmarks of various cancers. Mutation or expression level modifications of these genes can lead to cellular rewiring which contributes to the overall cancer oncogenesis. 9 out of the 46 microarray feature genes have been implicated in various cancers (as reported in the Sanger Cancer Gene Census) ERBB2, a feature gene identified in our analysis, is a well-documented breast cancer gene that has been implicated in other cancers, such as ovarian, and non-small cell lung carcinoma, as well (Cancer & Atlas, 2012b). Other feature genes identified from our analysis such as PCM1, KIT, PTPN11, CBLB, and BCL2 have been shown to be mutated in other cancers such as leukemia (Vogelstein & Kinzler, 2004). The gene RET has been implicated in medullary thyroid cancer (Takacova et al., 2014). Similar analysis on the RNA-Seq

114 feature genes revealed that 10 of the feature genes were cancer genes. Most cancer genes, for example

FGFR1 and IL2, were implicated in some form of lymphoma, whereas genes such as BCL2, BLM, CDK6,

ZBTB16, KIT, and ACSL6 were implicated in leukemia (Vogelstein & Kinzler, 2004). Many of these genes have been shown to play a key role in breast cancer in follow-up genomic analysis. One such gene is

DUSP1, which is induced by oxidative stress and growth factors. In these studies, DUSP1 has been shown to reduce the growth of cancer cells and differentially expressed in different breast cancer subtypes (C.

Chen, Hardy, & Mendelson, 2011). Another important nuclear migration gene that was overexpressed in

Basal subtype is NUDC. These microtubules stabilizing and de-stabilizing genes including NUDC, have been shown to play a key role in the overall progression of breast cancer and their treatments (Bhat &

Setaluri, 2007).

In the RNA-Seq heat map, genes such as PCM1, MAPK14, PTPN11, CBLB, RPL31, RPL38, and

RPL13 exhibit a distinctive behavior (Figure 35). The changes in the expression levels of these genes support the view that ribosomal protein synthesis between subtypes undergo a significant change. This result has also been noted in other analysis of breast cancers where the differential expression of ribosomal protein levels affects the mTOR and PI3K pathway (Eroles et al., 2012). The microarray heat map also displays large changes in AR, MYB, MAPT, TTK, EVOVL4, and PNMT genes. A majority of these genes are known to play a role in the maintenance of microtubules, which are known to be affected in some forms of breast cancer (McGrogan, Gilmartin, Carney, & McCann, 2008). Genes like AURKA, which in the heatmap was differentially expressed in LumB and Basal, was shown in an aggregated meta analysis studies to be prognostic and subtyping gene marker (Wirapati et al., 2008). Another important enzyme which plays an important role is PNMT, which was also shown to differentially expressed in HER2 subtype. PNMT was shown to be differentially expressed in HER2 amplified breast cancer using high-resolution genomic and expression analysis (Staaf et al., 2010).

The feature genes obtained from the microarray data set were able to differentiate the disease subtypes more accurately than those from the RNA-Seq data set (Table 5-6). Feature genes classified the

115 samples from the blind data set into breast cancer subtypes with 100% accuracy (Table 7-8). With the RNA-

Seq data, we were able to achieve 81% accuracy in the test set, while the training set was classified with

100% accuracy (Table 7, 8). This was similar to the accuracy obtained for features genes derived from previously published intrinsic breast cancer gene list (Table 9). The smaller training set in our training may partially explain these results, and we expect higher accuracies once our computational resources are expanded to use larger sample sizes. Additionally, Dillies.M.A et al (Dillies et al., 2013) have suggested that the microarray data sets have more consistent normalization methods than RNA-Seq analysis. RNA-

Seq analysis is a newer technology that lacks consensus around the normalization methodologies. The level

3 RNA-Seq data set utilized in our work measures the levels of the gene in units of RPKM, a measure that, along with proper normalization, remains the subject of a vigorous debate. Although our study of the empirical distribution of RPKM values suggested the use of log scale for both the feature selection, careful examination of the distribution of values remains a necessary task for improved normalization. With proper normalization for RNA-Seq, we expect to achieve results that are on par with microarray in terms of robustness.

116

IV - 6 Follow-up Analysis using TCGA datasets

With newer breast cancer microarray and RNA-Seq dataset becoming available at TCGA, we opted to reexamine our analysis with the newer microarray and RNA-Seq data. A number of studies evaluating the concordance of RNA-Seq and microarray biomarkers for breast cancer have yielded inconclusive results

(Marioni, Mason, Mane, Stephens, & Gilad, 2008; Soneson & Delorenzi, 2013; Z. Wang et al., 2009). Our goal here was to revisit the concordance hypothesis, and to evaluate the degree to which POC values are reflective of correlations at the level of raw reported data values. We postulated that similarity in POC values across the two data set would suggest similar outcomes because POC values are the basis for all subsequent computation in our network-based model. For the measure of similarity, we used the Pearson correlation coefficient implemented using the sci-kit python package (Pedregosa et al., 2011). From TCGA, we used dataset 'AgilentG4502A' which contains the microarray dataset. We also downloaded level3 dataset for both RNA-Seq and microarray dataset and converted it into legacy format. The samples from these TABLE 9: COMPARISON OF THE CORRELATION BETWEEN RNA-SEQ AND MICROARRAY POC SCORES BETWEEN TWO IDEPENDENT RUNS. ‘PREVIOUS RUN’ COLUMN REPORTS THE PUBLISHED ANALYSIS WHILE THE ‘CURRENT RUN’ REPORTS THE RESULTS FROM ANALYSIS OF THE NEW TCGA DATASET.

datasets were classified into their respective subtypes using PAM50 classification and cross-checked it with existing literatures.

117

Using the same algorithms used in our previous published analysis, we computed the POC scores for all edges and nodes. We compared the POC scores of all genes computed from our new dataset analysis for microarray and RNA-Seq with the POC scores from the published analysis using Pearson correlation coefficient. The correlation between a gene’s POC, for a given subtype combination, in the RNA-Seq and microarray datasets were calculated for both the previously published study and our current analysis and the results were tabulated in Table 9. Correlations between POC values across the two distinct sets of data

(previous and current) show a remarkable consistency – although correlations in the current study are consistently lower by approximately 5%. These results suggest that POC values remain largely consistent across these two distinct data sets.

The moderate to strong observed correlations among POC values are expected to be higher than correlations that are structurally present in the raw data. A lower correlation would indicate: a) an increase in variance as measure by POC in each data set, or b) a decrease in the covariance between RNA-Seq and mRNA values as measure by POC. Increased within sample variance, or decreased covariance, as measured by POC, will lead to a weaker discrimination capability for the POC – an undesirable property. It remains to show that the correlations between POC scores are not discernible directly from the raw scores.

To ascertain that the correlations between the POC scores was not an artifact of the raw expression values, we compared the correlation of POC scores of all genes between a given pair of conditions with the correlation of mean raw expression value and this is shown in table 10. We first calculated the correlation for the POC scores for every gene between microarray dataset from TCGA dataset (current analysis) and

GEO data (previously published results). Then we calculated the correlation scores for mean raw expression values of genes for both these datasets. Similarly, we calculated the correlation scores for both, POC scores and expression values for RNA-Seq analysis. Based on Table 11, we can conclude that for most subtype combinations, RNA-Seq correlation scores from both raw expression value and POC values show greater correlation scores than the microarray platform. This is because RNA-Seq platforms can measure changes in gene expression more accurately than the microarray platform which translates into stronger correlations.

118

Similarly, when comparing between subtypes of the same platform, the POC scores also consistently show

greater correlation scores than the raw expression values. These finding indicates that for a given platform,

POC scores are able to capture expression changes in a more consistent manner which is always greater

than the changes that can be seen by the raw expression values.

TABLE 10 : COMPARISON OF CORRELATION IN EACH PLATFORM AS MEASURED BY MEAN RAW EXPRESSION VALUES AND POC SCORES.

The combined results illustrate the consistent behavior of POC values across studies. While RNA-

seq scores and mRNA expression scores based on POC remain correlated (moderately to strongly) across

the two sets of studies, the concordance level between RNA-Seq data and microarray data are only

moderately correlated. The moderate concordance may suggest an explanation for the range of results

documented in the literature – from good concordance to weak concordance. The threshold used for

selection and the method of normalization are likely to play a significant role in the outcome of concordance

evaluation. It may be suggested that the continuous gradations of POC values is able to capture the

dependence of concordance on the threshold through the moderate POC correlations exhibited in table 9.

119

Chapter V: Expression Analysis of Sepsis, SIRS And Septic Shock

V - 1 Abstract

V - 1.1 Background

Sepsis and septic shock are leading causes of death for critically ill patients which can be attributed mainly to delayed diagnosis of the onset of sepsis. Reliable biomarkers that can detect the onset of sepsis at earlier stages of development could be used to improve patient outcomes, however the underlying pathological mechanisms of sepsis and septic shock are poorly understood, making the accurate and early clinical diagnosis of the disease extremely difficult.

V - 1.2 Method

In this study, we study we extend our novel path-centric approach for early detection of sepsis to use of time-series data for analysis of complex diseases. Wiring and information flow, concepts with analogs in network theory, are used as a basis for identifying significant changes in networks in order to retain genes that would otherwise be eliminated due to p-values or fold changes. This approach uses probabilistic analysis of high-throughput data (mRNA and miRNA), along with directed and enriched signaling network information to identify functionally and differentially active pathways. Our integrated approach to analyzing temporal variation and enforcing consistency among multiple data sets increases the likelihood of identifying more reliable and functional biomarkers.

V - 1.3 Results and Conclusion

Our analysis identified pathways with significantly altered interactions that were found to be associated with, for example, lymphocytes proliferation and maturation as well as cell-cell interactions. One significantly altered pathway involved the NFAT-pathway, which is known to regulate a number of immune processes during the course of the pathology. We expect to expand this platform to aid in the

120 discovery of prognostic and diagnostic molecular biomarkers associated with survival or other clinically relevant end-points.

V - 2 Background

Pediatric systemic inflammatory response (SIRS) is complex whole-body immunological response to a variety of insults including trauma, surgery, ischemia, severe tissue injury, and pancreatitis (Wong et al., 2009). Manifestations of SIRS includes but not limited to

 Body temperature: less than 36 C and greater than 38 C.

 Heart rate: greater than 90 beats per minute

 Tachypnea: high respiratory rate with greater than 20 breaths/min

 White blood cell count: less than 4000cells/mm3

When patients meet two or more of the above criteria with or without infection, they are classified as

Systemic inflammatory response syndrome (SIRS) (Herzum & Renz, 2008). A patient is diagnosed with sepsis when they meet the criteria for SIRS with a positive infection (Herzum & Renz, 2008). Patients who are diagnosed with sepsis and also show signs of cardiovascular, disturbed perfusion, metabolic acidosis, neurological disorders and other organ failure are classified as severe sepsis or septic shock (Herzum &

Renz, 2008). Sepsis, SIRS and septic shock are emergent immune disorders which come about due to dysregulation of innate and adaptive immune system (Iskander et al., 2013). Patients diagnosed with sepsis are associated with a high mortality rate (László, Trásy, Molnár, & Fazakas, 2015) and a very large financial burden (Iskander et al., 2013). Even with the extensive research done in the recent years, sepsis is still the largest cause of death in non-cardiac related ICUs; the mortality rates can be as high as 60% in septic shock patients and 40% of the total ICU expenditure (László et al., 2015; Wong et al., 2009). Current understanding of the behavior of the circulating immune physiology during disease progression is

121 very limited and researchers have concluded that the diverse syndrome is mainly due to the imbalance in the inflammatory network.

FIGURE The37: CphysiologicLINICAL DalIAGNOSIS alterations FOR for S EPSISsepsis,, SIRS, SIRS, AND and SsepticEPTIC shock SHOCK continue to be defined by non- specific alterations in physiology, including temperature, as well as heart and respiration rates (King, Bauzá,

Mella, & Remick, 2013). The current gold standard for differential diagnosis of sepsis and septic shock from SIRS involves positive blood cultures with supporting evidence, both clinically and from other anatomical sites (King et al., 2013). These tests can sometimes be inconclusive because the medical determinants are not specific to sepsis and are usually susceptible to variations such as prior antibiotic administration, low pathogen levels in the circulatory system, or contamination. Blood cultures also tend to be time consuming, making time-critical clinical interventions difficult (Iskander et al., 2013).

V - 2.1 Molecular Mechanism of Sepsis, SIRS and Septic Shock

The main cause for most of these syndromes are the excessive activation of antigen recognition system. This leads to release of pro-inflammatory immune systems mediators into the system leading to serious systemic dysfunction (Cornell, Wynn, Shanley, Wheeler, & Wong, 2010; King et al., 2013).

Immune cells are express receptors on the surface called the pattern recognition receptors (PPRs) which are

122 capable to triggering a body’s defenses reaction. These molecular patterns are primed to respond to patterns present in disease causing pathogens such as bacterial lipopolysaccride (LPS) (Janols et al., 2014; King et al., 2013; László et al., 2015). On detecting microbes, microbial pathogens associated molecular patterns

(MAMP) activate innate immunity via pattern recognition receptors (PPRs) which intern triggers the body’s immune inflammatory response against the pathogens. During microbial infection, microbial constituents or necrotic tissue trigger the inflammatory response resulting in the activation of the systemic immune response. When the inflammatory response is excessive and systemic, it results in sepsis. The excessive immune response leads to a marked imbalance of cytokine levels known as a cytokine storm which causes the normally advantageous response against infections into an excessive, body-damaging response (Janols et al., 2014; László et al., 2015; Nduka & Parrillo, 2009).

SIRS is brought about by physiological responses very similar to sepsis but the trigger for the immune reactions are very different. In many condition such as pancreatitis, the presence of large amount of necrotic debris in the system is enough to trigger an immune reaction. Cellular necrosis from injury releases mitochondrial DNA into the circulatory system which is capable of eliciting inflammatory signals.

The danger associated molecular pattern (DAMP) is more often triggered by host entities that usually harmless than by foreign pathogenic entities (Iskander et al., 2013; László et al., 2015; Nduka & Parrillo,

2009). Both DAMP and PAMP elicit similar physiological response making it difficult to distinguish the two via a single molecular or molecular pattern.

V - 2.2 Clinical Diagnosis

Due to the severe nature of these syndromes, it very important for the early diagnosis. Administering the right clinical treatment early, significantly affects the outcome of the patients but the heterogeneous nature of the disease, diagnosis of the disease is very difficult. Treatments such as broad spectrum antibiotics are generally not very effective in many cases as some of these bacterial strains might be antibiotic resistant (Janols et al., 2014). Also, it has been clinically shown that administering the right

123 treatment in the first hour of hypotension, significantly improves the survival rate among patients (Kumar et al., 2006). Some of the main difficulties in diagnosing of these disorders are:

 Sepsis is a heterogeneous disease which makes it difficult to diagnose.

 Sepsis is not an infection of the blood stream which is usually the bodily fluid used for identifying

infection.

 Sub-optimal sensitivity of the clinical identification of sepsis

o Prior antibiotic treatment can severely affect the ability to detect clinical manifestation of

the disease.

o Clinical tests may report positive even when the pathogens are mostly phagocytized

 Most of these are diagnosed in the ICU setting. The use of sedatives and inotropes to treat patients

in the ICU makes organ dysfunction in conditions like septic shock very difficult to identify.

 Clinical symptoms manifested by SIRS, septic shock and sepsis are very similar.

The current 'gold standard' diagnosis of sepsis involves positive blood culture. It is usually supported with evidence of infection both clinically and from other anatomical sites. Negative or inconclusive blood cultures could also result from prior antibiotic administration. The biggest drawbacks of blood cultures are that they are time-consuming and thus make it difficult to take time-critical decisions.

Two commonly used clinical biomarkers are

 Procalcitonin (PCT): A precursor to calcitonin that is synthesized in the thyroid ‘C’ cells (Kibe,

Adams, & Barlow, 2011). The secretion of PCT is a component of the immune system

inflammatory response and is usually specific to microbial infections making it a good candidate

for blood-based biomarker. PCT is also known to increase in level during kidney failure. Even

though PCT is able to track the rise of infection in the system, based on current clinical research, it

lacks the accuracy to be used as a prognostic biomarker without clinical judgment.

124

 C-reactive protein (CRP): CRP is an acute phase reactant protein produced in the liver in response

to flood of cytokine storm that is generally released by the system in response to a microbial

infection. CRP protein itself has shown to have both anti and pro-inflammatory properties. Like

PCT, CRP does not possess the accuracy to be used as a prognostic or diagnostic biomarker for

sepsis but the levels of CRP have been strongly correlated with the onset of septic shock (Pierrakos

& Vincent, 2010).

V - 2.3 Limitations of diagnostic methods

Identification of a robust set of biomarkers is an important step towards early diagnosis and treatment. One of the biggest problems facing diagnosis of sepsis, SIRS, and septic shock, is the lack of robust clinical or genomic biomarkers to accurately differentiate the conditions. There are two promising biomarkers that have been clinically tested, Procalcitonin (PCT) and C-reactive protein (CRP), but these markers only distinguish between sepsis with infection and SIRS without infection, but cannot predict if the patient will go into shock (László et al., 2015). The relationship between the biomarker and the state of disease physiology has not been fully understood, thus making it difficult to obtain a clear prognosis

(Janols et al., 2014; Y. Liu & Chance, 2013; Russell, 2011).

 The most preferred method for clinical diagnosis is non-related blood derived protein level markers

which have a higher rate of false positives.

 These markers only distinguish between sepsis with infection and SIRS without infection but

cannot predict if the patient will go into shock.

 The relationship between the biomarker and the state of disease physiology has not been fully

understood thus making it difficult to obtain a clear prognosis.

 Molecular biomarkers that have been identified using traditional expression level based seeding

process have yielded lower-success rates in clinical diagnosis.

125

V - 3 Dataset Information

The original microarray data used is currently available at Gene Expression Omnibus accession number: GSE13904 (Wong et al., 2009). The microarray measures the total RNA was isolated from whole blood samples using the PaxGene Blood RNA System (PreAnalytiX, Qiagen/Becton Dickson, Valencia,

CA) which was conducted according to the manufacturer’s specifications. Microarray hybridization was performed by the Affymetrix Gene Chip Core facility at Cincinnati Children’s Hospital Research

Foundation using the Human Genome U133 Plus 2.0 Gene-Chip (Affymetrix, Santa Clara, CA).

Samples for the microarray analysis were collected from children under the age of 10 admitted to the pediatric intensive care unit of Cincinnati Children’s Hospital (PICU). Only patients that were meeting the standard criteria for either sepsis, SIRS or septic shock were included in the study. The samples obtained from the patients were whole white blood cells. These blood samples were drawn from patients on day 1 and day 3 after the admittance into the PICU. All the patients in the study were assigned one of the three classifications (sepsis, SIRS or septic shock) on day 1 of admittance. They were then reclassified if necessary on day 3 based on the same criteria.

Day 1 in the study was defined within 24 hours of the patient meeting the criteria during admission to the PICU or after initial admission to the PICU with a non-study related classification. Samples for day

3 were obtained 48 hours after the first blood draw. The disease was classified according strict guidelines.

Severity of the disease was classified based on Pediatric risk mortality III score with organ failure in conditions of patients with sepsis being classified based on pediatric-specified criteria. Control patients for this study were recruited from ambulatory department using established criteria.

Overall goal of study was to evaluate the specificity of septic shock genomic expression signatures by comparing the genomic signatures from patients suffering from septic shock with other critically ill patients. The analysis revealed that common patterns exist across all critically ill patients including patients who later go into septic shock and there exists a unique pattern for individual disease types. Patients

126 suffering from septic shock were characterized by persistent activation of genes related to innate immune pathways and inflammatory system. Functional analysis of the modulated genes indicated that genes related to zinc and other heavy metals were also modulated. This is important because zinc homeostasis is known to play a key role in normal functioning of innate immune system and adaptive immune system. Also genes related to the IL-10 pathway were unregulated in patients with septic shock while genes from the TGF-β pathway was uniquely unregulated on day-3 of septic shock patients. IL-4 was persistently down-regulated in patients with septic shock.

One of the biggest limitation of the study was that expression levels obtained from whole blood derived RNA samples for microarray data reflected RNA levels from multiple white blood cell populations and were not reflective of the exact immune response. The study also lacked well-defined pathological parameters: Because the syndrome lacks measurable clinical parameters, it is very difficult to gauge the exact onset of the disease when a patient is admitted. In this study, day 1 was defined when a patient admitted to the hospital was diagnosed with the condition is enrolled into the study. There are a number of factors that can affect the day 1 assignment including the patient’s immune state, late/early admittance to the hospital, previous medical history and medications, and age. Therefore, day 1 and day 3, as defined in the study, may not have accurately reflected the onset and progression of the disease.

V - 4 Network Medicine and Human Inflammatory Disease Physiology

Common genetic variants affect many complex diseases including many inflammatory syndromes.

Sepsis, SIRS and septic shock syndrome can be viewed as an emergent syndrome that is brought about by combinatorial and simultaneous changes to multiple genes over time. Changes in circulating immune gene expression and rewiring of cellular signaling pathways can result in large scale alterations to cellular compositions during transformation.

The “progression” network provides a view of the changes to the state or maturation of the circulating immune cells triggered by various signaling events during the course of the disease. Network

127 medicine reveals the evolution of novel phenotypic properties through simple changes in the underlying interactome.

V - 4.1 Differential Expression patterns for studying multigenic diseases:

Large number of genes in the expression data, complex interaction types between the gene products, and lack of detailed kinetic and chemical parameters between different genes in the network make it difficult to model to model the network as a whole. It is difficult to determine information flow along the gene networks with a priori knowledge. Lack of detailed knowledge of the various parameters of the system coupled with the relatively few parameters of the systems that are simultaneously measured make recreating the network dynamics very difficult. Due to these limitations, most bioinformatics methodologies trace anomalous patterns of gene expression in the network versus dynamic modeling the network as a whole.

These anomalous gene expression pattern serve as the foundation on which the network can be built. The most common method of network module reconstruction is using list-based network seeding where genes whose expression levels have varied significantly serve the basis for which the modules/sub networks are reconstructed.

V - 4.2 Network Data

Curated biological interaction data (STRING, Reactome (Joshi-Tope et al., 2005)) and the National

Cancer Institute Pathway Interaction Database (NCI-PID) (Schaefer et al., 2009)) used in our platform were obtained from Pathway Commons (Cerami, Gross, Demir, & Rodchenkov, 2011). The interaction data was used to divide the network into directed and undirected sub networks based on the presence or absence of interaction directionality. The interactions obtained from Pathway Commons were resolved into a Simple

Interaction Format (SIF), which represents gene/protein interactions as simple pairwise interactions. The genes and their interactions were selected in an inclusive manner meant to represent the maximum possible number of genes in the given study - they were not selected to represent the disease outcome. Undirected interactions such as ‘Co control’, ’Interacts with’, ’In same component’ and ‘Reacts with’ formed part of

128 the undirected network while directed interactions such as ‘Component of’, ‘Metabolic catalysis’,

’Sequential catalysis’, and ‘State change’ formed the directed network. Isolated nodes were removed from both networks and multiple edges between nodes were replaced by a single edge. Details on the development of these networks can be found in our published literature. We observed that the degree distribution of nodes in both networks followed the power-law (scale-free network) and the average width of the directed network was approximately 3 (Vallabhajosyula et al., 2009).

V - 4.3 Identifying “Creative elements”

Creative elements are genes that are usually bottlenecks in the network which are usually hubs or connected to important party hubs in the system. For each selected hub, we identified all simple paths from the directed network between the hub and all other selected hubs. Genes which were separated by a maximum of 4 steps, which is approximately the average diameter of the network, were identified and selected for further analysis. Individual hub-subgraphs are created for every unique disease condition and every unique hub using the genes identified from the path analysis. To identify paths along which there exists an increased probability of information, Markov-chain process was conducted. For the Markov process, the disease condition hub-subgraph’s node and edge were assigned the disease condition specific probability of change values. The final value from 5-step transition probability was obtained from the analysis. To create a robust control for the above Markov chain process, 1000 hub-subgraph with node and edge whose values were assigned randomly, based on a distribution of values obtained from the original disease network, were assigned and 5 step transition probability was calculated for everyone in the network.

Disease specific genes whose transition probability values were greater than 3 standard deviations of the random value network were selected for the next level of analysis.

V - 5 Methods

V - 5.1 Probability of change (POC) for nodes and edges:

For pairwise comparison of inflammatory disease conditions like control and sepsis day1, the POC

129 score for each gene or microarray probe was calculated using the Bradley Terry Model. Bradley–Terry–

Luce (BTL) model for paired comparison, the probability that a gene expression in control ‘a’ is higher than that in sepsis day1 ‘b’ is given by assumes that for any element a휖 퐴 there exists a real number

a  such that, for all a, b A ,

a p a, b  ab  

Where p(a, b) is the probability that ‘a’ is chosen when (a, b) is presented.

The probability of perturbation of a particular gene using the BTL model, the abilities of individual expression value from the pairwise comparisons disease conditions were computed using

pS  POC   1 p S12 p S 

A POC score of zero represents very low likelihood of alteration in gene’s expression levels, whereas a score of one represents high likelihood of alteration in gene expression levels between the two conditions.

In microarray genomic data, where certain genes are represented by more than one probe, the probe with the highest POC value for the given pair of conditions is used. More details about the background and implementation can be found in our published literature.

V- 5.2 Temporal Causality for Nodes and Edge Scores

The dataset we have used in this analysis contains data obtained at two time points (day 1 and day 3). This enabled us to use the information present in time-series expression data to improve the probability of perturbation scores. We multiplied the probabilities of perturbation during the course of the disease. This improves the sensitivity of identifying differential changes through comparison with similar-but-not “same” conditions. It also provides a basis for estimating the probability by which a node and edge probabilities changes between two conditions.

130

V - 5.3 Hub Interaction Score

In our analysis, key hubs are defined as hubs which displays large changes in gene expression in value of the hub gene and its immediate interacting partner. The larger the change between two conditions, the greater the importance of hubs in rewiring the network topology. We are introducing a metric which captures the changes between the distribution of hub and its immediate partners between two conditions which is not too biased on the degree distribution of the hub. This was implemented using a modified version of the earth mover’s distance (EMD) to measure the same. EMD is commonly used in content based image retrieval algorithms to describe and summarize different features of an image. For example, the one- dimensional distribution of an image intensities describes the overall brightness content of a grayscale image, and a three-dimensional distribution can play a similar role for color images. Input for EMD metric calculation is composed of signatures that are constructed of probability of change values at different conditions. Individual condition’s signature is composed of unique edge probability of change score

(distance) and interacting partner’s probability of change score (weight) for all the hubs interacting partner.

Probability of change values for control is defined for the same hub-neighborhood topology but with values replaced by the average change of nodes and edges for the given condition. Distance metric is calculated for all the hubs in the system for a given condition with respect to their corresponding control signatures.

EMD calculates the minimum cost to transform the distribution of one condition to distribution of another condition making it an attractive method of studying activity around a hub.

The specific implementation follows this following approach:

● Pi , Qi Bipartite Network representative, Wi is the weight of the cluster

● P (Control Signatures)=()()PWi p11 P m W p m 

● Q (Disease Signatures)=(QiWW p11 ) (Q n p n )

● EMD can be defined as the cost of “moving supplies" from P to Q

131

mn dfij ij ij11 EMD(,) P Q  mn  fij ij11

Where fij the set of is flows and dij is the distance between two states.

For more information about the implementation details please refer to our published analysis.

Figure 38 Details - Methodology employed to identify biomarkers in the study of sepsis, SIRS and septic shock. POC scores are assigned to genes and their interactions in the “Network setup and POC scoring” step, followed by path analysis step is used to identify and quantify changes in biological pathways. Candidate paths are selected by first identifying “hub” genes with significant changes, and finding paths that connect hubs – called hub-to-hub sub-networks. Hub-to-hub sub- networks are then measure against “random networks” – networks with identical topology but randomly assigned scores. Random scores are drawn from an empirical distribution constructed from the distribution of calculated POC weights. Genes in paths identified as “significantly changed” are used to extract feature sets. Using our uniquely identified genes, a standard “Feature Selection” step is utilized to arrive a subset that sensitively classifies signals that differentiate the two conditions.

F IGURE 38: ALGORITHM AND STUDY DESIGN FOR THE TIME SERIES ANALYSIS OF SEPSIS, SIRS AND SEPTIC SHOCK

V - 5.4 Study Design and Path Analysis

This analysis aims to identify sepsis, SIRS and Septic shock disease-related genes and biomarkers identified using time series microarray dataset. The performance of these biomarkers will be evaluated based on their ability to correctly subtype an unseen dataset from the platform. The dataset contains time-

132 series data where blood samples for the microarray analysis were obtained for two days which were separated by a 48-hour interval. The control for this dataset was obtained from pediatric ICU patients who were not suffering from any of the above conditions.

The microarray gene expression data used in this analysis is currently available at Gene Expression

Omnibus accession number GSE13904. The different steps involved in the analysis are illustrated in Figure

38. The samples from the datasets were classified into sepsis day 1, sepsis day 3, SIRS day 1, SIRS resolved day 3, Septic Shock day 1 and Septic shock day 3 based on the metadata available within the GEO file. The samples whose metadata information didn’t fit the above classification were removed from the analysis.

The study was then divided into three broad projects defined below:

1) Sepsis: project identifies disease related genes from time-series samples tagged control, sepsis day

1 and sepsis day 3

2) Septic Shock: project identifies disease related genes from time-series samples tagged control,

septic shock day 1 and septic shock day 3

3) SIRS Resolved: project identifies disease related genes from time-series samples tagged control,

septic SIRS day 1 and SIRS resolved day 3

The results from the individual time-series run are merged to create a universal sepsis, septic shock and SIRS dataset which captures all the information for the individual dataset. Because our methodology only allows pairwise comparisons of disease conditions, the overall process was divided into disease condition specific runs as detailed in figure 39. POC for all the nodes and edges were calculated. From the individual control vs day1, day1 vs day 3 scores, the overall temporal scores were calculated for the individual disease condition. Once we have computed the overall POC scores for all edges and nodes for a given disease condition, the next step was to identify the dysregulated paths. We used key biological and network properties such that we reduced the total number of paths we analyzed without missing out on key

133 paths in the network. By studying how the hubs dissipate signals and tracing the changes occurring in these hub-hub sub network, we identified key paths and genes that are affected between subtypes.

Studying these paths and the genes in these paths sheds more light on the underlying disease physiology and can serve as potential disease biomarkers. One of the most important aspect of the analysis is that the number of genes that are deemed important between two disease conditions are decided by the expression value and that of their network topology and not by any statistical criteria. Pathway Analysis is then performed on the overall models to identify the disease-related genes in sepsis, SIRS and septic shock.

We wanted to make sure that the biomarkers we identified were not clustered in a small region of the overall network. We accomplished this by identifying feature ‘reporter’ genes for sub networks using feature selection algorithm and then combined all the reporter genes from all the combination runs to create a much larger list of genes. We re-ran the feature selection algorithm with cross validation until we identified the minimum number of genes that could effectively differentiate the subtypes for a blind dataset.

Feature selection with elimination was then performed on the all the disease-related genes combined to identify the biomarkers that can best differentiate the disease conditions. Heat maps of the biomarkers are generated and the performance of the biomarkers in classifying these disorders are evaluated.

FIGURE 39: STUDY DESIGN FOR SEPSIS, SIRS AND SEPTIC SHOCK ANALYSIS

Figure 39 Details: Using path analysis performed for each subtype pair, the disease-related genes were identified. Sub-network feature selection algorithms identified biomarkers that can effectively classify disease subtypes.

134

V - 6 Results

The main aim of the study to identify and rank the most differentially modulated paths along the various pathways of sepsis, SIRS and septic shock while taking into consideration the time-course of the disease. From these modulated paths, we wish to identify reliable molecular biomarkers that can consistently differentiate the patients suffering from sepsis, SIRS and septic shock based on their blood samples. In addition, the overall efficacy of time-series POC in generating biomarker candidates with discriminatory power is evaluated using standard methods of machine learning.

135

FIGURE 40: DISEASE PROFILE OF SEPSIS

Figure 40 Details: The sepsis dataset is composed of control, sepsis day1 and sepsis day 3 samples. Sub-figure a. describes the temporal order of the samples. Sub-figure b, and c plots the variations in POC scores and Absolute fold change (AFC) for all gene expression values between control and day1 and day 3 respectively. Sub-figure d plots the variation of POC scores between day 1 of sepsis and day 3 of sepsis while sub-figure e plots the overall temporal POC scores with the AFC on day 3. V - 6.1 Performance of POC in time-series analysis

Probability of Change (POC) is a logistic regression model which employs a game-theory based model to calculate the probability by which a given gene/probe’s expression value changes, either up or down, between two different disease conditions. In our previously published study, we had already

136 examined the relationship between POC scores and more commonly used ‘frequentist metrics’ such as p- value and absolute fold changes (AFC). In the published study, we were able to demonstrate that POC provides an effective measure for capturing both the absolute fold change and significance (p-value) of the genes, while preserving the distinct disease profiles between disease conditions. Furthermore, we were also able to identify several prominent genes with high POC which would have been eliminated in many frequentist methods as they showed small fold changes (less than 2-fold difference).

FIGURE 41: DISEASE PROFILE OF SEPTIC SHOCK

Figure 41 Details: Sub-figure a, and b plots the variations in POC scores and Absolute fold change (AFC) for all gene expression values between control and septic shock day1 and day 3 respectively. Sub-figure c plots the variation of POC scores between day 1 of sepsis and day 3 of septic shock while sub-figure d plots the overall temporal POC scores with the AFC on day 3

137

In the current study, we further extend the concept of POC to capture features from time-series microarray datasets. The microarray expression data used in this analysis was obtained from sepsis, septic shock and SIRS patients on day 1 and day 3 after being admitted into the ICU. The control data was obtained from patients admitted into the ICU but were not diagnosed with any inflammatory conditions. The control data is considered as the first time-point in our analysis, while day 1 and day 3 are considered as time point two and three respectively thereby providing a linear progression of time as shown in figure 40a.

FIGURE 42: DISEASE PROFILE OF SIRS

138

Figure 42 Details: Sub-figure a, and b plots the variations in POC scores and Absolute fold change (AFC) for all gene expression values between control and SIRS day1 and day 3 respectively. Sub-figure c plots the variation of POC scores between day 1 of sepsis and day 3 of SIRS while sub-figure d plots the overall temporal POC scores with the AFC on day 3

FIGURE 43: PLOTS DETAILS THE POC VS P-VALUE SCORES FOR THE TIME-SERIES SEPSIS ANALYSIS

139

FIGURE 44: PLOTS DETAILS THE POC VS P-VALUE SCORES FOR THE TIME-SERIES SEPTIC SHOCK ANALYSIS Figure 40b-e, Figure 41 a-d and 42 a-d plots the log2 normalized AFC of gene’s expression values with the corresponding variations in POC values between the control, day 1, day 3 and overall change for patients diagnosed with sepsis, septic shock and SIRS respectively. From these figures, it is evident that there are a large number of genes whose expression levels change consistently (AFC > 2 and POC > 0.5) for conditions control vs day1 and control vs day3 in sepsis (figure 40b and c), septic shock (Figure 41 a- b), and SIRS (Figure 42 a-b). It is also evident that there are larger number of genes are consistently over/under expressed (> 0.5 POC and > 2 AFC) in septic shock (Figure 41 a-b) when compared to the other conditions leading to more pathway dysregulation.

140

FIGURE 45: PLOTS DETAILS THE POC VS P-VALUE SCORES FOR THE TIME-SERIES SEPTIC SHOCK ANALYSIS The overall expression profile of genes in septic shock also differ from that of SIRS and sepsis.

Sepsis and SIRS plots show a narrow, tighter profile with a nearly linear relationship between POC and

AFC whereas septic shock’s profile is more diffused with more genes having larger AFC but lower POC scores. This usually arises due to the inconsistent change in expression levels as shown in our previous publication. It may also demonstrate the heterogeneity in the patient samples in the case of septic shock diagnosis.

Next, we use POC to understand how the genes change their expression levels during the course of the disease. Between day1 and day3, large number SIRS (Figure 42c) and sepsis genes (Figure 40d) show prominent POC changes (POC around 0.4) when compared to septic shock (Figure 41c) which shows very little change (POC around 0.2). It is essential to capture these changes as these genes might be playing an

141 important role in modulating pathways that helps in the patient’s recovery, as in the case of SIRS. These temporal changes are captured into a single score using temporal causality. In figure 40e, plots the POC vs

AFC, where the AFC used were that from day 3 expression levels while POC is a single which is composite of control to day 1 and day 1 - day 3 POC scores. Similar plots were also generated for septic shock (Figure

41 d) and SIRS (Figure 42d) too. Figure 3e clearly demonstrates that the overall POC scores for sepsis are an amalgamation of POC scores in Figure 40 b, c and d. The overall POC is scored based on how consistent a gene expression level has changed through the course of the disease. This is evident in Figure 40e, which has fewer genes with high AFC and low POC scores when compared to Figure 40b and 40c, while maintaining the overall profile of genes expression changes these conditions. Hence it can be seen that

POC, being a probabilistic model, is succinctly capturing time-series changes in gene expression which can then be used effectively in a network setting.

FIGURE 46: DISEASE-RELATED GENES FROM ANALYSIS

Figure 46 Details: Sub figure a Details the exact count of disease-related genes identified in each analysis based on individual time-series analysis with control. Common set of disease-related genes were then identified from different disease conditions and represented as Venn-diagram in while sub- figure b

142

FIGURE 47: PLOTS SHOWS THE DISTRUBUTION BETWEEN EMD SCORES, POC AND NODE DEGREE RESPECTIVELY

143

V - 6.2 Path Analysis and Functional Genomics

The biggest difference in our analysis and ‘frequentist’ analysis is that we did not rely on cut-offs for determining disease related genes. From our previously published literature, we were able to demonstrate that the platform was able to consistently identify similar disease-related genes even across different expression analysis methodologies like microarray and RNA-Seq. The first step in the identification of disease-related genes is to find key hubs in these networks. In this study, we classified hubs by selecting genes with the top 10%-degree distribution based on previously existing literature to define hubs in networks. The significant paths and disease-related genes were identified between these hubs. The number of disease-related genes that were identified is completely dependent on the number of dysregulated paths and do no rely on any cut-offs as in the case of frequentist methods. If the disease conditions are brought about by large number of dysregulated paths, as in the case of septic shock, the number of disease-related genes are high compared to conditions like sepsis.

TABLE 11: SELECTIVE PATHWAYS THAT SHOWED SIGNIFICANT CHANGES DURING SEPSIS TEMPORAL ANALYSYS Disease Pathways: Sepsis

ID Source FDR 1 Signalling by NGF 7.466E-36

2 EGFR1 Signaling Pathway 3.283E-31

3 Metabolism of lipids and lipoproteins 2.058E-29

4 IL-3 Signaling Pathway 2.146E-29

5 B Cell Receptor Signaling Pathway 3.682E-29

6 NGF signalling via TRKA from the plasma membrane 2.109E-28

7 Neurotrophin signaling pathway 6.606E-28

8 Metabolism 3.819E-26

9 Proteoglycans in cancer 5.352E-26

10 MAPK signaling pathway 8.954E-26

11 BDNF signaling pathway 1.184E-25

12 IL-6 Signaling Pathway 2.375E-25

13 ErbB1 downstream signaling 1.072E-24

14 MAPKinase Signaling Pathway 9.189E-24

15 Ras Pathway 1.276E-23

144

Figure 46a lists the number of disease-related genes that were identified to be significant in the

SIRS, sepsis and septic shock analysis with septic shock having the highest with 372 disease-related genes followed by SIRS with 219 and finally sepsis with 206 significant genes. Figure 46b show the overlapping disease-related genes between different analyses. All the three conditions share around significant number

(around 130) of disease-related genes between them which is to be expected as these disease conditions share similar inflammatory pathways. Conditions like sepsis share around 90 percent of their disease related genes with physiologically similar conditions like septic shock, while only sharing around 70 percent with conditions like SIRS, where inflammation isn’t brought about an infection.

TABLE 12: SELECTIVE PATHWAYS THAT SHOWED SIGNIFICANT CHANGES DURING SEPTIC SHOCK TEMPORAL ANALYSYS

Disease Pathways: Septic Shock

ID Source FDR 1 NGF signalling via TRKA from the plasma membrane 1.127E-17

2 Signalling by NGF 3.383E-16

3 Signaling Pathways in Glioblastoma 4.385E-16

4 Neurotrophin signaling pathway 6.827E-16

5 MAPKinase Signaling Pathway 2.350E-15

6 Ras Pathway 6.465E-15

7 MAPK signaling pathway 1.116E-14

8 Prostate Cancer 1.792E-14

9 EGF receptor signaling pathway 2.149E-14

10 B Cell Receptor Signaling Pathway 3.095E-14

11 Thromboxane A2 receptor signaling 5.913E-14

12 Metabolism 6.236E-14

13 IL-3 Signaling Pathway 7.600E-14

14 p38 MAPK Pathway 7.885E-14

15 Angiogenesis 8.641E-14

145

TABLE 13 : SELECTIVE PATHWAYS THAT SHOWED SIGNIFICANT CHANGES DURING SEPTIC SHOCK TEMPORAL ANALYSYS

Disease Pathways: SIRS

ID Source FDR 1 Neurotrophin signaling pathway 1.689E-23

2 BDNF signaling pathway 4.260E-21

3 NGF signalling via TRKA from the plasma membrane 5.750E-21

4 B Cell Receptor Signaling Pathway 7.872E-20

5 Prostate Cancer 1.136E-19

6 Insulin Signaling 1.620E-19

7 Signaling Pathways in Glioblastoma 2.198E-19

8 Signalling by NGF 6.044E-19

9 EGFR1 Signaling Pathway 1.728E-18

10 Ras Pathway 2.304E-18

11 Thromboxane A2 receptor signaling 6.670E-18

12 IL-3 Signaling Pathway 6.841E-18

13 Signaling events mediated by Hepatocyte Growth Factor 8.152E-18 Receptor (c-Met) 14 MAPKinase Signaling Pathway 2.873E-17

15 EGF receptor signaling pathway 3.231E-17 Traditional genomic analysis of these conditions have not yet yielded useful biomarkers. The initial authors of the expression analysis identified modulated genes which indicated that genes related to zinc and other heavy metals were modulated. This is important because zinc homeostasis is known to play a key role in the normal functions of innate immune system and adaptive immune system. Also, genes related to the IL-10, and IL-4 pathway were unregulated in patients with septic shock while genes from the TGF-β pathway was uniquely unregulated on day-3 of septic shock patients. In order to better understand the pathways that were dysregulated by the disease-related genes from our analysis, we conducted functional pathway analysis using DAVID online tools. The top dysregulated pathways for sepsis, SIRS and septic shock’s functional analyses are listed in table 10, 11 and 12 respectively.

Inflammatory pathways such as IL3, B cell receptor signaling, and neurotrophins pathways are common among all three conditions.

146

FIGURE 48: FEATURE SELECTION OF INFLAMMATORY BIOMARKERS

Figure 48 Details: A recursive feature elimination example with automatic tuning of the number of features selected with cross-validation. Our analysis identified 61 disease related genes which give us the highest number of correct classification from the overall list of 418 genes.

Functional analysis was carried out for the 3 separate set of genes; (i). Sepsis (disease) related genes, (ii). Septic Shock (disease) related genes, and (iii). SIRS (disease) related genes. (iv) all disease- related genes combined

(i) Sepsis (disease) related genes

During the functional analysis with DAVID and WEB-based GEne SeT AnaLysis Toolkit

(J. Wang, Duncan, Shi, & Zhang, 2013) of sepsis disease-related genes, over a third of the disease-

related genes were found to be related to metabolic and cellular processes such as apoptosis. Our

147 analysis also found many genes to be associated with immunological processes like, multiple mitogen activated kinase genes, multiple interferon regulatory factor genes, multiple tyrosine- protein kinase genes, tumor necrosis factor, PTK2, and CASP3 showed significant modulation. The disease association analysis using WEB-based GEne SeT AnaLysis Toolkit (J. Wang et al., 2013) from the overall sepsis analysis revealed multiple genes related to stress (CASP3, MTOR,

MAPK14), Fanconi anemia (FANC1, FANCD2, NPM1, EIFAK2), drug interaction (PXN,

CEBPA, PTK2, PPIA, SMAD3) and immunological deficiency syndromes (LCK, ADA, PTPRC,

FYN).

Our analysis of the disease-related genes was able to identify well-known inflammatory pathways including multiple interleukin pathway (IL1, IL2, IL4, IL5, IL6, IL10 and IL12) (H.

Zhao, Li, Lu, Sheng, & Yao, 2015), interferon-gamma pathway, TGF beta, and MAPK signaling

(Beutler, 2004). Multiple TLR pathways have undergone significant changes including TLR 2,3,

7,8,5,10 pathway, all of which have shown to play an important role in sepsis (Tsujimoto et al.,

2008). ARF6 and its downstream pathways play an important role in stabilizing the microvascular.

Further analysis into the functional pathways showed that these genes identified multiple pathways related to ARF6 trafficking and downstream events were dysregulated which have shown to be significant during sepsis (London et al., 2010). Previous studies have shown that microvascular leaks during sepsis is a major contributor to morbidity and mortality during sepsis (Goldenberg,

Steinberg, Slutsky, & Lee, 2011). S1P1 pathway which increases cortical actin via PAL, Rac and coractin which intern increases junctional targeting of E-cadherin showed significant changes in our analysis. Number of pathways related to EGR receptor and its network of downstream signaling also exhibited significant changes. Research has shown that EGFR and its network of downstream have a major impact on the inflammatory reactions and innate immune system through multiple mechanisms (Pastore, Mascia, Mariani, & Girolomoni, 2008) including up regulation of multiple

TLR receptors.

148

(ii) Septic Shock (disease) related genes

WEB-based GEne SeT AnaLysis Toolkit (J. Wang et al., 2013) was used to obtain

biological insights about the 372 disease-related genes which were able to significantly differentiate

samples from patients with septic shock condition. (J. Wang et al., 2013). Abnormalities in the

immune system were identified, including abnormalities in the immune and lymphatic systems.

Some of the genes identified in our analysis were shown to cause physiological abnormalities in

leukocytes, such as BCR, ADAM17, BTK, ZAP70, CFTR and ITK(Schibler, 2012). Some of the

genes, like CFTR were found to be associated with abnormalities in the cardiovascular system,

respiratory system, and metabolic/homeostatic control system. Functional pathway analysis using

DAVID revealed inflammatory pathways very similar to sepsis including IGF1, EGF receptor

signaling and its downstream pathways, ARF6 signaling pathways, S1P1 pathways, interleukin

pathways, TGF pathway and so on. The analysis also identified pathways which were not present

in sepsis analysis namely RAF1 pathway, which plays a role in zinc homeostasis. Zinc Transporter

(ZnT1) binds and activates the regulatory region of RAF-1. The activation is likely to occur through

the lowering of cytosolic Zn levels, which inhibits their reaction (Kambe, Tsuji, Hashimoto, &

Itsumura, 2015). In the case of septic shock, the cytosolic zinc levels reduce drastically leading to

the activation of RAF-1 and its downstream pathways. The dysregulation of Zn homeostasis related

genes during septic shock was also identified by the authors (Hong et al) during their genomic

analysis , which in accordance with our analysis(Wong et al., 2009).

(iii) SIRS (disease) related genes

Overall SIRS analysis identified 219 disease-related genes specific for SIRS. The patients from

whom the samples were obtained were alive and recovering on Day 3 of the analysis. Large number

of genes which showed significant changes during our analysis were part of the immune system

(33 genes), neoplasm (25 genes), blood and blood forming tissues (30 genes) and the lymphatic

system (30 genes). The functional pathway analysis using DAVID - identified some novel cross-

149

regulations between the immune system and energy metabolism which has not been documented

in existing literature (You could provide a sentence over here describing your insights on this novel

cross regulation, why do you think it would interesting). Pathways including mTOR, and LKB1

signaling pathway were shown to be significantly dysregulated, with LKB1 pathway among the

top pathway identified during the functional analysis. AMPK and its activators LKB1, SIRT1 along

with Foxo-family of transcription factors are regulators of immune activation and Treg functions,

both of which showed significant pathway dysregulation. These pathways directly or indirectly

exert their influence on mTOR pathway (Procaccini, Galgani, De Rosa, & Matarese, 2012) which

may play an important role in patient recovery.

(iv) When we conducted phenotypic functional analysis on the combined disease-related genes from all

the 3 forms of the disease using WebGesTalt, it revealed most significant abnormalities in the

lymphatic, (blood and blood forming tissues), endocrine systems, blood neoplasm and neoplasm of

the nervous system. Bone density, abnormality in metabolism/homeostasis especially calcium

homeostasis also shown significant phenotypic abnormalities. The functional pathways of the

combined genes were similar to pathways identified in sepsis, SIRS and septic shock analysis

individually.

From the combined list of disease-related genes, using recursive feature selection with elimination, we identified 61 biomarkers that could reclassify samples from an unseen test dataset. Many of the 61 biomarkers we had identified were shown to play an important role in development of immune system or during inflammation. We have discussed few on these biomarkers in detail below.

 ADA: Adenosine deaminase (also known as adenosine aminhydrolase, or ADA), an enzyme (EC

3.5.4.4) involved in purine metabolism whose primary role in humans is the development and

maintenance of the immune system. Expression of ADA was higher for patients suffering from

sepsis but not in case of the other two conditions. The deficiency of ADA in humans has been

150

linked to severe combined immunodeficiency (SCID) which results in defective T-cell receptor

signaling, thymic cell death and pulmonary inflammation (Haskó & Cronstein, 2004).

 SGK1: Serum glucocorticoid regulated kinase1 or SGK1 is an enzyme which plays an important

role in the cellular stress response. Like ADA, SGK1’s gene expression levels are elevated during

sepsis than the other two conditions and its activities influence the regulation of transport, hormone

release, neuroexcitability, inflammation, cell proliferation and apoptosis. Elevated levels of SGK1

enhances the Th17 differentiations via IL-23R (Graham & Xavier, 2013).

 FANCD2: The FANCD2 is part of the Fanconi anemia complementation group which includes

many other genes including FANCA, FANCB, FANCC, FANCD1 (BRCA2). As the Fanconi

anemia proteins play an important role in maintenance of hematopoietic stem cells, the genetic

disorder of this region is characterized by bone marrow failure and cancer predisposition. Recently

studies have also linked cytokine hypersensitivity of hematopoietic to apoptotic cues which can

explain its role in cellular and immunity remodeling (Sebastian-leon et al., 2014; Sejas et al., 2007).

 GSK3B: Glycogen Synthase Kinase 3 B is an important regulator between pro and anti-

inflammatory cytokine and chemokine production in the central nervous system. It has also shown

to play a role in the innate and adaptive immune responses by influencing proliferation,

differentiation and survival of T-Cells (Beurel, Michalek, & Jope, 2010)

 MMP9: Matrix Metalloproteinase-9 plays a role in the migration of inflammatory cells across the

extra cellular matrix. MMP-9 deficiency has shown to protect against mortality in an endotoxic

shock model in mice, and selective MMP-9 blocking has shown to be possible therapeutic treatment

for sepsis (Renckens et al., 2006).

 PRKCA: protein kinase C alpha play roles in many particularly important mediators of immune

intracellular signaling. PKC-regulated signaling pathways play a significant role in many aspects

of immune responses, from development, differentiation, activation and survival of lymphocytes

to macrophage activation (Tan & Parker, 2003).

151

 DGAT2: Diacylglycerol O-Acyltransferase 2 are responsible for the synthesis of triglycerides

which plays a role in the overall fatty acid metabolism. DGAT, which is present in macrophage

and is involved in lipid storage capacity has shown to a role in activation of macrophage (Roy et

al., 2013).

Future direction for the study, we would be conducting RT-PCR based experiment for conformation of some of the biomarkers identified from our analysis.

V - 6.3 Biomarkers and Classification Performance

Ideally, sepsis, septic shock and SIRS biomarkers should reflect the biology of the disease, as evidenced by the biochemical changes that are characterized as the host response and infection at a cellular and the sub-cellular levels. It is unlikely to identify one single biomarker that can satisfy all the possible needs and expectations in sepsis research and management. Once we identified the genes that change between the different inflammatory conditions, the next step in our analysis is to determine the subset of genes which can be used as biomarkers to effectively differentiate the disease conditions.

TABLE 14: BIOMARKERS IDENTIFIED FOR SEPSIS, SEPTIC SHOCK AND SIRS ID Gene Name GENE SYMBOL 1717 7-dehydrocholesterol reductase DHCR7, 6868 ADAM metallopeptidase domain 17 Adam17, 79993 ELOVL family member 7, elongation of long chain fatty acids elovl7, (yeast) 2177 Fanconi anemia, complementation group D2 Fancd2, 6011 G protein-coupled receptor kinase 1 GRK1, 8826 IQ motif containing GTPase activating protein 1 IQGAP1, 3717 Janus kinase 2 Jak2, 8850 K(lysine) acetyltransferase 2B KAT2B, 8569 MAP kinase interacting serine/threonine kinase 1 mknk1, 4905 N-ethylmaleimide-sensitive factor Nsf, 9610 Ras and Rab interactor 1 RIN1, 50650 Rho guanine nucleotide exchange factor (GEF) 3 ARHGEF3,

152

6667 Sp1 transcription factor sp1, 7272 TTK protein kinase ttk, 2180 acyl-CoA synthetase long-chain family member 1 ACSL1, 100 adenosine deaminase ada, 55331 alkaline ceramidase 3 ACER3, 60496 aminoadipate-semialdehyde dehydrogenase- AASDHPPT, phosphopantetheinyl transferase 6310 ataxin 1 ATXN1, 814 calcium/calmodulin-dependent protein kinase IV camk4, 1455 casein kinase 1, gamma 2 CSNK1G2, 1495 catenin (cadherin-associated protein), alpha 1, 102kDa Ctnna1, 983 cell division cycle 2, G1 to S and G2 to M Cdk1, 993 cell division cycle 25 homolog A (S. pombe) CDC25A, 80184 centrosomal protein 290kDa CEP290, 1124 chimerin (chimaerin) 2 Chn2, 1588 cytochrome P450, family 19, subfamily A, polypeptide 1 CYP19A1, 10395 deleted in liver cancer 1 DLC1, 84649 diacylglycerol O-acyltransferase homolog 2 (mouse) DGAT2, 2909 glucocorticoid receptor DNA binding factor 1 GRLF1, 2932 glycogen synthase kinase 3 beta GSK3B, 2887 growth factor receptor-bound protein 10 GRB10, 8365 histone cluster 1, H4l; histone cluster 1, H4k; histone cluster Hist1h4a,Hist1h4b 4, H4; histone cluster 1, H4h; histone cluster 1, H4j; histone cluster 1, H4i; histone cluster 1, H4d; histone cluster 1, H4c; histone cluster 1, H4f; histone cluster 1, H4e; histone cluster 1, H4b; histone cluster 1, H4a; histone cluster 2, H4a; histone cluster 2, H4b 9759 histone deacetylase 4 HDAC4, 56261 hypothetical protein KIAA1434 GPCPD1, 3690 integrin, beta 3 (platelet glycoprotein IIIa, antigen CD61) ITGB3, 4318 matrix metallopeptidase 9 (gelatinase B, 92kDa gelatinase, Mmp9, 92kDa type IV collagenase) 4214 mitogen-activated protein kinase kinase kinase 1 MAP3K1, 9261 mitogen-activated protein kinase-activated protein kinase 2 MAPKAPK2, 11343 monoglyceride lipase MGLL, 10135 nicotinamide phosphoribosyltransferase NAMPT, 8021 nucleoporin 214kDa Nup214, 4983 oligophrenin 1 OPHN1, 5108 pericentriolar material 1 pcm1, 5347 polo-like kinase 1 (Drosophila) PLK1, 3778 potassium large conductance calcium-activated channel, KCNMA1, subfamily M, alpha member 1 5663 presenilin 1 Psen1, 5578 protein kinase C, alpha Prkca, 5562 protein kinase, AMP-activated, alpha 1 catalytic subunit PRKAA1, 5566 protein kinase, cAMP-dependent, catalytic, alpha PRKACA,

153

5494 protein phosphatase 1A (formerly 2C), magnesium- PPM1A, dependent, alpha isoform 5770 protein tyrosine phosphatase, non-receptor type 1 ptpn1, 5788 protein tyrosine phosphatase, receptor type, C Ptprc, 5914 retinoic acid receptor, alpha rarA, 6197 ribosomal protein S6 kinase, 90kDa, polypeptide 3 Rps6ka3, 6446 serum/glucocorticoid regulated kinase 1 SGK1, 27316 similar to RNA binding motif protein, X-linked; similar to LOC100129585,LOC100 hCG2011544; RNA binding motif protein, X-linked 131735,RBMX, 6850 spleen tyrosine kinase SYK, 6622 synuclein, alpha (non A4 component of amyloid precursor) Snca, 207 v-akt murine thymoma viral oncogene homolog 1 akt1, 6714 v-src sarcoma (Schmidt-Ruppin A-2) viral oncogene Src, homolog (avian)

For identifying the features, all the disease-related genes from individual analysis was combined and feature selection algorithm was run on this set of genes. From a total of 418 disease-related genes, we were able to identify 61 feature genes/biomarkers which could effectively differentiate the disease conditions. Figure 48 show that with the 61 selected features shown in table 14, the feature selection algorithm had the highest cross validation score of least number of misclassifications. Table 14 represents

61 genes that were identified from the analysis while figure 49 represents the heatmap of the expression levels of the genes in all the conditions. All the expression levels in this heatmap were preprocessed and rescaled between -1 and 1. The heatmap represent the plots for both the testing data set and the blind dataset where the samples representing the testing dataset are represented by the white bar while the blue bar represent the testing dataset.

154

FIGURE 49: SUB-FIGURE A SHOW HOW THE FEATURE SELECTION WAS USED TO IDENTIFY BIOMARKERS. Figure 49 Details: A recursive feature elimination was used with automatic tuning for the number of features selected with cross-validation. Our analysis identified 61 disease related genes which give us the highest number of correct classification from the overall list of 418 genes. Visualization of the performance of these biomarkers on the training and test set is shown as a heat map in sub-figure b. The color bar identifies the normalized and rescaled expression values for genes in the heat map. Each column represents one biomarker gene while each row identifies sample values. The orange/teal bar on the left is used to denote training (orange), and test samples (teal). Sepsis, septic shock and SIRS samples are color-coded in the adjacent bar. The explanation for the subtype color codes is provided in the space to the right of the heat map.

The genes were row clustered for the testing dataset and then applied to the overall heat map. Tight gene clusters can be seen for both sepsis and SIRS’s expression values but these clusters aren’t prominent in the septic shock samples. This may be due to highly heterogeneous nature of the disease when compared to sepsis and SIRS. Sepsis upregulated genes such as SGK1, PCM1, and CEP20 are related to many inflammatory pathways such as G-protein coupled receptors, interleukin associated pathways and EGFR pathways. Modulations of genes such as ADA on the other hand are directly linked to development and maintenance of immune system.

155

To test how effective, the feature genes that we had identified were at classifying samples, we created a support vector based multi-class classifier which was trained using the training dataset. Once the machine was created, we tested its effectiveness of the machine at classifying samples using the training dataset. The results of the analysis are presented in Table 15. It is clear from the table that the genes could identify all the samples with 100% precision and with a 100% recall suggesting that the feature genes we had identified could effectively differentiate the disease. We then used the machine to test the effectiveness on unseen training dataset and the results of this classification are presented in table 15. It is clear from the table that the machine could identify all the samples with 100% precision and recall.

TABLE 15: MULTI-CLASS CLASSIFICATION RESULT FOR THE TEST AND TRAINING DATASET

Subtypes Training Dataset Test Dataset

Precision Recall F1-score support Precision Recall F1-score support

Sepsis 1.0 1.0 1.0 14 1.0 1.0 1.0 4

SIRS 1.0 1.0 1.0 36 1.0 1.0 1.0 10

Septic Shock 1.0 1.0 1.0 83 1.0 1.0 1.0 22

Avg. total 1.0 1.0 1.0 133 1.0 1.0 1.0 133

156

Chapter VI: Summary and Future Directions

Modeling complex diseases using interaction networks is challenging because these disease conditions are emergent phenotypes that are caused by alterations in more than one gene in combination with environmental and lifestyle condition. Also, complex disorders tend to share symptoms with other diseases, making them hard to correctly identify and treat the disorder. Existing approaches do not address these challenges completely because most statistical and computational methods are not adequately suited to describe large complex systems. In our methodology, we addressed this by leveraging key biological network properties to identify altered pathways and characterizing changes between disease phenotypes based on related (along paths) set of genes.

The main goal when designing the algorithm are listed below.

 Reliable identification of disease-related genes and biomarkers for complex diseases: The

biggest challenge in studying complex disease has been in differentiating complex disease

subtypes. In this thesis, we have demonstrated the effectiveness of our methodology in

differentiating subtypes of breast cancer by correctly classifying unknown test samples of breast

cancers. We later extended the analysis to study and differentiate unknown samples from sepsis,

SIRS and septic shock with promising results showing reasonable accuracy and precision.

 Minimize the use of arbitrary thresholds: Many commonly used frequentist methods employ

thresholds to identify differentially expressed genes. In many cases, the thresholds used are not

disease specific and even slight variations in thresholds can have severe impact on the genes

selected. Our methodology avoided most thresholds to identify both disease-related genes and

biomarkers by employing a global network analysis and creating random baseline networks to

disease-related genes as well as biomarkers.

 Independence from expression platforms: The two common platforms used to study mRNA

expression analysis of samples are microarray and RNA-Seq. Although some studies indicate a

157

strong correlation between the two techniques, significant studies published in highly regarded

journals have demonstrated a substantial discordance, indicating that there are problems either in

the techniques themselves or in the manner by which the datasets are compared. Internal

reproducibility of the RNA-seq data was greater than that of the microarray data. Many studies

have shown that the correlations between the RNA-seq data and the individual microarrays were

low, but correlations between the RNA-seq values and the geometric mean of the microarray values

were moderate. However, in our methodology, we have demonstrated reasonable concordance

between RNA-Seq and microarray data with respect to both the POC scores and disease-related

genes identification as demonstrated using breast cancer dataset.

In the current iteration of our methodology, our primary aim was to standardize the methodology and demonstrate its value in identifying disease related genes and biomarkers in a platform independent manner. The next step in the evolution of our methodology would be to identify ways to scale the platform so that more people are able to access and use it in their analysis. This would include porting the infrastructure to be able to run on cloud platforms and improving the performance and features of the algorithm. We also hope to improve our algorithm by providing the necessary infrastructure to include multiple datasets, study miRNA-mRNA interactions and study disease-related genes and pathways across species. Some of the known limitations of our approach are discussed below

 Sample Size: The reliability of logistic regression models (Bradley Terry Model), which is used in

the POC score calculation, is dependent on the sample size. Larger sample sizes (typically larger

than 20) improve the accuracy of POC scores significantly. However, the expense of obtaining

larger data sets motivates the collection of smaller sample sizes. Thus, careful examination of POC

value in studies having lower sample sizes may be required; making our approach similar to

existing semi-automated approaches. We can overcome this limitation by integrating POC scores

from multiple independent dataset and thereby improving the reliability of individual gene’s POC

158

score. The implementation details for integrating multiple datasets can be found algorithm

improvement section.

 Threshold in Path Analysis: Our approach performs a global network analysis which does not

make primary use of thresholds for selection of genes. However, we note that our computational

steps make secondary use of threshold in order to speed up processing, and to streamline

interpretation of final path analysis results. The use of threshold in the final step in order to identify

pathways outside of three standard deviations is explicitly noted as the only threshold used.

Although we selected a three standard deviation based on multiple iterative studies, it would be

necessary to study the sensitivity of the algorithm to variation in standard deviation for other disease

conditions especially when the disease genotypes are very similar.

 Network: Another limitation within our analysis is the lack of curated network information of all

the genes in the genomic datasets. Although this limitation applies to any method of analysis, there

appears to be a stronger bias for genes involving directed interactions. Only 25-30% of the genes

were used in the path analysis because the remaining genes did not have curated directed

information, or no interacting partner could be found in the network. As a result, it is possible that

several disease-related genes were excluded from path analysis because they were not part of the

directed network. It is important to identify approaches to integrate some or all of these genes as

there may be vital disease related biomarker present in the excluded genes. Including these genes

may also improve the recall and precision during classification as some of these genes may show

more robust changes between disease conditions. One possible solution would be to develop our

algorithm to include undirected interactions between genes and thereby extend the analytic

capabilities of our approach.

 Missing Gene Expression Values: Expression analysis do not always include all the known genes

in their analysis. As we currently do not perform data imputation, these genes would be excluded

from our analysis even though they maybe topologically vital proteins. One approach to overcome

this limitation is to impute values for the missing genes in the network. The imputed POC scores

159

for the missing gene can derived on the mean values of other genes the network. If these genes are

identified as significant in the path analysis, their importance can be examined using follow-up

genomic analysis.

VI - 1 Software Architecture:

The current version, which was used for standardization using breast cancer, was mainly developed to prototype our methodology. Improvements in performance, user interface, and packaging will enable the prototype to better scale for large-scale use by the scientific and research communities. The next step, in terms of architecture development, is to refactor sections of the code to enable us to scale our application, make it cloud compatible, and open source the platform so that other researchers can take this forward.

There are several aspects that need attention before the software can scale to multiple users and some of them have been listed below:

VI - 1.1 Cloud Service:

One of the biggest challenges in scaling the platform is the computationally intensive nature of the application. During the development of the platform we had access to our own computing cluster which was owned by Dr. Hamid and it was sufficient for most of our analysis. This included running single analysis at a time across multiple computers for weeks before analyzing the results. As we plan to expand the platform beyond our lab, one of the biggest challenges will be to identify the right computational architecture necessary to deploy the code in a scalable fashion. Users would need to have access to the computationally intensive infrastructure required by the platform while having control over their own private data.

We believe the Amazon EC2 service would be an ideal deployment model for our platform. It allows us to customize the Ubuntu instance with the packages and code necessary for our analysis. Users can instantiate our EC2 C4 compute cluster which comes prepackaged with all the necessary code and data files. User expression data can be uploaded into amazon S3 data buckets and these mount the bucket into

160 on their respective instance to begin the analysis. The amazon EC2 instance also allows users to decide the number of processors and the size of RAM they would like to dedicate for their analysis that, in turn, is based on the amount of data and how fast they want to complete the analysis. Users can dedicate more computational resources (RAM and processors) if they want to complete their analysis within a few hours or choose a cheaper alternative if they intend to run the code for a longer duration.

By taking the Amazon EC2 route, we can avoid the expensive process of building packages for multiple OS versions as users no longer need to install the packages on their personal machines.

VI - 1.2 Smarter multi-processing

Scoring all paths of varying length in a scale-free network is a non-deterministic polynomial-time hard (NP hard) problem. One of the biggest challenges during our research was developing a methodology that allows simultaneous scoring of millions of network paths in a reasonable amount of time. In order to achieve this, we developed a custom map-reduce type of algorithm which breaks down the overall network into smaller sub-networks based on most active hubs. Even with the custom developed map-reduce algorithm with a job manager, the computational times are still quite substantial. We hope to improve the algorithm by making the job-manager which controls the multi-tasking smarter and avoid any processor from being idle.

VI - 1.3 User Interface:

The current iteration of software, most of the user input was obtained by guiding users via a command line interface (CLI) through the data upload, data extraction and run configuration process. In order to make the platform more user friendly and accessible to a larger audience, we would need to improve the user interface as most biologists are not comfortable using a terminal to initiate their runs. Currently, using CLI, users can edit the configuration for project runs which includes designing the combination runs, fields of data to be extracted from the input files, number of processors to be used for runs, and a number of other configuration settings. CLI-based prompts would not be ideal when the platform is deployed as

161 web service. Hence we will need to develop browser-based configuration editors where users can edit the configuration settings using commonly used internet browsers such as Internet Explorer or Firefox.

VI - 1.4 Automation:

There platforms currently consist of multiple individual modules like GSE reader, RNA-Seq reader,

GSE-run module and many more and all of these modules have individual configuration files that needs to be edited via CLI. We did not automate the entire process because the output data from one module needed to be transformed before it can be sent as an input to the next module. In the next iteration of the platform, we plan to integrate these modules automatically based on the study design such that users can start a run, and check up status of runs by addressing and automating these data transformations based on the types of analysis being done.

VI - 1.5 Improve run-times:

Global network analysis to study all paths in networks are computationally intensive as it is an

NP hard problem. As the sample sizes for the analysis increases, the computational power required for the logistic regression models for calculating POC scores also drastically increases. Some our runs, like the breast cancer runs, each study combination took over 4 days to complete analysis. This is not ideal in many circumstances and is definitely not scalable for many users. In order to productionize the platform, we should develop new strategies to improve the run times for these combinations by using many techniques including but not limited to dimensional reduction of samples matrix, smarter sub-network creation, identifying hotshots in the disease network and so on. Ideally, we would expect to get the total run-time for each combination to under 1 day which would be about a 75% reduction in the overall run- time.

VI - 1.6 Improve monitoring and reporting:

In the current version, the reports are simple-text based and the plots are generated using matplotlib. Although these plots and results are sufficient for our analysis, we hope to extend it and make

162 it more interactive as we scale the platform. We hope to employ a simple GitHub flavored markdown to present all the results in a dashboard for every user. We also plan to move all the plots into an interactive online platform called plotly where users can view, modify and export plots in various vector graphics format like svg. These plotly graphs can also be easily embedded into the dashboard making it easier to produce reports for every project.

VI - 1.7 Improved Data Management Schemes:

Some of the challenges with respect to the input data, where multiple data formats are publically available for the expression analysis datasets. Most of the data formats were dependent on the expression platform analysis and/or the repository for the data. In order to accommodate all these datasets, we developed an intermediate data format which contains all the information necessary for our analysis. All external formats were converted to this intermediate format based on the user derived information. The next step in improving the data management scheme would be to create standardized templates for different input data sources which users can just import instead of users completely entering all the necessary data which would require extract for a given study design. Such a template would also provide the necessary transformations for gene expression levels and provide a framework for data imputation for genes whose value is either missing or unknown.

VI - 1.8 Integration with Public API’s:

In recent years, there has been an explosion of biological API’s for many features including dataset import, gene identification and summary, functional analysis, pathway topology and many more scenarios.

The biggest advantage of using API’s in our platform is that it enables us to have access to latest data every time we run our analysis. It also avoids downloading and storing extremely large datasets/databases.

Extremely large databases make it difficult to deploy the platform as an Amazon EC2 instance and the freshness of the data cannot be guaranteed as the data in the original source might have changed since the last download.

163

VI - 2. Algorithm

The algorithm I developed was designed to identify coherent changes in paths along directed interactomes. We have used the breast cancer dataset to show the algorithm was capable of identifying coherent changes and the resultant biomarkers were able to re-classify the unseen test dataset. In the next phase, we plan to improve the precision of the algorithm by improving both the network and the types of analysis that can be done using our platform. Some of these improvements are discussed below.

VI - 2.1 Network Analysis:

Another limitation within our analysis is the lack of curated network information of all the genes in the genomic datasets. Although this limitation applies to any method of analysis, there appears to be a stronger bias for genes involving directed interactions. Only 25-30% of the genes were used in the path analysis because the remaining genes did not have curated directed information, or no interacting partner could be found in the network. As a result, it is possible that several disease-related genes were excluded from path analysis because they were not part of the directed network. Future work in curating these networks will improve the accuracy of our analysis. Moreover, a future, more advanced version of our algorithm, will include undirected interactions between genes and will extend the analytic capabilities of our approach.

VI - 2.2 Multiple dataset integration

POC scores are used to measure the reliability of changes for both edges and nodes in the network.

The POC uses a logistic regression model to measure the probability by which the node or edge has changed between two conditions. A POC score close to 1 shows consistent variation while a score close to 0 shows no change in gene expression level. Since the node and edge scores are probabilistic values, it is possible to integrate POC scores from individual analysis to enrich models. This is accomplished by conducting the analysis independently for individual datasets. A new overall network can then be created where the node and edge POC scores are the simple mean of the POC scores from the individual analysis and hence the

164 network represents the overall results from multiple datasets. This would be very useful in studying significant paths in all the metastasis phases of cancer across multiple datasets.

.

VI - 2.3 miRNA-mRNA Analysis

FIGURE 50: PROSPECTIVE METHODOLOGY TO STUDY MICRO-RNA M-RNA EXPRESSION

MicroRNA (miRNA) are a class of endogenous, evolutionarily conserved, small non coding member of the RNA molecule family which discovered in 1998 by Andrew Fire and Craig Mello (Ciesla et al., 2011; Enerly et al., 2011; Mikaelian, Scicchitano, Mendes, Thomas, & Leroy, 2012). miRNA plays an important role in regulating key biological processes in plant, animals and some viruses including

165 development, proliferation, apoptosis, stress response and tumourigenesis (Enerly et al., 2011). These small

RNA molecules, which are usually range from about 19-23 non-coding nucleotides in length, are translated in the nucleus from clustered intronic regions and acted on by polymerase to form a hairpin structure pri- miRNA(Mikaelian et al., 2012). Drosha and Pasha enzymes then cleave the polyadenylated tails and prepare a pre-miRNA is ready to be translocate into the cytosol(Buffa et al., 2011; Witkos et al., 2011).

Once in the cytosol, Dicer works on the pre-miRNA complex to form active miRNA. These active miRNAs have short variable 3’ and 5’ ends which plays an important role in gene regulation. They specifically target the 3’ un-translated region (UTR) end of messenger RNA (mRNA) and are hypothesized to act similarly to transcription factor in function (Sales et al., 2010). The binding to mRNA affects protein production and the most noticeable and consistent data can be seen at the protein level (Sales et al., 2010).

Synergistic regulation between miRNA and mRNA are important in understanding the role of genes and their regulation in complex diseases. It is important to study the role of multiple miRNA as they concurrently decrease the stability and suppress the expression of gene during at a post-transcriptional level during the progression of disease. In is therefore necessary to model the role the miRNAs and their

FIGURE 51: MOTIVATION BEHIND THE FUNCTIONAL ENRICHMENT ANALYSIS USING MIRNA-MRNA INTERACTIONS

166 interacting feedbacks has on the overall disease interactome. A possible extension of our approach could enable the analysis of microRNA modulations and their respective mRNA expression levels during the progression of complex diseases as shown in Figure 50.

Typically, miRNA analyses are typically performed independent of the mRNA expression functional analyses. The influence of miRNA on their interacting mRNA expression, can be viewed analogous to the interactions between hub gene and their interacting partners. The impact of inter and intra modular hubs between two conditions is measured using Hub Interaction Score (HIS), where a higher HIS score indicates greater impact of a hub on the disease condition. Similar to hub genes, the impact of miRNA is measured by the changes in expression levels of its interacting partners during the course of the disease.

The interacting partners a given miRNA and the weights of these interactions are obtained from databases such as miRNA database called targetscan (Witkos et al., 2011). If the miRNA has a functional impact on the mRNA expression level, then we hypothesize that change in the miRNA level would also cause change in the mRNA expression level and hence there would increase the Hub Interaction score (HIS) distance when compared to the control (Figure 51). We plan to improves overall path analysis scoring module by enriching probability scores of high-confidence nodes using the newly generated HIS scores and thereby identify target pathways that are mostly likely to modulated during the disease progression. This will help better understand underlying disease physiology there by helping in improved path detection. Figure 51 details our implementation of functional enrichment using miRNA-mRNA expression data for sepsis patients. For the mRNA expression values, we used the expression values obtained from our previous time- series expression values (Dataset id GSE13904) and miRNA expression values were obtained from GEO database (Dataset id GSE13205).

TABLE 16: PRELIMINARY RESULTS FROM OUR MIRNA-MRNA ANALYSIS FOR SEPSIS

Modified HIS Targetscan miRNA Name miRNA POC Score Score Selection Criteria

167

3.82 top one percent hsa-mir-551b 0.97 3.71 top one percent hsa-mir-520d 0.81 3.6 top one percent hsa-mir-520c 0.61 3.47 top one percent hsa-mir-411 0.69 3.3 top one percent hsa-mir-560 0.94 3.22 top one percent hsa-mir-124a 1 3.15 top one percent hsa-mir-521 1 3.15 top one percent hsa-mir-517b 1 3.11 top one percent hsa-mir-564 0.5 3.06 top one percent hsa-mir-489 1 3.06 top one percent hsa-mir-191 0.5 3.05 top one percent hsa-mir-588 0.53 3.01 top one percent hsa-mir-614 1 2.96 top one percent hsa-mir-365 0.72 2.64 top one percent hsa-mir-598 0.53 2.62 top one percent hsa-mir-551a 1 2.61 top one percent hsa-mir-572 0.94 2.6 top one percent hsa-mir-554 1 2.57 top one percent hsa-mir-491 0.53 2.55 top one percent hsa-mir-649 0.83 2.51 top one percent hsa-mir-769-5p 0.75 2.47 top one percent hsa-mir-383 0.5 2.46 top one percent hsa-mir-299-5p 0.5 2.44 top one percent hsa-mir-449b 0.83 2.44 top one percent hsa-mir-134 0.5

POC scores for miRNA can be calculated similar to that of mRNA node gene expression levels and

HIS module can be used to identify which miRNA has the greatest influence in terms of reducing the expression levels of their interacting mRNA partner as described in Figure 51. The weights of the edges in

Figure 51 were obtained from recomputed scores from targetscan (Witkos et al., 2011). Initial analysis yielded some interesting results as shown in the table 16 above. miRNA like hsa-mir-551b has been shown to be modulated in many complex diseases including cancers and Alzheimer. hsa-mir-520d on the other hand has shown to play an important role in maintaining stem cell nature (Tsuno, Wang, Shomori,

Hasegawa, & Miura, 2014). hsa-mir-411 was shown to be differentially expressed in FSHD myoblasts and

168 shown to play a role in regulating myogenesis (Harafuji, Schneiderat, Walter, & Chen, 2013). One of the limitation of this approach is that the interaction of the miRNA are based of predicted interacting partners and these might not be validated targets. These predicted targets might influence the ranking of the miRNA’s impact based on the HIS score. In the future, we hope to improve the path analysis algorithm to include the effect of microRNAs on its immediate interacting partners, which could shed light on how these changes shape the complex disease network.

VI - 2.4 Identifying analogous paths across species

One of the key features of our methodology is that we employ POC scores to measure the consistence of change of a gene between two conditions. To study most complex diseases, researchers employ model organisms such as mice or rats upon which the disease is emulated and studied. Since model organism needn’t always share the exact same genes/pathways, extrapolation of expression analysis using traditional frequentist methodology across organisms is usually challenging. POC on the other hand can allow us to compare analogous gene’s POC scores across organism by studying how consistently the

169

FIGURE 52: PROSPECTIVE METHODOLOGY TO STUDY PATHWAYS ACROSS SPECIES expression levels change between disease conditions. More details about the comparisons are shown in figure 52.

VI - 2.5 Improve precision and recall for RNA-Seq dataset classification

Although the biomarkers identified using our approach were able to reclassify unseen test microarray datasets with reasonable precision and recall, it didn’t perform equally well for certain breast cancer subtypes of RNA-Seq datasets. Although the recall has been consistently high, the precision of identifying HER2 subtype was much lower than any other subtypes. Although it requires further analysis to identify the exact reason for poor recall, we believe that reason might be due to the wide ranges of expression values. One possible reason for this might be the normalization techniques required to preprocess RNA-Seq datasets before classification using machine learning techniques. Microarray

170 techniques on the other hand have had robust normalization methodologies and hence the classification of the unseen datasets were more robust. One possible solution to improve the classification scores for RNA-

Seq datasets would be to employ alternative normalization techniques before the biomarker classification analysis.

VI - 2.6 New bioinformatics uses for the statistical metric: prospecting protein site- directed mutagenesis

VI – 2.6.1 Abstract

The Earth Mover's Distance (EMD), which was the basis for the Hub Interactions Score (HIS) in our path analysis, evaluates dissimilarity between two multi-dimensional distributions in some feature space where a distance measure between single features. In this study, we extended our HIS algorithm to examine dissimilarity between two multi-dimensional distributions of protein secondary structure during site- directed mutagenesis experiments. Site-directed mutagenesis experiments are targeted modification of potential functional sites that play a key role to assist in deciding the most probable sequence modifications that would lead to desired structural changes. In the absence of a high-resolution structure for any protein, research into the structure-function relationships has relied on analysis of site-directed mutagenesis effects as well as of chemical modification of reactive residues. The key step for the selection of mutagenesis site

171 relies in part on the empirical knowledge of evolutionary sequence information. The aim of platform (µsap) is to use structural information, which is more preserved than sequence information, to highlight the conformational changes associated with sequence modification in proteins targeted for mutagenesis studies.

In this study we present a novel computational approach to assist in predicting the most probable sequence modifications that would lead to desired structural changes. Among other predictive functions, µsap can be used to determine sequence changes that could potentially lead to large conformational changes, which in turn may result in loss of function while possibly retaining structure. µsap uses multimer databases and their associated secondary structure developed from low mutual sequence identity proteins in order to parameterize remote homologies. The parameterized distances between homologous proteins can be subsequently used to inform the relative importance of primary structure, secondary structure, and the length of multimer fragment, by comparing predictions to known results. µsap can also play a key role in protein engineering by suggesting possibilities for engineering new proteins through novel fragment modifications in functional regions.

172

VI – 2.6.2 Introduction

FIGURE 53: ILLUSTRATES THE IMPORTANCE OF CORRELATING CONFORMATIONAL STRUCTURE TO STRUCTURE AND FUNCTION OF PROTEINS. Figure 53 Details: sub-figure b: illustrates the relationship between local structural occupancy and its conformational state. The plot indicates heat-map of confirmation occupancy of glycine in spatial phi and psi angular distribution.

Protein are made up of sequence of amino acid molecules bound together by peptide bonds. The spatial orientation of key amino acids within the proteins enables them to perform the various biological functions (Dunbrack, 2006). To better understand the function and interactions of a new protein at a molecular level, it is necessary to study the three dimensional structure and configuration of key elements within the protein. These structures of proteins are usually observed using techniques such as X-ray crystallography, NMR spectroscopy and dual polarization interferometry (Koehl & Levitt, 1999; Koehl,

2001).

173

The protein structures are usually the determined by the rotations of alpha carbon atoms in the amino acid molecule (Acids & Summary, 2006). The complex protein structures are usually described using multiple levels namely

1) Primary Structure: The primary structure of proteins describes the linear sequence of amino acids

that make up the various chains within the protein. The primary structures are usually formed by

strong co-valent bonds, known as peptide bonds, that bind the amino acids together. The ordering

of the amino acids in these chains are usually determined by genes via the translation process. For

most proteins, the primary structures are usually stored in FASTA format in the PDB database

(Heinig & Frishman, 2004).

2) Secondary Structure: The secondary structure of proteins describes the highly regular local sub

structures formed by mostly hydrogen bonds between the primary polypeptide chains. The local

sub structures can be defined by two dihedral angles, called  and , on the Ramachandran plot

(Carugo & Djinović-Carugo, 2013; Zhou, O'Hern, & Regan, 2011) . There are two

prominent types of secondary structures

a. Alpha helix

b. Beta sheets

Dictionary of Protein Secondary Structure (DSSP) is commonly used to describe the secondary structure of proteins using single letter codes such as H ( helix) and B ( sheets) (Rost, 2001). The secondary structures for available proteins are usually present in the PDB file.

3) Tertiary Structure: The tertiary structure of proteins describes the three dimensional structure of

monomeric and multimeric protein molecules. The stability of the structures is driven by tertiary

interactions such salt bridges, and hydrogen bonds which help hide the hydrophobic residues while

exposing hydrophilic residues to water.

174

4) Quaternary Structure: Some large proteins are composed of multiple proteins subunits like

multimers. The quaternary structure of proteins defines the stable three dimensional structure of

multi-subunit proteins and how they fit together.

Predicting the structure and function of proteins based on their amino acid sequence has always been a very challenging problem. While the sequences are less conserved as seen in homology based modeling, most of the functional sites are conserved as shown in figure 53. Protein function predictions are generally employed on proteins which are poorly studied or understood. Most proteins are composed of well conserved structure and functional domains. In many of these cases structural similarity is a good indicator of functional similarity. For most stable proteins, three dimensional structures can be determined using techniques such as X-ray crystallography and NMR (Koehl, 2001). These results are then captured and stored in stored in databases such (PDB).

One of the most important secondary structures for stability of the protein is the hydrophobicity.

Hydrophobicity effects are evident in many facets of protein structure stabilization of protein globular structure in solution, protein–protein interactions associated with protein subunit assembly, protein– receptor binding, and other intermolecular biorecognition processes. Hydrophobicity effects are evident in many facets of protein structure stabilization of protein globular structure in solution, protein–protein interactions associated with protein subunit assembly, protein–receptor binding, and other intermolecular biorecognition processes (L. Wang, Eghbalnia, & Markley, 2007). Contributions of a specific amino acid side chain in a polypeptide chain to the overall hydrophobicity remains an important challenge.

Some proteins do not fold into their biochemically functional (stable) forms due to variations in conditions such as temperature, and pH (L. Wang et al., 2007). These proteins tend to denature under the three dimensional measuring conditions and change their functional secondary and tertiary structure. These changes can sometimes be temporary but many of these changes can be permanent. As chaperones and heat shock proteins are absent during the measurement, the structure measured is not reflective of the true structure and hence the function of the protein. One of the accepted ways of studying the functional sites of

175 such proteins is called mutational studies (Shanmugam & Natarajan, 2014; Tokuriki & Tawfik, 2009). The computational models, using data driven approach, can be constructed to predict the functional sites of proteins based on data already available from other well documented protein structures.

Predictive (a priori) knowledge can be used as a powerful tool in a range of experimental investigations when performing targeted modifications of protein sequences with an unknown secondary or tertiary structure (Eswar et al., 2006; Gromiha, 2007; Rost & Sander, 1993). But, how much a priori information can be gained just from a sequence before experiment begins? How reliable and sensitive is it?

Can it be derived without using “black-box” approaches?

Site-directed mutagenesis experiments are targeted modification of potential functional sites that play a key role to assist in deciding the most probable sequence modifications that would lead to desired structural changes (Shanmugam & Natarajan, 2014). Site-directed mutagenesis experiments are important in understanding the structural and functional sites of proteins by modifying one or more sites in the overall protein sequence. For a successful site-directed mutational study, it is important the modified/mutant protein preserves the important structural and functional elements so that the protein is modified and transported to the right site in the cell. One of the most important factor for a successful study is the stability of the modified protein. Protein mutant stability has shown to be strongly dependent on secondary structure and location of residues based on accessible surface area. The choice of amino acid for mutagenesis is usually driven by empirical sequence based cost matrices such as Blosum & PAM as shown in Figure 47b.

Prediction of sequences has shown to improve with classification of data based on the mutations in helical, strand, coil and turn regions and hydrophobic and hydrogen bonds. Inclusion of neighboring and surrounding residues leads to remarkable improvements in the correlation in all the subgroups of mutations.

The success of site-directed mutagenesis experiments has shown to be dependent on

(i) Original residue,

(ii) Mutant residue,

176

(iii) Neighboring environment of the residue targeted for modification, as well as

(iv) pH and temperature.

In our study we focused on using empirical information from a combination of structural propensity, neighbourhood, and accessible surface area to derive a biophysically inspired measure for evaluating “pseudo-energetic” changes. This measure can be used for: a) transforming sequence into a “compressed” numeric data, b) measuring “pseudo-energetic” variability along the sequence, or c) evaluating changes due to site mutations.

177

VI – 2.6.3 Method

The application was developed using C and Tcl programming languages, with SQlite databases used to sequentially access the tripeptides combinations as required, and consists of tripeptide knowledgebase and change cost calculation.

Tri-peptide Knowledgebase

The first task is to create a knowledgebase for all known proteins against which an unknown protein can be compared with. To accomplish this, a database of all three long peptide sequences was created from publically available databases. All the PDB sequences including the FASTA sequences were downloaded from PDB.org based on the list obtained from PDBSelect. Secondary structural elements of these proteins was then calculated from their atomic coordinates using STRIDE program downloaded from the website (add website) as of Dec 2009. STRIDE was used to derive structural information

(STRIDE is similar to DSSP, but utilizes both hydrogen bond energy and main-chain dihedral angles rather than hydrogen bonds alone). The distribution of tri-peptide angles was represented by a set of bins

(or clusters) in a discrete 5-degree Ramachandran grid. We call this representation the discrete signature of the tri-peptide. A normalization factor is introduced in order to avoid favoring smaller signatures in the case of partial matching.

FIGURE 54: ILLUSTRATES STEPS INVOLVED FOR THE DIFFERENT TYPES OF ANALYSIS THAT CAN BE DONE

178

(i) For each protein, the amino acid sequences are then decomposed into tri-peptides as shown in

figure 54. For each of the tripeptide, the torsion angles were discretized and normalization

before they were deposited into a tripeptide database. SQLite was used to construct these

databases so that they could be retrieved as required.

FIGURE 55: OVERALL COST OF AMINO-ACID REPLACEMENT Figure 55 Details: Details the various cost involved in changing a Valine (V) amino acid to Alanine (A) for the three adjacent tripeptides (ii) Cost Calculation: The next step is to identify the best possible substitution for the given

input protein as shown in figure 55. The user inputs include

a. PDB format (if structure is available), or FASTA sequence (if structure is not available),

b. The modification site (amino acid position)

c. Type(s) of modification required

179

FIGURE 56: HEATMAP OF THE NEIGHBORHOOD EFFECT DURING SITE- DIRECTED MUTAGENESIS

Figure 56 Details: Heatmap showing the cost of site-directed mutagenesis when an amino-acid ‘Q’ is changed to all the other possible amino acids. Regions with red indicate higher costs for change and thus more structural TABLE 17: SHOWS VARIATION IN EMD DISTANCES BASED ON CHANGES

Depending on the input options, and requirements selected by the user, preliminary data, including a decomposition of the target region into tripeptide sequence is built. The input sequence around the regions of interest is decomposed into tripeptide, as seen in figure 56. Amino acids at the site of interest are replaced one at a time, and the cost of each replacement is calculated. Best possible sequence substitution is determined by identifying the lowest cost of amino acid replacement using the 1st Wasserstein distance (W).

180

The details about the distances calculation will be discussed in the section below. Based on amino acid properties the probable replacement scores, the appropriate amino acid to replace can be determined. For each tripeptide source-target pair, the cost of reshaping the discrete signature from that of the source to the target is specified as the “least-expensive reshaping of source to destination signature is calculated. The application then uses the databases to score all possibilities by using W, and determines the mutation with the best possible score. This process can be extended to study the entire protein to account for neighbor effect. The scores are then ranked, and the best possible overall sequences are returned to the user.

Alternatively, it could also be used to construct new type of tri-peptide substitution matrix constructed based on W which also takes into account the neighbor effect. The cost of these substitution can be visually seen in the heatmap in Figure 55 along with the substitution scores described in table 16.

Wasserstein distance (W) or Earth Mover’s Distance (EMD)

W is used to compute the distance between two distributions represented by signatures. The signatures are sets of weighted features that capture the distributions. The torsion angles along with the hits count (normalized), represent the features for the sequences in study. For a given pair of tripeptides (original tripeptide (P) and the tripeptide with replacement (Q)), W distance metric captures the changes between the distribution of an amino-acid and its immediate neighbor between M and O. It was implemented using a modified version of the published Earth mover’s distance and customized to study tripeptides. In terms of work done or cost of transformation, 푃푖,푄푖 Cluster representative, 푊푖is the weight of the cluster

● P (Signatures)= {(푃 , 푊 ) . . . . (푃 , 푊 )} 푖 푝11 푛 푝1푚

● Q (Signatures)= {(푄 , 푊 ) . . . . (푄 , 푊 )} 푖 푝11 푛 푝1푛

● W can be defined as the cost of “moving supplies" from P to Q

푚 푛 푚 푛 W (P,Q)=(∑푖=1 ∑푗=1 푑푖푗 푓푖푗)/(∑푖=1 ∑푗=1 푓푖푗)

Work done or total cost. Once the transportation problem is solved, and we have found the optimal flow F, the earth mover's distance is denoted as the work normalized by the total flow. W naturally extends

181 the notion of a distance between single elements to that of a distance between sets, or distributions, of elements. The distance calculation between P and Q can be clearly illustrated using the Ramachandran plots shown in Figure 57 where the cost of changing central peptide in AAA from A to K is much lower than the cost of changing A to P. For combinations of cost changes, we can study the cost change in heat map for matrices as shown in figure 56.

Similar try-peptide analysis can be used on the entire protein that is being studied as shown in figure 57.

Cost(AAA to AKA < Cost(AAA to APA)

FIGURE 57: UNDERSTANDING EMD COSTS USING RAMACHANDRAN PLOTS

Figure 57 Details: Plots above show distribution of phi and psi angles on a Ramachandran plots for tripeptides AKA, AAA and APA. It is clear from the plots that the cost of moving AAA to APA would be greater than AAA to AKA VI – 2.6.4 Discussion

The use of structural information further refines information provided by BLOSUM and PAM matrices.

Scoring by tripeptides windows is a key step to obtaining accurate predictions. Using torsion angle distributions as a surrogate for favorable energy states is justified by the key role of these angles in conformational studies. The choice of W is driven by the following natural interpretation in the context of our application:

182

 It is applicable to the more general variable-size signatures, which subsume histograms while it

reflects the notion of nearness properly

 The notion of a partial match is very natural. This is important because we are not always sure that

the conformational space is completely sampled.

 It is a true metric, and therefore endows the conformational space with a metric structure.

 It is bounded from below by the distance between the centers of mass of the two signatures (when

the ground distance is induced by a norm).

VI – 2.6.5 Future Directions

We plan to extend W scoring by assigning weights based on user-defined knowledge. This would allow us

to refine scoring based on similar or distant secondary structure target tri-peptide. Also, the inclusion of

Primary Structure: MFGRDPFDSL FERMFKEFFA TPMTGTTMIQ SSTGIQISGK GFMPISIIEG DQHIKVIAWL PGVNKEDIIL NAVGDTLEIR AKRSPLMITE SERIIYSEIP EEEEIYRTIK LPATVKEENA SAKFENGVLS VILPKAESSI KKGINIE

FIGURE 58: EXAMPLE OF TRIPEPTIDE ANALYSIS USING OUR METHODOLOGY

amino acid index into the calculation will enhance the currently-existing scoring mechanism. In the long

run, we plan to develop an Internet-based web form for the program where people can check their

mutation’s stability. We can also construct a new type of tri-peptide substitution matrix constructed based

on W, which takes into account additional neighbor effects.

183

Chapter VIII: References

Acids, B. A., & Summary, C. (2006). Branched-Chain Amino Acids : Metabolism , Physiological Function , and Application, (3), 333–336. Acker, J., Murroni, O., Mattei, M. G., Kedinger, C., & Vigneron, M. (1996). The gene (POLR2L) encoding the hRPB7.6 subunit of human RNA polymerase. Genomics, 32(32), 86–90. http://doi.org/S0888-7543(96)90079-8 [pii]; 10.1006/geno.1996.0079 [doi] Aderem, A., & Smith, K. D. (2004). A systems approach to dissecting immunity and inflammation. Seminars in Immunology. http://doi.org/10.1016/j.smim.2003.10.002 Agarwal, S., Deane, C. M., Porter, M. A., & Jones, N. S. (2010). Revisiting Date and Party Hubs : Novel Approaches to Role Assignment in Protein Interaction Networks, 6(6). http://doi.org/10.1371/journal.pcbi.1000817 Aittokallio, T., & Schwikowski, B. (2006). Graph-based methods for analysing networks in cell biology. Briefings in Bioinformatics, 7(3), 243–55. http://doi.org/10.1093/bib/bbl022 Al-Lazikani, B., Banerji, U., & Workman, P. (2012). Combinatorial drug therapy for cancer in the post- genomic era. Nature Biotechnology, 30(7), 679–92. http://doi.org/10.1038/nbt.2284 Albert, R. (2005). Scale-free networks in cell biology. Journal of Cell Science, 118(Pt 21), 4947–57. http://doi.org/10.1242/jcs.02714 Altman, R. B. (2012). Introduction to Translational Bioinformatics Collection. PLoS Computational Biology, 8(12), e1002796. http://doi.org/10.1371/journal.pcbi.1002796 Antunes, M. S., Morey, K. J., Tewari-Singh, N., Bowen, T. a, Smith, J. J., Webb, C. T., … Medford, J. I. (2009). Engineering key components in a synthetic eukaryotic signal transduction pathway. Molecular Systems Biology, 5(270), 270. http://doi.org/10.1038/msb.2009.28 Applegate, D., Dasu, T., Krishnan, S., & Urbanek, S. (2011). Unsupervised clustering of multidimensional distributions using earth mover distance. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’11 (p. 636). http://doi.org/10.1145/2020408.2020508 Arpino, G., Generali, D., Sapino, A., Del Matro, L., Frassoldati, A., de Laurentis, M., … Dogliotti, L. (2013). Gene expression profiling in breast cancer: a clinical perspective. Breast (Edinburgh, Scotland), 22(2), 109–20. http://doi.org/10.1016/j.breast.2013.01.016 Auffray, C., & Nottale, L. (2008). Scale relativity theory and integrative systems biology: 1. Founding principles and scale laws. Progress in Biophysics and Molecular Biology, 97(1), 79–114. http://doi.org/10.1016/j.pbiomolbio.2007.09.002 Ballester, P. J., & Mitchell, J. B. O. (2010). A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics, 26(9), 1169–1175. http://doi.org/10.1093/bioinformatics/btq112 Bandyopadhyay, S., & Ali-Fehmi, R. (2013). Breast carcinoma: molecular profiling and updates. Clinics in Laboratory Medicine, 33(4), 891–909. http://doi.org/10.1016/j.cll.2013.08.009 Barabási, A.-L. (2009). Scale-free networks: a decade and beyond. Science (New York, N.Y.), 325(5939), 412–3. http://doi.org/10.1126/science.1173299 Barabási, A.-L., Gulbahce, N., & Loscalzo, J. (2011a). Network medicine: a network-based approach to human disease. Nature Reviews. Genetics, 12(1), 56–68. http://doi.org/10.1038/nrg2918 Barabási, A.-L., Gulbahce, N., & Loscalzo, J. (2011b). Network medicine: a network-based approach to human disease. Nature Reviews. Genetics, 12(1), 56–68. http://doi.org/10.1038/nrg2918

184

Barrenäs, F., Chavali, S., Alves, A. C., Coin, L., Jarvelin, M.-R., Jörnsten, R., … Benson, M. (2012). Highly interconnected genes in disease-specific networks are enriched for disease-associated polymorphisms. Genome Biology, 13(6), R46. http://doi.org/10.1186/gb-2012-13-6-r46 Beisser, D., Brunkhorst, S., Dandekar, T., Klau, G. W., Dittrich, M. T., & Müller, T. (2012). Robustness and accuracy of functional modules in integrated network analysis. Bioinformatics (Oxford, England), 28(14), 1887–94. http://doi.org/10.1093/bioinformatics/bts265 Beresford, M. J. (2010). Medical reductionism: lessons from the great philosophers. QJM : Monthly Journal of the Association of Physicians, 103(9), 721–4. http://doi.org/10.1093/qjmed/hcq057 Beurel, E., Michalek, S. M., & Jope, R. S. (2010). Innate and adaptive immune responses regulated by glycogen synthase kinase-3 (GSK3). Trends in Immunology, 31(1), 24–31. http://doi.org/10.1016/j.it.2009.09.007 Beutler, B. (2004). Inferences, questions and possibilities in Toll-like receptor signalling. Nature, 430(6996), 257–263. http://doi.org/10.1038/nature02761 Bhat, K. M. R., & Setaluri, V. (2007). Microtubule-associated proteins as targets in cancer chemotherapy. Clinical Cancer Research, 13(10), 2849–2854. http://doi.org/10.1158/1078-0432.CCR-06-3040 Bierie, B., & Moses, H. L. (2006). Tumour microenvironment: TGFbeta: the molecular Jekyll and Hyde of cancer. Nature Reviews. Cancer, 6(7), 506–20. http://doi.org/10.1038/nrc1926 Bizzarri, M., Cucina, a, Conti, F., & D’Anselmi, F. (2008). Beyond the oncogene paradigm: understanding complexity in cancerogenesis. Acta Biotheoretica, 56(3), 173–96. http://doi.org/10.1007/s10441-008-9047-8 Böde, C., Kovács, I. a, Szalay, M. S., Palotai, R., Korcsmáros, T., & Csermely, P. (2007). Network analysis of protein dynamics. FEBS Letters, 581(15), 2776–82. http://doi.org/10.1016/j.febslet.2007.05.021 Bodenmiller, B., Wanka, S., Kraft, C., Urban, J., Campbell, D., Pedrioli, P. G., … Aebersold, R. (2010). Phosphoproteomic analysis reveals interconnected system-wide responses to perturbations of kinases and phosphatases in yeast. Science Signaling, 3(153), rs4. http://doi.org/10.1126/scisignal.2001182 Botstein, D., & Risch, N. (2003a). Discovering genotypes underlying human phenotypes : past successes for mendelian disease , future approaches, 33(march). http://doi.org/10.1038/ng1090 Botstein, D., & Risch, N. (2003b). Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nature Genetics, 33 Suppl(march), 228–37. http://doi.org/10.1038/ng1090 Buck, M. B., & Knabbe, C. (2006). TGF-beta signaling in breast cancer. Annals of the New York Academy of Sciences, 1089, 119–26. http://doi.org/10.1196/annals.1386.024 Buffa, F. M., Camps, C., Winchester, L., Snell, C. E., Gee, H. E., Sheldon, H., … Ragoussis, J. (2011). microRNA-associated progression pathways and potential therapeutic targets identified by integrated mRNA and microRNA expression profiling in breast cancer. Cancer Research, 71(17), 5635–45. http://doi.org/10.1158/0008-5472.CAN-11-0489 Cancer, T., & Atlas, G. (2012a). Comprehensive molecular portraits of human breast tumours. Nature, 490(7418), 61–70. http://doi.org/10.1038/nature11412 Cancer, T., & Atlas, G. (2012b). Comprehensive molecular portraits of human breast tumours. Nature, 490(7418), 61–70. http://doi.org/10.1038/nature11412 Carlson, C. S., Eberle, M. a, Kruglyak, L., & Nickerson, D. a. (2004). Mapping complex disease loci in whole-genome association studies. Nature, 429(6990), 446–52. http://doi.org/10.1038/nature02623

185

Carlson, J. M., & Doyle, J. (2001). Complexity and robustness. Carugo, O., & Djinović-Carugo, K. (2013). A proteomic Ramachandran plot (PRplot). Amino Acids, 44(2), 781–790. http://doi.org/10.1007/s00726-012-1402-z Cerami, E. G., Gross, B. E., Demir, E., & Rodchenkov, I. (2011). Pathway Commons , a web resource for biological pathway data, 39(November 2010), 685–690. http://doi.org/10.1093/nar/gkq1039 Chan, S. Y., & Loscalzo, J. (2012a). The emerging paradigm of network medicine in the study of human disease. Circulation Research, 111(3), 359–74. http://doi.org/10.1161/CIRCRESAHA.111.258541 Chan, S. Y., & Loscalzo, J. (2012b). The emerging paradigm of network medicine in the study of human disease. Circulation Research, 111(3), 359–74. http://doi.org/10.1161/CIRCRESAHA.111.258541 Chavali, A. K., Gianchandani, E. P., Tung, K. S., Lawrence, M. B., Peirce, S. M., & Papin, J. a. (2008). Characterizing emergent properties of immunological systems with multi-cellular rule-based computational modeling. Trends in Immunology, 29(12), 589–99. http://doi.org/10.1016/j.it.2008.08.006 Chen, C., Hardy, D., & Mendelson, C. (2011). Progesterone receptor inhibits proliferation of human breast cancer cells via induction of MAPK phosphatase 1 (MKP-1/DUSP1). Journal of Biological Chemistry, 286(50), 43091–102. http://doi.org/10.1074/jbc.M111.295865 Chen, L., Huang, T., Shi, X.-H., Cai, Y.-D., & Chou, K.-C. (2010). Analysis of protein pathway networks using hybrid properties. Molecules (Basel, Switzerland), 15(11), 8177–92. http://doi.org/10.3390/molecules15118177 Chuang, H.-Y., Lee, E., Liu, Y.-T., Lee, D., & Ideker, T. (2007). Network-based classification of breast cancer metastasis. Molecular Systems Biology, 3(140), 140. http://doi.org/10.1038/msb4100180 Ciesla, M., Skrzypek, K., Kozakowska, M., Loboda, A., Jozkowicz, A., & Dulak, J. (2011). MicroRNAs as biomarkers of disease onset. Analytical and Bioanalytical Chemistry, 401(7), 2051–61. http://doi.org/10.1007/s00216-011-5001-8 Cornell, T. T., Wynn, J., Shanley, T. P., Wheeler, D. S., & Wong, H. R. (2010). Mechanisms and regulation of the gene-expression response to sepsis. Pediatrics, 125(6), 1248–58. http://doi.org/10.1542/peds.2009-3274 Csermely, P. (2008). Creative elements: network-based predictions of active centres in proteins and cellular and social networks. Trends in Biochemical Sciences, 33(12), 569–76. http://doi.org/10.1016/j.tibs.2008.09.006 Cukuroglu, E., Gursoy, A., & Keskin, O. (2010). Analysis of hot region organization in hub proteins. Annals of Biomedical Engineering, 38(6), 2068–78. http://doi.org/10.1007/s10439-010-0048-9 Dalman, M. R., Deeter, A., Nimishakavi, G., & Duan, Z.-H. (2012). Fold change and p-value cutoffs significantly alter microarray interpretations. BMC Bioinformatics, 13 Suppl 2(Suppl 2), S11. http://doi.org/10.1186/1471-2105-13-S2-S11 Damaghi, M., Wojtkowiak, J. W., & Gillies, R. J. (2013). pH sensing and regulation in cancer. Frontiers in Physiology, 4 DEC(December), 1–10. http://doi.org/10.3389/fphys.2013.00370 Damrauer, J. S., Hoadley, K. a, Chism, D. D., Fan, C., Tiganelli, C. J., Wobker, S. E., … Kim, W. Y. (2014). Intrinsic subtypes of high-grade bladder cancer reflect the hallmarks of breast cancer biology. Proceedings of the National Academy of Sciences of the United States of America, 111(8), 3110–5. http://doi.org/10.1073/pnas.1318376111 Data, E. (2004). Statistical Applications in Genetics and Molecular Biology Calculating the Statistical Significance of Changes in Pathway Activity From Gene Expression Data Calculating the Statistical Significance of Changes in Pathway Activity From Gene Expression Data . Statistical Applications

186

in Genetics and Molecular Biology, 3(1). de la Iglesia, D., García-Remesal, M., de la Calle, G., Kulikowski, C., Sanz, F., & Maojo, V. (2013). The impact of computer science in molecular medicine: enabling high-throughput research. Current Topics in Medicinal Chemistry, 13(5), 526–75. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/23548020 del Sol, A., Balling, R., Hood, L., & Galas, D. (2010). Diseases as network perturbations. Current Opinion in Biotechnology, 21(4), 566–71. http://doi.org/10.1016/j.copbio.2010.07.010 Demirci, M. F., Shokoufandeh, A., Keselman, Y., Bretzner, L., & Dickinson, S. (2006). Object Recognition as Many-to-Many Feature Matching. International Journal of Computer Vision, 69(2), 203–222. http://doi.org/10.1007/s11263-006-6993-y Dennis, G., Sherman, B. T., Hosack, D. a, Yang, J., Gao, W., Lane, H. C., & Lempicki, R. a. (2003). DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biology, 4(5), P3. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/12734009 Desmedt, C., Haibe-Kains, B., Wirapati, P., Buyse, M., Larsimont, D., Bontempi, G., … Sotiriou, C. (2008). Biological processes associated with breast cancer clinical outcome depend on the molecular subtypes. Clinical Cancer Research : An Official Journal of the American Association for Cancer Research, 14(16), 5158–65. http://doi.org/10.1158/1078-0432.CCR-07-4756 Dillies, M.-A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., … Jaffrézic, F. (2013). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics, 14(6), 671–83. http://doi.org/10.1093/bib/bbs046 Dunbrack, R. L. (2006). Sequence comparison and protein structure prediction. Current Opinion in Structural Biology, 16(3), 374–384. http://doi.org/10.1016/j.sbi.2006.05.006 Dunham, I., Kundaje, A., Aldred, S. F., Collins, P. J., Davis, C. a, Doyle, F., … Lochovsky, L. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414), 57–74. http://doi.org/10.1038/nature11247 Edwards, J. S., & Palsson, B. O. (2000). Robustness analysis of the Escherichia coli metabolic network. Biotechnology Progress, 16(6), 927–39. http://doi.org/10.1021/bp0000712 Enerly, E., Steinfeld, I., Kleivi, K., Leivonen, S.-K., Aure, M. R., Russnes, H. G., … Børresen-Dale, A.- L. (2011). miRNA-mRNA integrated analysis reveals roles for miRNAs in primary breast tumors. PloS One, 6(2), e16915. http://doi.org/10.1371/journal.pone.0016915 Eroles, P., Bosch, A., Pérez-Fidalgo, J. A., & Lluch, A. (2012). Molecular biology in breast cancer: intrinsic subtypes and signaling pathways. Cancer Treatment Reviews, 38(6), 698–707. http://doi.org/10.1016/j.ctrv.2011.11.005 Eswar, N., Webb, B., Marti-Renom, M. A., Madhusudhan, M. S., Eramian, D., Shen, M.-Y., … Sali, A. (2006). Comparative protein structure modeling using Modeller. Curr Protoc Bioinformatics (Vol. Chapter 5). http://doi.org/10.1002/0471250953.bi0506s15.Comparative Eungdamrong, N. J., & Iyengar, R. (2004). Modeling cell signaling networks. Biology of the Cell / under the Auspices of the European Cell Biology Organization, 96(5), 355–62. http://doi.org/10.1016/j.biolcel.2004.03.004 Friedman, N. (2004). Inferring cellular networks using probabilistic graphical models. Science (New York, N.Y.), 303(5659), 799–805. http://doi.org/10.1126/science.1094068 Gao, J., Aksoy, B. A., Dogrusoz, U., Dresdner, G., Gross, B., Sumer, S. O., … Schultz, N. (2013). Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Science Signaling, 6(269), pl1. http://doi.org/10.1126/scisignal.2004088

187

Gasco, M., Shami, S., & Crook, T. (2002). The p53 pathway in breast cancer. Breast Cancer Research : BCR, 4(2), 70–6. Retrieved from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=138723&tool=pmcentrez&rendertype=a bstract Gelfond, J. a, Ibrahim, J. G., Gupta, M., Chen, M.-H., & Cody, J. D. (2013). Differential expression analysis with global network adjustment. BMC Bioinformatics, 14, 258. http://doi.org/10.1186/1471- 2105-14-258 Gitter, A., Carmi, M., Barkai, N., & Bar-Joseph, Z. (2013). Linking the signaling cascades and dynamic regulatory networks controlling stress responses. Genome Research, 23(2), 365–76. http://doi.org/10.1101/gr.138628.112 Goldenberg, N. M., Steinberg, B. E., Slutsky, A. S., & Lee, W. L. (2011). Broken barriers: a new take on sepsis pathogenesis. Science Translational Medicine, 3(88), 88ps25. http://doi.org/10.1126/scitranslmed.3002011 Graham, D. B., & Xavier, R. J. (2013). From genetics of inflammatory bowel disease towards mechanistic insights. Trends in Immunology, 34(8), 371–378. http://doi.org/10.1016/j.it.2013.04.001 Gromiha, M. M. (2007). Prediction of protein stability upon point mutations. Biochemical Society Transactions, 35(Pt 6), 1569–73. http://doi.org/10.1042/BST0351569 Guedj, M., Marisa, L., de Reynies, a, Orsetti, B., Schiappa, R., Bibeau, F., … Theillet, C. (2012). A refined molecular taxonomy of breast cancer. Oncogene, 31(9), 1196–206. http://doi.org/10.1038/onc.2011.301 Guiu, S., Michiels, S., André, F., Cortes, J., Denkert, C., Di Leo, a, … Reis-Filho, J. S. (2012). Molecular subclasses of breast cancer: how do we define them? The IMPAKT 2012 Working Group Statement. Annals of Oncology : Official Journal of the European Society for Medical Oncology / ESMO, 23(12), 2997–3006. http://doi.org/10.1093/annonc/mds586 Han, J.-D. J. (2008). Understanding biological functions through molecular networks. Cell Research, 18(2), 224–37. http://doi.org/10.1038/cr.2008.16 Harafuji, N., Schneiderat, P., Walter, M. C., & Chen, Y. (2013). miR-411 is up-regulated in FSHD myoblasts and suppresses myogenic factors. Orphanet Journal of Rare Diseases, 8, 55. http://doi.org/10.1186/1750-1172-8-55 Haskó, G., & Cronstein, B. N. (2004). Adenosine: An endogenous regulator of innate immunity. Trends in Immunology, 25(1), 33–39. http://doi.org/10.1016/j.it.2003.11.003 He, X., & Zhang, J. (2006). Why do hubs tend to be essential in protein networks? PLoS Genetics, 2(6), e88. http://doi.org/10.1371/journal.pgen.0020088 Heinig, M., & Frishman, D. (2004). STRIDE: A web server for secondary structure assignment from known atomic coordinates of proteins. Nucleic Acids Research, 32(WEB SERVER ISS.), 500–502. http://doi.org/10.1093/nar/gkh429 Helikar, T., Konvalina, J., Heidel, J., & Rogers, J. a. (2008). Emergent decision-making in biological signal transduction networks. Proceedings of the National Academy of Sciences of the United States of America, 105(6), 1913–8. http://doi.org/10.1073/pnas.0705088105 Herzum, I., & Renz, H. (2008). Inflammatory markers in SIRS, sepsis and septic shock. Current Medicinal Chemistry, 15(6), 581–7. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/18336272 Honecker, F., Rohlfing, T., Harder, S., Braig, M., Gillis, A. J. M., Glaesener, S., … Balabanov, S. (2014). Proteome analysis of the effects of all-trans retinoic acid on human germ cell tumor cell lines. Journal of Proteomics, 96, 300–313. http://doi.org/10.1016/j.jprot.2013.11.010

188

Hu, J., Song, Y., & Chen, S. (n.d.). Finding important hubs in scale-free gene networks, 1–10. Hu, X., Stern, H. M., Ge, L., O’Brien, C., Haydu, L., Honchell, C. D., … Cavet, G. (2009). Genetic alterations and oncogenic pathways associated with breast cancer subtypes. Molecular Cancer Research : MCR, 7(4), 511–22. http://doi.org/10.1158/1541-7786.MCR-08-0107 Huang, T., Weng, R. C., & Lin, C. (2006). Generalized Bradley-Terry Models and Multi-Class Probability Estimates, 7, 85–115. Huber, W. (2003). Analysis of microarray gene expression data, 1–37. Hucka, M., Smith, L., Wilkinson, D., Bergmann, F., Hoops, S., Keating, S., … Schaff, J. (2010). The Systems Biology Markup Language (SBML): Language Specification for Level 3 Version 1 Core. Nature Precedings. http://doi.org/10.1038/npre.2010.4959 Isci, S., Ozturk, C., Jones, J., & Otu, H. H. (2011). Pathway analysis of high-throughput biological data within a bayesian network framework. Bioinformatics, 27(12), 1667–1674. http://doi.org/10.1093/bioinformatics/btr269 Iskander, K. N., Osuchowski, M. F., Stearns-Kurosawa, D. J., Kurosawa, S., Stepien, D., Valentine, C., & Remick, D. G. (2013). Sepsis: multiple abnormalities, heterogeneous responses, and evolving understanding. Physiological Reviews, 93(3), 1247–88. http://doi.org/10.1152/physrev.00037.2012 Janols, H., Wullt, M., Bergenfelz, C., Bj??rnsson, S., Lickei, H., Janciauskiene, S., … Bredberg, A. (2014). Heterogeneity among septic shock patients in a set of immunoregulatory markers. European Journal of Clinical Microbiology and Infectious Diseases, 33(3), 313–324. http://doi.org/10.1007/s10096-013-1957-y Jerby, L., Shlomi, T., & Ruppin, E. (2010). Computational reconstruction of tissue-specific metabolic models: application to human liver metabolism. Molecular Systems Biology, 6(401), 1–9. http://doi.org/10.1038/msb.2010.56 Jiao, X., Sherman, B. T., Huang, D. W., Stephens, R., Baseler, M. W., Lane, H. C., & Lempicki, R. a. (2012). DAVID-WS: a stateful web service to facilitate gene/protein list analysis. Bioinformatics (Oxford, England), 28(13), 1805–6. http://doi.org/10.1093/bioinformatics/bts251 Jin, Q., & Esteva, F. J. (2008). Cross-talk between the ErbB/HER family and the type I insulin-like growth factor receptor signaling pathway in breast cancer. Journal of Mammary Gland Biology and Neoplasia, 13(4), 485–98. http://doi.org/10.1007/s10911-008-9107-3 Joshi-Tope, G., Gillespie, M., Vastrik, I., D'Eustachio, P., Schmidt, E., de Bono, B., … Stein, L. (2005). Reactome: A knowledgebase of biological pathways. Nucleic Acids Research, 33, 428–432. http://doi.org/10.1093/nar/gki072 Kambe, T., Tsuji, T., Hashimoto, A., & Itsumura, N. (2015). The Physiological, Biochemical, and Molecular Roles of Zinc Transporters in Zinc Homeostasis and Metabolism. Physiological Reviews, 95(3), 749–784. http://doi.org/10.1152/physrev.00035.2014 Kao, K.-J., Chang, K.-M., Hsu, H.-C., & Huang, A. T. (2011). Correlation of microarray-based breast cancer molecular subtypes and clinical outcomes: implications for treatment optimization. BMC Cancer, 11, 143. http://doi.org/10.1186/1471-2407-11-143 Kay, L. E. (1995). Who wrote the book of life? Information and the transformation of molecular biology, 1945-55. Science in Context, 8, 609–634. http://doi.org/10.1017/S0269889700002210 Keller, E. F. (2005). The century beyond the gene. Journal of Biosciences, 30, 3–10. http://doi.org/10.1007/BF02705144 Keller, E. F., & Harel, D. (2007). Beyond the gene. PLoS ONE, 2. http://doi.org/10.1371/journal.pone.0001231

189

Kibe, S., Adams, K., & Barlow, G. (2011). Diagnostic and prognostic biomarkers of sepsis in critical care. The Journal of Antimicrobial Chemotherapy, 66 Suppl 2, ii33–40. http://doi.org/10.1093/jac/dkq523 Kim, Y., Wuchty, S., & Przytycka, T. M. (2010). Simultaneous Identification of Causal Genes and Dys- Regulated Pathways in Complex Diseases. Methods, 263–280. Kim, Y., Wuchty, S., & Przytycka, T. M. (2011). Identifying Causal Genes and Dysregulated Pathways in Complex Diseases, 7(3). http://doi.org/10.1371/journal.pcbi.1001095 King, E. G., Bauzá, G. J., Mella, J. R., & Remick, D. G. (2013). Pathophysiologic mechanisms in septic shock. Laboratory Investigation, 94(August 2013), 4–12. http://doi.org/10.1038/labinvest.2013.110 Kitano, H. (2004). Biological robustness. Nature Reviews. Genetics, 5(11), 826–37. http://doi.org/10.1038/nrg1471 Koehl, P. (2001). Protein structure similarities. Current Opinion in Structural Biology, 11(3), 348–353. http://doi.org/10.1016/S0959-440X(00)00214-1 Koehl, P., & Levitt, M. (1999). A brighter future for protein structure prediction. Nature Structural Biology, 6(2), 108–111. http://doi.org/10.1038/5794 Kumar, A., Roberts, D., Wood, K. E., Light, B., Parrillo, J. E., Sharma, S., … Cheang, M. (2006). Duration of hypotension before initiation of effective antimicrobial therapy is the critical determinant of survival in human septic shock. Critical Care Medicine, 34, 1589–1596. http://doi.org/10.1097/01.CCM.0000217961.75225.E9 László, I., Trásy, D., Molnár, Z., & Fazakas, J. (2015). Sepsis: From Pathophysiology to Individualized Patient Care. Journal of Immunology Research, 2015. http://doi.org/10.1155/2015/510436 Lee, E., Chuang, H.-Y., Kim, J.-W., Ideker, T., & Lee, D. (2008). Inferring pathway activity toward precise disease classification. PLoS Computational Biology, 4(11), e1000217. http://doi.org/10.1371/journal.pcbi.1000217 Liang, D., Han, G., Feng, X., Sun, J., Duan, Y., & Lei, H. (2012). Concerted perturbation observed in a hub network in Alzheimer’s disease. PloS One, 7(7), e40498. http://doi.org/10.1371/journal.pone.0040498 Liu, K.-Q., Liu, Z., Hao, J., Chen, L., & Zhao, X. (2012). Identifying dysregulated pathways in cancers from pathway interaction networks. BMC Bioinformatics. BMC Bioinformatics. http://doi.org/10.1186/1471-2105-13-126 Liu, Y., & Chance, M. R. (2013). Pathway analyses and understanding disease associations. Current Genetic Medicine Reports, 1(4), 230–238. http://doi.org/10.1007/s40142-013-0025-3 Liu, Z.-P., Wang, Y., Zhang, X.-S., & Chen, L. (2012). Network-based analysis of complex diseases. IET Systems Biology, 6(1), 22–33. http://doi.org/10.1049/iet-syb.2010.0052 Lo, K., Raftery, A. E., Dombek, K. M., Zhu, J., Schadt, E. E., Bumgarner, R. E., & Yeung, K. Y. (2012). Integrating external biological knowledge in the construction of regulatory networks from time- series expression data. BMC Systems Biology, 6(1), 101. http://doi.org/10.1186/1752-0509-6-101 London, N. R., Zhu, W., Bozza, F. A., Smith, M. C. P., Greif, D. M., Sorensen, L. K., … Li, D. Y. (2010). Targeting Robo4-dependent Slit signaling to survive the cytokine storm in sepsis and influenza. Science Translational Medicine, 2(23), 23ra19. http://doi.org/10.1126/scitranslmed.3000678 Loscalzo, J., & Barabasi, A.-L. (2011). Systems biology and the future of medicine. Wiley Interdisciplinary Reviews. Systems Biology and Medicine, 3(6), 619–27. http://doi.org/10.1002/wsbm.144

190

M??nsson, R., Tsapogas, P., ??kerlund, M., Lagergren, A., Gisler, R., & Sigvardsson, M. (2004). Pearson Correlation Analysis of Microarray Data Allows for the Identification of Genetic Targets for Early B-cell Factor. Journal of Biological Chemistry, 279(17), 17905–17913. http://doi.org/10.1074/jbc.M400589200 Margolin, A. a, Wang, K., Lim, W. K., Kustagi, M., Nemenman, I., & Califano, A. (2006). Reverse engineering cellular networks. Nature Protocols, 1(2), 662–71. http://doi.org/10.1038/nprot.2006.106 Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., & Gilad, Y. (2008). RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Research, 18(9), 1509–17. http://doi.org/10.1101/gr.079558.108 Mason, O., & Verwoerd, M. (2007). Graph theory and networks in Biology. IET Systems Biology, 1(2), 89–119. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/17441552 Mazzocchi, F. (2008). Exceeding the limits of reductionism and determinism using complexity theory. Molecular Biology. McGrogan, B. T., Gilmartin, B., Carney, D. N., & McCann, A. (2008). Taxanes, microtubules and chemoresistant breast cancer. Biochimica et Biophysica Acta - Reviews on Cancer, 1785, 96–132. http://doi.org/10.1016/j.bbcan.2007.10.004 Mei, J., Zhao, J., & Fu, Y. (2012). Analysis of Functional Modules in Protein Networks Using Graph Clustering Method. Advanced Materials Research, 482-484, 612–615. http://doi.org/10.4028/www.scientific.net/AMR.482-484.612 Mikaelian, I., Scicchitano, M., Mendes, O., Thomas, R. a, & Leroy, B. E. (2012). Frontiers in Preclinical Safety Biomarkers: MicroRNAs and Messenger RNAs. Toxicologic Pathology, (June). http://doi.org/10.1177/0192623312448939 Miller, L. D., Smeds, J., George, J., Vega, V. B., Vergara, L., Ploner, A., … Bergh, J. (2005). An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proceedings of the National Academy of Sciences of the United States of America, 102(38), 13550–5. http://doi.org/10.1073/pnas.0506230102 Milne, A. N., Carneiro, F., O’Morain, C., & Offerhaus, G. J. a. (2009). Nature meets nurture: molecular genetics of gastric cancer. Human Genetics, 126(5), 615–28. http://doi.org/10.1007/s00439-009- 0722-x Mirzarezaee, M., Araabi, B. N., & Sadeghi, M. (2010). Features analysis for identification of date and party hubs in protein interaction network of Saccharomyces Cerevisiae. BMC Systems Biology, 4, 172. http://doi.org/10.1186/1752-0509-4-172 Mitrea, C., Taghavi, Z., Bokanizad, B., Hanoudi, S., Tagett, R., Donato, M., … Drăghici, S. (2013). Methods and approaches in the topology-based analysis of biological pathways. Frontiers in Physiology, 4(October), 278. http://doi.org/10.3389/fphys.2013.00278 Nduka, O. O., & Parrillo, J. E. (2009). The pathophysiology of septic shock. Critical Care Clinics, 25(4), 677–702, vii. http://doi.org/10.1016/j.ccc.2009.08.002 Norum, J. H., Andersen, K., & Sørlie, T. (2014). Lessons learned from the intrinsic subtypes of breast cancer in the quest for precision therapy. The British Journal of Surgery, 101(8), 925–38. http://doi.org/10.1002/bjs.9562 Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., & Kanehisa, M. (1999). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 27(1), 29–34. http://doi.org/10.1093/nar/27.1.29 Olsen, J. V., Blagoev, B., Gnad, F., Macek, B., Kumar, C., Mortensen, P., & Mann, M. (2006). Global, In

191

Vivo, and Site-Specific Phosphorylation Dynamics in Signaling Networks. Cell, 127, 635–648. http://doi.org/10.1016/j.cell.2006.09.026 Papin, J. a, Hunter, T., Palsson, B. O., & Subramaniam, S. (2005). Reconstruction of cellular signalling networks and analysis of their properties. Nature Reviews. Molecular Cell Biology, 6(2), 99–111. http://doi.org/10.1038/nrm1570 Parker, J. S., Mullins, M., Cheang, M. C. U., Leung, S., Voduc, D., Vickery, T., … Bernard, P. S. (2009). Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of Clinical Oncology : Official Journal of the American Society of Clinical Oncology, 27(8), 1160–7. http://doi.org/10.1200/JCO.2008.18.1370 Pastore, S., Mascia, F., Mariani, V., & Girolomoni, G. (2008). The epidermal growth factor receptor system in skin repair and inflammation. The Journal of Investigative Dermatology, 128(6), 1365– 1374. http://doi.org/10.1038/sj.jid.5701184 Pavlopoulos, G. a, Secrier, M., Moschopoulos, C. N., Soldatos, T. G., Kossida, S., Aerts, J., … Bagos, P. G. (2011). Using graph theory to analyze biological networks. BioData Mining, 4(1), 10. http://doi.org/10.1186/1756-0381-4-10 Pawson, T., & Linding, R. (2008). Network medicine. FEBS Letters, 582(8), 1266–70. http://doi.org/10.1016/j.febslet.2008.02.011 Pedregosa, F., Weiss, R., & Brucher, M. (2011). Scikit-learn : Machine Learning in Python, 12, 2825– 2830. Perou, C. M., & Børresen-Dale, A.-L. (2011). Systems biology and genomics of breast cancer. Cold Spring Harbor Perspectives in Biology, 3(2). http://doi.org/10.1101/cshperspect.a003293 Pierrakos, C., & Vincent, J.-L. (2010). Sepsis biomarkers: a review. Critical Care (London, England), 14, R15. http://doi.org/10.1186/cc8872 Poirel, C. L., Rodrigues, R. R., Chen, K. C., Tyson, J. J., & Murali, T. M. (2013a). Top-down network analysis to drive bottom-up modeling of physiological processes. Journal of Computational Biology : A Journal of Computational Molecular Cell Biology, 20(5), 409–18. http://doi.org/10.1089/cmb.2012.0274 Poirel, C. L., Rodrigues, R. R., Chen, K. C., Tyson, J. J., & Murali, T. M. (2013b). Top-down network analysis to drive bottom-up modeling of physiological processes. Journal of Computational Biology : A Journal of Computational Molecular Cell Biology, 20(5), 409–18. http://doi.org/10.1089/cmb.2012.0274 Polyak, K., Shipitsin, M., Campbell-Marrotta, L., Bloushtain-Qimron, N., & Park, S. Y. (2009). Breast tumor heterogeneity: causes and consequences. Breast Cancer Research : BCR, 11 Suppl 1(19), S18. http://doi.org/10.1186/bcr2279 Prat, A., & Perou, C. M. (2011). Deconstructing the molecular portraits of breast cancer. Molecular Oncology, 5(1), 5–23. http://doi.org/10.1016/j.molonc.2010.11.003 Procaccini, C., Galgani, M., De Rosa, V., & Matarese, G. (2012). Intracellular metabolic pathways control immune tolerance. Trends in Immunology, 33(1), 1–7. http://doi.org/10.1016/j.it.2011.09.002 Przytycka, T. M., Singh, M., & Slonim, D. K. (2010). Toward the dynamic interactome: it’s about time. Briefings in Bioinformatics, 11(1), 15–29. http://doi.org/10.1093/bib/bbp057 Ptak, C., & Petronis, A. (2008). Epigenetics and complex disease: from etiology to new therapeutics. Annual Review of Pharmacology and Toxicology, 48, 257–76. http://doi.org/10.1146/annurev.pharmtox.48.113006.094731

192

Ray, M., Ruan, J., & Zhang, W. (2008). Variations in the transcriptome of Alzheimer’s disease reveal molecular networks involved in cardiovascular diseases. Genome Biology, 9(10), R148. http://doi.org/10.1186/gb-2008-9-10-r148 Reis-Filho, J. S., & Pusztai, L. (2011). Gene expression profiling in breast cancer: classification, prognostication, and prediction. Lancet, 378(9805), 1812–23. http://doi.org/10.1016/S0140- 6736(11)61539-0 Renckens, R., Roelofs, J. J. T. H., Florquin, S., de Vos, A. F., Lijnen, H. R., van’t Veer, C., & van der Poll, T. (2006). Matrix Metalloproteinase-9 Deficiency Impairs Host Defense against Abdominal Sepsis. The Journal of Immunology, 176(6), 3735–3741. http://doi.org/10.4049/jimmunol.176.6.3735 Rost, B. (2001). Protein Secondary Structure Prediction Continues to Rise. J. Struct. Biol., 134, 204–218. http://doi.org/10.1006/jsbi.2000.4336 Rost, B., & Sander, C. (1993). Prediction of protein secondary structure at better than 70% accuracy. Journal of Molecular Biology. http://doi.org/10.1006/jmbi.1993.1413 Roy, C., Gupta, A., Fisette, A., Lapointe, M., Poursharifi, P., Richard, D., … Cianflone, K. (2013). C5a Receptor Deficiency Alters Energy Utilization and Fat Storage. PLoS ONE, 8(5). http://doi.org/10.1371/journal.pone.0062531 Rozenblatt-Rosen, O., Deo, R. C., Padi, M., Adelmant, G., Calderwood, M. a, Rolland, T., … Vidal, M. (2012). Interpreting cancer genomes using systematic host network perturbations by tumour virus proteins. Nature, 487(7408), 491–5. http://doi.org/10.1038/nature11288 Russell, J. a. (2011). Gene expression in human sepsis: what have we learned? Critical Care (London, England), 15(1), 121. http://doi.org/10.1186/cc9384 Sales, G., Coppe, A., Bisognin, A., Biasiolo, M., Bortoluzzi, S., & Romualdi, C. (2010). MAGIA, a web- based tool for miRNA and Genes Integrated Analysis. Nucleic Acids Research, 38(Web Server issue), W352–9. http://doi.org/10.1093/nar/gkq423 Sandhu, R., Parker, J. S., Jones, W. D., Livasy, C. a., & Coleman, W. B. (2010). Microarray-Based Gene Expression Profiling for Molecular Classification of Breast Cancer and Identification of New Targets for Therapy. Laboratory Medicine, 41(6), 364–372. http://doi.org/10.1309/LMLIK0VIE3CJK0WD Schadt, E. E. (2009). Molecular networks as sensors and drivers of common human diseases. Nature, 461(7261), 218–23. http://doi.org/10.1038/nature08454 Schaefer, C. F., Anthony, K., Krupa, S., Buchoff, J., Day, M., Hannay, T., & Buetow, K. H. (2009). PID: the Pathway Interaction Database. Nucleic Acids Research, 37(Database issue), D674–9. http://doi.org/10.1093/nar/gkn653 Schibler, K. R. (2012). Physiology and Abnormalities of Leukocytes. In Neonatology (pp. 804–818). Milano: Springer Milan. http://doi.org/10.1007/978-88-470-1405-3_105 Schroeder, M. P., Gonzalez-Perez, A., & Lopez-Bigas, N. (2013). Visualizing multidimensional cancer genomics data. Genome Medicine, 5, 9. http://doi.org/10.1186/gm413 Sebastian-leon, P., Vidal, E., Minguez, P., Conesa, A., Amadoz, A., Armero, C., … Vidal-, A. (2014). Understanding disease mechanisms with models of signaling pathway activities ., 1–19. http://doi.org/10.1186/s12918-014-0121-3 Sejas, D. P., Rani, R., Qiu, Y., Zhang, X., Fagerlie, S. R., Nakano, H., … Pang, Q. (2007). Inflammatory Reactive Oxygen Species-Mediated Hemopoietic Suppression in Fancc-Deficient Mice. The Journal of Immunology, 178(8), 5277–5287. http://doi.org/10.4049/jimmunol.178.8.5277

193

Shanmugam, A., & Natarajan, J. (2014). Combination of site directed mutagenesis and secondary structure analysis predicts the amino acids essential for stability of M. leprae MurE. Interdisciplinary Sciences, Computational Life Sciences, 6(1), 40–47. http://doi.org/10.1007/s12539- 014-0185-1 Shannon, C. E. (1948). BA mathematical theory of communication,[Bell System Tech. J. Shou, C., Bhardwaj, N., Lam, H. Y. K., Yan, K.-K., Kim, P. M., Snyder, M., & Gerstein, M. B. (2011). Measuring the evolutionary rewiring of biological networks. PLoS Computational Biology, 7(1), e1001050. http://doi.org/10.1371/journal.pcbi.1001050 Sinn, H. P., & Kreipe, H. (2013). A brief overview of the WHO classification of breast tumors, 4th edition, focusing on issues and updates from the 3rd edition. Breast Care, 8(2), 149–154. http://doi.org/10.1159/000350774 Slonim, D. K., & Yanai, I. (2009). Getting started in gene expression microarray analysis. PLoS Computational Biology, 5(10), e1000543. http://doi.org/10.1371/journal.pcbi.1000543 Soneson, C., & Delorenzi, M. (2013). A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics, 14(1), 91. http://doi.org/10.1186/1471-2105-14-91 Sorger, P. K., Allerheiligen, S. R. B., Abernethy, D. R., Altman, R. B., Brouwer, K. L. R., Califano, A., … Lalonde, R. (2011). Quantitative and systems pharmacology in the post-genomic era: new approaches to discovering drugs and understanding therapeutic mechanisms. In An NIH white paper by the QSP workshop group (pp. 1–48). Sorlie, T., Tibshirani, R., Parker, J., Hastie, T., Marron, J. S., Nobel, A., … Botstein, D. (2003). Repeated observation of breast tumor subtypes in independent gene expression data sets. Proceedings of the National Academy of Sciences of the United States of America, 100(14), 8418–23. http://doi.org/10.1073/pnas.0932692100 Staaf, J., Jönsson, G., Ringnér, M., Vallon-Christersson, J., Grabau, D., Arason, A., … Borg, A. (2010). High-resolution genomic and expression analyses of copy number alterations in HER2-amplified breast cancer. Breast Cancer Research : BCR, 12(3), R25. http://doi.org/10.1186/bcr2568 Stelzl, U., Worm, U., Lalowski, M., Haenig, C., Brembeck, F. H., Goehler, H., … Wanker, E. E. (2005). A human protein-protein interaction network: a resource for annotating the proteome. Cell, 122(6), 957–68. http://doi.org/10.1016/j.cell.2005.08.029 Supper, J., Spangenberg, L., Planatscher, H., Dräger, A., Schröder, A., & Zell, A. (2009). BowTieBuilder: modeling signal transduction pathways. BMC Systems Biology, 3, 67. http://doi.org/10.1186/1752- 0509-3-67 Sweeney, C., Bernard, P. S., Factor, R. E., Kwan, M. L., Habel, L. a, Quesenberry, C. P., … Caan, B. J. (2014). Intrinsic subtypes from PAM50 gene expression assay in a population-based breast cancer cohort: differences by age, race, and tumor characteristics. Cancer Epidemiology, Biomarkers & Prevention : A Publication of the American Association for Cancer Research, Cosponsored by the American Society of Preventive Oncology, 23(5), 714–24. http://doi.org/10.1158/1055-9965.EPI-13- 1023 Taherian-Fard, A., Srihari, S., & Ragan, M. a. (2014). Breast cancer classification: linking molecular mechanisms to disease prognosis. Briefings in Bioinformatics. http://doi.org/10.1093/bib/bbu020 Takacova, M., Bullova, P., Simko, V., Skvarkova, L., Poturnajova, M., Feketeova, L., … Pastorekova, S. (2014). Expression pattern of carbonic anhydrase IX in medullary thyroid carcinoma supports a role for RET-mediated activation of the HIF pathway. American Journal of Pathology, 184(4), 953–965. http://doi.org/10.1016/j.ajpath.2014.01.002 Tan, S.-L., & Parker, P. J. (2003). Emerging and diverse roles of protein kinase C in immune cell

194

signalling. The Biochemical Journal, 376(Pt 3), 545–52. http://doi.org/10.1042/BJ20031406 Taylor, I. W., Linding, R., Warde-Farley, D., Liu, Y., Pesquita, C., Faria, D., … Wrana, J. L. (2009). Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nature Biotechnology, 27(2), 199–204. http://doi.org/10.1038/nbt.1522 Tokuriki, N., & Tawfik, D. S. (2009). Stability effects of mutations and protein evolvability. Current Opinion in Structural Biology, 19(5), 596–604. http://doi.org/10.1016/j.sbi.2009.08.003 Tsujimoto, H., Ono, S., Efron, P. A., Scumpia, P. O., Moldawer, L. L., & Mochizuki, H. (2008). Role of Toll-like receptors in the development of sepsis. Shock (Augusta, Ga.), 29(3), 315–21. http://doi.org/10.1097/SHK.0b013e318157ee55 Tsuno, S., Wang, X., Shomori, K., Hasegawa, J., & Miura, N. (2014). Hsa-miR-520d induces hepatoma cells to form normal liver tissues via a stemness-mediated process. Scientific Reports, 4. http://doi.org/10.1038/srep03852 Vallabhajosyula, R. R., Chakravarti, D., Lutfeali, S., Ray, A., & Raval, A. (2009). Identifying hubs in protein interaction networks. PloS One, 4(4), e5344. http://doi.org/10.1371/journal.pone.0005344 Van Regenmortel, M. H. V. (2004). Biological complexity emerges from the ashes of genetic reductionism. Journal of Molecular Recognition : JMR, 17(3), 145–8. http://doi.org/10.1002/jmr.674 Van Regenmortel, M. H. V. (2004). Reductionism and complexity in molecular biology. Scientists now have the tools to unravel biological and overcome the limitations of reductionism. EMBO Reports, 5(11), 1016–20. http://doi.org/10.1038/sj.embor.7400284 Vidal, M., Cusick, M. E., & Barabási, A.-L. (2011). Interactome networks and human disease. Cell, 144(6), 986–98. http://doi.org/10.1016/j.cell.2011.02.016 Vogelstein, B., & Kinzler, K. W. (2004). Cancer genes and the pathways they control. Nature Medicine, 10(8), 789–99. http://doi.org/10.1038/nm1087 Voichita, C., Donato, M., & Draghici, S. (2012). Incorporating gene significance in the impact analysis of signaling pathways. Proceedings - 2012 11th International Conference on Machine Learning and Applications, ICMLA 2012, 1(1), 126–131. http://doi.org/10.1109/ICMLA.2012.230 Wang, J., Duncan, D., Shi, Z., & Zhang, B. (2013). WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013. Nucleic Acids Research, 41(Web Server issue), W77–83. http://doi.org/10.1093/nar/gkt439 Wang, L., Eghbalnia, H. R., & Markley, J. L. (2007). Nearest-neighbor effects on backbone alpha and beta carbon chemical shifts in proteins. Journal of Biomolecular NMR, 39(3), 247–57. http://doi.org/10.1007/s10858-007-9193-3 Wang, S., & Biology, C. (2011). ERBB Receptors and Breast Cancer. Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews. Genetics, 10(1), 57–63. http://doi.org/10.1038/nrg2484 Weigelt, B., Baehner, F. L., & Reis-filho, J. S. (2010). The contribution of gene expression profiling to breast cancer classification , prognostication and prediction : a retrospective of the last decade, (November 2009), 263–280. http://doi.org/10.1002/path West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., … Nevins, J. R. (2001). Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 98(20), 11462–7. http://doi.org/10.1073/pnas.201162998 Whitacre, J. M. (2012). Biological robustness: paradigms, mechanisms, and systems principles. Frontiers

195

in Genetics, 3(May), 67. http://doi.org/10.3389/fgene.2012.00067 Wirapati, P., Sotiriou, C., Kunkel, S., Farmer, P., Pradervand, S., Haibe-Kains, B., … Delorenzi, M. (2008). Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Research : BCR, 10(4), R65. http://doi.org/10.1186/bcr2124 Witkos, T. M., Koscianska, E., & Krzyzosiak, W. J. (2011). Practical Aspects of microRNA Target Prediction. Current Molecular Medicine, 11(2), 93–109. Retrieved from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3182075&tool=pmcentrez&rendertype= abstract Wong, H. R., Cvijanovich, N., Allen, G. L., Lin, R., Anas, N., Meyer, K., … Shanley, T. P. (2009). Genomic expression profiling across the pediatric systemic inflammatory response syndrome, sepsis, and septic shock spectrum. Critical Care Medicine, 37(5), 1558–66. http://doi.org/10.1097/CCM.0b013e31819fcc08 Wülfingen, B. B. von. (2009). Biology and the systems view. EMBO Reports, 10, 37–41. You, L. (2004). Toward computational systems biology. Cell Biochemistry and Biophysics, 40(2), 167– 84. http://doi.org/10.1385/CBB:40:2:167 Zhang, R., & Lin, Y. (2009). DEG 5.0, a database of essential genes in both prokaryotes and . Nucleic Acids Research, 37(Database issue), D455–8. http://doi.org/10.1093/nar/gkn858 Zhao, H., Li, W., Lu, Z., Sheng, Z., & Yao, Y. (2015). The Growing Spectrum of Anti-Inflammatory Interleukins and Their Potential Roles in the Development of Sepsis. Journal of Interferon & Cytokine Research, 35(4), 242–251. http://doi.org/10.1089/jir.2014.0119 Zhao, S., Fung-Leung, W.-P., Bittner, A., Ngo, K., & Liu, X. (2014). Comparison of RNA-Seq and Microarray in Transcriptome Profiling of Activated T Cells. PLoS ONE, 9(1), e78644. http://doi.org/10.1371/journal.pone.0078644 Zhao, Y., Chen, M. H., Pei, B., Rowe, D., Shin, D. G., Xie, W., … Kuo, L. (2012). A Bayesian Approach to Pathway Analysis by Integrating Gene-Gene Functional Directions and Microarray Data. Statistics in Biosciences, 4(1), 105–131. http://doi.org/10.1007/s12561-011-9046-1 Zhou, A. Q., O'Hern, C. S., & Regan, L. (2011). Revisiting the Ramachandran plot from a new angle. Protein Science, 20(7), 1166–1171. http://doi.org/10.1002/pro.644 Zhu, X., Gerstein, M., & Snyder, M. (2007). Getting connected: analysis and principles of biological networks. Genes & Development, 21(9), 1010–24. http://doi.org/10.1101/gad.1528707

196