A Network-Based Approach to Associate High Density Lipoprotein

A network-based approach to associate High

Density Lipoprotein (HDL)’s subspeciation with

its cardiovascular protective functions

A dissertation submitted to the

Graduate School

of the University of Cincinnati

in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

in the Department of Biomedical Engineering

of the College of Engineering

2012

Jingyuan Deng

B.S., University of Science and Technology of China, 2007

Committee Chair: Long (Jason) Lu, Ph.D.

Committee Members: Sean Davidson, Ph.D. Jarek Meller, Ph. D. & Rao Marepalli, Ph.D.

Abstract

High density lipoproteins are a heterogeneous group of particles composed of proteins and lipids that in approximately equal mass. The most abundant HDL proteins are apolipoprotein (apo) A-I and A-II, yet recent proteomic studies have identified up to 50 additional proteins within HDL.

HDL is well-known for its critical cardiovascular disease (CVD) prevention function, which is mainly achieved through the mediation of the reverse cholesterol transport (RCT). In addition to that, recent studies have shown that HDL also displays a series of similarly diverse functions related with CVD-protection, including anti-oxidation, anti-inflammation, and endothelial relation. What‟s more, a growing body of evidence (including our research) has suggested these diverse HDL functions may be mediated by distinct stable subspecies that happen to co- fractionate with classically defined “HDL”.

To better characterize the structural composition and functionality of HDL subspecies, our collaborator have applied three non-density based orthogonal separation chromatography techniques (Gel filtration (GF), Anion exchange (AE), and Isoelectric focusing (IEF)) for the isolation of HDL from human plasmas. Generally, these techniques fractionated normal human plasmas to phospholipid-containing subfractions, then the HDL associated proteins and their distributions were determined using Mass Spectrometry. Given the proteomic profiles of HDL proteins, our work is to systematically identify the structural HDL subspecies and study their biological functions. In the first step, we assume that HDL associated proteins, which have similar co-migration patterns when separated by different techniques, are likely to form distinct lipoprotein subspecies. So for a protein pair, showing consistently high similarity in migration patterns across techniques would provide the strongest evidence of their co-existence in the same

i particle. Therefore, we developed two novel scoring systems to quantitatively measure the similarity between proteins‟ co-migration patterns. Using these score systems we built the co- migration similarity network, and validated it using both literature search and gene co-expression analysis. However, we noticed that, because of the limitations of experimental techniques, the co-migration analysis may be biased against low-abundance proteins or highly labile complexes.

To complement these limitations in determining the co-occurrence of proteins in the same HDL particle, we turned to an ideal alternative approach: the Computational technique, which has been shown successful for predicting PPIs in model organisms, e.g., yeast and human. So in the second step, we constructed a predicted HDL interactome network by integrating several most relevant features associated with protein interactions such as genome-wide sequence, function, gene expression and curated datasets from literature. Then we merge it with the previous-built co-migration similarity network to obtain the generalized interactome network. This generalized

HDL interactome network is more accurate, complete and unbiased than individual networks.

Finally, to quantitatively connect the HDL subspecies and their biological functions, our collaborator also performed cholesterol efflux assay, which is the most common way to measure the protective activities of HDL, on each of the samples. We then applied a network-based classification to identify functional modules that optimally correlate with functional activity profiles from the generalized HDL interactome network. A module, consisting of a group of

HDL proteins, may correspond to the entirety or part of an HDL particle that carries out HDL‟s

CVD-protective function. These modules provide models of the molecular mechanisms underlying CVD protection, and they are significantly more reproducible and accurate than individual marker proteins.

iii

Acknowledgements

First, I would like to acknowledge my gratitude to my advisor and mentor Dr. Long (Jason) Lu, for all the valuable supervision, motivation and opportunities in the lab throughout my graduate studies. His willingness, high expectations and encouragement empower me with the knowledge and confidence to engage in my own pursuits. I deeply value Jason‟s respect for his students‟ independence and interest as a regular person in addition to being budding scientists. I am very appreciating and grateful for all the training with him, and I am quite sure that without his care, supervision and friendship, I would not be able to complete this work.

The next, I would like to express my sincere gratitude to my committee, Dr. Davidson Sean, Dr.

Jarek Meller and Dr. Rao Marepalli, for their time and efforts in providing support and guidance throughout the duration of this thesis, and especially for their insightful advice and useful suggestions as well as encouragements in our committee meetings.

I also would like to thank Dr. Jun Ma and Dr. Michael Wagner in the division of Biomedical

Informatics at Cincinnati Children‟s Hospital for all the friendly taking and useful discussions. I am especially grateful for the valuable questions they raised during my presentations in our journal club. Many thanks to Dr. Xiaodong Lin in University of Rutgers for all the precious suggestions about the mathematical models of this project.

As a graduate student in Bioinformatics at University of Cincinnati, I am very lucky to have many outstanding and brilliant fellow students. I learn a lot from them. Sincere thanks to all the current and past members of our lab for their assistance, support, and friendship. Also thanks to

Scott, a graduate student in Dr. Davidson Sean‟s lab, for all the discussions and suggestions about the experimental parts of the project. I especially would like to thank many good friends I iv met here, Zhaowei Ren, Yixuan Guo, Shuai shao, Feng He and more. Thanks for all the happiness, great food and other fun stuff we shared.

Most important of all, I would like to thank my parents for their unconditional love, infinite care and limitless support through all of my endeavors. They teach me how to be a good person with good characteristics which will benefit me all my life. Here I would like to especially thank my husband for his support throughput my pursuit of the Ph.D. He is the most important daily source of support and motivation through this process. This dissertation could not have been completed without him. I would like to dedicate this work to him.

My Publications

1. He F, Wen Y, Deng J, Lin X, Lu LJ, Jiao R and Ma J (2008) Probing intrinsic properties of a robust morphogen gradient in Drosophila. Developmental Cell. 15(4):558-67.

2. Deng J, Wang W, Lu LJ and Ma J (2010) A two-dimensional simulation model of the Bicoid gradient in Drosophila. PLoS One. 5(4): e10275.

3. Zhang M, Deng J, Fang C, Zhang X and Lu LJ (2010) Biomolecular network analysis and applications. Knowledge-Based Bioinformatics. 11: 253-288.

4. He F, Wen Y, Cheung D, Deng J, Lu LJ, Jiao R and Ma J (2010) Distance measurements via the morphogen gradient of bicoid in Drosophila embryos. BMC Developmental Biology. 10:80.

5. Gordon SM, Deng J, Lu LJ and Davidson SW (2010) Proteomic characterization of human plasma High Density Lipoprotein fractionated by gel filtration chromatography. Journal of

Proteome Research. 9(10): 5239-5249.

6. Deng J, Deng L, Su S, Zhang M, Lin X, Wei L, Minai A, Hassett DJ and Lu LJ (2010)

Investigating the predictability of essential genes across distantly related organisms using an integrative approach. Nucleic Acids Research. 1-13.

7. Ren J, Jegga A, Zhang M, Deng J, Liu J, Gordon C, Aronow B, Lu LJ, Zhang B, Ma J (2011)

A Drosophila model of the neurodegenerative disease SCA17 reveals a role of RBP-J/Su(H) in modulating the pathological outcome Hum. Mol. Genet. (2011) 20 (17): 3424-3436.

8. Deng J, Tan L, Lin X, Lu Y and Lu LJ (2012) Exploring the optimal strategy to predict essential genes in microbes. Biomolecules 2(1), 1-22.

9. Deng J, Su, S; Lin, X; Hassett, DJ and Lu LJ (2012) A Statistical Model for Improving the

Accuracy of Transposon Mutagenesis Determined Essential Genes (under review)

10. Deng J, Gordon SM, Davidson WS and Lu LJ (2012) Protein co-migration pattern analysis to identify functional HDL subspecies.

11. Deng, J., Gordon, S.M., Davidson, W.S., & Lu, L.J. Co-migration pattern analysis to identify

High Density Lipoprotein (HDL) subspeciations. Arthriosclerosis, Thrombosis and Vascular

Biology 2012 Scientific Sessions.

12. Shah, S.A., Gordon, S.M., Lu, L.J., Deng, J., Dolan, M.L., Urbina, M.E., Davidson, W.S.

HDL Subspecies- A More Powerful Approach to Assess Cardiovascular Risk in Youth With

Type 2 Diabetes. Arthriosclerosis, Thrombosis and Vascular Biology 2012 Scientific Sessions.

vii

Contents

Abstract ...... i

Acknowledgements ...... iv

My Publications ...... vi

Contents ...... viii

List of figures ...... xi

List of tables...... xiv

1 Chapter 1. Overview...... 1

1.1 Overview of HDL related background information...... 1

1.1.1 HDL composition ...... 1

1.1.2 Overview of HDL functions ...... 2

1.1.3 Current evidence for HDL subspeciation ...... 3

1.2 Overview of Network-based approaches to identify Biomarkers ...... 4

1.2.1 Hubs-based methods ...... 5

1.2.2 Module-Based methods ...... 5

1.2.3 Dysregulated Pathway-Based methods ...... 9

1.3 Significance and our contribution ...... 10

2 Chapter 2. Analysis of co-migration patterns of HDL protein reveals structural subspecies ...... 15

2.1.1 Introduction ...... 15

2.2 Results ...... 17

viii

2.2.1 HDL associated proteins exhibit distinct patterns using orthogonal separation chromatography

techniques ...... 17

2.2.2 Novel scoring systems to quantify the similarity of proteins‟ co-migration patterns ...... 20

2.2.3 Analysis of co-migration networks to identify HDL subspecies ...... 23

2.2.4 Functional analysis of the identified HDL subspecies ...... 31

2.2.5 Validation of the identified HDL subspecies ...... 31

2.3 Discussion ...... 35

2.4 Methods and Material ...... 38

2.4.1 Data sources ...... 38

2.4.2 Random permutation test ...... 40

2.4.3 Binominal test ...... 41

2.4.4 Function enrichment analysis ...... 41

3 Chapter 3. Mapping of the generalized HDL interactome network by integrating co- migration network and predicted interactome network ...... 43

3.1 Introduction ...... 43

3.2 Results ...... 45

3.2.1 Assessing the performance of individual features ...... 45

3.2.2 Logistic regression framework ...... 52

3.2.3 Merging with co-migration network ...... 57

3.3 Discussion ...... 58

3.4 Materials and methods ...... 58

3.4.1 Data sources ...... 60

3.4.2 Genomic features for predicting HDL protein interactions...... 60

3.4.3 Evaluating the predictive power of features using Nomagram...... 62

3.4.4 Logistic Regression Procedure ...... 65

4 Chapter 4. Identification of functional modules from the HDL interactome map...... 67

4.1 Introduction ...... 67

4.2 Results ...... 69

4.2.1 Functional analysis of HDL subfractions ...... 69

4.2.2 The framework of the functional HDL subspecies identification ...... 70

4.2.3 Functional HDL subspecies identification ...... 73

4.2.4 Subnetwork markers are reproducible across different samples ...... 76

4.2.5 Classification performance of the module markers...... 78

4.3 Methods and Materials ...... 80

4.3.1 Cholesterol efflux assay ...... 80

4.3.2 Function enrichment analysis ...... 81

4.3.3 Random permutation test ...... 81

5 Chapter 5. Other computational biology projects ...... 82

5.1 Unraveling the mechanistic basis of gene essentiality ...... 82

5.1.1 Investigating the predictability of essential genes across distantly related organisms using an

integrative approach ...... 82

5.1.2 A Statistical Framework for Improving Genomic Annotations of Essential Genes ...... 97

5.2 Probing the Developmental Robustness in Fruit Fly Drosophila ...... 113

5.2.1 Introduction ...... 113

5.2.2 Results and Discussion ...... 114

Bibliography ...... 122

List of figures

Figure 2-1 The outline of the systematic approach to identify structural HDL subspecies...... 17 Figure 2-2. Protein, cholesterol and PL profile of normal human plasma separated by our (A) GF method (B) AE method; (C) IEF method...... 19 Figure 2-3 The Venn diagram of the identified HDL associated proteins from normal human plasmas using three non-density based separation techniques. The values on the graph indicate the numbers of proteins each part contains...... 19 Figure 2-4 Relative abundance profiles of known HDL proteins across fractions generated by (A) gel filtration (B) anion exchange, and (C) isoelectric focusing...... 20 Figure 2-5. (a) The full C-score similarity network and (b) The full S-score similarity network. Different colors denote nodes with different degrees. Red nodes have higher degrees than yellow nodes...... 24 Figure 2-6 Node degree distributions of the (a) C-score co-migration network and (b) S-score co- migration network...... 29 Figure 2-7 Co-migration patterns of the top clique with size three in 10 plasma samples separately. This clique contains three proteins APOH, CFAI and HEMO. Curves with different colors represent migration patterns of different proteins...... 30 Figure 2-8 Examples HDL subspecies with protein apoE. The color of the node is proportional to the fold change of gene expression response to the apoE deletion. The blue color edges are literature supported edges...... 34 Figure 2-9 Examples HDL subspecies with known protein complexes structure support...... 35 Figure 3-1. The distribution of Fold enrichment score (FES) of mRNA co-expression...... 46 Figure 3-2. The distribution of Fold enrichment score (FES) of functional similarity...... 48 Figure 3-3. The FES distribution of similarity of phylogenetic trees...... 49 Figure 3-4. The FES distribution of protein pair co-occurrence...... 50 Figure 3-5. The Nomogram for visualization of the features. Each feature has a corresponding line indicating the relationship between a feature value and its predictive contribution assessed by Naïve Bayes analysis. The number on the line is the value of the feature and each value corresponds to a point score above. The longer the line is, the more predictive power the feature has in prediction...... 51 Figure 3-6. ROC curve based on all feature combined, as well as individual feature...... 54 Figure 3-7. Genelized HDL interactome network ...... 58 Figure 4-1. The activity profiles of cholesterol efflux by GF subfractions in three normal human plasmas...... 69

Figure 4-2. The outline of the network-based framework...... 70 Figure 4-3 The size distribution of the identified discriminative modules ...... 74 Figure 4-4 Correlation score distributions between (a) protein modules / (b) single proteins and functional activity...... 75 Figure 4-5 The protein expression profile of (a) single protein and (b) single module that have the highest correlation score, along with the functional activity profile...... 76 Figure 4-6 Overlap in module makers vs. single-gene makers identified from the three samples. The single-gene analysis was performed by using the same number of top discriminative genes as the number of genes covered by module markers...... 77 Figure 4-7 The selected predictive modules used in the logistic regression classifier. They were extracted from the generalized HDL interactome network. The red edges indicate these interactions are originally from co-expression network, the green edges indicate the interactions are from the predicted HDL interactome network and the blue edges indicate these interactions occurring in both networks...... 79 Figure 5-1 Comparison of genomes and essential genes in EC and AB The square represents 4289 EC total genes; the rectangle represents 3308 AB total genes. The overlap of the two represents 1198 orthologs determined by the RBH method. The rectangle with dashed border represents the total 302 EC essential genes. The rectangle with diagonal brick shades represents the total 499 AB essential genes. The rectangle within the dashed border and with diagonal brick shades represents the common essential genes in both species. The area of each rectangle is approximately proportional to the number of genes it represents...... 84 Figure 5-2 The Nomogram for visualization of the 13 selected features. Each feature has a corresponding line indicating the relationship between a feature value and its predictive contribution assessed by Naive Bayes analysis. The number on the line is the value of the feature and each value corresponds to a point score above. The longer the line is, the more predictive power the feature has in prediction...... 86 Figure 5-3 ROC curves plot the TPR versus FPR for different thresholds of classifier probability output. (A) and (B): EC -> AB; (C) and (D): AB -> EC. (A) Ten-fold cross-validations on the EC essential gene data set. (B) Predictions of AB essential genes. The classifier was trained on EC dataset and evaluated on AB essential genes. (C) Ten-fold cross-validations on the AB essential gene data set. (D) Predictions of EC essential genes. The classifier was trained on AB data set and evaluated on EC essential genes...... 88 Figure 5-4 ROC curves for: (A) Ten-fold cross-validations on the EC essential gene data set. (B) Predictions of PA essential genes. The classifier was trained on EC dataset and evaluated on PA essential genes...... 91

xii

Figure 5-5 ROC curves for: (A) Ten-fold cross-validations on the EC essential gene data set. (B) Predictions of BS essential genes. The classifier was trained on EC dataset and evaluated on BS essential genes...... 92 Figure 5-6 the integrative approach significantly extends the coverage of homology mapping. IG stands for the integrative approach. RBHstands for the reciprocal best hit approach. For the IG method, the cutoffs are set to be the same as the number of essential genes in each organism, i.e. (PA: 678, AB: 499, BS: 192)...... 94 Figure 5-7 Three factors have strong associations with false TM assignments. (A) Gene length. The lengths of TmEs are significantly shorter than those in the PEC dataset and total genes. Many of these short genes may be false essential genes. (B) Position of insertions. Essential genes mistakenly assigned to be non-essential by TM often have insertions in the 25% extreme-ends (5% in 5‟ end and 20% in 3‟ end). These insertions do not completely disrupt a gene‟s function. (C) Number of insertions. 75% of the essential genes mistakenly assigned to be non-essential by TM only have one insertion in them...... 102 Figure 5-8 Illustration of the statistical model...... 105 Figure 5-9 Robustness of our model at subsaturation levels of transposon insertions. The dashed line showed p-values of the Fisher‟s exact test to examine whether the true essential rate in PNTmEs is significantly lower than that in the original TmEs set. Similarly, the solid line showed p-values of the Fisher‟s exact test to examine whether the true essential rate in our PETmNs is significantly higher than that in the original TmNs set...... 108 Figure 5-10 Simulated Bcd distributions in a Drosophila embryo. A. A simulated embryo at nuclear cycle 14 showing the local total Bcd concentration [Btot] (arbitrary units). The A–P position is shown as absolute distance x (in mm) from the anterior. The ratio of total Bcd molecules in the cortical layer to those in the inner part of the embryo is 1.88 at nuclear cycle 14 (see text for further details). B. A plot of local DNA-bound Bcd concentration [Bbound] (arbitrary units) within the cortical layer as a function of fractional embryo length x/L. In this and other figures presented in this report, [Bbound] at each A–P position represents the mean [Bbound] value of all cubes within the cortical layer of the embryo at that A–P position. C. Same as B except [Bbound] is on ln scale. Linearity of ln[Bbound] indicates an exponential Bcd protein gradient; see text legend for more information about Adjusted R2 values to further evaluate the quality of exponential fitting of the simulated data...... 117 Figure 5-11 Stability of nuclear Bcd concentrations...... 118 Figure 5-12 Scaling properties of the Bcd gradient...... 120

xiii

List of tables

Table 2-1 75 identified HDL subspecies ranked by the significance of co-migration scores...... 25 Table 2-2 31 identified HDL subspecies from the filtered co-migration networks ranked by the significance of co-migration scores...... 29 Table 3-1. The top 50 predicted interactions along with their labels in gold-standard dataset...... 54 Table 3-2. Performance of the top predictions evaluated by precision and recall rate ...... 56 Table 4-1 Performance Table of the classifier on the test set using selected modules as input ...... 79 Table 4-2 Performance Table of the classifier on the test set using selected top single proteins as input . 80 Table 5-1 Improvement of overlaps with the PEC dataset using our model...... 106 Table 5-2 Validation using allelic exchange experiments in P. aeruginosa PAO1. E – Essential; N – Non- essential...... 110

xiv

1 Chapter 1. Overview

1.1 Overview of HDL related background information

1.1.1 HDL composition

High density lipoprotein is a highly heterogeneous family of particles with different size and composed of lipids and proteins in approximate equal mass. It enables lipids like cholesterol and triglycerides to be transported within bloodstream. HDL is the smallest and densest the lipoprotein particles in human plasma. HDL consists of a hydrophobic core containing cholesteryl esters and triglycerides, surrounded by a hydrophilic surface monolayer of phospholipids, unesterified cholesterol and apolipoproteins. Their most abundant apolipoproteins are apoA-I which takes up roughly 70% of the protein mass and apoA-II takes up additional 15-

20% of the protein mass. Recent proteomic studies on HDL using high-resolution mass spectrometry (1-4), including two from our team (5, 6), have identified upwards of 75 distinct proteins. Besides the major HDL proteins apoA-I and apoA-II, they also include other classical apolipoproteins (7) such as apoC, apoE, apoD, apoM and apoA-IV, as well as enzymes and transfer proteins such as lecithin: cholesterol acyl transferase (LCAT) (8), paraoxonase, cholesteryl ester transfer protein (CETP) (9) and phospholipid transfer protein (PLTP) (10).

Many of these newly identified HDL associated proteins mediate functions that are surprisingly outside the realm of lipid transport and metabolism. For example, HDL was found to be a host for numerous protease inhibitors as well as mediators of the complement cascade, suggesting a possible role for HDL in innate immunity (11). This has created substantial interest in the possibility that HDL may play many more roles than previous supposed.

1.1.2 Overview of HDL functions

Many epidemiological studies have shown that plasma levels of high-density lipoprotein cholesterol (HDL-C) and its major protein apolipoprotein A-I (apoA-I) are inversely related to the atherosclerosis and cardiovascular disease (CVD) risk. This suggests that the HDL, especially its major protein component apoA-I have the ability to reduce the CVD risk. It is accompanied through the reverse transportation of the cholesterol efflux from peripheral cells, such as macrophage-derived foam cells in the vessel wall, to the liver for catabolism. This is well known as the reverse cholesterol transport (RCT) and is considered as the classical protective function of HDL toward atherosclerosis (12, 13). Besides this lipid transport activity, HDL also found to have a series of similarly diverse functions that together contribute to its CVD protection attribution. First, HDL has the ability to prevent oxidative modification of LDL via the HDL associated protein Paraoxonase 1(PON1) (14). LDL oxidation is commonly considered to contribute to the initiation and progression of atherosclerosis and PON1is able to degrade oxidized LDL phospholipids which prevents the accumulation of oxidized lipids in LDL and their proinflammatory action (15). Second, HDL has also been shown to have anti-inflammatory properties through its ability to inhibit the adhesion molecules expression on endothelial cells such as VCAM-1, E-selectin, ICAM-1 & endotoxin (16) and prevent monocyte adhesion to the endothelium (17). Finally, HDL also has the ability to stimulate nitric oxide (NO) production by vascular endothelial cells (18). NO acts as a potent vasodilator and modulator of vascular tone; so decreased levels of NO are associated with pre-atherosclerotic vascular sites. These antioxidative, anti-inflammatory and pro-vasodilatory properties of HDL might have same importance as its well-known cholesterol efflux function in the process of protecting against the cardiovascular disease development.

1.1.3 Current evidence for HDL subspeciation

In clinical diagnosis, the commonly used HDL measurement is the HDL cholesterol (HDL-C) level, i.e. the amount of cholesterol contained in HDL particles. While there is no question that high plasma HDL-C is inversely correlated with CVD on a population basis, it has not been proven to be a good absolute measurement of CVD risk in individuals. There are many examples of people with higher HDL-C levels have fewer problems with CVD, while others with low

HDL-C levels have increased rates. In other words, although high plasma levels of HDL cholesterol might associate with better cardiovascular health, simply increasing one's HDL level might not increase cardiovascular health. This observation suggests that HDL‟s CVD protection role may not be totally performed by the whole HDL pool. Actually the presence of over 50 proteins in association with HDL combined with the spatial constraints of HDL‟s small size makes it unlikely that all of these proteins could be present on a single HDL particle. What‟s more, there is significant evidence that major HDL proteins form particles and not all of these particles are the same (19). Thus, it is likely the existence of HDL particles instead of their absolute quantity that influence its activity. Given the similarly diverse functions that HDL performs, this has led to the proposal of HDL sub-speciation, the idea that the entire pool of

HDL is composed of several distinct particles that each with a distinct proteomic composition, which perform distinct biological roles. Some question this idea, suggesting that HDL does not form stable particle but exist instead as a transient ensemble of proteins that randomly exchange.

However, there is significant evidence that HDL proteins truely segregate into compositionally stable particles. Asztalos et.al.(20) have shown highly distinct HDL protein patterns using a two- dimensional gel electrophoresis system. In addition, our proteomic profiling of HDL density (5) and gel filtration (6) subfractions clearly indicated that HDL proteins distribute in distinct

3 patterns across the HDL spectrum. There is also strong evidence that major HDL activities rely on cooperative interactions between associated proteins. A classic example is apoA-I‟s cofactor role with lecithin:cholesterol acyl transferase (LCAT), enzyme responsible for the esterification of free cholesterol in HDL; the presence of apoA-I stimulates LCAT activity by several orders of magnitude vs. its absence. The most striking example of on-particle cooperation is the discovery that a specific HDL subparticle can mediate the lysis of T. brucei, a trypanosome responsible for

African sleeping sickness (21, 22). This HDL particle contains apoA-I, apoL-I and haptoglobin- related protein (HRP) forming a complex that is taken up by the trypanosome via the HRP moiety allowing apoL-I to permeablize its lysosomes to lethal effect. This is the strongest evidence yet for distinct particles within classically defined HDL that perform highly specialized functions through cooperative protein interaction (23). Given the multitude of known HDL proteins, it is easy to image the existence of additional unknown subspecies that may contribute to HDL cardio-protection.

1.2 Overview of Network-based approaches to identify Biomarkers

In current knowledgebase, massive biological data are available in public databases. This enables us to build protein networks representing different biological aspects, such as protein-protein interaction networks and pathway networks. Recently, a series of increasingly sophisticated network-based tools have been developed to predict potential disease genes and pathways as novel biomarkers that offer better targets for drug development and disease classification. These tools can be generally grouped into three categories: Hubs-based methods; Module-based methods; Dysregulated pathway-based methods.

1.2.1 Hubs-based methods

Hubs are nodes in the network that have the highest degrees. In protein network, proteins represented by hubs usually have special and important biological functional roles. Evidence suggests that hub proteins tend to be encoded by essential genes , evolve more slowly and more conservetive than non-hub proteins (24-26). Therefore, in human PPI network, hubs should be more likely to associate with those cancer related genes. Wachi et al. (27) analyzed the topological features of the interactome network using differentially expressed genes in lung squamous cancer tissue. For genes with a certain degree k, they calculated the fraction of them which were differentially expressed in lung cancer tissues. Then they measured the correlation between the fractions and the connectivity (measured by degrees). They found that protein products of genes that were up-regulated in squamous cell carcinoma of the lung were more likely to be hubs compared with proteins whose expression levels were not affected. In another separate study, Jonsson et al (28) investigated the cancer proteins in an extensive human interactome network. Their results showed that the known cancer proteins exhibited a different network topology compared with protein not linked in cancer. In particular, they found that cancer proteins had an average of twice as many interaction partners as non-cancer proteins.

1.2.2 Module-Based methods

Protein networks have been found to be naturally divided into modules. Each module is a discrete unit composed of a group of highly connected components that performs a relatively independent function (29-31). Accurately identifying modules from protein networks contributes greatly to the detection of module-based prognostic biomarkers for disease. Current, many computational methods have been developed to combine a panel of gene expression profiles with

5 the protein network, with the gold to identify modules under context-dependent conditions. In these methods, each module is a set of connected proteins whose expression levels are combined to determine this modules‟ overall activity, which in turn is used to predict the phenotypic class of novel samples. Therefore, an intriguing question is how the proteins cooperative with each other to perform the module's overall activity. To answer this question, we summarized the current approaches regarding the module activity in cancers into three categories: (1) Aggregate expression, (2) Differential expression, and (3) Data-driven Logic function.

(1) Aggregate expression

In aggregate expression, the activity of each module is homogenous and defined as the aggregate/sum of its components‟ activity. This is the most widely-used approach to define a module‟s activity. A number of approaches have been demonstrated for extracting relevant modules based on coherent expression profiles of their genes that are able to discriminate the disease status. One classical example is Chuang et al. (32), where they used a scoring and searching procedure to extract modules whose activities across the patients were highly discriminative of metastasis. Through overplaying the gene expression data onto the PPI network, they defined the expression profile for a given module based on the aggregate expression definition. Then they applied a greedy search to identify modules that have optimal discriminative power to metastatic statues. They further trained a logistic regression classifier using identified modules to predict novel metastatic status. Compared with individual gene markers identified in previous studies, the identified modules were significantly more reproducible between different breast cancer cohorts and the classification achieved higher accuracy in prediction.

Guo et al. (33) proposed an edge-based scoring and searching approach to extract modules from interactome network using gene expression profiles. In the approach, each edge in the network was assigned a responsive score which is defined as the covariance of the expression levels of the connected two proteins. Based on the edge scores, the score for a connected subnetwork is defined as the sum of all edges‟ score within it. Then they applied a simulated annealing procedure, which has the advantage of jumping out from local optimization to identify candidate modules. They applied the method to analyze a human prostate cancer dataset and found the identified modules are able to cover the majority of the prostate cancer genes obtained previously.

(2)Differential expression

In Differential expression, the module activity is defined as the difference in expression levels across interaction genes/proteins. The above additive formulation of module activity can only highlight the coordinate dysregulation of interacting proteins that has positive correlation, overlooking the effects of inhibitory and other complex forms of interactions found in cancer progression. Taylor et al. (34) addressed these challenges in their study to develop a method to identify prognostic signatures in breast cancer from the human PPI network. They assumed that during tumor progression, the rewiring of the signaling networks drives phenotypic alterations while maintaining the robustness of the network, suggesting that there may be differences in hub-type association with cancer. They computed the relative expression of PPI network hubs with each of their interacting partners, and used them to determine for which hubs the relative expression differed significantly between patients who survived for more than 5 years versus those who died. Then they employed affinity propagation algorithm using those network signatures to predict the patient outcome. Their prognostic signature provided good outcome prediction result: for those patients with poor-prognosis modularity signature, only 48% 7 survived >5 years. The analysis of the two breast cancer patient populations suggests that the changes of the modules in the human interactome may be useful as an indicator of breast cancer prognosis.

Another alternative way is provided by Nacu et al. (35) which used the T-statistic of normalized microarray data to measure to what extent each gene is differentially expressed. For a given module, the score is calculated by averaging the absolute value of T-statistic, and furthermore some parametric assumption can be incorporated into this scoring system. Since increasing expression level leads to positive T-statistic and decreasing expression level leads to negative one, the average of the absolute T-statistic can detect significant expression change in both ways.

They applied their method on Lymphocyte data (36) and retrieved several known and potential pathways, which proved their method had the potential to identify functional subnetworks based on the differential of gene expression.

(3) Data-driven Logic function

In Data-driven Logic function, protein components encode basic logic functions (such as AND,

OR & NOT) are further combined within modules to code for complex programs. These methods can present polychromatic modules and conditional interactions, as well as deal with multiclass scenarios that have more than two outcomes. Recent study of Chowdhury et al. (37) has introduced a method by using subnetwork state functions. In this method, they formulated coordinate dysregulation by calculating the mutual information between subnetwork state functions and the corresponding phenotype. They developed a lower bound of the information provided on phenotype by a state function; and using this bound, they developed bottom-up enumeration algorithms that can effectively shrink the subnetwork space and is able to efficiently

8 identify informative state functions. Subnetworks identified by this algorithm can be used to train neural networks for classification of phenotype. Similarly, Hwang et al. (38) proposed a hypergraph-based learning algorithm named HyperGene for cancer outcome prediction and biomarker identification. They classified the gene expression profiles into three states (basal, up

& down) rather than two in Chowdhury‟s study. Then they built a hypergraph with phenotype information as vertices and gene expression states as hyperedges. A global solution of this regularization framework is solved by optimizing the cost function which is calculated by the weighted sum of inconsistent labeling both phenotypes and hyperedges.

What‟s more, Trey Ideker et al. (39) developed a method named “Network-Guided Forests

(NGF)” to identify the network modules whose logic relateds with key biological and clinical outcomes. NGF is based on random forests with biological constraints induced by a PPI network.

Specifically, the NGF framework learns a set of decision trees where each tree maps to a connected component of the PPI network. The collection of all tree outputs is used to predict the disease state of the biological sample. They applied the approach in breast cancer metastasis and found that the in cancer most predictive network decision functions rely on both coherent and opposing gene activities such as the combination of oncogenes and tumor suppressors. This suggested that the medical genetic should move beyond cataloguing individual cancer genes to cataloguing their combinatorial logic.

1.2.3 Dysregulated Pathway-Based methods

The previous module-based methods focus on extracting modules/subnetworks from human interactome network based on the coherent expression profiles, similar approaches have been applied to pathway analysis. Incorporating pathway information with gene expression is able to

9 classify disease based on the activity of entire or partial signaling pathways instead of on the expression profiles of individual proteins or genes. In such methods, the pathway information is curated from different sources such as MetaCyc (40), Biocarta (41) and KEGG(42) databases.

Tian et al. (43) proposed a statistical framework for determining whether a specified group of genes derived from a pathway has a coordinated association with a phenotype of interest.

Specifically, they tested whether the observed associations of genes in a pathway is a random sample from the background distribution of all observed associations where the background distribution is generated by permuting the phenotypes. The test score was adjusted by essentially replacing it with its quartile. Every group of genes was assigned a score that is defined as the aggregated average of the test statistics of its member genes and groups with high scores were more likely to be differentially expressed with respect to the phenotype.

1.3 Significance and our contribution

Despite advances in cardiovascular disease (CVD) health promotion, heart disease remains to be the most important cause of mortality in the United States killing one person every 30 seconds.

As a major complication of the current obesity epidemic and Type 2 Diabetes, CVD will continue to be a premier health problem for the foreseeable future. In both clinical and molecular biological research, HDL is an attractive target because it is the body‟s nature defense against

CVD. However, current strategies to boost HDL levels do so indiscriminately without regard for functionality. While there is no question that high plasma levels of HDL are inversely correlated with CVD on a population basis, there are many examples of individuals with high HDL that have CVD and vice versa (44). The implication is clear – not all HDL are created equal. HDL is comprised of numerous particles populations, some which are likely to be cardioprotective,

10 others which may not. It is quite possible that raising HDL, particularly through mechanisms that interfere with its catabolism, may increase the wrong sorts of particles at the expense of the cardioprotective ones. This concept was perhaps most poignantly demonstrated by the recent failure of Torcetrapib, a CETP inhibitor that failed to reduce vascular plaque despite raising

HDL-C by some 70+ % (45). Knowing the protective subspecies, therapies can be brought online to raise these specifically without altering others. Considering the tremendous resources that have been poured into non-specific rising of plasma HDL-C, it seems prudent to invest more research into understanding HDL subspeciation and its connection to CVD.

My dissertation work is directed at developing and validating a network-based approach which combines advanced proteomic analysis and a network-based computational framework to uncover new HDL subspecies in normal human plasma and associate them with known HDL cardio-protective function. In Chapter 2, we developed two novel correlation based scoring systems to quantitatively measure the similarity between HDL proteins‟ co-migration patterns generated in three chromatographic separation techniques. We then built two co-migration similarity networks based on the similarity scores and identified structural HDL subspecies from these networks. Function analysis of these HDL subspecies shows that they have a variety of different but related biological functions such as reverse cholesterol transport, anti-oxidation, anti-inflammation and endothelial relaxation.

Showing consistently high similarity in migration patterns provides the strong evidence of their co-existence in the same subspecies. However, due to the limitations of experimental techniques that are biased against low-abundance proteins or highly labile complexes. To complement these limitations in determining the co-occurrence of proteins in the same HDL particle, we turned to an ideal alterative approach: integrative computational approach. In Chapter 3, we constructed 11 a combined HDL interactome network by integrating the co-migrating similarity network and a predicted HDL interactome network. Interacting proteins are often found to share common properties, e.g., similar phylogenetic profiles and co-expression patterns. We developed a logistical regression framework to build a predicted HDL interaction network by integrating several relevant features associated with protein interactions (mRNA co-expression, functional similarity, similar phylogenetic profiles and co-occurrence in annotated gene sets). Then we combined the predicted HDL interactome network with the previous built co-migration similarity network to obtain the generalized interactome network. Merging the two networks enables us to get a more complete and unbiased network.

Protein interactome networks have been found to be naturally divided into functional modules. In

Chapter 4, we identified functional modules responsible for known HDL function from HDL interactome network. A functional activity profile (cholesterol efflux) has been obtained from the Gel Filtration separation technique. We developed a network-based classification, where functional modules that optimally correlate with functional activity profiles were identified from the HDL interactome network. A module, consisting of a group of HDL proteins, may correspond to the entirety or part of an HDL particle that carries out HDL‟s CVD-protective function. These network-based modules outperformed individual proteins as markers for HDL function in both reproducibility and accuracy.

Chapter 5 lists other computational biological projects I participated and completed during my graduate study. One project is the unraveling of the mechanistic basis of gene essentiality (24).

Rapid and accurate identification of new essential genes in under-studied microorganisms will significantly improve our understanding of how a cell works and the ability to re-engineer microorganisms. However, predicting essential genes across distantly related organisms remains 12 a challenge. Here, we present a machine learning-based integrative approach that reliably transfers essential gene annotations between distantly related bacteria. Ten-fold cross-validation in the same organism yielded AUC scores between 0.86 and 0.93. Cross-organism predictions yielded AUC scores between 0.69 and 0.89. The transferability is likely affected by growth conditions, quality of the training data set and the evolutionary distance. We are thus the first to report that gene essentiality can be reliably predicted using features trained and tested in a distantly related organism.

The second project is to develop a statistical framework for improving genomic annotations of essential genes. Most of the essential gene annotations in genomic databases are determined by whole-genome transposon mutagenesis (TM). However, there are substantially systematic biases associated with TM experiments. We developed a statistical framework capable of incorporating these factors and providing corrections of current TM based essential gene annotations.

Evaluated by the essential gene profile in E. coli, our model significantly improved the accuracy of original TM datasets by filtering out false essential and non-essential assignments. Our method also showed encouraging results in improving subsaturation level TM datasets. Besides that, our model has broad applicability to other bacteria such as Pseudomonas aeruginosa PAO1 and Francisella tularensis novicida in the paper. In summary, our method will be a promising tool in improving genomic annotation of essential genes and enabling large-scale explorations of gene essentiality.

The third project is to probing the developmental robustness in fruit fly drosophila. Bicoid (Bcd) is a Drosophila morphogenetic protein responsible for patterning the anterior structures in embryos. Recent experimental studies have revealed important insights into the behavior of this morphogen gradient, making it necessary to develop a model that can recapitulate the biological 13 features of the system, including its dynamic and scaling properties. We present a biologically realistic 2-D model of the dynamics of the Bcd gradient in Drosophila embryos. This model is based on equilibrium binding of Bcd molecules to non-specific, low affinity DNA sites throughout the Drosophila genome. It considers both the diffusion media within which the Bcd gradient is formed and the dynamic and other relevant properties of bcd mRNA from which Bcd protein is produced. Our model recapitulates key features of the Bcd protein gradient observed experimentally, including its scaling properties and the stability of its nuclear concentrations during development. Our simulation results suggest that, in our model, Bcd protein diffusion is important for the formation of an exponential gradient in embryo. They demonstrate that highly complex biological systems can be effectively modeled with relatively few parameters.

2 Chapter 2. Analysis of co-migration patterns of HDL protein

reveals structural subspecies

2.1.1 Introduction

High density lipoprotein (HDL) is a blood-borne assembly of lipids such as phospholipids, cholesterol, triglycerides and different kinds of apoliproteins. HDL plays critical roles in the prevention of cardiovascular disease (CVD), the major cause of mortality in the United States.

Recent proteomics studies have identified upwards of 75 distinct HDL associated proteins (See

1.1 overview of HDL composition). Many of the newly identified proteins mediate functions are surprisingly outside the realm of lipid transport. For example, HDL was found to be a host for numerous protease inhibitors as well as mediators of the complement cascade, suggesting a possible role in innate immunity. Recently, there is strong evidence suggesting that the diverse

HDL functions are mediated by distinct HDL subspecies, each containing unique protein compositions that perform distinct biological functions (See 1.2 overview of HDL functions and

1.3 Current evidence for HDL subspeciation).

To better characterize the structural composition and functionality of HDL subspecies, our collaborator has applied three non-density based orthogonal separation chromatography techniques (Gel filtration (GF), Anion exchange (AE), and Isoelectric focusing (IEF)) for the isolation of HDL from human plasmas. Generally, these techniques fractionated normal human plasmas to phospholipid-containing subfractions, then the HDL associated proteins and their distributions were determined using Mass Spectrometry. Given the proteomic profiles of HDL proteins, our work is to systematically identify the structural HDL subspecies and study their

15 biological functions. Fig. 2-1 illustrates the outline of our approach. Assuming that HDL associated proteins, which have similar co-migration patterns when separated by different techniques, are likely to form distinct lipoprotein subspecies. We developed two novel scoring systems to quantitatively measure the similarity between proteins‟ co-migration patterns and identified candidate HDL subspecies based on these scores systems. We further validated these identified subspecies using both literature search and gene co-expression analysis. Functional analysis is also performed and the identified subspecies are enriched with CVD-protective functions. These identified species revealed novel interactions among HDL proteins and provided models of the underlying molecular mechanism of CVD.

Density gradient ultracentrifugation (UC) is currently preferred method of HDL isolation because density is a major resolving factor between non-lipid and lipid-bound protein. The underlying principle of this method is to quantitatively float the relatively light lipid-bound proteins away from the heavy non-lipid associated proteins. However, this method often involves prolonged centrifugation steps which are required process to float the lipoproteins into the gradients and the use of high salt concentrations which can modify the protein structure and deplete lots of apolipoproteins from fractions. In addition, the high salt must also be removed for further analysis of lipoprotein fractions which results in poor recoveries (46). What‟s more,

Van‟t Hooft et al. „s study has shown that more than one half of the apolipoprotein E was disassociated with HDL proteins during ultracentrifugation (47). This will lead to misleading result for apoE and its interaction partners. Thus, there is a significant need to analyze the HDL proteome using alternative separation techniques. In this work, instead of UC, Dr. Davidson‟s lab applied three other attractive non-density based separation techniques for the isolation of

HDL proteins.

Figure 2-1 The outline of the systematic approach to identify structural HDL subspecies.

2.2 Results

2.2.1 HDL associated proteins exhibit distinct patterns using orthogonal

separation chromatography techniques

Our collaborator has developed three non-density based orthogonal separation chromatography techniques to fractionate normal human plasma to phospholipid-containing subfractions. Then they isolated lipid associated proteins and determined their identities and relative distributions across subfractions using mass spectrometry. The first separation technique developed is gel filtration (GF) chromatography which fractionates subparticles by molecular size. This method has been used on three human plasma samples. For each plasma sample, GF separates it into 17

17 successive size-based fractions and identified 106 HDL associated proteins in total (Fig. 2-2A).

The second technique is anion exchange (AE) chromatography, which separates particles by particle charges. The AE method has been applied to four normal human plasma samples and identified 140 HDL associated proteins in total (Fig. 2-2B). The third technique is the isoelectric focusing (IEF) chromatography. It separates particles based on the isoelectric point or the PH at which a particle has a net charge of zero. IEF was applied to three plasma samples and identified

93 proteins (Fig. 2-2C). We noticed that the sizes of HDL associated proteins identified by different techniques are different. Fig. 2-3 shows the Venn diagram of the three sets of identified

HDL associated proteins. The majority of identified proteins (76) are common among the three techniques, and there are also proteins unique to each specific techniques. The uniqueness is caused by the different underlying separation principles, which suggests that their results could be complemented with each other, and combining them enables us to obtain a comprehensive

HDL protein pool. Therefore, in our analysis, we focus on the 159 unique HDL associated proteins.

Figure 2-2. Protein, cholesterol and PL profile of normal human plasma separated by our (A) GF method (B) AE method; (C) IEF method.

Figure 2-3 The Venn diagram of the identified HDL associated proteins from normal human plasmas using three non-density based separation techniques. The values on the graph indicate the numbers of proteins each part contains.

The previous proteomic profiling of HDL density and gel filtration subfractions clearly shows that HDL proteins distribute in distinct patterns across the HDL spectrum (5, 6), which provides strong evidence that HDL proteins segregate into distinct subspecies. This current proteomic studies also revealed the same interesting observations that support this point. Fig. 2-4 shows the relative abundance profiles of the apolipoproteins across fractions generated by (A) Gel filtration, (B) Anion exchange, and (C) Isoelectric focusing. As expected, the most abundant protein apoA-I occurs in almost every fraction while other low abundant apolipoproteins display relatively distinct distributions. Under scrutiny, we noticed that several apolipoprotein pairs such as apoA-I & apoA-IV, apoA-II & apoJ, apoA-IV & apoC-I display similar co-migration patterns, 19 indicating that they may be sequestered to the same sets of HDL particles through potential structure interactions and perform distinct biological functions.

Figure 2-4 Relative abundance profiles of known HDL proteins across fractions generated by (A) gel filtration (B) anion exchange, and (C) isoelectric focusing.

2.2.2 Novel scoring systems to quantify the similarity of proteins’ co-migration

patterns

Given the observation that HDL associated proteins segregated into distinct subspecies, we hypothesized that those proteins had similar co-migration patterns across different separation techniques may involve in the same subparticles. To quantitatively measure the similarity of co- migration patterns between proteins, we developed two novel scoring systems named C-scores

20 and S-scores which capture the similarity from two different aspects. Overall speaking, C-score system is based on the count of overlapped fractions containing both proteins. This method is straight-forward. But it ignores the protein abundance information and is biased for high abundance proteins because they occur in almost every fraction. And S-score system is based on the variation of two proteins‟ normalized migration profiles, which does not require the relative abundance in different fractions to be independent and is also more resistant to errors in low peptide counts produced by low abundance proteins. The following are the detailed descriptions of these two systems:

Count-overlapping Approach (C-score)

Given the proteomic data, we first developed a method named “count-overlapping” to calculate the similarity between two proteins‟ abundance profiles across all the fractions. Specifically, for a given protein pair (Px, Py), the count-overlapping score (C-score) in a given plasma sample k is calculated as:

CPPIPP(,)(,) k x yi i x y

th th where i is the i fraction for that sample and Ii is a binary variable, Ii=1 if the i fraction contains at least one count of both Px and Py, and 0 otherwise. In this method, each HDL protein pair in each plasma sample received a C-score. Next, we combined the C-scores in all the 10 plasma samples. Here to reduce the data complexity, we defined the combined C-score between a given protein pairs as:

CPPCPPICPP(,)( (,))  ((,)0)  x yk k x yk k x y

The variance-based approach (S-score)

Taking into account the protein abundance information, we developed the second scoring system based on the paired variance to measure the similarity of the co-migrating patterns of two proteins. In this approach, for each protein pair (Px, Py), the co-migration score in a given plasma sample k is calculated by

N SPPPPN( , ) 1  (  )2 / (  1) k x yi1 x,, i y i

th where i is the i fraction and N is the number of fractions for that sample, Pxi, and Pyi, are

th normalized values of Px and Py in the i fraction, which are calculated by dividing the spectral counts in the ith fraction over the total counts in all the N fractions:

~ N ~ N P  P / P and P  P / P x,i x,i i1 x,i y,i y,i i1 y,i .

The larger the S-score is, the more similar of the co-migrating patterns of the two proteins are.

Since we have three different separation techniques and 10 different samples, to quantify the co- migration of a protein pair (Px, Py) across all the plasma samples, we used an un-weighted geometric mean to combine the scores of the 10 samples by:

10 SPPSPP( , ) 1  (1  ( , ))1/10 x yk1 k x y

This approach does not require the independence assumption of the fractions which are required by the Pearson correlation coefficient. And also alleviate the bias against low-abundance proteins.

Protein pairs that have high C-scores and S-scores indicate that they are consistently co-separate across different separation techniques and plasma samples, implying their co-occurrence in the same HDL particle.

2.2.3 Analysis of co-migration networks to identify HDL subspecies

Based on the C-scores and S-scores, two co-migration networks were built separately. In the C- score system, each protein pair with a C-score larger than 0 was considered connected in the network. This generated a network containing 62 nodes and 602 edges with the average degrees of 20 (Fig. 2-5a). And in our S-score system, as each HDL protein pair was assigned a co- migration score between 0 and 1, the resulting co-migration matrix was continuous. All the elements in this matrix form a background empirical distribution. This confidence of the co- migration score between each protein pair was described by its position in the background empirical distribution. Those pairs with high score percentiles tend to be more reliable compared to those with low percentiles. To obtain statistical significance, we selected the upper 5% tail to be the predicted edges of the network, and the confidence level of each edge was described by the corresponding p-value of its co-migration score with respect to the background empirical distribution. Using this strategy, the built S-score co-migration network contains 60 nodes and

628 edges with an average degree of 20 (Fig. 2-5b).

Figure 2-5. (a) The full C-score similarity network and (b) The full S-score similarity network. Different colors denote nodes with different degrees. Red nodes have higher degrees than yellow nodes.

A clique in a network is a subset of its nodes where every two nodes in this subset are connected by an edge. A maximal clique is a clique that cannot be added one more neighbor nodes otherwise would not be a clique anymore. Thus, maximal cliques in the co-migration networks may correspond to distinct HDL subspecies where the components have quite similar co- migration patterns. We identified all the maximal cliques from both networks using the Bron-

Kerbosch algorithm (48). There are 113 maximal cliques identified from C-score co-migration network with sizes from 3 to 16 and 110 maximal cliques from S-score co-migration network with sizes from 3 to 19. The number of the overlapping cliques between the two sets of maximal cliques is 75. These 75 cliques may represent the consistent HDL subspecies across different plasmas, separation techniques, as well as analysis approaches.

We did further exploration on the 75 candidate HDL subspecies by assigning them significant

scores. We defined the co-migration score of a given subspecies as the un-weighted average of

all edges‟ S-scores within it. And to assess the significance of the co-migration score, we

performed random permutation experiments (see Materials and Method), and the p-value of

each of the 75 subspecies was calculated based on the global null distribution. All these 75

cliques have significant co-migration scores. Table 2-1 shows the 75 identified HDL subspecies

ranked by the p-values with fixed size.

Table 2-1 75 identified HDL subspecies ranked by the significance of co-migration scores.

Size of HDL subspecies subspecies

3 APOH CFAI HEMO

3 CO4B FIBB ITIH2

4 APOB FIBA FIBB FIBG

4 APOA1 APOA2 APOC3 APOM

4 A2MG FIBA FIBG MUC

4 ALBU ANT3 APOA4 PEDF

4 CERU CLUS HEP2 ITIH4

5 ALBU HEMO HRG IGHG1 PLMN

5 APOA1 APOA2 APOC3 CLUS PON1

5 ALBU APOA1 APOA2 ITIH4 TTHY

5 AACT ALBU CLUS FETUA KNG1

6 ALBU APOA1 APOA2 CLUS IGHG1 THRB

6 APOA1 APOA2 CLUS IGHG1 THRB VTNC

6 ALS APOA1 APOA2 CLUS ECM1 ITIH4

7 AACT ALBU APOA1 HRG IGHG1 KAC LAC

7 APOA1 APOA2 APOA4 APOC3 CLUS ECM1 ITIH4

7 APOA1 APOA2 APOA4 CLUS ECM1 HEP2 ITIH4

7 APOA1 CO4B FIBB FIBG IGHG1 ITIH1 LAC

7 AACT ALS APOA1 APOA2 CLUS ITIH4 KNG1

8 ALBU APOA1 CFAB HEMO HRG IGHG1 KAC LAC

8 APOA1 APOA2 CLUS FIBA FIBG IGHG1 MUC VTNC

8 APOA1 APOA2 APOC1 APOC3 CO3 IGHA1 IGHG1 LAC

8 ALBU APOA1 GELS HEMO HRG IGHG1 KAC LAC

8 APOA1 CFAH CLUS FIBB FIBG HPT IGHA1 MUC

8 APOA1 APOA2 CLUS FIBA FIBG IGHG1 ITIH1 VTNC

8 ALBU APOA1 APOA2 APOA4 GELS HEP2 IGHG1 LAC

8 A1AT ALBU APOA1 APOA2 CFAB HRG IGHG1 ITIH4

8 AACT ALBU APOA1 APOA2 CLUS HRG IGHG1 PGRP2

8 ALBU APOA1 APOA2 APOC3 CLUS IGHG1 ITIH4 KNG1

8 APOA1 CO3 CO4B IGHA1 IGHG1 KAC LAC MUC

8 ALBU ANT3 APOA1 APOA2 APOA4 CLUS HEP2 ITIH4

8 APOA1 APOC3 CO3 CO4B IGHA1 IGHG1 LAC MUC

8 A1AT AACT ALBU APOA1 APOA2 HRG IGHG1 ITIH4

8 ALBU APOA1 APOA2 CLUS CO9 HEP2 ITIH4 LAC

8 APOA1 APOC3 APOE CO4B FIBG IGHG1 ITIH1 LAC

8 A1AT AACT ALBU APOA1 APOA2 HEP2 IGHG1 ITIH4

9 APOA1 APOA2 CLUS FIBA FIBB FIBG IGHG1 ITIH1 LAC

9 APOA1 CO3 HV305 IGHA1 IGHG1 KAC KV302 LAC MUC

9 APOA1 APOA2 APOC3 CLUS CO3 IGHA1 IGHG1 LAC MUC

9 APOA1 APOC1 CO3 HV305 IGHA1 IGHG1 KAC KV302 LAC

9 ALBU APOA1 APOA2 APOA4 APOC3 CLUS IGHG1 ITIH4 LAC

9 ALBU APOA1 APOA2 APOA4 GELS HEMO HRG IGHG1 LAC

9 ALBU APOA1 APOA2 APOC3 CLUS HPT IGHA1 IGHG1 KNG1

9 APOA1 CO4B FIBB FIBG IGHA1 IGHG1 KAC LAC MUC

9 ALBU APOA1 APOA2 CLUS HPT HRG IGHA1 IGHG1 KNG1

9 ALBU APOA1 APOA2 APOA4 CFAB HEMO HRG KAIN PGRP2

9 APOA1 APOA2 APOC3 APOE CLUS FIBG IGHG1 ITIH1 LAC

9 ALBU APOA1 APOA2 APOA4 CLUS HEP2 IGHG1 ITIH4 LAC

9 AACT ALBU APOA1 APOA2 CLUS HRG IGHG1 ITIH4 LAC

9 APOA1 APOE CO4B FIBG IGHA1 IGHG1 KAC LAC MUC

9 AACT ALBU APOA1 APOA2 CLUS HEP2 IGHG1 ITIH4 LAC

9 APOA1 APOC3 APOE CO4B FIBG IGHA1 IGHG1 LAC MUC

9 AACT ALBU APOA1 APOA2 CLUS HRG IGHG1 ITIH4 KNG1

9 AACT ALBU APOA1 APOA2 CLUS HEP2 IGHG1 ITIH4 KNG1

9 AACT ALBU ANT3 APOA1 APOA2 CLUS HEP2 ITIH4 KNG1

10 ALBU APOA1 APOA2 APOA4 APOC3 CLUS HPT IGHA1 IGHG1 LAC

10 ALBU APOA1 HV304 HV305 IGHA1 IGHG1 KAC KV302 LAC LV102

10 ALBU APOA1 APOC1 HPT HV305 IGHA1 IGHG1 KAC KV302 LAC

10 ALBU APOA1 APOA2 APOA4 CFAB CLUS HEMO HRG IGHG1 LAC

10 ALBU APOA1 APOA2 APOA4 APOC1 APOC3 HPT IGHA1 IGHG1 LAC

10 ALBU APOA1 APOA2 APOA4 CLUS FIBB HRG IGHG1 ITIH4 LAC

10 ALBU APOA1 APOA2 APOC1 APOC3 APOE HPT IGHA1 IGHG1 LAC

10 ALBU APOA1 APOA2 APOA4 CFAB CLUS HEMO HRG IGHG1 PGRP2

10 ALBU APOA1 APOA2 APOA4 CFAB CLUS HRG IGHG1 ITIH4 LAC

10 ALBU APOA1 APOC1 APOE HPT IGHA1 IGHG1 KAC KV302 LAC

10 ALBU APOA1 APOA2 APOA4 CFAB CFAI HEMO HRG IGHG1 PGRP2

11 ALBU APOA1 FIBA FIBB FIBG HPT HRG IGHA1 IGHG1 KAC LAC

11 ALBU APOA1 APOA2 APOA4 CLUS FIBB HPT HRG IGHA1 IGHG1 LAC

11 ALBU APOA1 APOA2 APOC3 FIBG HPT HV304 IGHA1 IGHG1 LAC MUC

11 ALBU APOA1 APOE FIBG HPT IGHA1 IGHG1 KAC KV302 LAC MUC

12 ALBU APOA1 APOA2 CLUS FIBA FIBB FIBG HPT IGHA1 IGHG1 LAC MUC

12 ALBU APOA1 APOA2 CLUS FIBA FIBB FIBG HPT HRG IGHA1 IGHG1 LAC

12 ALBU APOA1 APOA2 FIBA FIBB FIBG HPT HV304 IGHA1 IGHG1 LAC MUC

12 ALBU APOA1 APOA2 APOC3 APOE CLUS FIBG HPT IGHA1 IGHG1 LAC MUC

14 ALBU APOA1 FIBA FIBB FIBG HPT HV304 HV305 IGHA1 IGHG1 KAC KV302 LAC MUC

We noticed that the 75 identified subspecies largely enriched with abundant apolipoproteins such as apoA-I, apoA-II. We plotted the node degree distributions of C-score and S-score co- migration networks (Fig. 2-6). Several known HDL proteins have very high node degrees. For example, the most abundant protein, apoA-I, has a degree of 57 in the C-score network, which indicates that it will be included in almost every subspecies. In reality, we are more interested in subspecies among those low abundant HDL proteins. This is because many of these proteins are newly identified in recent proteomics studies and they may reveal some interesting and novel results. Therefore, in our second analysis, we removed those widely distributed proteins.

Specifically, we removed proteins occurring in more than five fractions within all the samples.

There are 12 proteins are eliminated from the network (APOA-I, APOA-II, ALBU, CLUS,

CO4B, FIBA, FIBB, FIBG, HPT, IGHAI, IGHGI and LAC). We notice that these proteins are almost all well-known and relative abundant HDL associated proteins. After this filtering, the resulting C-score network now contains only 47 nodes and 208 edges with an average node degree of 9, and S-score network has 48 nodes and 243 edges with an average node degree of 10.

The C-score similarity network generates 65 maximal cliques with the size ranging from three to seven, and the S-score network generates 60 maximal cliques with the size from three to eight.

The number of overlapped cliques between the two networks is 31. For these 31 identified HDL subspecies identified from the filtered network, we also assigned them co-migration significant scores as introduced above. Table 2-2 shows the 31 HDL subspecies ranked by the p-values with fixed size. For the subspecies with size three, the top subspecies contain proteins apo-H,

Complement factor I (CFAI) and Hemopexin (HEMO). Fig. 2-7 shows the migration patterns of these subspecies in the 10 plasma samples separately. Their migration patterns are quite similar, 28 which indicates our score systems are able to successfully detect proteins with similar co- migration patterns. Another example of the subspecies contains Hemopexin (HEMO), Histidine- rich glycoprotein (HRG) and Plasminogen (PLMN). This subspecies ranked third in Table 2-2 and HRG and PLMN has been shown to be interacting in literature (49).

Figure 2-6 Node degree distributions of the (a) C-score co-migration network and (b) S-score co-migration network.

Table 2-2 31 identified HDL subspecies from the filtered co-migration networks ranked by the significance of co-migration scores.

Size of subspecies HDL subspecies Size of subspecies HDL subspecies

3 APOH CFAI HEMO 4 CFAB HEMO HRG KAC

3 APOC3 APOE MUC 4 APOA4 CFAB HRG ITIH4

3 HEMO HRG PLMN 4 ANT3 APOA4 HEP2 ITIH4

3 ALS ECM1 ITIH4 4 APOA4 GELS HEMO HRG

3 APOC3 CO3 MUC 4 APOE KAC KV302 MUC

3 APOC3 APOE ITIH1 4 AACT ALS ITIH4 KNG1

3 APOC1 APOC3 APOE 4 APOC1 APOE KAC KV302

3 APOC3 HV304 MUC 4 A1AT CFAB HRG ITIH4

3 A1AT HEP2 ITIH4 4 APOA4 ECM1 HEP2 ITIH4

3 AACT HRG PGRP2 5 HV304 HV305 KAC KV302 LV102

3 AACT HRG ITIH4 5 HV304 HV305 KAC KV302 MUC

3 APOC1 APOC3 CO3 5 CO3 HV305 KAC KV302 MUC

3 APOA4 APOC1 APOC3 5 CFAB CFAI HEMO HRG PGRP2

3 APOA4 GELS HEP2 5 AACT ANT3 HEP2 ITIH4 KNG1

3 APOA4 APOC3 ITIH4 5 APOC1 CO3 HV305 KAC KV302

6 APOA4 CFAB HEMO HRG KAIN PGRP2

Figure 2-7 Co-migration patterns of the top clique with size three in 10 plasma samples separately. This clique contains three proteins APOH, CFAI and HEMO. Curves with different colors represent migration patterns of different proteins.

2.2.4 Functional analysis of the identified HDL subspecies

We performed functional enrichment analysis on these HDL subspecies. We collected function annotations of HDL proteins from Gene Ontology (50). For each of the identified subspecies, to examine whether it is enriched within a certain function, we used hypergenometic test to test the null hypothesis that they were picked out randomly from the genome (see Material and

Methods). For the originally identified 75 HDL subspecies, we found that 60 out of them have significantly enriched functions after Benjamini correction. And the most enriched functions are reverse cholesterol transport, complement activation, platelet activity, coagulation, innate immune response, and regulation of acute inflammatory response, etc. These results are consistent with the previous known HDL functions such as reverse cholesterol transport, anti- oxidation, anti-inflammation and endothelial relaxation. For the 31 HDL subspecies identified from the filtered co-migration networks with smaller sizes, we found 16 of them have significant enriched functions after Benjamini correction. And the enriched functions are acute inflammatory response, phospholipid efflux, phospholipid binding, coagulation and regulation of vesicle-mediated transport, etc.

2.2.5 Validation of the identified HDL subspecies

Validate the identified HDL subspecies in public databases

We validated the identified HDL subspecies using the known protein interaction data downloaded from the Human Protein Reference Database (HPRD) and IntAct database (see

Material and Methods). For the originally identified 75 HDL subspecies, we found that 72 of them have at least one literature supported edges. We used binominal test to examine the significance of this ratio, and the p-value is smaller than 1E-6 (see Material and Methods). For

31 individual subspecies, we also assessed its significance of literature support. We generated random cliques with the identical size of the given clique and counted how many literature supported edges they have, which forms a background distribution. Then we evaluated the significance of the literature supports using this background distribution. 64 of the 75 identified cliques have significant literature supports. Then, for the 31 HDL subspecies with smaller sizes, we also validated them using known interaction data. The result shows that only 5 of the 31 subspecies have literature support, which indicates that the current known interactions occur among those abundance proteins, and our 31 subspecies may reveal many novel and interesting

HDL protein interactions that never been discovered before.

Validate the HDL subspecies using global gene expression data

Our proteomic data provided direct evidence for the co-expression of components within the same subspecies at the proteomic level. We assume that they should also have co-expression at the genomic level. To test this, we downloaded the human mRNA expression data from Gene expression omnibus (GEO) (see Material and Methods). The mRNA co-expression of two genes was calculated using Pearson correlation coefficients between the two expression profiles.

And the mRNA co-expression of HDL subspecies was defined as the unweighted average of all edges‟ co-expression scores within it. We also assessed the significance of the subspecies‟ co- expression scores using random permutation experiments. Among the 75 originally identified

HDL subspecies 45 of them have significant mRNA co-expression scores. And for the 31 HDL subspecies with smaller sizes, 15 of them have significant mRNA co-expression scores.

Validate the identified HDL subspecies using apoE knockout gene expression data

Knockout key components in a given subspecies and assess the gene expression changes of other components in the same subspecies can provide us direct evidence of the HDL subspecies. Here we used the apoE knockout gene expression data downloaded from GEO (see Materials and

Method). We compared the gene expression between apoE-deficit mice and wide-type mice and defined genes with significant expression change if their expression levels have at least 5 fold changes. Among our 75 HDL subspecies, 7 of them contain apoE protein. And we found that all these apoE containing subspecies contain at least one significantly expression changed genes.

Fig. 2-8 shows two examples of the identified subspecies that contain apoE. Both of the subspecies have significant co-migration scores and gene co-expression scores. And they also enriched with literature supported interacted edges. We noticed that when knockout protein apoE, several components with significantly expression changes, such as HPT, APOA1, APOC, KAC,

MUC.

Figure 2-8 Examples HDL subspecies with protein apoE. The color of the node is proportional to the fold change of gene expression response to the apoE deletion. The blue color edges are literature supported edges.

Validate the HDL subspecies using protein structure information

Protein complex structure data provides strong evidence for the structural HDL subspecies. We also validated the identified HDL subspecies using the known protein complex downloaded from

Protein Data Bank (PDB). Specifically, for each candidate HDL subspecies, we blast its protein components against PDB to see whether some of them occur in the same PDB record and marched to different protein chains. This situation would suggest that these components might be structurally possible to be involved in the same protein complex. For example, we identified an

HDL subspecies including four components apoB, Fibrinogen alpha chain (FIBA), Fibrinogen beta chain (FIBB), and Fibrinogen gamma chain (FIBG) shown in Fig. 2-9a, where FIBA,

FIBBB and FIBG are matched to the Human Fibrinogen complex (3GHG) (51). In this complex,

FIBA, FIBB & FIBG are also physically interacted with each other. Another promising example is the HDL subspecies shown in Fig. 2-9b. Three components of this subspecies are matched to known protein complex 2ESG (52), where albumin (ALBU) is matched to chain C, Ig alpha-1 chain C region (IGHA1) is matched to Chain A, and Ig kappa chain C region (KAC) is matched

Chain L. This supports that ALBU, IGHA1 & KAC should belong to the same HDL particle. In addition to the structure evidence, this subspecies is also supported by literature, gene co- expression as well as apoE knockout data.

Figure 2-9 Examples HDL subspecies with known protein complexes structure support.

2.3 Discussion

In this project, we systematically identify and characterize structural HDL subspecies and their biological functions through analyzing proteins‟ co-migration patterns generated by three orthogonal separation chromatography techniques. This work has innovations in the following aspects: first, the current studies of HDL proteomics are few and are limited in that, by only looking at pools of total HDL or very broad density subsets (i.e., HDL2 or HDL3), they do not take into account the heterogeneity of the HDL population. Our collaborator‟s proteomic experiments fractionated the total HDL pool into successive fractions to allow for discrimination between subspecies. Greater fractionation of the total HDL population allows us to more closely examine protein variances between particles and identify specific HDL subspecies. Second, it is the first report that computationally calculated the MS-determined co-migration patterns and

35 designed an ideal approach to systematically infer HDL subspecies. It uncovered the relationship between HDL subspecies and function in a way that has not been attempted before. As such, it filled a major gap in our understanding of the compositional and functional heterogeneity of

HDL particles.

Our work is also very significant in current cardiovascular disease research. First, a better understanding of the HDL subspecies that are either protective against or permissive to CVD will lead directly to new diagnostic tests for these species. For example, if we find that a particular connection (i.e. chemical cross-link) between protein X and Y is correlated with the development of CVD, we can envision using the interacting peptides as a basis for generating antibodies targeted to the junctions between proteins X and Y. This antibody could then be used in a clinical assay to screen patients for this deleterious HDL subspecies. Second, the active HDL subspecies identified in this work may focus new HDL raising therapies that offer more specificity than those currently under exploration by the pharmaceutical industry (53, 54).

Current strategies to boost HDL levels do so indiscriminately without regard for functionality; and, since HDL is comprised of numerous particle populations, simply raising HDL may increase the wrong sorts of particles at the expense of the cardio-protective ones. Our research allowed people to have a better understanding of the subspecies makeup of the fractions classically referred to as “HDL”.

An obvious criticism of our proteomic data is the validity of the assumption that the PPIs on PL- rich particles are stable enough to survive the various separation strategies. Indeed, each proposed method has drawbacks that can potentially perturb PPIs, e.g. dilution effects in GF and salt effects in AE. However, we emphasize that our correlation scoring system is designed to combine all data from all three separation methods and derive a global score that dictates the 36 likelihood of an interaction between candidate proteins. Therefore, it is tolerant of a potential artifactual lack of co-migration due to experimental perturbation in one of the separations. While this would lower the derived global score, the failure of a given complex to survive one of the separations does not necessarily preclude the identification of the interaction.

In our analysis, we applied our analysis onto two datasets, one is the whole proteomic dataset and the other is the filtered dataset, where we eliminated 12 most abundant proteins. The rational to do this is mainly because the characteristics of the HDL proteins. The 12 most abundant proteins (e.g., apoA-I and apoA-II) have wide distributions across fractions; this means they have large overlaps in migration patterns with other proteins, and resulting in large scores (especially

C-scores). Therefore, these proteins are tend to be hubs in the co-migration networks and also likely to be included in the cliques. Also, these abundant proteins are all well-studied proteins with known HDL functions. For example, we know apoA-I interact with many HDL proteins that participate in diverse functions. Therefore, we expected that it would be occur in many identified subspecies. So when performing the whole proteomic dataset analysis, including them would cause the identified subspecies to have large overlap. But, in contrast, we are more interested in those low abundance proteins which are relatively newly identified proteins.

Identifying the associations among those proteins are more meaningful, as they may reveal novel interactions among HDL proteins. That‟s why we performed the second analysis on the filtered dataset. As expected, the identified subspecies have relatively small sizes and very low overlaps.

These subspecies covered a number of distinct proteins with functions varying from lipid transportation to innate immunity.

Since our result provided a high resolution of functional specificity of practical HDL subspecies, in the future, with such knowledge, more targeted therapies can be explored to boost the cardio- 37 protective HDL particles, even if they make up a minor fraction of total HDL. In addition, small molecule therapies that mimic the cardio-protective effects of identified beneficial HDL subspecies could be explored. Furthermore, the validated network-based approach can also be applicable to correlate HDL subspecies with CVD status, resulting in effective disease biomarkers. And in the long term, therapeutic strategies can be designed to modify certain HDL subspecies based on their effects with the goal of reducing CVD.

2.4 Methods and Material

2.4.1 Data sources

Novel human plasma subfractionation methods to identify protein composition.

(A) Gel filtration (GF) approach: GF separates by molecular diameter. The principal roadblock to non-density-based proteomic studies on HDL is the co-elution of high-abundance plasma proteins, e.g., albumin and immunoglobins, which are not associated with lipids. These co- elution proteins interfere with MS analysis “swamping out” low-abundance proteins and preventing their identification. To make MS analysis of the three non-density-based HDL isolation method possible, our collaborator has made novel use of the existing commercial compound, Lipid Removal Agent (LRA; Supelco), which allows phospholipid particles (i.e.,

HDL) to be retained from isolated fractions while contaminating plasma proteins are washed away. Our collaborator also developed a triple Superdex GF method that effectively separates

HDL from the most abundant plasma protein, serum albumin. The cholesterol trace shows that

38 the LDL and VLDL co-elute in one large peak with HDL eluting in a broad peak of its own (Fig.

2-2A).

(B) Anion exchange (AE) approach: In AE chromatography, the positively charged stationary phase binds negatively charged analytes, which are then competed off the column by a salt gradient. Our collaborator developed a new technique using a 10/30 MonoQ AE column with a gradient of 0-45% perchlorate. The lipid and protein profiles from a typical separation of normal human plasma are shown in Fig. 2-2B.

(C) Isoelectric focusing (IEF) approach: Liquid-phase IEF technique separates proteins based on isoelectric point (pI) or the pH at which a protein has a net charge of zero. Our collaborator has recently acquired the necessary equipment to perform the isolation, Rotofor (Biorad), and just finished optimizing this technique (Fig. 2-2C).

These new techniques dramatically increase our ability to examine protein variances between particles and allow us to study all lipoprotein classes, not just what is traditionally considered

HDL, in human plasma.

The known HDL protein interaction data were downloaded from Human Protein Reference

Database (HPRD): http://www.hprd.org/, and the IntAct database: http://www.ebi.ac.uk/intact/.

In HPRD, interactions all have strong experimental supports, such as cross-linking, affinity chromatography or immnuoprecipitation. These experiments are conducted either in vivo or in vitro based on normal human plasma, thus, they may present the HDL interactions under normal conditions. In IntAct, all their binary interactions are manually curated from the publications.

The human gene expression data were downloaded from the Gene Expression Omnibus (GEO) database in NCBI: http://www.ncbi.nlm.nih.gov/geo/. We used the following microarray data:

GSE3526: Comparison of normal human tissue (55) and GSE3059: Profile gene expression in human peripheral blood cells (56).

The Mice aplipoprotein-E knockout gene expression data (GSE28031 && GSE2372) were also downloaded from GEO.

The protein complex structure data were downloaded from the Protein Data Bank: http://www.rcsb.org/pdb/home/home.do, which is a repository for the 3-D structural data of large biological molecules such as proteins and protein complexes.

2.4.2 Random permutation test

To test the individual significance of the identified subspecies, we randomly shuffle the label of all the proteins with respect to their co-expression patterns (which means after shuffle the protein

A might have the co-expression pattern of “old” protein C), then identify subspecies based on this shuffled data. We repeated this process 50,000 times, and collected all “random” identified subspecies in each iteration. We sorted these “random” subspecies by size and counted the number of literature supported edges (or gene co-expression evidence edges for the gene co- expression validation) occurring within each random subspecies. Then for each of the 75 “true” identified subspecies, we compared the number of literature supported edges with the “random” ones with identical size. These “random” subspecies form a global null distribution for each size, and the corresponding significance p-value of the “true” subspecies was calculated by counting the percentage of “random” ones with equal or more literature support edges. We use 0.05 as p- value cutoff, subspecies with smaller than 0.05 p-value will be considered significant. 40

2.4.3 Binominal test

For the 75 identified HDL subspecies, we validated them using the known protein-protein interaction data downloaded from public database. We found that 72 of them have at least one literature supported edges. We used binomial test to examine the overall significant of this ratio.

For the 159 HDL proteins in total, there are 12561 possible protein pairs. And within them there are 210 literature supported edges. In our candidate subspecies there are 2471 protein pairs

(edges), and 654 of them have literature supports. The probability that random select 2471 edges from the pool (12561 edges) with replacement follows binomial distribution with n=2471 and p=210/12561, and the expected observed number of edges should be 2471*210/12561 which is about 41. Therefore, the probability of getting 654 or more edges is less than 1e-6, which is the corresponding significance p-value.

2.4.4 Function enrichment analysis

Gene Ontology (GO) provides controlled vocabulary terms called GO terms to describe gene function. We collect function annotations of HDL associated proteins from GO. For each of the identified cliques, to examine whether it is enriched within a certain function. We used hypergenometric test to test the null hypothesis that they were picked out randomly from the

p  n p     p i m i genome. And the p-values were calculated as    , where n is the number of total iq n  m genes in the genome, m is the number of genes in the clique, p is the number of genes among total genes within the given GO term and q is the number of genes in this clique that also within

41 that GO term. The smaller the p-value is, the more significant the clique is enriched within the given function.

3 Chapter 3. Mapping of the generalized HDL interactome

network by integrating co-migration network and predicted

interactome network

3.1 Introduction

Similar co-migration patterns among proteins across fractions may imply their co-existence in the same HDL particle. In Chapter 2, we examined the co-migration patterns across three orthogonal separation techniques, for a protein pair, showing consistent high similarity in migration patterns across techniques would provide the strongest evidence of their co-existence in the same particle and built co-migration similarity network based on that. However, we also noticed that, because of the limitation of experimental techniques, the co-migration analysis may be biased against low-abundance proteins or highly labile complexes. On the other hand, computational techniques that have been shown successful for predicting protein complex in model organisms, e.g., yeast and human, provide an ideal alternative approach to complement these limitations in determining the co-occurrence of proteins in the same HDL particle. Here, protein interactions are defined broadly as co-complexation and thus not necessarily direct physical contacts.

Previous research has demonstrated that many sequence, structural and functional genomics features have been found to have correlation with protein interactions in protein complexes, and they have been shown to have predictive power in predicting protein interaction. For example, proteins are more likely to interact if they share similar phylogenetic profiles (57), have co- adaptation mutations along the evolutionary (58-60), are co-expressed (61), or have homologs

43 which are known to interact in another organism (62-64). In addition, interaction predictions can be further improved by integrating different features (65-69). Various machine learning methods have been applied, ranging from the simple Naïve Bayes (66) to the more sophisticated boosting

(70) and decision tree based methods (71, 72).

In our report, we took into account the most relevant features associated with protein interactions such as genome-wide sequence, function, gene expression and curated datasets from literature for HDL protein interaction prediction. We first assembled these features that are potential predictors for interaction and evaluated their predictor power using known HDL interactions downloaded from public databases. We then combined these features together to form an integrated interaction prediction using a logistic regression classifier, which does not require the independence of the features. Using this framework, each HDL protein pair was assigned a probability score of interaction. Chosen a proper cutoff, the top protein pairs were predicted interactions. These predicted interactions were used to build the predicted HDL interactome network. After building the predicted interactome network, we combined it with the previously built co-migration similarity network to obtain the generalized interactome network. This generalized interactome network was more complete and accurate than individual networks.

3.2 Results

3.2.1 Assessing the performance of individual features

We focused on four most relevant features known to be associated with protein interactions. Our rationale that these features can be applicable to HDL complexes which consist of both proteins and lipids, is based on our previous successful applications of these features to predict protein complexes including helical membrane protein-protein interactions (PPIs), which are often mediated by lipids (73).

(A) mRNA co-expression:

This feature is context-dependent and derived from functional genomics experiments. Recently it has been widely used in protein interaction prediction since the large-scale microarray data became available. Subunits of the same protein complex often show significantly co-expression in terms of similar mRNA levels or expression profiles, e.g., we can see the expression patterns of the subunits of a complex are correlated over a time course (66, 74). This relationship of co- expression has been shown useful in predicting protein interactions in both yeast and human (66,

70, 75). We obtained mRNA expression data of normal human subjects from public database such as GEO. For each pair of HDL proteins, mRNA co-expression index was calculated as the correlation coefficient between their mRNA expression profiles over different samples (see 3.3

Methods and Materials).

To quantify the performance of this feature, we calculated the fold enrichment score (FES) as follows: we first categorized the numeric feature into several consecutive bins, e.g., from low to high. Then for each bin, we calculated how frequent of known HDL interactions (see 3.3

Methods and Materials) falling into this bin as well as how frequent of total HDL interactions in this bin. Then the FES is defined as the ratio of these two frequencies. For each bin, the larger the FES is, the more confident that this bin is enriched with true HDL interactions. We plotted the FES distribution of this feature and Fig. 3-1 shows that the known HDL interactions are enriched in bins with larger correlation scores. There is a positive correlation between FES and correlation score. This is consistent with previous research that this feature is a good indicator of protein interactions.

mRNA Co-expression 6

3 FES

0 Low High

Figure 3-1. The distribution of Fold enrichment score (FES) of mRNA co-expression.

(B) Functional similarity:

This is also a common feature used to predict protein interactions. Interacting proteins often

function in the same biological process, and with similar molecular functions. This means two

proteins that interact are more likely to belong to the same biological process/molecular function

than to different processes/functions. We collected functional annotations of HDL proteins from

Gene Ontology (GO). For each pair of HDL proteins, we calculated the functional similarity

score between them using three widely used sematic similarity methods: Resnik‟s similarity,

Lin‟s similarity and Jiang‟s similarity (76) (See 3.3 Methods and Materials). Fig 3-2. shows the

FES distribution of this feature using different similarity methods and GO classifications. We

can see generally the known HDL interactions are enriched in the bins with high functional

similarity in all three methods and the positive correlation is invariant across different similarity

methods and GO categories.

Resnik’s similarity-Biological Process Lin's similarity-Biological Process Jiang's similarity-Biological Process 7 7 6

6 6 5

5 5 4

4 4

FES FES 3 FES 3 2 2 2

1 1 1

0 0 Low High 0 Low High Low High

Resnik's similarity-Molecular Function Lin's similarity-Molecular Function Jiang's similarity-Molecular Function 2.5 3 3

2.5 2.5 2

2 2 1.5 1.5

1.5 FES

FES FES

1 1 1

0.5 0.5 0.5

0 0 Low High 0 Low High Low High 47

Figure 3-2. The distribution of Fold enrichment score (FES) of functional similarity.

Interacting proteins pairs are often found to co-evolve, e.g., insulin & its receptors, and dockerins

& cohexins (77). Under these situations, these interacting protein paris‟ phylogenetic trees would show a greater degree of similarity than non-interacting proteins. The phylogenetic trees are constructed by the corresponding distance matrices in multiple sequence alignments. And the similarity of phylogenetic trees can be calculated by the linear correlation between these distance matrices (See 3.3 Methods and Materials). High correlation values were to be interpreted as indicative of true interactions (59, 78). Here for each HDL protein pair, the linear correlations between their distance matrices were calculated. Fig. 3-3 shows that the known HDL interactions are enriched in the higher correlation categories than the random pairs. This is consistent with the observation that the interacted proteins have more similar phylogenetic trees (77, 78).

Similarity of Phylogenetic tree 12

6 FES

0 Low High

Figure 3-3. The FES distribution of similarity of phylogenetic trees.

(D) Co-occurrence in annotated gene sets

The Molecular Signatures Database (MSigDB) is a collection of annotated gene sets (79). The gene sets in the database are divided into the following major collections. The first collection is the curated gene sets, where gene sets are collected from different kinds of public databases such as online pathway databases, literatures in Pubmed and knowledge of specific experts. It contains the following sub-collections: chemical and genetic perturbations, where gene sets represent gene expression signatures of genetic and chemical pertubations. Canonical pathways where gene sets are derived from the online pathway databases such as the BioCarta and KEGG.

Usually, these gene sets are canonical representation of a biological process compiled by domain experts. The second big collection is the motif gene sets which contain the microRNA targets where gene sets contain genes that share a 3‟-UTR microRNA binding motif, and transcription

49 factor targets where gene sets contain genes that share a transcription factor binding site defined in the TRANSFAC database. MSigDB also contains computational gene sets and GO gene sets.

Here we used the gene sets in the first two collections, curated gene sets and motif gene sets. For each HDL protein pair, we calculate its co-occurrence in these gene sets. From the Fig. 3-4 we can see chemical and genetic perturbation and pathway gene sets are highly predictive features.

The true interactions are significantly enriched in bins with high co-occurrence categories.

Chemical and genetic perturbation Pathway gene sets Motif gene sets 20 30 3.5

18 25 3 16

14 2.5 20 12 2

10 15

FES

FES FES 1.5 8 10 6 1 4 5 0.5 2

0 0 0 Low High Low High Low High

Figure 3-4. The FES distribution of protein pair co-occurrence.

To compare the predictive power of different features, we performed a Naïve Bayes analysis and ranked all features according to the coverage length of log-odds ratio (24) (See 3.3 Methods and

Materials) (Fig. 3-5). As the figure shown, the longer the overall coverage length is, the greater the contribution of the corresponding feature has to the target class, i.e., protein interaction. We can see that the most predictive feature is pathway co-occurrence which is calculated based on the shared pathway information. This demonstrates that two proteins taking part in the common 50 pathway are more likely to be within the same protein complex than belong to different pathways.

The following predictive features are motif co-occurrence, functional similarity and co- evolutionary. These features are all well known to be associated with protein interactions.

Figure 3-5. The Nomogram for visualization of the features. Each feature has a corresponding line indicating the relationship between a feature value and its predictive contribution assessed by Naïve Bayes analysis. The number on the line is the value of the feature and each value corresponds to a point score above. The longer the line is, the more predictive power the feature has in prediction.

3.2.2 Logistic regression framework

In order to optimally combine these features, we used the Logistic Regression framework (73)

(see 3.3 Methods and Materials). Basically we used the “gold-standard” dataset to estimate the optimal coefficient associated with each feature. Each protein pair was predicted as true interaction if the probability score of interaction exceeds a cutoff. The gold-standard positive set contains 110 true interactions among the 159 HDL associated proteins downloaded from the

Human Protein Reference Database (HPRD) (see 3.3 Methods and Materials). Previous research has shown that each protein has an average of five physical partners (80). And the average number of co-complex partners is larger than physical partners (about 15~20). Taken together, we assume that each HDL protein has on the order of ten co-complex protein partners.

Thus, there should be ~800 pairs of co-complexes protein interactions in total within the 159

HDL associated proteins, and the ratio of the positive set (true interaction) and negative set size is about 1:15. To make a more balanced training set, each time we randomly selected 110*15

“non-gold-standard” proteins pairs as the “gold-standard” negative set, together with the 110

“gold-standard” positive set, to form the complete training set.

To get a more unbiased evaluation of the model performance, we used ten-fold cross-validation approach. For each trial, we divided the training set into ten subsets of equal size: nine subsets are used as the training set to build the classifier, and the remaining one used as the test set. The average AUC value among the ten test sets was reported as the final performance score of the model. We applied the model on 1000 different training sets assembled by fixing the gold standard positive set and random sampling from the negative sets. The averaged AUC score of all the models is 0.858. Figure 3-6 shows the ROC curve of the model based on the features

52 individually and integrated together. We can see the model integrated all the features outperforms the models built individually, especially at the false positive rates (<0.2). Among all the individual classifiers, when the false positive rate is low (<0.005), the best individual classifiers are related with pathway coccurence. When the false positive rate is higher (>0.1), the best individual classifiers are related with functional similarity. For each training-testing process, we also applied the model onto the whole dataset, where each HDL protein pair received a probability score of interaction. The final interaction score for a given protein pair is assigned by averaging all the scores of different training models. We assumed that there are about 800 actual

HDL protein interactions. Therefore, the top 800 protein pairs were predicted to be true interactions finally. To evaluate the performance of our predicted interactions, we further calculated the precision rate (fraction of predicted interactions are true interactions) and the recall rate (fractions of true interactions are predicted to be interact). Table 3-1 lists the performance when choosing different score cutoffs. We can see that the precision rate is very high (0.7) for the top 20, and the rate decreases to 7.25% for the top 800 predictions. At the same time, the corresponding recall rates increased from 13% to 53%. This indicates that our predicted interactions are able to cover the majority of the true interactions. And this provides the rationale for our particular choice of cutoff. To compare this performance with “random guesses”, we also calculated the fold enrichment score. The FES for the top 20 predictions is 77 and even for the top 800 prediction, the score is still as high as 8. Here Table 3-1 lists the top 50 predictions along with their labels in the gold standard dataset. We can see 23 out of the 50 predictions are true interactions. This performance is significantly better than the “random guesses”, which will only get 0.9% accuracy. The other pairs are not necessarily incorrect, since the current HPRD database only includes 1/3 of human proteins.

0.9

0.8

0.7

0.6

0.5 Combined 0.4 Chemical perturbation Pathway cooccurence 0.3 Motif cooccurence

Co-evolution True positive rate (Sensitivity) rate positive True Co-expression 0.2 Functional similarity

0.1

0 0 0.2 0.4 0.6 0.8 1 False positive rate (1-Specificity)

Figure 3-6. ROC curve based on all feature combined, as well as individual feature.

Table 3-1. The top 50 predicted interactions along with their labels in gold-standard dataset.

Protein A Protein B Probability score of Labels in gold interaction standard

FIBB_HUMA FIBG_HUMA 0.999882 1

C1R_HUMA C1S_HUMA 0.990632 1

FIBA_HUMA FIBB_HUMA 0.989535 1

FIBA_HUMA FIBG_HUMA 0.983859 1

CO8A_HUMA CO9_HUMA 0.897101 1

CO8A_HUMA CO8B_HUMA 0.885635 1

C1S_HUMA CO3_HUMA 0.858405 0

CO5_HUMA CO6_HUMA 0.855424 1

CO6_HUMA CO8A_HUMA 0.854804 0

PROC_HUMA THRB_HUMA 0.846423 1

CO3_HUMA CO5_HUMA 0.837978 1

CO5_HUMA CO9_HUMA 0.828467 0

IGHG3_HUMA LAC_HUMA 0.817204 0

C1S_HUMA IC1_HUMA 0.815675 1

PROS_HUMA THRB_HUMA 0.807055 1

CO8B_HUMA CO9_HUMA 0.804102 0

PROC_HUMA PROS_HUMA 0.77369 1

CO2_HUMA CO3_HUMA 0.772506 1

CO5_HUMA CO8A_HUMA 0.771818 0

C1R_HUMA IC1_HUMA 0.769883 1

CO6_HUMA CO8B_HUMA 0.766277 0

C1QB_HUMA C1QC_HUMA 0.759848 1

C1S_HUMA LAC_HUMA 0.747148 0

CO6_HUMA CO9_HUMA 0.743985 0

C1R_HUMA CO2_HUMA 0.739398 0

KG1_HUMA THRB_HUMA 0.738327 1

CO8A_HUMA CO8G_HUMA 0.736607 1

FA9_HUMA THRB_HUMA 0.736314 1

IGHG1_HUMA LAC_HUMA 0.726099 0

FIBG_HUMA THRB_HUMA 0.718969 0

CO8B_HUMA CO8G_HUMA 0.709074 0

CO2_HUMA CO5_HUMA 0.70648 1

CO5_HUMA CO8B_HUMA 0.700997 1

CO2_HUMA CO4A_HUMA 0.699563 0

CO2_HUMA CO8A_HUMA 0.693396 0

FA12_HUMA KLKB1_HUMA 0.692921 1

C1R_HUMA CO3_HUMA 0.677411 0

FA10_HUMA PROS_HUMA 0.671909 1

FA10_HUMA THRB_HUMA 0.665546 0

IGJ_HUMA LAC_HUMA 0.654371 0

CO2_HUMA CO9_HUMA 0.645729 0

FIBB_HUMA THRB_HUMA 0.642428 0

C1S_HUMA CO7_HUMA 0.634043 0

C1R_HUMA CFAB_HUMA 0.631446 0

CO3_HUMA CO4A_HUMA 0.628338 0

FIBG_HUMA PLM_HUMA 0.626791 0

PLM_HUMA THRB_HUMA 0.625727 0

AT3_HUMA THRB_HUMA 0.622355 1

CO2_HUMA CO6_HUMA 0.620364 0

FIBG_HUMA PROS_HUMA 0.61456 0

Table 3-2. Performance of the top predictions evaluated by precision and recall rate

Top Consistent with Precision/Recall Fold enrichment score Predictions HPRD Rate (FES)

20 14 70% / 13% 76.9

50 23 46% / 21% 50.6

100 27 27% / 25% 29.7

200 38 19% / 35% 20.9 56

800 58 7.25% / 53% 8.0

3.2.3 Merging with co-migration network

After building the predicted HDL interactome network, we merged it with the previously built co-migration similarity network to obtain the generalized HDL interactome network. This is because we know that proteins have similar co-migration patterns in different separation techniques are more likely to form distinct subspecies. What‟s more, the accuracy of the co- migration similarity network has been validated (in Chapter 2) using different biological sources of evidence. Therefore, the merge of the two networks enables us to obtain a more complete and unbiased interactome network. The predicted HDL interactome network contains 103 proteins and 800 edges, the co-migration network contains 60 nodes and 628 edges and the combined generalized HDL interactome network contains 134 nodes and 1382 edges. Therefore, our final

HDL interactome network consists of 1382 interactions among HDL associated proteins, among which 110 are in the gold-standard positive set. Fig. 3-7 shows this generalized HDL interactome network. In this figure, the green and red edges indicate the interactions are from the predicted interactome network and the co-migration network, respectively. And the blue edges indicate the common interactions of the two networks.

Figure 3-7. Genelized HDL interactome network. Each node is a HDL associated protein. The green and red edges indicate the corresponding interactions from predicted HDL interactome and co-migration network, respectively. The blue edges indicate the interactions are from both networks.

3.3 Discussion

In the work, we first constructed a high-quality predicted HDL interactome network in human by integrating a diverse set of functional and genomic features based on gene sequence, functions, comparative genomics and curated genes sets in literature. These features are shown to be highly

58 correlated with interactions with different biological inspects. A logistical regression classifier was applied to optimally combine these features for interaction prediction. The AUC score of our model was achieved at 0.86 and our predictions covered the majority of the known interactions in the gold-standard positive set, using cross-validation. The accuracy of the predicted interactome network was highly depended on the selected features, quality of the gold- standard dataset, as well as the chosen classifiers. To further improve the prediction accuracy, in the future, we can consider more biological relevant features. For example, we have proved that protein domain information is the most useful feature in gene essentiality prediction (24), as the function of a protein is performed by its constituted domains instead of the entire protein. Using the domain-domain interaction information, we are able to improve our current predictions. Also, the current gold-standard dataset only contains 110 known HDL interactions, as more such experimental data are deposited into the database, we will expect the result should be improved a lot.

The predicted HDL interactome network contains 125 unique proteins and 852 edges, and the co-migration network contains 60 nodes and 628 edges. There are only 98 common edges between the two networks. This small overlap suggests that the two networks capture protein interactions from different aspects. Co-migration network reveals structural protein particles from direct experimental evidence, although the experiments maybe biased against low abundance proteins or highly unstable particles. And the predicted interactome network suggests protein interactions from indirect evidence such as different scours of predictive features. Thus, combining the two networks provides a more comprehensive interactome network. Fig. 3-7 shows that the two networks are relatively separated and also somehow connected through several components such as the complement proteins CO3, CO9, CO4B, CO2, CFAI, CO8B, etc.

3.4 Material and methods

3.4.1 Data sources

The known HDL protein interaction pairs are downloaded from Human Protein Reference

Database (HPRD) (ref): http://www.hprd.org/ . Currently, there are 110 known PPIs between 159

HDL proteins deposited in this database. Interactions in this database all have strong experimental supports, such as cross-linking, affinity chromatography or immunoprecipitation.

These experiments are conducted either in vivo or in vitro based on normal human plasma, thus, they may represent the HDL interactions under normal condition.

The human Gene expression data were downloaded from Gene Expression Omnibus (GEO) datasets in NCBI: http://www.ncbi.nlm.nih.gov/geo/.

The functional annotations of HDL proteins are downloaded from the Gene Ontology database: http://www.geneontology.org/.

The multiple sequence alignments (MSA) were obtained by searching for homologous proteins with BLAST and aligning them with COBALT Multiple Alignment Tool (81). The distance matrix was calculated using Proctdis package from PHYLIP: http://www.phylip.com/.

3.4.2 Genomic features for predicting HDL protein interactions mRNA co-expression: We used the following microarray data: GSE3526: Comparison gene expression of normal human tissues, and GSE3059: Profile gene expression in human peripheral blood cells. These experiments measure the gene expression for normal human tissues or normal blood cells. To calculate the mRNA co-expression, Pearson correlation coefficients |rij| are

60 calculated across multiple sets of data. A pair of protein will a |rij|>0.7 is considered co- expressed and interacting.

Functional similarity: We collected function annotations of HDL associated proteins from Gene

Ontology (GO), which provide a controlled vocabulary of terms called GO terms for describing gene product characteristics and contain the gene product annotation data . For each gene annotation, there is an evidence code which is used to describe the specific type of evidence on which the annotation is based. The followings are common evidence codes used. EXP: Inferred from Experiment, IDA: Inferred from Direct Assay, IPI: Inferred from Physical Interaction,

IMP: Inferred from Mutant Phenotype and IGI: Inferred from Genetic Interaction (50). The IPI code which the annotation is inferred from physical interaction may contain the same information as our gold standard dataset. So we removed the gene annotations with the IPI evidence code.

The GO ontology is structured as a directed acyclic graph where every GO term has determined relationship with other terms. To measure the similarity between two GO terms, we calculated the sematic similarity between them. We applied three well-known sematic similarity methods used in GO ontology. The first was resnik‟s similarity defined as this equation:

** ˆ sim(,) t t ICms (,) t t max IC () t tˆ Pa(,) t t* IC( tˆ ) log P ( tˆ )

Where t and t* are two different GO terms, and Pa (t,t*) denotes the set of all common ancestors of GO terms t and t*. IC (t) denotes the information content of term t which is defined as the negative logarithm of the probability of observing term t. The information content of each GO

61 term is precomputed for each ontology based on the empirical observation. The second was the

Lin‟s similarity which is expressed as:

2IC ( t , t* ) sim(,) t t*  ms IC()() t IC t*

And the third was named Jiang and Conrath‟s similarity:

*** sim( t , t ) 1  min(1, IC ( t )  2 ICms ( t , t )  IC ( t ))

Similarity of phylogenetic trees: To calculate the phylogenetic trees, we first downloaded the sequence alignment files related with HDL proteins from the homologene database: http://www.ncbi.nlm.nih.gov/homologene, where the sequence alignments were constructed exclusively based on clustering by the sequences‟ BLAST result from sequenced eukaryotic genomes. Then to obtain a quantitative indicator of the interacting between two proteins, the multiple sequence alignments (MSAs) of both proteins were reduced to the set of organisms common to the two proteins. The MSAs of each protein was used to construct the corresponding inter-sequence distance matrix. These matrices were commonly used to construct the corresponding phylogenetic trees. Finally, the linear correlation between these distance matrices was calculated as the indicator of similarity of phylogenetic trees between two corresponding proteins.

3.4.3 Evaluating the predictive power of features using Nomagram.

We evaluated the predictive power of all the features using Nomogram described in our previous study(24). In a Naïve Bayes analysis, the probability of a class c (interaction or non-interaction) given an instance X with a set of attribute values X = is 62

P(a ,a ...a | c)P(c) P(c) P(ai | c) P(c | X )  1 2 n  i P(X ) P(X ) (1)

The equation translates to

P( a | c ) logit P ( c | X ) logit P ( c ) log i (2) i P( ai | c )

p where logit p  log and c is alternative classes. 1 p

P( c ) P ( a | c ) P( c | X )  i Since the Odds is defined as Odds i , the terms in summation can be P( c | X ) P ( c ) P ( ai | c ) i

P( c | ai ) expressed as odds ratios (OR): P( aii | c ) P ( c | a ) OR() ai P( ai | c ) Pc() Pc()

The logOR values are normalized as point values in the range of [-100, 100] and plotted against the first line in the nomogram.

Further, We now take the right term in (2) and call it F(c|X):

P(a | c) F(c | X )  log i  logOR(a )  P(a | c)  i i i i

The equation (2) translates to (3) and (4)

P( c | X ) P ( c ) log logF ( c | X ) (3) 1P ( c | X ) 1 P ( c )

Pc() 1 logF ( c | X ) P( c | X ) 1 e 1Pc ( ) (4) 

By summing up all the logOR according to the first line, F(c|X) is calculated and the probability is shown in equation (4). To get a direct view, the F(c|X) and P(c|X) are plotted coordinately at the bottom. By looking at the F(c|X), P(c|X) is easily shown.

In the Fig. 3-5, each feature has a corresponding line indicating the relationship between a feature value and its predictive contribution is assessed by Naïve Bayes analysis. The vertical axis represents the feature values, and the horizontal axis represents normalized log OR. To make a prediction, the contribution of each feature is measured as a point score (topmost axis in the nomogram), and the individual point scores are summed up to determine the probability of interaction (bottom two axes of the nomogram). The longer the line is, the more predictive power the feature has in prediction. Since we are interested in predicting interaction but not non- interaction pairs, the features with a positive coverage length is considered as useful features.

When the value of the feature is unknown, its contribution is 0 point. Therefore, not knowing anything about the interaction, the total point score is 0, and the corresponding probability equals to the unconditional prior. Besides enabling the prediction, the Naive Bayes nomogram nicely reveals the structure of the model and the relative influences of the feature values to the class probability. For the training dataset, Pathway co-occurrence is a feature with high potential influence on the probability of interaction: it monotonic increases along with the chances of interaction in the 2-D view. The larger the value is, the more likely that it is to be interacted.

3.4.4 Logistic Regression Procedure

In Logistic Regression all numerical features must have linear relationship within the Logit, and those without linear property were converted into categorical values by dividing the data into several successive categories. Each piece of evidence, i.e., a genomic feature taking on a particular value, was fitted by a coefficient. So for each HDL pair, its Logit was estimated by a linear combination of all different pieces of evidence, and the probability of interaction is uniquely determined by its Logit. The protein pair was predicted as true interaction if the probability score exceeds a cutoff. In this process, all we need to estimate are the coefficients for each categorical value of each feature. This was achieved by Maximum Likelihood Estimate.

Specifically, in our model, the probability of a given HDL pair to be true interaction is given by:

1 py( 1) ; 1 ez

Here y=1 means the interaction is true, 0 otherwise. Z is the logit part of this protein pair, and is calculated by combining all the related features:

4 ni z0  ij I() x i  j ; ij11

In this logit part, 0 is intercept, ij is the coefficient for the jth category of the ith feature. ni is

the total number of categories the ith feature has. I() xi  j is an indicator function, it is equal to one when the ith feature of this protein pair belongs to the jth category, and 0 otherwise.

During the MLE process, we used the known data (the known HDL protein interaction pairs) to estimate the coefficients. The likelihood function is given by:

mm1 ezg()k L( y ,..., y ) f ( y | g )  { I ( y  1)}{ I ( y  0)} 1 m k kz()() gkk k z g k ; kk1111ee

Here gk are the features of the ith gold-standard protein pair.

4 Chapter 4. Identification of functional modules from the HDL

interactome map

4.1 Introduction

A network perspective has been widely used to describe the structure of wide-ranging systems in nature, including the Internet, power grids, neuronal network, and society. Despite the seemingly significant differences among these applications, they all have the common network topology features. Thus, networks may provide a framework for characterize the nature of biology in a vivid manner accepted by the public. In contemporary systems biology, network representation has also been applied to describe various molecular systems including the protein interactome, metabolic network, transcriptional regulatory network and functional association network.

Network approaches have successfully predicted protein functions, guided large-scale experiments, facilitated drug discovery and designing, and expedited novel biomarker identification.

Protein-protein interaction (PPI) networks have been found to be naturally divided into modules.

Each module, or subnetwork, is a discrete set of proteins which forms a single connected component that performs a relatively independent task. A number of studies have reported that high modularity in the PPI network often indicates a strong correspondence between PPI modules and functional units. Accurate identification of modules from large-scale PPI networks contributes greatly to protein function annotation. Another promising application of modular analysis is the identification of module-based prognostic biomarkers for disease such as different cancers.

In characterizing HDL functions, this perspective of network-based modules is especially relevant because accumulating evidence has suggested that different HDL functions are carried out by distinct HDL particles. To characterize the structural HDL subspecies and their associated biological functions, our collaborator has applied three novel orthogonal separation chromatography techniques to fractionate normal human plasmas to subfractions and identified

HDL associated proteins and their relative distributions across these fractions (in Chapter 2).

Here for those subfractions derived from gel filtration (GF) chromatography technique, our collaborator further performed cholesterol efflux assay to measure their functional activities.

Given the protein abundance data, functional activity data, as well as the generalized HDL interactome network, here we developed and validated a novel approach which combines advanced proteomic analysis, functional assay and a network-based computational framework to identify new HDL subspecies in normal human plasma and associate them with known HDL functions. The identified HDL subspecies may suggest distinct functional units and interactions within and among these subspecies may reveal novel molecular mechanisms underlying the

CVD. What‟s more, the identified subspecies are more reproducible and predictive than individual proteins.

This work was highly innovative because the creative integration of computational and experimental approaches uncovered the relationship between HDL subspeciation and function in a way that has not been attempted before. As such, it filled a major gap in our understanding of the compositional and functional heterogeneity of HDL particles.

4.2 Results

4.2.1 Functional analysis of HDL subfractions

The cholesterol efflux assay is the most commonly used assay to measure the protective activities of HDL. Here the subfractions derived from the GF separation technique are also subjected to a panel of the cholesterol efflux assay. Each HDL fraction was evaluated for the ability to promote the efflux of cholesterol. Results of the assay are expressed as fractional release of cellular cholesterol to equal volumes of the HDL fractions. Fig. 4-1 shows the activity profile of cholesterol efflux generated by GF subfractions in three normal human plasmas.

Figure 4-1. The activity profiles of cholesterol efflux by GF subfractions in three normal human plasmas.

4.2.2 The framework of the functional HDL subspecies identification

The goal of the framework is to search subnetworks, or modules from the HDL interactome network, whose abundance profiles across the fractions are highly correlated with the functional activity profiles. We take the following steps (Fig. 4-2):

Figure 4-2. The outline of the network-based framework.

Constructing data structure: From the results of proteomics characterization and functional profiling of HDL fractions for the cholesterol efflux activity, we have obtained the protein abundance level Xijk for each HDL protein Pi, i=1,…,N in each fraction Fj, j=1,…,L and for each human subject or sample Sk, k=1,…,S. At the same time, we also have obtained the functional activity value Yjk in each fraction Fj for each sample Sk. Thus, the HDL protein abundance data can be displayed into a matrix X with N rows and LS columns. Each element Xi,jk represents the abundance level of the ith protein in the jth fraction of the kth sample. And the corresponding functional activity profile can also be displayed in a vector Y with LS columns. Each element Yjk 70 represents the function activity value in the jth fraction of the kth sample. First, we normalized both protein abundance matrix and functional activity profile by these two equations:

LS XX X ˆ i, jk i , jk 1 i, jk XXi,, jk, i  ()XX 2 LS  jk i,, jk i  LS YY Y ˆ jk jk 1 jk YYjk ,  ()YY 2 LS  jk jk 

Where each X and Y were z-transformed to ˆ and ˆ with mean=0 and s.d.=1. ijk jk X ijk Yjk

Scoring subnetworks: we overlay the normalized protein abundance data onto the generalized

HDL interactome network and define the expression profile of a given subnetwork/module. Our recent research indicates that a static conglomerate network may not accurately reflect the property under specific condition. Therefore, directly identifying modules from a conglomerate network may provide misleading results. For example, a protein component may not even be expressed under a given condition. By overlaying protein abundance level onto the HDL PPI network, we effectively take into account the context-specific information under each fraction.

As the Fig. 4-2 shown, assuming a candidate module M contains R different protein components, the expression vector of this module is defined as the aggregative average of the abundance profiles of its components:

R ˆ MXRjk  i, jk / i1

Then we examine the discriminative power of this module to the functional activity profile using the correlation score S(M), this score is defined as the absolute Pearson correlation coefficient between the two vectors:

(MMYY )(ˆ )  jk jk jk S() M rMY, ()()MMYY22ˆ jkjk jk jk

Searching significant subnetworks/modules:

Simple hill climbing: Given the correlation score function S(M), we apply a greedy search to identify a set of modules for which S(M) are locally optimal that can best classify the existence of the function. Specifically, candidate modules were seeded with a single protein and expanded iteratively. At each iteration, a protein from the neighbors of proteins in the current module that has at most a distance d from the seed was considered. All neighboring proteins were tested and the addition that yields the maximal score increase were accepted. The search stops when no addition increases the score beyond a specified improvement rate r and outputs the optimal set of modules for the classification. We adopt appropriate values of d and r in order to keep the search local while avoiding overfitting. The significance of identified modules was assessed by random permutation experiments. For example, we permuted the abundance vectors of individual proteins in the network or the functional activities of the fraction.

Classification evaluation.

Logistic regression model is trained on the module activity matrix with each detected modules as a predictor variable and functional activity as a response variable. To select a set of modules that best fit the training sets, different feature selection methods such as forward selection, backward

72 selection, or stepwise selection are applied. To make an independent training-testing process, we selected a set of optimized modules using the dataset from the first two samples and then tested on that from the third one; and vice versa. After the model is selected, we use five-fold cross- validation to test the performance of the model. The averaged AUC values among the five test sets were reported as a final classification performance for the module sets. Also to compare the predictive performance of our modules versus individual protein prediction, we train the logistical regression model on the original protein abundance matrix.

4.2.3 Functional HDL subspecies identification

The generalized HDL interactome network contains 122 nodes and 1276 edges. The proteomic data obtained from GF separation technique contains 106 unique proteins. Therefore, we restricted our analysis on the 106 proteins with both protein abundance information and functional activity data. In the current method, each protein in the interactome network was used to “seed” to generate a subnetwork. This subnetwork locally optimizes the correlation between its expression profile and the functional activity profile. We chose from the 2 step neighbors of the seed nodes and set the improvement rate r=0.01 to find local optimal modules. A total of 85 significant subnetworks were identified using the protein abundance data generated by GF techniques in 3 human plasmas (consisting of 91genes, and based on the random test for statistical significance (see Material and Methods)). Fig 4-3 shows the size distribution of the identified significant modules. The average size of these modules is six and the majority modules with sizes from four to eight.

35 30

25 20

15 Frequency 10 5 0 3 4 5 6 7 8 9 10 Sizes of modules

Figure 4-3 The size distribution of the identified discriminative modules

To show the improvement of modules verse single proteins, for each of these identified significant modules, we calculated the correlation score between its expression profile and the functional activity profile. Fig. 4-4 (a) shows the frequency of the correlation score distribution for these modules. We can see all these modules with correlation scores larger than 0.75 and the majority modules with scores around 0.85. We also plot the correlation score distribution of individual proteins in Fig. 4-4(b). Apparently, there is a negative relationship between the correlation scores and the frequencies. As the correlation score increases, the corresponding frequency gradually decreases. And only one protein has correlation score larger than 0.6. This result well demonstrates that these modules‟ abundance levels across fractions are more discriminative of the existence of the HDL function than individual proteins, which supports the idea that HDL‟s efflux ability is mainly performed by those compositionally distinct HDL particles instead of several independent well-studied HDL proteins. Therefore, each significant

74 module may be viewed as a putative marker for HDL‟s protective function, which is not based on a single protein but rather on the aggregate behavior of proteins connected in a functional network. This feature is a significant departure from the previous co-migration analysis, which does not provide functional insight into the identified markers.

Figure 4-4 Correlation score distributions between (a) protein modules / (b) single proteins and functional activity.

We also plotted the expression profiles of the single top protein (Fibrinogen alpha chain (FIBA))

& the top module (apoA2, apoE, apoH, C1QC, C4BP, CO5, FIBB and ITIH4) that have the highest correlation score. Fig4-5(a) shows the expression profile of FIBA which has a correlation score of 0.61 with the functional activity profile shown as the red curve. Fig4-5(b) shows the expression profile of the top module which has a correlation score of 0.86. We can see the module integrating protein expression and the network information is able to significantly increase the discriminative power to the functional activity.

Figure 4-5 The protein expression profile of (a) single protein and (b) single module that have the highest correlation score, along with the functional activity profile.

For these significant modules, we performed functional enrichment analysis. We collected function annotations of HDL proteins from Gene Ontology. For each of the modules, we used hypergenometic test to examine whether it is enriched within a certain function (see Material and Methods). For the 85 significant modules, we found that 60 out of them have significantly enriched functions after Benjamini correction. And the most enriched functions are reverse cholesterol transport, complement activation, platelet activity, coagulation, innate immune response, and regulation of acute inflammatory response, etc. These results are consistent with the previous known HDL functions such as reverse cholesterol transport, anti-oxidation, anti- inflammation and endothelial relaxation.

4.2.4 Subnetwork markers are reproducible across different samples

To examine the overlap between the subnetworks identified across different samples using our network-based approach. We identified all the significant modules from the three samples separately and there are 85, 67 and 70 significant modules respectively. The three modules sets

76 contain 86, 69 and 72 unique proteins respectively and combing the three sets we got 89 unique proteins. We calculated the protein agreement among the three modules sets and found 60 out of the 89 proteins occur in the three sets (Fig. 4-6). To assess the reproducible between modules and individual proteins, we ranked the single protein correlation scores in the three samples separately and chosen the top 86, 69 and 72 proteins with the highest correlation score from the three ranking lists separately as the predicted single markers. Combing the three single marker sets, we got 99 unique proteins and 47 out of them occur in all the three samples. The reproducible of the module markers is significantly higher than the individual markers across different samples (Fig4-6) (P-value= 0.004 using Fisher exact test).

Figure 4-6 Overlap in module makers vs. single-gene makers identified from the three samples.

The single-gene analysis was performed by using the same number of top correlated genes as the number of genes covered by module markers.

4.2.5 Classification performance of the module markers

The purpose of the identified modules is to be used as biomarkers to connect the HDL particle and CVD protective function. We evaluated the performance of the identified modules in terms of their ability to classify the cholesterol efflux function. To do this, first we selected two samples as “known” subjects and the other one as “blind” subject. Then we applied the previous module selection procedure on the “known” subjects‟ abundance data and functional activity data. The CVD protective function profile was categorized into a binary vector using the median value of the functional activity data of the “known” subjects as cutoff. We built a logistic regression classifier using the identified significant modules as predictors and the functional activity as response variable. The model was trained using the selected modules on the “known” subjects, and then tested on the “blind” subject to evaluate the predictive power. We applied the stepwise feature selection procedure to choose a set of optimal features and three modules were selected into the final model (Fig. 4-7). The three modules contained 7, 5 and 7 protein components respectively and the overlaps among them are very low. As the figure shown, in the modules, edges with different colors indicate the corresponding interactions were derived from different types of networks (co-migration network or predicted HDL interactome network).

These predictive modules comprised of different types of edges imply that these two networks are complement with each other, and the combined network is more complete and unbiased than individual networks. Functional enrichment analysis of these three modules shows that they are enriched with the following functions: acute inflammatory response, complement activation, and the regulation of cholesterol import. Using the three modules as input, we built the logistic regression model. An AUC of 0.952 was achieved on the “blind” test set, Table 4-1 shows the performance table of the test set, we can see among the 17 test fractions, 15 of them were

78 corrected predicted with the accuracy rate as high as 88%. As the Logistic model only selected three modules as covariates, the over fitting problem was minimized.

To compare the performance of the modules verse individual proteins, we also trained the logistic regression classifier on the original protein abundance matrix. Here the single protein markers were selected from the two “known” subjects. We chose the same number of top single proteins that have the highest correlation scores as the number of proteins included in the three modules. The model was built on the two “known” subjects using the selected protein markers as input, and then tested on the “blind” subject. Table 4-2 shows the performance of the model, we can see among the 17 fractions, only 12 are corrected predicted, the accuracy decreased from 88% to 70%, and the AUC also decreased from 0.952 to 0.80. This result demonstrates that the identified modules have better predictive power than top single proteins.

Figure 4-7 The selected predictive modules used in the logistic regression classifier. They were extracted from the generalized HDL interactome network. The red edges indicate these interactions are originally from co-expression network, the green edges indicate the interactions are from the predicted HDL interactome network and the blue edges indicate these interactions occurring in both networks.

Table 4-1 Performance Table of the classifier on the test set using selected modules as input

Positive (with function) Negative (no function)

Predicted Positive 10 2

Predicted Negative 0 5

Table 4-2 Performance Table of the classifier on the test set using selected top single proteins as input

Positive (with function) Negative (no function)

Predicted Positive 5 0

Predicted Negative 5 7

4.3 Methods and Materials

4.3.1 Cholesterol efflux assay

Assay of HDL‟s cholesterol efflux function was performed on plasma fractions collected by the

GF separation technique described above. Our collaborator measured the ability of different

HDL fractions to promote the efflux of cholesterol from a mouse macrophage cell line using a method described by Sakr et al. J774 cells were cholesterol labeled by incubation with [3H]- cholesterol, then labeled cells were exposed to various HDL fractions. [3H]-cholesterol was measured in the medium (effluxed) and in the cells (remaining) such that cholesterol efflux can be calculated as [3H]-cholesterol in medium/([3H]-cholesterol in medium + [3H]-cholesterol in cells)*100. Our collaborator compared the activity of the function in all fractions of a given separation technique to create a relative functional activity profile. (Fig.4-1)

4.3.2 Function enrichment analysis

4.3.3 Random permutation test

To test the individual significance of the identified modules, we randomly shuffle the expression pattern of all the proteins in the interactome network, then identify modules using the same procedure based on this shuffled network. We repeated this process 50,000 times, and collected all “random” identified modules in each iteration. We sorted these “random” modules by size and calculate the correlation between their expression profile and the function activity profile.

These “random” modules‟ correlation scores form a global null distribution for each size, and the corresponding significance p-value of the “true” modules was calculated by counting the percentage of “random” ones with equal or larger correlation score. We use 0.05 as p-value cutoff, modules with smaller than 0.05 p-value will be considered significant.

5 Chapter 5. Other computational biology projects

5.1 Unraveling the mechanistic basis of gene essentiality

5.1.1 Investigating the predictability of essential genes across distantly related

organisms using an integrative approach

Essential genes are defined as those when disrupted, confer a lethal phenotype to microorgansims under defined conditions. They can provide important information on cell function and can be used as potential targets for antibiotics drug development. Studying gene essentiality is also important in basic science because it is a crucial step toward understanding the complex relationship between genotype and phenotype (82). Therefore, rapid and accurate identification of new essential genes in under-studied microorganisms will significantly improve our understanding of how a cell works and the ability to re-engineer microorganisms (83, 84).

Experimental identification of essential genes can be accomplished either by targeted mutagenesis, where specific genes are identified prior to genetic manipulations and confirmatory studies based upon the experimental data or by random mutagenesis, where the target genes are identified only after the experimental disruptions (85). While targeted mutagenesis produces more reliable results with higher accuracy, random mutagenesis appears to be more cost- effective. Nonetheless, genomic-scale systematic screening for lethal gene disruptions by either approach is a formidable undertaking. To circumvent the expense and difficulty of these screens, researchers attempt to identify essential genes in under-studied organisms. These attempts often rely on homology mapping to help elucidate essential genes. However, this method has several limitations. First, homology mapping is limited to the conserved orthologs between species,

82 which often correspond to a small portion of a target bacterial genome (86). In addition, although essential genes tend to be conserved, conserved genes are often not essential. Finally, beyond the testability of homology mapping, differences in genetic regulation or protein modification, genetic redundancy or divergence in cellular pathways or processes between organisms may also have great bearing on relative essentiality.

In this study, we developed a machine learning-based integrative approach as an alternative to transfer gene essentiality annotations between organisms. This approach identifies relevant features of essential genes and makes predictions using a weighted combination of hallmark features. In this project, we focused on four bacterial species that have well-characterized essential genes, and tested the transferability between three pairs among them. For each pair, we trained our classifier to learn traits associated with essential genes in one organism, and applied it to make predictions in the other. The predictions were then evaluated by examining the agreements with the known essential genes in the target organisms.

Transferring essential gene annotations between E. coli and A. baylyi

In order to test the hypothesis that essential gene annotations can be transferred between distantly related organisms, we chose to perform an analysis on a pair of relatively distantly related organisms: E. coli (EC) and A. baylyi (AB). The reasons to select this organism pair is: (1)

Essential genes are well characterized in both organisms, and (2) Both AB and EC are r- proteobacteria in taxonomy. We first used a reciprocal best hit (RBH) method to compare the genomes between the two organisms (87). Between EC and AB, there are 1198 orthologs. This represents 28 or 36% of the EC or AB genomes, respectively (Figure 5-1). We also examined the overlaps between the two essential gene datasets based on identifying orthologs (Figure 5-1).

There are 195 essential genes in common between the EC and AB essential datasets, making up

65 and 39% of the two essential gene sets, respectively. It is clear that both pathogens have a substantial portion of unique essential genes, which is consistent with a previous report that bacterial species share a limited number of common essential genes (88).

Figure 5-1 Comparison of genomes and essential genes in EC and AB The square represents 4289 EC total genes; the rectangle represents 3308 AB total genes. The overlap of the two represents 1198 orthologs determined by the RBH method. The rectangle with dashed border represents the total 302 EC essential genes. The rectangle with diagonal brick shades represents the total 499 AB essential genes. The rectangle within the dashed border and with diagonal brick shades represents the common essential genes in both species. The area of each rectangle is approximately proportional to the number of genes it represents.

Selecting suitable features for predicting gene essentiality

It is becoming increasingly apparent that genomic sequences represent only one aspect of the complex genetic relationships that have evolved under diverse selection pressures (89); therefore,

84 it is necessary to consider a variety of features. Our study considered three main types of features:

(i) those intrinsic to a gene‟s sequence (e.g. GC content, protein length); (ii) those derived from genomic sequence (e.g. localization signals and codon adaption measures) and (iii) experimental functional genomics data (e.g. gene-expression microarray data). We used three criteria to select the most suitable features. First, the features should be easily obtained and available to most microorganisms. Second, the features should have high predictive power of gene essentiality.

And third, the features should minimize biological redundancy. Using the above criteria, among a total of 28 characteristic features that we considered, we identified 13 of them potentially associated with gene essentiality in EC with relatively weak correlation among themselves

(Figure 5-2). Interestingly, these features represent different aspects from sequence to function.

These diverse aspects of the correlated features suggest that gene essentiality is likely determined not solely by the genomic sequence of a gene, but by multiple aspects of biology. Among the 13 features, the strongest turns out to be DES (domain enrichment in essential genes), which has not been considered by previous studies. The next four strongest features are CBI, Nc, PHYS and

L_aa, consistent with previous studies (87, 90, 91)

Figure 5-2 The Nomogram for visualization of the 13 selected features. Each feature has a corresponding line indicating the relationship between a feature value and its predictive contribution assessed by Naive Bayes analysis. The number on the line is the value of the feature and each value corresponds to a point score above. The longer the line is, the more predictive power the feature has in prediction.

Cross-validations of the classifier using E. coli essential gene set

The 13 selected features were then used as input variables for four classifiers: Naive Bayes, logistical regression, decision tree and CN2 rule. The input of the classifiers contained the features of each gene and the class labels if they were used as the training data. Each classifier scheme independently generated a probability score of gene essentiality. The best performance

86 was obtained by combining the output probability scores of these diverse classifiers using an unweighted approach and hence was used as the final prediction. The 10-fold cross-validation result shown in the ROC curve indicated that, at the level of 1% FPR, the classifier achieved 45%

TPR (Figure 5-3A). The area under curve (AUC) score of the classifier is 0.93 and the positive predictive value (PPV or precision) is 0.70 with the probability threshold set at 0.5. Because of the imbalanced training dataset (essential: non-essential=1:13), to avoid making excessive false positive predictions, a slightly higher cost can be assigned against false positives. This is equivalent to raising the probability threshold for the predictions which yields fewer false positives. At the probability threshold set at 0.75, the precision of our predictions increased 14% to 0.80 (108/135) (Figure 5-3A).

Figure 5-3 ROC curves plot the TPR versus FPR for different thresholds of classifier probability output. (A) and (B): EC -> AB; (C) and (D): AB -> EC. (A) Ten-fold cross-validations on the EC essential gene data set. (B) Predictions of AB essential genes. The classifier was trained on EC dataset and evaluated on AB essential genes. (C) Ten-fold cross-validations on the AB essential gene data set. (D) Predictions of EC essential genes. The classifier was trained on AB data set and evaluated on EC essential genes.

Predicting AB essential genes by integrating intrinsic and context-dependent features

After the 10-fold cross-validation on known essential genes in EC, we applied the classifier to predict AB essential genes, denoted as EC -> AB. The accuracy was evaluated by examining the agreement with the assignments from the gene knockout experiments in AB. At the level of 1%

FPR, the result indicated that the classifier achieved a 28% TPR (Figure 5-3B). The AUC score is 0.80 and PPV is 0.81 at the threshold of 0.5. That is, among the 212 predictions that received the highest scores in AB, about 172 are true essential genes. The prediction accuracy is excellent considering that a random selection of 212 AB genes would contain only 32 essential genes.

We then performed a reciprocal prediction of EC essential genes using the AB essential gene data set, denoted as AB -> EC. The prediction yielded a ROC curve with an AUC score of 0.89 and a

PPV of 0.43 (Figure 5-3C and D). We speculated that the lower precision was because the AB data set contained 100 genes associated with biosynthesis function (e.g. amino acids, cofactors) that are needed for survival only on minimal media (92). Inclusion in the training set of the genes that are essential only on minimal media may have led our classifier to learn characteristics unique to these genes, thereby resulting in a poorer classification of the ‘true’ essential genes.

Transferring essential gene annotations between E. coli and P. aeruginosa

We next conducted predictions between EC and P. aeruginosa PAO1 (PA). PA is a ubiquitous and opportunistic pathogen capable of causing chronic infection of the lungs of cystic fibrosis patients. It is Gram-negative and belongs to the same class of g-proteobacteria as EC and AB.

A set of 678 PA essential genes, or 12% of its total genes, has been identified by transposon mutagenesis (93). Due to the random nature of transposon insertion events, the results of transposon mutagenesis often contain systematic bias. For example, the essential genes determined by transposon mutagenesis contain a disproportionately higher percentage of short proteins because shorter proteins are more likely to be missed by transposons (93, 94).

Using the same feature selection strategy as we employed to predict essential genes between EC and AB, we identified a set of nine features in EC. Note that they are different from those used in

EC->AB. We then used the same method to predict essential genes in PA by learning the features from EC, denoted as EC->PA, and generated a ROC curve with an AUC score of 0.69 and

PPV=0.57 (Figure 5-4A and B). The reciprocal PA->EC prediction showed a similar pattern of decreased accuracy (AUC=0.79 and PPV=0.41).

Figure 5-4 ROC curves for: (A) Ten-fold cross-validations on the EC essential gene data set. (B) Predictions of PA essential genes. The classifier was trained on EC dataset and evaluated on PA essential genes.

Transferring essential gene annotations between E. coli and B. subtilis

To explore the limit of the transferability, we also attempted to predict essential genes in B. subtilis (BS). Unlike EC, PA and AB, BS is Gram-positive bacteria. The evolutionary distance between EC and BS is substantially farther than the other two species: estimated to be around

3000 myrs. Among the 271 essential genes listed in (95), we included 192 that were determined by experimental techniques and disregarded 79 genes that were predicted by homology mapping from other bacteria, mostly E. coli. Using the strategy described in previous sections, we applied our methods to transfer essential gene annotations from EC to BS, denoted EC->BS, and compared the predictions with the available known essential gene dataset in BS. The prediction in BS generated a ROC curve with an AUC score of 0.80 and PPV 0.54 (Figures 5-5A and B).

Similarly, the reciprocal BS->EC prediction yielded a ROC curve with an AUC score of 0.86 and

PPV 0.48. The results suggest that despite the long evolutionary distance between BS and EC, there are common characteristics underlying EC and BS essential genes represented by features that can still be recognized by our machine learning approach.

Integrative genomics significantly improves the accuracy and coverage compared with homology mapping

In order to illustrate the substantial improvement in coverage by our method, we first used homology mapping to transfer essential gene annotations from EC to AB. Among the 302 known essential genes in EC, 234 genes could be directly mapped to the AB genes using an RBH approach. Therefore, the corresponding 234 orthologous genes in AB were predicted by RBH to be essential. Among these 234 predictions, 195 were true essential as determined by the AB essential gene dataset (Figure 5-6). Please note, these 234 orthologs are the maximal number of predictions homology mapping can make, given the definition of orthologs. We then selected appropriate cutoffs so that our method made the same number of predictions as the number of 92 essential genes in the target organism, i.e. 499 in AB. Compared with homology mapping, among the 195 genes correctly predicted by homology mapping, our approach also predicted 189 (97%) as true essential. On the other hand, our approach predicted 77 unique predictions that could not be made by homology mapping.

We used the following three examples to illustrate the discrepancies between our method and homology mapping. For example, ACIAD0822 was determined as essential by both targeted mutagenesis (92) and our prediction. This gene has been annotated with the function of aspartyl/glutamyl-tRNA amidotransferase with no ortholog in EC (96). Its closest homolog in

EC is b1394 involved in fatty acid metabolic process (GO:0006631), different from that of

ACIAD0822. In addition, b1394 is a non-essential gene. In this case, homology mapping is unable to predict ACIAD0822 as essential. In contrast, the integrated effect of four strong features (PHYS, PA, CAI and Cyto) enabled our method to correctly assign this gene as essential.

In another example, ACIAD2634 was determined as essential by both targeted mutagenesis (92) and our prediction. Its closest homology in EC is b2499, a non-essential gene with the same function. In this case, RBH incorrectly predicted it as non-essential. However, the combined influence of DES, PHYS, Nc and CAI allowed our method to successfully override the incorrect assignment by RBH. On the other hand, ACIAD2640 was determined as an essential gene by both targeted mutagenesis (25) and the RBH approach, while our method incorrectly predicted it as non-essential. The main reason we failed to predict it as essential is this gene has a paralog in

AB, which resulted in a strongly unfavorable PA score. Its predicted subcellular localization further cemented the incorrect assignment. Our predictions in PA and BS suggested a similar conclusion (Figure 5-6).

Figure 5-6 the integrative approach significantly extends the coverage of homology mapping. IG stands for the integrative approach. RBHstands for the reciprocal best hit approach. For the IG method, the cutoffs are set to be the same as the number of essential genes in each organism, i.e. (PA: 678, AB: 499, BS: 192).

DISCUSSIONS

By taking advantage of the abundant genomic sequences and functional genomics data available in four bacterial species, EC, PA, AB and BS, we have developed a machine learning-based approach that predicts essential genes by integrating features potentially associated with gene essentiality in Prokaryotes. Although essential gene data sets are also available in many other genomes (97), most of them were determined by transposon mutagenesis whose results may contain systematic biases (93, 98). To strike a balance between comprehensiveness and validity of our analysis, we chose to include all three bacterial species (EC, AB and BS) whose essential gene were determined by targeted mutagenesis, which are considered the highest quality, and the one (PA) whose essential genes were determined by transposon mutagenesis by two independent 94 groups. Our 10-fold cross-validations in four organisms showed AUC scores 0.9, suggesting that gene essentiality, albeit a complex property is highly predictable by learning the characteristics underlying gene essentiality. We believe this is the best cross-validation result in the same organism to date in predicting essential genes. We attributed this significant improvement over previous studies to incorporating both intrinsic and context-dependent features. In particular, we discovered domain enrichment, which has not been considered in previous studies, as the strongest feature.

Our study is also significant in that this is the first report that gene essentiality can be reliably transferred between distantly related organisms using a machine learningbased approach. When using our method to transfer essentiality between distantly related organisms, the accuracy of predicting essential genes can be affected by the following four factors:

First, the essential gene data set on which the classifiers are trained should be of high quality.

Errors in the training dataset will significantly reduce the accuracy of predictions, as we observed in PA->EC. Second, the essentiality should be transferred under the same or highly similar growth conditions. Gene essentiality is likely a contextual property (89). Organisms are likely to use different sets of essential genes under different conditions. Predicting essential genes under different conditions than those of the training set will likely result in decreased predictive accuracy, as we observed between EC and AB. However, a recent study on E. coli conditional essential genes showed that <20% of the total essential genes are different between glycolysis and glucose metabolisms (99). Therefore, our algorithm will still be useful in capturing the majority of the essential genes in the target organism even when the growth conditions are different, although the best performance will be achieved under the same or highly similar growth conditions. Third, the evolutionary distance seems to play an important role in the 95 accuracy of predictions. It is encouraging to see that the classifier can transfer gene essentiality between Gram-negative and Gram-positive bacteria, although the accuracy of transferring is lower than between gram-negative bacteria. An interesting future direction would be to investigate further to what extent our method can be applicable. For example, to what extent can essential genes be transferred between Prokaryotes and Eukaryotes?

The comparison between our method and homology mapping highlights the limitations of homology mapping. Homology mapping is most useful in closely related organisms, such as S. cerevisiae and S. mikatae (91). However, in more divergent organisms, it is severely limited by the number of conserved orthologs. In contrast, our approach does not have this limitation because the prediction is based on the features that can be computed for all genes; therefore, it can easily explore the gene space for which homology mapping is inapplicable of.

In summary, by integrating features available to all genes, our method provides a valuable alternative for predicting essential genes beyond orthologs. The application of our approach to bacterial species has tremendous potential to significantly improve our ability to re-engineer microorganisms as well as to respond to many emergency situations, such as bioterrorist attack or bioremediation of oil-spilled Gulf regions. Although our research was performed in

Prokaryotes, where the highest quality essential gene datasets are available, the conclusions drawn from this study are expected to be also valid in other domains of life.

5.1.2 A Statistical Framework for Improving Genomic Annotations of Essential

Genes

In current genomic databases, most of the annotations of microbial essential genes are derived directly or indirectly from transposon mutagenesis (TM) experiments. In fact, whole-genome TM followed by sequence-based identification of insertion sites is the most frequently used experimental approach to determine essential genes in microorganisms (93, 94, 100-113). Using this approach, essential genes for a variety of bacterial species, such as Escherichia coli and Pseudomonas aeruginosa, have been identified, greatly increasing the insights into the essential processes necessary for growth of these bacteria under defined conditions.

Transposons are segments of DNA that can move (transpose) from one location in a genome to another (114, 115). The locations in which a transposon can move depends on the sequence that the transposase recognizes and cleaves, although the recognition sequence for some transposons is unclear or has yet to be determined. TM results in disruption of the region of the genome where the transposon is inserted. If an insertion within a predicted ORF allows the resulting strain to form a colony on appropriate solidified media, it is unlikely that ORF is essential for viability under those conditions (Fig. S1). Therefore, TM identifies essential genes using a

“negative” approach, i.e., identifying many regions that are not essential and presuming that everything else is essential.

Due to the random nature of transposon insertion events, there are a number of factors that may create systematic biases in TM experiments. For example, it is inevitable that some genes, especially shorter ones, will be missed simply by chance (94, 101, 106). This will create false positive errors in which non- essential genes are determined as essential by TM. On the other hand, the insertion may take place at any part of a gene, such as the extreme ends, which may not fully disrupt the function of the gene product (93,

94, 100-102, 104, 106, 116). This will create false negative errors in which essential genes are determined as non-essential by TM. These biases from TM experiments have introduced substantial errors in essential gene annotations in current genomic databases (117). In order to render the large-scale integrative and comparative analysis of essential genes possible, these biases must be quantitatively assessed and corrected.

In this project, we developed a novel Poisson model based statistical framework to simulate the TM insertion process and subsequently correct the experimental biases. Briefly, the statistical framework works as follows: We first quantitatively assessed the effects of potential factors that may affect the accuracy of TM results, such as gene length and relative insertion positions, and subsequently incorporated relevant factors into the framework. Through iteratively optimizing parameters, we finalized the model and inferred the actual insertion events occurred in each gene given the observed insertion information. Finally we described each gene‟s essentiality on probability measure, and provided corrections towards possible biases in the TM assigned annotations. We took advantage of the definitive mapping of essential genes in E. coli MG1655 strains determined by single gene knockout experiments

(PEC set) (118) to identify the errors in the essential gene annotations produced by TM experiments

(Gerdes set) (101) by comparing their assignments. Although the single-gene knockout experiments are not completely error free, they have been considered the least error-prone (118). We also realized that the essential genes uniquely identified by TM may have biological significance as they may represent genes essential for fitness as suggested by Gerdes et al. (98); however, since our focus was on those essential for survival, we still referred to them as “errors”.

Assessing the effects of four factors on the accuracy of TM experiments

Previous studies suggested the accuracy of TM experiments may be affected by four main factors: gene length, insertions in the distal regions, number of insertions per gene and polar effects (93,

101). To quantitatively assess each factor‟s influence on the accuracy of TM results, we examined the association of each factor with the FETmEs and FNTmNs, respectively (Fig. 5-7).

(1) Gene length: potentially causing false positive errors

In TM experiments, the genes that have never been detected with transposon hits will be assigned as essential. The average detectable insertion density is about 1 per 400 bp in TM experiments under saturation levels (94, 101, 106). This suggests relatively short genes (e.g.

≤300 bp) would easily be missed simply by chance, and thus will be incorrectly labeled as essential. To quantitatively assess its influence, we compared the length of essential genes in the

Gerdes and PEC sets, using the total genes as a control (Fig. 5-7A). The student t-test shows that the average length of essential genes in the Gerdes set (730 bp) is significantly shorter than that in the PEC set (1,003 bp) (P-value < E-11). It is also significantly shorter than total genes (982 bp) (P-value < E-25) while the difference between the PEC set and total genes is considered not significant (P-value > 0.05).

(2) Insertions in the 5‟- and 3‟-ends of genes: potentially causing false negative errors

In TM experiments, sometimes insertions occurring at the extreme ends of a gene‟s coding sequence may not sufficiently disrupt its function (93, 94, 100-102, 104, 106, 116). In these cases, essential genes may be mistakenly assigned as non-essential genes, creating a false negative error. We compared the distributions of the position of insertions within the ORFs between FNTmNs and TNTmNs. As expected, the FNTmNs have a higher percentage of transposon insertions in the 3‟- and 5‟-ends than TNTmNs (Fig. 5-7B). To assess the significance of this difference, we simulated pure random insertion experiments within the coding sequences. The P-values showed that in the 20%-most of the 3‟-end and the 5%-most of 99 the 5‟-end regions, the FNTmNs have significantly more insertions than TNTmNs. We named it the “25% extreme ends” rule and used it later in the model.

(3) Number of insertions per gene: potentially causing false negative errors

Occasionally, a few insertions in a gene are insufficient to completely disrupt its function, especially when the target gene is relatively long (93, 94, 101, 106, 116, 119). We plotted the distribution of the number of insertions per gene for both FNTmNs and TNTmNs (Fig. 5-7C).

The histogram showed that about 75% of the FNTmNs that were mistakenly assigned to be non- essential by TM only harbored a single insertion. The average insertion number per gene among

FNTmNs was 1.56, significantly smaller than TNTmNs (4.13) (P-value ≤ E-26).

(4) Polar effects: potentially causing false positive errors

In TM experiments, polar effects can occur when a transposon inserts in a dispensable gene and prevents the transcription of its downstream essential genes in the same operon (94, 101,

106). Because the insertion in this dispensable gene actually disrupts the entire downstream essential genes and causes death of these mutants, this dispensable gene will be incorrectly labeled as essential, causing a false positive error.

To detect the number of FETmEs caused by polar effects, we examined all TmEs within each of 2,665 operons that were inferred experimentally or computationally. If there exists a TETmE that resides downstream of a FETmE, this FETmE is considered to be potentially caused by a polar effect. Among the 429 FETmEs, only 46 of them may be caused by polar effects. The small number is likely due to the fact that many TM experiments have been designed to prevent

100 polar effects, typically by designing transposons with a strong or regulatable promoter downstream of the transposase but still within the transposon.

(A) Gene Length

(B) Position of insertions

0.2 False non-essential genes True non-essential genes 0.15

0.1

Percentage of Genes of Percentage 0.05

0 5 10 20 30 40 50 60 70 80 90 100 Percentile of ORF Length (5'->3')

101

0.8 False non-essential genes 0.7 True non-essential genes 0.6

0.5

0.4

0.3

Percentage of Genes of Percentage 0.2

0.1

0 1 2 3 4 >5 Number of Insertions

Figure 5-7 Three factors have strong associations with false TM assignments. (A) Gene length. The lengths of TmEs are significantly shorter than those in the PEC dataset and total genes. Many of these short genes may be false essential genes. (B) Position of insertions. Essential genes mistakenly assigned to be non-essential by TM often have insertions in the 25% extreme- ends (5% in 5’ end and 20% in 3’ end). These insertions do not completely disrupt a gene’s function. (C) Number of insertions. 75% of the essential genes mistakenly assigned to be non- essential by TM only have one insertion in them.

Developing the statistical framework to correct TM errors

We then developed a statistical framework that is capable of incorporating potential factors that have strong associations with FETmE and FNTmN assignments. This model not only estimates the overall error rates, but also assigns a score to indicate the probability that an individual gene is essential given the TM assignments. We incorporated three of the above four potential factors into our model. Since polar effects are only responsible for a relatively few number of FETmEs, we chose not to include this factor into our model. The general idea of this model is illustrated in

Fig. 5-8. This model requires two assumptions:

(A) Transposons insert randomly and independently into the coding region of a gene; and

102

(B) Each transposon has the same ability to disrupt a gene‟s function, although this ability may vary at different regions of a gene.

Assumption (A) does not require transposon insertions to be uniformly distributed along the entire genome because of the insertion “hot” and “cold” spots observed in microbial genomes and thus provides a more realistic approximation of the process of transposon insertions.

If we assume that transposons insert into a gene independently and the insertions occur at a constant rate during mutagenesis, then this process can be characterized by a Poisson distribution

(120, 121). The probability that there are k insertions occurring within a gene with length L can be expressed as:

Poissonk;rL  erLrLk / k!.

Here r is the local insertion density on the DNA fragment, estimated by counting the number of insertions within a 30 kb-long region flanking the coding sequence. If k=0, this equation describes the probability that this ORF is missed in TM experiments.

Based on assumption (B), we defined two parameters P1 and P2 both in the range of (0, 1) representing the probability that an individual transposon insertion disrupts a gene‟s function when the insertion occurs at the 25% extreme ends or in the middle of a gene, respectively. We assumed the same probability (P1) for the 5%-most of 5‟-end and 20%-most of 3‟-end to disrupt a gene‟s function.

Under these two assumptions, we can calculate the probability of being essential for each gene given the TM assignments.

103

(A) If it is a TM assigned essential gene (TmE), we have:

Pr(E 1| TmE )  Pr( E  1| nreal  n obs  0) (1)

(B) Similarly, if it is a TM assigned non-essential gene (TmN), we have:

Pr(E 1| TmN )  Pr( E  1| nreal  n obs  0) (2)

Here E is a binary variable and E=1 if this gene is essential; otherwise, it is non-essential. n real represents the real number of insertions occurring in this gene during the TM experiment

n and obs represents the observed number of insertions in the TM dataset. Here =0 if the gene is assigned as essential by TM, and >0 if it is assigned as non-essential. can be further

n n separated into two parts 3'or5'ends and middle to represent the observed insertions occurring at the

25% extreme ends or in the middle of a gene, respectively. In the transposon insertion process, if an insertion hits a true essential gene and disrupts its function, the inserted mutant will die and

n this insertion will not be observable in the TM dataset; therefore the real insertion number real should be greater or equal to the observed insertion number . While for a non-essential gene, no matter whether an insertion disrupts its function or not, it will not die. Thus the real insertion number always equals to the observed insertion number .

After the derivation of equations, we then used an iterative procedure (122-124) to estimate the values of unknown parameters that determines Eqs. (1) and (2).

104

Figure 5-8 Illustration of the statistical model.

Validating the model in E. coli TM dataset

The Gerdes dataset contains 615 TmEs and 3218 TmNs. Using the above algorithm, we found

the converged P1= 0.92, P2=0.995 and the overall essential rate ρess = 12.9%. Since our model assigned each individual gene a score to indicate its probability of being essential, we ranked these genes in the TmEs and TmNs separately. Among the 615 TmEs, the expected number of

n essential genes ( P(E  1) ) we estimated was 479. Using the expected number of essential i genes as the cutoff, the top 479 genes were named predicted essential genes by our model among the TM-assigned essential genes (PETmEs) and the remaining 136 genes were predicted non- essential genes by our model among the TM-assigned essential genes (PNTmEs). Similarly,

105 among the 3218 TmNs, the expected number of essential genes we estimated was 16. Using this cutoff, the top 16 genes were named predicted essential genes by our model among the TM- assigned non-essential genes (PETmNs) and the remaining ones were predicted non-essential genes among the TM-assigned non-essential genes (PNTmNs).

To assess the accuracy of our predictions, we compared our results with the PEC dataset

(Table 5-1). Among the 479 PETmEs, 176 (or 37%) were true essential, significantly higher than that in the original TmEs (186/615=30%) (P-value =0.013, Fisher‟s exact test). Remarkably, among the 136 PNTmEs that we filtered out, only 10 (or 7.4%) were true essential, significantly lower than that of the original TmEs, i.e., 30% (P-value<10-8). On the other hand, among the 16

PETmNs, 5 (or 31%) were true essential, significantly higher than that in the original TmNs

(2.3%) (P-value < 10-4). These results strongly indicated that our model successfully enhanced the accuracy of the original TM assignments.

Table 5-1 Improvement of overlaps with the PEC dataset using our model.

TM dataset Our Statistical PEC dataset (Gold Standard) Model (Gerdes set) E (259) N (3574)

TmEs (615) PETmEs (479) 176 303

PNTmEs (136) 10 126

TmNs (3218) PETmNs (16) 5 11

PNTmNs (3202) 68 3134

106

Testing the model’s robustness in analyzing subsaturation level TM datasets

Compared with a saturated TM experiment, an unsaturated TM experiment generally contains a higher FPR, because genes are more likely to be missed by transposon insertions and thus incorrectly assigned as essential. To test our model‟s applicability to unsaturated TM datasets, we randomly removed 10%, 30% and 50% of the total insertions from the Gerdes dataset to simulate the effects of different subsaturation levels of TM experiments. We then applied our model on these subsaturated datasets. The results suggested a strong robustness in analyzing subsaturation level TM datasets (Fig. 5-9). As shown in Fig. 5-9, the lower curve (dashed line) showed the p-values of the Fisher‟s exact test to examine whether the true essential rate in

PNTmEs is significantly lower than that in the original TmEs set. Similarly, the upper curve

(solid line) showed the p-values of the Fisher‟s exact test to examine whether the true essential rate in our PETmNs is significantly higher than that in the original TmNs set. For each of the three (10%, 30% and 50%) random experiments, we repeated 100 times to obtain the error bars.

The results showed that under each of these subsaturation conditions, our model still significantly improved TM results.

107

Robustness of the model -1

-2

-3

-4

-5 [P-value]

0 -6

1 log -7

-8

-9

-10 10% 30% 50% Different subsaturated levels of TM experiment

Figure 5-9 Robustness of our model at subsaturation levels of transposon insertions. The dashed line showed p-values of the Fisher’s exact test to examine whether the true essential rate in PNTmEs is significantly lower than that in the original TmEs set. Similarly, the solid line showed p-values of the Fisher’s exact test to examine whether the true essential rate in our PETmNs is significantly higher than that in the original TmNs set.

Testing the model’s applicability to P. aeruginosa by allelic exchange experiments

The ultimate test is to see if our model is applicable to other microorganisms. A set of essential genes has been determined by TM to a saturated level in another -Proteobacteria, P. aeruginosa

PAO1 (Jacobs dataset) (93). This TM dataset contains 678 putative essential genes and 4892 non-essential genes. If we assume the probability that an individual transposon insertion disrupts a gene‟s function, is the same across different species, i.e., P1= 0.92 and P2=0.995 as those in E.

108 coli, then the overall essential rate ρess we estimated for PAO1 is 10.1% and the expected numbers of PETmEs and PETmNs are 540 and 15, respectively.

Because a whole-genome single gene knockout dataset is not yet available in this organism, we chose to purse allelic exchange experiments to validate our predictions in PAO1.

In order to make sure our experimental procedure can correctly identify essential genes, we first tested it on PA4238, a known essential gene from literature, as positive control. As expected,

PA4238 was determined by the allelic exchange experiments as essential. According to our model, PA4238 has a rank 60 out of 678, higher than the expected number of essential genes among TmEs, i.e., 540; therefore it was predicted as essential by our model. In this case, the results from the TM assignment, our model and the allelic exchange experiment are consistent

(Table 3). We then selected six ORFs from the PAO1 TM dataset to further test our model:

PA0723, PA2954 and PA2143 were selected from the TmEs; and PA3746, PA4260 and PA0985 were selected from the TmNs. Among the three TmEs (PA2954, PA0723 and PA2143), PA0723 was a true essential gene but PA2954 and PA2143 turned out to be non-essential. According to our model, PA0723 was ranked 421 out of 678. Because the expected number of essential genes among TmEs is 540, it was correctly predicted to be essential. In contrast, PA2954 and PA2143 were given a rank of 590 and 664, respectively. They were both predicted as false positive error

(i.e., non-essential) genes because their ranks were lower than 540. Our model was correct in all three cases. Among the three TmNs (PA3746, PA4260 and PA0985), PA0985 was a true non- essential gene but PA3746 and PA4260 were tested to be essential. Our model assigned PA3746 a rank of 6 out of 4892. Because the rank was higher than 15, the expected number of essential genes among the TmNs, PA3746 was predicted to be essential. In contrast, PA0985 and PA4260

109 were assigned a rank of 106 and 104, respectively; they were predicted to be non-essential. Our model was correct in two out of the three cases.

Overall, among the seven genes tested by allelic exchange experiments, our model agreed with the experimental validations in six of them. Remarkably, among all three cases that directly contradicted the TM assignments, allelic exchange experiments supported our predictions.

Table 5-2 Validation using allelic exchange experiments in P. aeruginosa PAO1. E – Essential; N – Non-essential.

PAO1 Length Local Assignments Ranks by Assignments Assignments by genes Allelic exchange Insertion by TM Our Model by our model experiments

Density PA3746 1374 4.9521 N 6/4289 E E

PA4260 822 3.4102 N 104/4289 N E

PA0985 1497 7.5415 N 106/4289 N N

PA4238 1002 3.3564 E 60/678 E E (Positive Control)

PA0723 249 7.4446 E 421/678 E E

PA2954 570 2.0336 E 590/678 N N

PA2143 288 2.1479 E 664/678 N N

DISCUSSION

In this study, we developed a statistical framework to systematically filter out errors (both

FETmEs and FNTmNs) and thus improved the accuracy in TM-determined essential gene datasets. This model is significant in four ways:

110

First, our model significantly enhances the accuracy of the original TM assignments. In the E. coli TmEs, our PNTmEs, i.e., the false essential genes filtered out by our method, had a significantly lower true essential rate than that in the original TmEs. In contrast, in the TmNs, our PETmNs, i.e., the predicted essential genes from TmNs, had a significantly higher true essential rate than that in the original TmNs. The confidence scores generated by our model were shown to have positive correlations with the enrichment of true essential genes.

Second, our model displays robustness in analyzing unsaturation level TM datasets and resisting with experimental errors as demonstrated in the simulated unsaturation level TM datasets. This is potentially useful in significantly reducing the time and costs currently associated with TM experiments.

Third, because our model adopts simple but more realistic assumptions, it is applicable across multiple microorganisms. We applied it to the sequenced strain P. aeruginosa PAO1, in which only a TM dataset is available. Among the seven chosen genes for an experimental test, six of them were assigned correctly by our model with an overall accuracy of 86%. We have also demonstrated its broad applicability to microorganisms in F. novicida.

Finally, our model is flexible and able to incorporate more potential factors. For example, we assigned different weights to the insertions based on the positions where they were inserted in the genes. In the future, we will consider other factors that might affect the accuracy of TM experiments (119).

Our approach is significantly different from previous approaches. Our model is capable of estimating the probability score of being essential for individual genes, rather than only estimating an overall essential rate for the whole genome as in Jacobs et al.‟s method. In addition, 111

Jacobs et al. used a multinomial distribution which cannot incorporate the difference between

“hot” and “cold” spots. Furthermore, their method is not applicable for correcting false negative errors, i.e., detecting true essential genes among TmNs. In Lamichhane et al.‟s study, their method was applicable to a subsaturation level of TM. Since a gold-standard dataset is not yet available in M. tuberculosis, we cannot compare our subsaturation results with theirs. In addition, they also have the same issues as in Jacobs et al.‟s method by disregarding false negative errors and differences in insertion density. By taking into account these features, our model is expected to be more realistic and accurate.

112

5.2 Probing the Developmental Robustness in Fruit Fly Drosophila

5.2.1 Introduction

Development is a robust process that must achieve a precise and reproducible outcome despite inevitable variations among individual embryos (125-128). One such variation is the size of embryos. In Drosophila, patterning along the anterior-posterior (A-P) axis exhibits scaling properties such that embryonic structures are patterned in proportion to embryo length (129-132).

Bicoid (Bcd), a morphogenetic protein in Drosophila, is responsible for instructing patterning along the A-P axis (133-135). How developmental scaling is achieved is a subject of intense interest (136-138). Several models have been proposed to explain scaling along the A-P axis in

Drosophila but they currently remain at a theoretical level (139-145). For example, it was proposed recently that, theoretically, scaling could be achieved through the special operation of cytoplasmic streaming (145), but such a streaming mechanism has not yet been observed experimentally.

According to a widely held model (146), Bcd gradient formation can be described as a dynamical process of localized protein production and embryo-wide diffusion and decay. To fully describe and understand the Bcd gradient system, it is necessary to have a realistic model that considers the special geometry of the embryo, the dynamics of the media with which the

Bcd gradient is formed, and key features of the bcd mRNA from which Bcd is produced. For example, unlike what is assumed in most of the current models, the maternally deposited bcd mRNA is not restricted to a single point at the anterior tip (147-150). Instead, it is distributed in a cloud-like shape in early embryos with its own volume, shape, density and location. How the distribution of bcd mRNA may affect the shape and formation of the Bcd gradient remains an

113 important question. This question is further highlighted by a recent report that challenged the widely held diffusion model (151). In that report, it was shown that bcd mRNA distribution follows a dynamic process and, based on these findings, it was suggested that the shape and formation of the Bcd protein gradient are dictated by bcd mRNA distributions.

In this report, we develop a realistic 2-D simulation model of Bcd gradient formation. Our model considers both the distribution dynamics and the amount of bcd mRNA from which Bcd protein is synthesized. It also follows the doubling of nuclear number at each nuclear cycle and nuclear movements during development. Our model represents a realistic and quantitative description of the dynamics of the Bcd gradient system. Our model readily recapitulates both the experimentally observed scaling properties of the Bcd gradient and the stability of Bcd concentrations inside the nuclei during development. Our model also allows us to study the effects of dynamic distributions of bcd mRNA on Bcd gradient formation.

5.2.2 Results and Discussion

Basic features of the model

In our model, Markov chain is applied to simulate the dynamics of Bcd protein diffusion inside the Drosophila embryo. At each time t, Bcd distribution within the embryo is treated as one of the Markov states. The transition from one Markov state to the next is determined by diffusion and spatially-uniform degradation of Bcd molecules at a transition probability (set to 1 in this work). Thus, the entire dynamic diffusion process of Bcd in the embryo is characterized by a chain of successive transitions from one Markov state to another as a function of time, with its current state depending solely on its previous state. We consider two forms of Bcd molecules

114 in our model: free-diffusing molecules, Bfree, that are in the cytoplasm, and immobile molecules,

Bbound, that are bound to low-affinity DNA sites inside the nuclei (see Discussion for further information). The binding/unbinding behavior of Bcd to these DNA sites is characterized by the rate constants k1 and k2, respectively, and expressed as:

k1 Bfree + DNAsites Bbound, (1) k2

At equilibrium, the ratio of bound to free Bcd molecules is determined by the association  constant KA = k1/k2:

[Bbound]  [DNAsites] KA , (2) [Bfree ] where [B ] is the concentration of free-diffusing Bcd in the cytoplasm, [B ] is the  free bound concentration of DNA-bound, immobile Bcd inside the nucleus thus also representing the nuclear concentration of Bcd, and [DNAsites] is the concentration of low-affinity DNA sites, which are in excess in relation to the total number of Bcd molecules. The concentrations of free- diffusing and DNA-bound Bcd molecules inside the embryo can be expressed by the following set of equations:

[B ] free  D2[B ] k [B ][DNA ] k [B ][B ], (3) t free 1 free sites 2 bound free [B ] bound  k [B ][DNA ] k [B ][B ], (4) t 1 free sites 2 bound bound

[Btot ]  [Bfree ] [Bbound], (5)

where [Btot] is the local total Bcd concentration, D is the diffusion constant of free-diffusing Bcd  molecules inside the cytoplasm, and ω is the degradation rate (space-independent) of Bcd

115 molecules. Thus, with a fast chemical equilibrium assumption, the change in local total Bcd concentration [Btot] can be expressed as:

[Btot] D 2   [Btot][Btot]. (6) t (1 [DNAsites] KA )



In our 2-D simulations, a slice of the Drosophila embryo is represented as an ellipse, within which the number of nuclei is doubled after each nuclear division. We model the embryo as a homogeneous medium between nuclear cycles 1-9. At the onset of nuclear cycle 10, all nuclei move to the cortex, resulting in two distinct homogeneous media of the embryo: the cortical layer and the inner part. In our model, Bcd protein is produced within an anteriorly- localized bcd mRNA sphere. Fig. 5-10A represents a simulated 2-D embryo at nuclear cycle 14 showing the total local Bcd concentration [Btot]. Our simulation results realistically recapitulate the biological system and reveal several important features that are consistent with experimental observations. First, as seen experimentally (129, 131, 133, 152, 153), Bcd is concentrated within the cortical layer (Fig. 5-10A). Within this layer, the nuclear Bcd concentration forms an exponential gradient (see Fig. 5-10B,C) with a length constant  of 94 m, which is fully consistent with experimentally measured values (129, 131). Second, unlike other models in which Bcd production is restricted to a single point at the anterior resulting in a Bcd concentration gradient that follows an exponential function throughout the entire A-P length, our simulated results realistically reproduce the experimentally observed "deviation" at the anterior part of the embryo, a region where Bcd concentrations do not follow the exponential function

(129, 131, 152).

116

Figure 5-10 Simulated Bcd distributions in a Drosophila embryo. A. A simulated embryo at nuclear cycle 14 showing the local total Bcd concentration [Btot] (arbitrary units). The A–P position is shown as absolute distance x (in mm) from the anterior. The ratio of total Bcd molecules in the cortical layer to those in the inner part of the embryo is 1.88 at nuclear cycle 14 (see text for further details). B. A plot of local DNA-bound Bcd concentration [Bbound] (arbitrary units) within the cortical layer as a function of fractional embryo length x/L. In this and other figures presented in this report, [Bbound] at each A–P position represents the mean [Bbound] value of all cubes within the cortical layer of the embryo at that A–P position. C. Same as B except [Bbound] is on ln scale. Linearity of ln[Bbound] indicates an exponential Bcd protein gradient; see text legend for more information about Adjusted R2 values to further evaluate the quality of exponential fitting of the simulated data.

Stability of Bcd gradient at different nuclear cycles

One of the most striking features of the Bcd gradient dynamics is the stability of Bcd concentrations inside individual nuclei during nuclear cycles 10-14 (153). To specifically

117 evaluate our model on this critical feature, we simulated Bcd gradient formation and analyzed

Bcd concentration profiles at distinct nuclear cycles. We plotted two separate profiles: local nuclear Bcd concentration [Bbound] (see Eqs. 1-5) and Bcd concentration inside individual nuclei

[Bn] ([Bn]  [Bbound]/nuclear number), which are shown in Fig. 5-11A and B, respectively. In these figures, Bcd concentration profiles at nuclear cycles 10-14 are shown. Our results demonstrate that, even as the nuclear number doubles after each nuclear division during this period, the profiles of Bcd concentrations inside individual nuclei [Bn] remain stable (Fig. 5-

11B). Consistent with the live-imaging data (153), our simulated [Bn] profiles maintains <10% variation between nuclear cycles 10-14 throughout the entire A-P length. Similar to nuclear cycle 14 (Fig. 5-10B,C), the simulated nuclear Bcd concentration profiles at other nuclear cycles also follow an exponential function with a length constant consistent with experimental values and exhibit the proper anterior "deviation" (not shown).

Figure 5-11 Stability of nuclear Bcd concentrations.

118

Scaling properties of the Bcd gradient

As suggested by our recent studies (131), the scaling properties of embryonic patterning along the A-P axis can be directly traced to the scaling properties of the Bcd gradient itself. To determine whether our model can recapitulate such properties of the Bcd gradient, we conducted two different analyses. In our simulations, we assumed that the amount of bcd mRNA is correlated with the embryo volume. In our first analysis, we simulated two individual embryos that have distinct lengths (550 m and 600 m). Fig. 5-12A and B show the nuclear Bcd concentration profiles at nuclear cycle 14 expressed as a function of, respectively, absolute distance from the anterior x (in m) and fractional embryo length x/L. Our simulation results show that, while these two profiles in the anterior and middle portions of the embryo are apart from each other in the x plot, they converge in the x/L plot, most notably in the broad mid- portion of the embryo. When the same bcd mRNA amount was applied to these two embryos, such a convergence was not observed (data not shown). In a second analysis, we simulated a group of 50 embryos that differ in size. We calculated nuclear Bcd concentration noise (standard deviation divided by the mean) at nuclear cycle 14 for these embryos as a function of either x or x/L. Our results (Fig. 5-12C) show a significantly lower [Bbound] noise in x/L than in x, a finding that is fully consistent with experimental data (131). In contrast, when there is no correlation between bcd mRNA amount and embryo size in our simulations, [Bbound] noise actually became higher as a function of x/L than of x. These two sets of analyses demonstrate that, with a simple assumption that the amount of bcd mRNA deposited into an egg is proportional to egg volume

(or a Bcd production rate proportional to egg volume), our model can readily reproduce the experimentally observed scaling properties of the Bcd gradient.

119

Figure 5-12 Scaling properties of the Bcd gradient.

Discussions

The 2-D simulation model of Bcd gradient described in this report realistically captures the dynamics of the biological system. It considers the exponential increase in nuclear number and nuclear movements in early embryos. Consistent with experimental findings (153), our simulated results show that the profiles of Bcd concentrations inside individual nuclei [Bn] are stable during nuclear cycles 10-14 (Fig. 5-11B). Our 2-D model can recapitulate or explain other important features of the Bcd gradient system. In particular, it can recapitulate the scaling 120 properties of the Bcd gradient with a simple assumption that the rate of Bcd production in an embryo is proportional to the embryo volume. It also allows us to simulate realistically the production of Bcd within a bcd mRNA sphere and analyze the consequences of the dynamic distributions of bcd mRNA on Bcd protein gradient formation.

The 2-D model described in this report represents a useful tool to simulate the Bcd gradient system in a realistic way. Since our simulation procedure is based on Bcd distribution dynamics in both X and Y axes on a real 2-D plane, it could be readily expanded into a 3-D model. Our model can also be used in future simulations of Bcd gradient dynamics under specific physical or genetic perturbations that may affect, e.g., bcd mRNA localization (150, 154,

155), Bcd protein diffusion/degradation, or developmental clock distortions (129, 156, 157).

Finally, although our current model focuses on the deterministic behaviors of Bcd distribution dynamics, it could be used as a foundation to explore stochastic properties of the system.

121

Bibliography

1. Rezaee F, Casetta B, Levels JH, Speijer D, & Meijers JC (2006) Proteomic analysis of high- density lipoprotein. Proteomics 6(2):721-730.

2. Vaisar T, et al. (2007) Shotgun proteomics implicates protease inhibition and complement activation in the antiinflammatory properties of HDL. J Clin Invest 117(3):746-756.

3. Karlsson H, Leanderson P, Tagesson C, & Lindahl M (2005) Lipoproteomics II: mapping of proteins in high-density lipoprotein using two-dimensional gel electrophoresis and mass spectrometry.

Proteomics 5(5):1431-1445.

4. Heller M, et al. (2005) Mass spectrometry-based analytical tools for the molecular protein characterization of human plasma lipoproteins. Proteomics 5(10):2619-2630.

5. Davidson WS, et al. (2009) Proteomic analysis of defined HDL subpopulations reveals particle- specific protein clusters: relevance to antioxidative function. Arterioscler Thromb Vasc Biol 29(6):870-

876.

6. Gordon SM, Deng J, Lu LJ, & Davidson WS (2010) Proteomic characterization of human plasma high density lipoprotein fractionated by gel filtration chromatography. J Proteome Res 9(10):5239-5249.

7. Link JJ, Rohatgi A, & de Lemos JA (2007) HDL cholesterol: physiology, pathophysiology, and management. Curr Probl Cardiol 32(5):268-314.

8. Yokoyama S (2006) Assembly of high-density lipoprotein. Arterioscler Thromb Vasc Biol

26(1):20-27.

9. Barter PJ & Kastelein JJ (2006) Targeting cholesteryl ester transfer protein for the prevention and management of cardiovascular disease. J Am Coll Cardiol 47(3):492-499.

122

10. Kontush A & Chapman MJ (2006) Antiatherogenic small, dense HDL--guardian angel of the arterial wall? Nat Clin Pract Cardiovasc Med 3(3):144-153.

11. Gordon S, Durairaj A, Lu JL, & Davidson WS (2010) High-Density Lipoprotein Proteomics:

Identifying New Drug Targets and Biomarkers by Understanding Functionality. Curr Cardiovasc Risk

Rep 4(1):1-8.

12. Moore RE, et al. (2005) Increased atherosclerosis in mice lacking apolipoprotein A-I attributable to both impaired reverse cholesterol transport and increased inflammation. Circ Res 97(8):763-771.

13. Cuchel M & Rader DJ (2006) Macrophage reverse cholesterol transport: key to the regression of atherosclerosis? Circulation 113(21):2548-2555.

14. Mackness MI, Arrol S, & Durrington PN (1991) Paraoxonase prevents accumulation of lipoperoxides in low-density lipoprotein. FEBS Lett 286(1-2):152-154.

15. Aviram M & Rosenblat M (2004) Paraoxonases 1, 2, and 3, oxidative stress, and macrophage foam cell formation during atherosclerosis development. Free Radic Biol Med 37(9):1304-1316.

16. Cockerill GW, Rye KA, Gamble JR, Vadas MA, & Barter PJ (1995) High-density lipoproteins inhibit cytokine-induced expression of endothelial cell adhesion molecules. Arterioscler Thromb Vasc

Biol 15(11):1987-1994.

17. Barter PJ, Baker PW, & Rye KA (2002) Effect of high-density lipoproteins on the expression of adhesion molecules in endothelial cells. Curr Opin Lipidol 13(3):285-288.

18. Kuvin JT, Harati NA, Pandian NG, Bojar RM, & Khabbaz KR (2002) Postoperative cardiac tamponade in the modern surgical era. Ann Thorac Surg 74(4):1148-1153.

19. Movva R & Rader DJ (2008) Laboratory assessment of HDL heterogeneity and function. Clin

123

Chem 54(5):788-800.

20. Asztalos BF, Sloop CH, Wong L, & Roheim PS (1993) Two-dimensional electrophoresis of plasma lipoproteins: recognition of new apo A-I-containing subpopulations. Biochim Biophys Acta

1169(3):291-300.

21. Rifkin MR (1978) Identification of the trypanocidal factor in normal human serum: high density lipoprotein. Proc Natl Acad Sci U S A 75(7):3450-3454.

22. Raper J, Fung R, Ghiso J, Nussenzweig V, & Tomlinson S (1999) Characterization of a novel trypanosome lytic factor from human serum. Infect Immun 67(4):1910-1916.

23. Shiflett AM, Bishop JR, Pahwa A, & Hajduk SL (2005) Human high density lipoproteins are platforms for the assembly of multi-component innate immune complexes. J Biol Chem 280(38):32578-

32585.

24. Deng J, et al. (2011) Investigating the predictability of essential genes across distantly related organisms using an integrative approach. Nucleic Acids Res 39(3):795-807.

25. Jeong H, Mason SP, Barabasi AL, & Oltvai ZN (2001) Lethality and centrality in protein networks. Nature 411(6833):41-42.

26. Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, & Feldman MW (2002) Evolutionary rate in the protein interaction network. Science 296(5568):750-752.

27. Wachi S, Yoneda K, & Wu R (2005) Interactome-transcriptome analysis reveals the high centrality of genes differentially expressed in lung cancer tissues. Bioinformatics 21(23):4205-4208.

28. Jonsson PF & Bates PA (2006) Global topological features of cancer proteins in the human interactome. Bioinformatics 22(18):2291-2297.

124

29. Hartwell LH, Hopfield JJ, Leibler S, & Murray AW (1999) From molecular to modular cell biology. Nature 402(6761 Suppl):C47-52.

30. Eisenberg D, Marcotte EM, Xenarios I, & Yeates TO (2000) Protein function in the post-genomic era. Nature 405(6788):823-826.

31. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, & Barabasi AL (2002) Hierarchical organization of modularity in metabolic networks. Science 297(5586):1551-1555.

32. Chuang HY, Lee E, Liu YT, Lee D, & Ideker T (2007) Network-based classification of breast cancer metastasis. Mol Syst Biol 3:140.

33. Guo Z, et al. (2007) Edge-based scoring and searching method for identifying condition- responsive protein-protein interaction sub-network. Bioinformatics 23(16):2121-2128.

34. Taylor IW, et al. (2009) Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nat Biotechnol 27(2):199-204.

35. Nacu S, Critchley-Thorne R, Lee P, & Holmes S (2007) Gene expression network analysis and applications to immunology. Bioinformatics 23(7):850-858.

36. Critchley-Thorne RJ, et al. (2007) Down-regulation of the interferon signaling pathway in T lymphocytes from patients with metastatic melanoma. PLoS Med 4(5):e176.

37. Chowdhury SA, Nibbe RK, Chance MR, & Koyuturk M (2011) Subnetwork state functions define dysregulated subnetworks in cancer. J Comput Biol 18(3):263-281.

38. TaeHyun Hwang ZT, Rui Kuangy, Jean P. Kocher (2008) Learning on Weighted Hypergraphs to

Integrate Protein Interactions and Gene Expressions for Cancer Outcome Prediction. Data Mining, IEEE

International Conference on In Data Mining, 2008 0:293-302.

125

39. Dutkowski J & Ideker T (2011) Protein networks as logic functions in development and cancer.

PLoS Comput Biol 7(9):e1002180.

40. Caspi R, et al. (2006) MetaCyc: a multiorganism database of metabolic pathways and enzymes.

Nucleic Acids Res 34(Database issue):D511-516.

41. Nishimura D (2001) BioCarta Biotech Software & Internet Report 2(3):117-120.

42. Kanehisa M & Goto S (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids

Res 28(1):27-30.

43. Tian L, et al. (2005) Discovering statistically significant pathways in expression profiling studies.

Proc Natl Acad Sci U S A 102(38):13544-13549.

44. Gordon T, Castelli WP, Hjortland MC, Kannel WB, & Dawber TR (1977) High density lipoprotein as a protective factor against coronary heart disease. The Framingham Study. Am J Med

62(5):707-714.

45. Kastelein JJ, et al. (2007) Effect of torcetrapib on carotid atherosclerosis in familial hypercholesterolemia. N Engl J Med 356(16):1620-1630.

46. Sawle A, Higgins MK, Olivant MP, & Higgins JA (2002) A rapid single-step centrifugation method for determination of HDL, LDL, and VLDL cholesterol, and TG, and identification of predominant LDL subclass. J Lipid Res 43(2):335-343.

47. van't Hooft F & Havel RJ (1982) Metabolism of apolipoprotein E in plasma high density lipoproteins from normal and cholesterol-fed rats. J Biol Chem 257(18):10996-11001.

48. Akkoyunlu EA (1973) The enumeration of maximal cliques of large graphs. SIAM Journal on

Computing 2:1-6.

126

49. Angles-Cano E, Gris JC, Loyau S, & Schved JF (1993) Familial association of high levels of histidine-rich glycoprotein and plasminogen activator inhibitor-1 with venous thromboembolism. J Lab

Clin Med 121(5):646-653.

50. Ashburner M, et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology

Consortium. Nat Genet 25(1):25-29.

51. Kollman JM, Pandi L, Sawaya MR, Riley M, & Doolittle RF (2009) Crystal structure of human fibrinogen. Biochemistry 48(18):3877-3886.

52. Almogren A, Furtado PB, Sun Z, Perkins SJ, & Kerr MA (2006) Purification, properties and extended solution structure of the complex formed between human immunoglobulin A1 and human serum albumin by scattering and ultracentrifugation. J Mol Biol 356(2):413-431.

53. Spillmann F, Schultheiss HP, Tschope C, & Van Linthout S (2010) High-density lipoprotein- raising strategies: update 2010. Curr Pharm Des 16(13):1517-1530.

54. Degoma EM & Rader DJ (2011) Novel HDL-directed pharmacotherapeutic strategies. Nat Rev

Cardiol 8(5):266-277.

55. Roth RB, et al. (2006) Gene expression analyses reveal molecular relationships among 20 regions of the human CNS. Neurogenetics 7(2):67-80.

56. Ma J, Dempsey AA, Stamatiou D, Marshall KW, & Liew CC (2007) Identifying leukocyte gene expression patterns associated with plasma lipid levels in human subjects. Atherosclerosis 191(1):63-72.

57. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, & Yeates TO (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A

96(8):4285-4288.

127

58. Gobel U, Sander C, Schneider R, & Valencia A (1994) Correlated mutations and residue contacts in proteins. Proteins 18(4):309-317.

59. Pazos F, Helmer-Citterich M, Ausiello G, & Valencia A (1997) Correlated mutations contain information about protein-protein interaction. J Mol Biol 271(4):511-523.

60. Pazos F & Valencia A (2008) Protein co-evolution, co-adaptation and interactions. EMBO J

27(20):2648-2655.

61. Jansen R, Greenbaum D, & Gerstein M (2002) Relating whole-genome expression data with protein-protein interactions. Genome Res 12(1):37-46.

62. Walhout AJ, et al. (2000) Protein interaction mapping in C. elegans using proteins involved in vulval development. Science 287(5450):116-122.

63. Matthews LR, et al. (2001) Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". Genome Res 11(12):2120-2126.

64. Yu H, et al. (2004) Annotation transfer between genomes: protein-protein interologs and protein-

DNA regulogs. Genome Res 14(6):1107-1118.

65. Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, & Eisenberg D (1999) A combined algorithm for genome-wide prediction of protein function. Nature 402(6757):83-86.

66. Jansen R, et al. (2003) A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 302(5644):449-453.

67. Ge H, Walhout AJ, & Vidal M (2003) Integrating 'omic' information: a bridge between genomics and systems biology. Trends Genet 19(10):551-560.

68. von Mering C, et al. (2002) Comparative assessment of large-scale data sets of protein-protein

128 interactions. Nature 417(6887):399-403.

69. Bork P, et al. (2004) Protein interaction networks from yeast to human. Curr Opin Struct Biol

14(3):292-299.

70. Lu LJ, Xia Y, Paccanaro A, Yu H, & Gerstein M (2005) Assessing the limits of genomic data integration for predicting protein networks. Genome Res 15(7):945-953.

71. Lin N, Wu B, Jansen R, Gerstein M, & Zhao H (2004) Information assessment on predicting protein-protein interactions. BMC Bioinformatics 5:154.

72. Zhang LV, Wong SL, King OD, & Roth FP (2004) Predicting co-complexed protein pairs using genomic and proteomic data integration. BMC Bioinformatics 5:38.

73. Xia Y, Lu LJ, & Gerstein M (2006) Integrated prediction of the helical membrane protein interactome in yeast. J Mol Biol 357(1):339-349.

74. Ge H, Liu Z, Church GM, & Vidal M (2001) Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat Genet 29(4):482-486.

75. Rhodes DR, et al. (2005) Probabilistic model of the human protein-protein interaction network.

Nat Biotechnol 23(8):951-959.

76. Haiying Wang FA (2004) Gene Expression Correlation and Gene Ontology-Based Similarity: An

Assessment of Quantitative Relationships. In Proceedings of the 2004 IEEE Symposium on

Computational In-telligence in Bioinformatics and Computational Biology:pp. 25–31.

77. Goh CS & Cohen FE (2002) Co-evolutionary analysis reveals insights into protein-protein interactions. J Mol Biol 324(1):177-192.

78. Pazos F & Valencia A (2002) In silico two-hybrid system for the selection of physically

129 interacting protein pairs. Proteins 47(2):219-227.

79. Subramanian A, et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102(43):15545-15550.

80. Kumar A & Snyder M (2002) Protein complexes take the bait. Nature 415(6868):123-124.

81. Papadopoulos JS & Agarwala R (2007) COBALT: constraint-based alignment tool for multiple protein sequences. Bioinformatics 23(9):1073-1079.

82. Dowell RD, et al. (2010) Genotype to phenotype: a complex problem. Science 328(5977):469.

83. Gibson DG, et al. (2010) Creation of a bacterial cell controlled by a chemically synthesized genome. Science 329(5987):52-56.

84. Pennisi E (2010) Genomics. Synthetic genome brings new life to bacterium. Science

328(5981):958-959.

85. Pucci MJ (2006) Use of genomics to select antibacterial targets. Biochem Pharmacol 71(7):1066-

1072.

86. Bruccoleri RE, Dougherty TJ, & Davison DB (1998) Concordance analysis of microbial genomes. Nucleic Acids Res 26(19):4482-4486.

87. Chen Y & Xu D (2005) Understanding protein dispensability through machine-learning analysis of high-throughput data. Bioinformatics 21(5):575-581.

88. Zalacain M, et al. (2003) A global approach to identify novel broad-spectrum antibacterial targets among proteins of unknown function. J Mol Microbiol Biotechnol 6(2):109-126.

89. D'Elia MA, Pereira MP, & Brown ED (2009) Are essential genes really essential? Trends

130

Microbiol 17(10):433-438.

90. Gustafson AM, Snitkin ES, Parker SC, DeLisi C, & Kasif S (2006) Towards the identification of essential genes using targeted genome sequencing and comparative analysis. BMC Genomics 7:265.

91. Seringhaus M, Paccanaro A, Borneman A, Snyder M, & Gerstein M (2006) Predicting essential genes in fungal genomes. Genome Res 16(9):1126-1135.

92. de Berardinis V, et al. (2008) A complete collection of single-gene deletion mutants of

Acinetobacter baylyi ADP1. Mol Syst Biol 4:174.

93. Jacobs MA, et al. (2003) Comprehensive transposon mutant library of Pseudomonas aeruginosa.

Proc Natl Acad Sci U S A 100(24):14339-14344.

94. Liberati NT, et al. (2006) An ordered, nonredundant library of Pseudomonas aeruginosa strain

PA14 transposon insertion mutants. Proc Natl Acad Sci U S A 103(8):2833-2838.

95. Kobayashi K, et al. (2003) Essential Bacillus subtilis genes. Proc Natl Acad Sci U S A

100(8):4678-4683.

96. Barbe V, et al. (2004) Unique features revealed by the genome sequence of Acinetobacter sp.

ADP1, a versatile and naturally transformation competent bacterium. Nucleic Acids Res 32(19):5766-

5779.

97. Hannay K, Marcotte EM, & Vogel C (2008) Buffering by gene duplicates: an analysis of molecular correlates and evolutionary conservation. BMC Genomics 9:609.

98. Gerdes S, et al. (2006) Essential genes on metabolic maps. Curr Opin Biotechnol 17(5):448-456.

99. Joyce AR, et al. (2006) Experimental and computational assessment of conditionally essential genes in Escherichia coli. J Bacteriol 188(23):8259-8271.

131

100. Hutchison CA, et al. (1999) Global transposon mutagenesis and a minimal Mycoplasma genome.

Science 286(5447):2165-2169.

101. Gerdes SY, et al. (2003) Experimental determination and system level analysis of essential genes in Escherichia coli MG1655. J Bacteriol 185(19):5673-5684.

102. Lamichhane G, et al. (2003) A postgenomic method for predicting essential genes at subsaturation levels of mutagenesis: application to Mycobacterium tuberculosis. Proc Natl Acad Sci U S

A 100(12):7213-7218.

103. Salama NR, Shepherd B, & Falkow S (2004) Global transposon mutagenesis and essential gene analysis of Helicobacter pylori. J Bacteriol 186(23):7926-7935.

104. Glass JI, et al. (2006) Essential genes of a minimal bacterium. Proc Natl Acad Sci U S A

103(2):425-430.

105. Suzuki N, et al. (2006) High-throughput transposon mutagenesis of Corynebacterium glutamicum and construction of a single-gene disruptant mutant library. Appl Environ Microbiol 72(5):3750-3755.

106. Gallagher LA, et al. (2007) A comprehensive transposon mutant library of Francisella novicida, a bioweapon surrogate. Proc Natl Acad Sci U S A 104(3):1009-1014.

107. French CT, et al. (2008) Large-scale transposon mutagenesis of Mycoplasma pulmonis. Mol

Microbiol 69(1):67-76.

108. Cameron DE, Urbach JM, & Mekalanos JJ (2008) A defined transposon mutant library and its use in identifying motility genes in Vibrio cholerae. Proc Natl Acad Sci U S A 105(25):8736-8741.

109. Langridge GC, et al. (2009) Simultaneous assay of every Salmonella Typhi gene using one million transposon mutants. Genome Res 19(12):2308-2316.

132

110. Murray GL, et al. (2009) Genome-wide transposon mutagenesis in pathogenic Leptospira species.

Infect Immun 77(2):810-816.

111. Molina-Henares MA, et al. (2010) Identification of conditionally essential genes for growth of

Pseudomonas putida KT2440 on minimal medium through the screening of a genome-wide mutant library. Environ Microbiol 12(6):1468-1485.

112. Lamichhane G, et al. (2011) Essential metabolites of Mycobacterium tuberculosis and their mimics. MBio 2(1):e00301-00310.

113. Christen B, et al. (2011) The essential genome of a bacterium. Mol Syst Biol 7:528.

114. Berg DE & Howe MM (1989) Mobile DNA (American Society for Microbiology, Washington,

D.C.) pp xii, 972 p., [975] p. of plates.

115. Hamer L, DeZwaan TM, Montenegro-Chamorro MV, Frank SA, & Hamer JE (2001) Recent advances in large-scale transposon mutagenesis. Curr Opin Chem Biol 5(1):67-73.

116. Akerley BJ, et al. (2002) A genome-scale analysis for identification of genes required for growth or survival of Haemophilus influenzae. Proc Natl Acad Sci U S A 99(2):966-971.

117. Zhang R & Lin Y (2009) DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res 37(Database issue):D455-458.

118. Kato J & Hashimoto M (2007) Construction of consecutive deletions of the Escherichia coli chromosome. Mol Syst Biol 3:132.

119. Gerdes SY, et al. (2002) From genetic footprinting to antimicrobial drug targets: examples in cofactor biosynthetic pathways. J Bacteriol 184(16):4555-4572.

120. Good IJ (1986) Some statistical applications of Poisson's work. Statistical science 1 (2):157–180.

133

121. Ross SM (1996) Stochastic processes (Wiley, New York) 2nd Ed pp xv, 510 p.

122. Lehmann EL & Casella G (1998) Theory of point estimation (Springer, New York) 2nd Ed pp xxvi, 589 p.

123. Zolman JF (1993) Biostatistics : experimental design and statistical inference (Oxford University

Press, New York) pp xv, 343 p.

124. Balakrishnan N, Melas VB, & Ermakov SM (2000) Advances in stochastic simulation methods

(Birkhäuser, Boston) pp xxvi, 386 p.

125. Martinez Arias A & Hayward P (2006) Filtering transcriptional noise during development: concepts and mechanisms. Nat Rev Genet 7(1):34-44.

126. Kerszberg M & Wolpert L (2007) Specifying positional information in the embryo: looking beyond morphogens. Cell 130(2):205-209.

127. Lander AD (2007) Morpheus unbound: reimagining the morphogen gradient. Cell 128(2):245-

256.

128. Lewis J (2008) From signals to patterns: space, time, and mathematics in developmental biology.

Science (New York, N.Y 322(5900):399-403.

129. Houchmandzadeh B, Wieschaus E, & Leibler S (2002) Establishment of developmental precision and proportions in the early Drosophila embryo. Nature 415(6873):798-802.

130. Lott SE, Kreitman M, Palsson A, Alekseeva E, & Ludwig MZ (2007) Canalization of segmentation and its evolution in Drosophila. Proceedings of the National Academy of Sciences of the

United States of America 104(26):10926-10931.

131. He F, et al. (2008) Probing intrinsic properties of a robust morphogen gradient in Drosophila. Dev

134

Cell 15:558-567. PMID: 18854140.

132. Manu, et al. (2009) Canalization of gene expression in the Drosophila blastoderm by gap gene cross regulation. PLoS biology 7:e1000049.

133. Driever W & Nüsslein-Volhard C (1988) A gradient of bicoid protein in Drosophila embryos. Cell

54:83-93.

134. Struhl G, Struhl K, & Macdonald P (1989) The gradient morphogen bicoid is a concentration- dependent transcriptional activator. Cell 57:1259-1273.

135. Ephrussi A & Johnston DS (2004) Seeing is believing. The bicoid morphogen gradient matures.

Cell 116(2):143-152.

136. Day SJ & Lawrence PA (2000) Measuring dimensions: the regulation of size and shape.

Development 127(14):2977-2987.

137. Patel NH & Lall S (2002) Precision patterning. Nature 415(6873):748-749.

138. Ben-Zvi D, Shilo BZ, Fainsod A, & Barkai N (2008) Scaling of the BMP activation gradient in

Xenopus embryos. Nature 453(7199):1205-1211.

139. Howard M & Ten Wolde PR (2005) Finding the center reliably: robust patterns of developmental gene expression. Physical Rev. Lett. 95(20):208103.

140. Houchmandzadeh B, Wieschaus E, & Leibler S (2005) Precision domain specification in the developing Drosophila embryo. Pysical Rev. E 72:061920.

141. Aegerter-Wilmsen T, Aegerter CM, & Bisseling T (2005) Model for the robust establishment of precise proportions in the early Drosophila embryo. Journal of theoretical biology 234(1):13-19.

135

142. McHale P, Rappel WJ, & Levine H (2006) Embryonic pattern scaling achieved by oppositely directed morphogen gradients. Phys Biol 3(2):107-120.

143. Ishihara S & Kaneko K (2006) Turing pattern with proportion preservation. Journal of theoretical biology 238(3):683-693.

144. Bergmann S, et al. (2007) Pre-steady-state decoding of the Bicoid morphogen gradient. PLoS biology 5(2):e46.

145. Hecht I, Rappel WJ, & Levine H (2009) Determining the scale of the Bicoid morphogen gradient.

Proceedings of the National Academy of Sciences of the United States of America 106(6):1710-1715.

146. Wolpert L (1969) Positional information and the spatial pattern of cellular differentiation. J.

Theor. Biol 25:1-47.

147. Berleth T, et al. (1988) The role of localization of bicoid RNA in organizing the anterior pattern of the Drosophila embryo. EMBO J. 7:1749-1756.

148. Frigerio G, Burri M, Bopp D, Baumgartner S, & Noll M (1986) Structure of the segmentation gene paired and the Drosophila PRD gene set as part of a gene network. Cell 47:735-746.

149. St. Johnston D, Driever W, Berleth T, Richstein S, & Nüsslein-Volhard C (1989) Multiple steps in the localization of bicoid mRNA to the anterior pole of the Drosophila oocyte. Development Suppl.:13-19.

150. Crauk O & Dostatni N (2005) Bicoid determines sharp and precise target gene expression in the

Drosophila embryo. Curr Biol 15(21):1888-1898.

151. Spirov A, et al. (2009) Formation of the bicoid morphogen gradient: an mRNA gradient dictates the protein gradient. Development 136(4):605-614.

152. Holloway DM, Harrison LG, Kosman D, Vanario-Alonso CE, & Spirov AV (2006) Analysis of

136 pattern precision shows that Drosophila segmentation develops substantial independence from gradients of maternal gene products. Dev Dyn 235(11):2949-2960.

153. Gregor T, Wieschaus EF, McGregor AP, Bialek W, & Tank DW (2007) Stability and nuclear dynamics of the bicoid morphogen gradient. Cell 130(1):141-152.

154. Driever W & Nüsslein-Volhard C (1988) The bicoid protein determines position in the Drosophila embryo in a concentration dependent manner. Cell 54:95-104.

155. Ferrandon D, Elphick L, Nusslein-Volhard C, & St Johnston D (1994) Staufen protein associates with the 3'UTR of bicoid mRNA to form particles that move in a microtubule-dependent manner. Cell

79(7):1221-1232.

156. Lucchetta EM, Lee JH, Fu LA, Patel NH, & Ismagilov RF (2005) Dynamics of Drosophila embryonic patterning network perturbed in space and time using microfluidics. Nature 434(7037):1134-

1138.

157. Lucchetta EM, Vincent ME, & Ismagilov RF (2008) A precise Bicoid gradient is nonessential during cycles 11-13 for precise patterning in the Drosophila blastoderm. PLoS One 3(11):e3651.

137