Deconvolution of PPI Networks: Approximation Algorithms and Optimization Techniques

Dong Hyun Kim

School of Computer Science McGill University Montréal

May 2012

A thesis submitted to McGill University in partial fulfilment of the requirements for the degree of Doctor of Philosophy

c Dong Hyun Kim, 2012 ii Contents

Abstract vi

Résumé ix

Declaration xi

Acknowledgments xii

List of Figures xiv

1 Introduction 1 1.1 Experimental Techniques ...... 3 1.1.1 Yeast-Two Hybrid (Y2H) Method...... 4

1.1.2 Affinity Purification / Mass Spectrometry (AP-MS) ...... 6 1.1.3 Indirect Approaches ...... 9 1.2 Computational Analyses of PPI Data ...... 11 1.2.1 Graph Models for PPI Networks ...... 11 1.2.2 Identification of -Protein Interactions ...... 14 1.2.3 Identification of Protein Complexes ...... 16 1.3 AP-MS based PPI Networks ...... 19 1.3.1 Protein Complexes in Yeast ...... 20 1.3.2 Modularity in Yeast Protein Complexes ...... 21 1.4 Contributions and Thesis Outline ...... 23

2 Approximation Algorithms and Optimization Techniques 27

iii iv

2.1 NP-hard Optimization Problems ...... 27 2.2 Graph Parameters ...... 29 2.3 Approximation Algorithms ...... 32 2.3.1 Lower-bounding the Optimum ...... 33 2.3.2 Polynomial Time Approximation Schemes ...... 35 2.3.3 LP-based Algorithms ...... 39 2.4 Genetic Algorithms ...... 41

3 Protein Quantification with Shared Peptides 43 3.1 Problem Formulation for Protein Quantification ...... 46 3.2 Hardness of Protein Quantification Problems ...... 50 3.3 Approximation Algorithms for Protein Quantification ...... 53 3.3.1 An Algorithm for Multicover ...... 53 3.3.2 An Algorithm for Minimum Protein Types ...... 54 3.3.3 An Algorithm for Minimum Uniform Error ...... 56 3.3.4 An Algorithm for Minimum Error Sum ...... 62 3.4 Experimental Results ...... 63 3.4.1 Performance on Simulated Data ...... 63 3.4.2 Protein Quantification from AP-MS Data ...... 66 3.5 Discussions ...... 72 3.6 Bibliographic Notes ...... 74

4 Direct PPI Networks from AP-MS Data 75 4.1 Background ...... 76 4.2 Mathematical Modelling and Problem Formulation ...... 77 4.2.1 A Probabilistic Model for AP-MS Data ...... 78 4.2.2 Problem Formulation ...... 80 4.3 Algorithm for A-DIGCOM ...... 82 4.3.1 Identification of Weakly Connected Regions ...... 84 4.3.2 Identification of Densely Connected Regions ...... 90 v

4.3.3 Cut-based Genetic Algorithm ...... 92 4.3.4 Restricting the Solution Space for GA ...... 95 4.4 Experimental Results ...... 96 4.4.1 Randomized Hill-climbing Algorithm ...... 97

4.4.2 Choosing a Tolerance Level δ and Handling Numerical Errors . . . . 97 4.4.3 Generation of Scale-free Networks ...... 98 4.4.4 Calculation of Connectivity Matrix from Peptide Counts ...... 98 4.4.5 Model Validation ...... 99 4.4.6 Accuracy of the Algorithm ...... 100 4.4.7 Inferring Direct Interactions from AP-MS Experimental Data . . . . . 105 4.5 Discussions ...... 106 4.6 Bibliographic Notes ...... 110

5 Hypergraph Modelling of PPI Data 111 5.1 Preliminaries ...... 113 5.2 Clique Cover for Graphs with Bounded Treewidth ...... 116 5.2.1 Running Time of Treewidth-based Algorithm ...... 119 5.2.2 Modifications for Clique Partition ...... 120 5.3 Planar Clique Cover ...... 120 5.3.1 Clique Cover for Planar Graphs with Bounded Branchwidth ...... 121 5.3.2 Modifications for Clique Partition ...... 123 5.3.3 Baker’s Technique on Planar Graphs ...... 123 5.4 Experimental Results ...... 126 5.4.1 Simulated and Biological Networks ...... 127 5.4.2 Performance of the Treewidth and Branchwidth-based Algorithms . 128 5.4.3 Performance of PTAS for Planar Graphs ...... 130 5.4.4 Clique Cover in Biological Networks ...... 130 5.5 Discussions ...... 131 5.6 Bibliographic Notes ...... 132 vi

6 Conclusion and Future Directions 133

Bibliography 137 Abstract

Understanding the organization of protein-protein interactions (PPIs) as a complex network is one of the main pursuits in proteomics today. With the help of high-throughput experimen- tal techniques, a large amount of PPI data has become available, providing us a rough picture of how interact in biological systems. One of the leading technologies for identifying protein interactions is affinity-purification followed by mass spectrometry (AP-MS). While the AP-MS method provides the ability to detect protein interactions at biologically reasonable ex- pression levels, this technique still suffers from poor accuracy as well as lack of sound approach to interpret the obtained interaction data.

In this thesis, we look for sources of systematic errors and limitations of the data from AP- MS experiments, and propose several approaches for improvements. In particular, we identify various problems that arise within the experimental pipeline, and propose combinatorial algo- rithms developed using the theory of approximation algorithms and discrete mathematics. The first part of the thesis deals with quantification of proteins from MS-based experiments. Exist- ing approaches for protein quantification often use each detected peptide as an indicator for the originating protein. These approaches ignore peptides that belong to more than one protein family. We attack this problem of protein quantification by taking these shared peptides into con- sideration, and propose a framework for estimating protein abundance via linear programming.

In AP-MS data, the identified protein interactions contain a large number of indirect inter- actions as artifacts from a series of simultaneous physical interactions. The second part of this thesis studies the problem of distinguishing direct physical interactions from indirect interac- tions. To do this, we first propose a probabilistic graph model for the PPI data, and design a combinatorial algorithm suited for graphs with underlying structure that is evident in PPI net- works.

While the traditional model for PPI networks is a binary graph representing pairwise inter- actions, a large number of interactions involve more than two interaction partners. Such col- lections of proteins interacting concertedly are known as protein complexes, and various ap- proaches have been proposed to identify the complexes in the network. When these complexes are overlapping, however, the existing complex detection methods often fail to identify the con- stituent complexes. Taking one step further on this line of research, the last part of this thesis discusses the problem of modelling PPI networks as hypergraphs by studying the clique cover problem on sparse networks.

For each problem discussed throughout the thesis, we obtain either an exact algorithm, an algorithm with provably good guarantee on the output quality, or a heuristic with efficient run-

vii viii

ning time. Furthermore, each of the proposed algorithms is empirically tested against biological data as well as simulated data, in order to validate both computational efficiency and biological soundness. Résumé

Comprendre l’organisation des interactions protéine-protéine (IPP) en tant que réseau com- plexe est un des plus grands problèmes de la protéomique moderne. Avec l’aide de techniques expérimentales à haut débit, une grand quantité de données de IPP est devenue disponible, nous procurant ainsi une image approximative du fonctionnement des interactions entre protéines dans des systèmes biologiques. Une des technologies de fine pointe pour identifier des interac- tions protéiques est la purification d’affinité suivie de la spectrométrie de masse (PA-SM). Même si la méthode de PA-SM nous permet de détecter des interactions de protéines à des niveaux d’expression raisonnables biologiquement, cette technique souffre encore d’une précision défi- ciente et d’un manque d’une approche saine pour interpréter les résultats d’interactions obtenus par celle-ci.

Dans cette thèse, nous cherchons des sources d’erreurs systématiques et des limites des données provenant d’expériences PA-SM et nous proposons plusieurs approches amenant des améliorations. En particulier, nous identifions divers problèmes présents dans la procédure ex- périmentale et proposons des algorithmes combinatoires développés en utilisant la théorie des algorithmes d’approximation et des mathématiques discrètes. La première partie de la thèse étudie la quantification des protéines provenant d’une expérience basée sur la spectrométrie de masse. Les approches existantes pour la quantification de protéines utilisent souvent chaque peptide détecté comme un indicateur de la protéine d’origine. Ces approches ignorent les pep- tides qui appartiennent à plus d’une protéines. Nous attaquons ce problème de quantification de protéines en prenant ces peptides partagés en considération et proposons un cadre de travail pour estimer l’abondance des protéines via la programmation linéaire.

Les interactions de protéines identifiées dans les données de PA-SM contiennent un grand nombre d’interactions indirectes qui sont des artéfacts d’une série d’interactions physiques si- multanées. La seconde partie de cette thèse étudie le problème de distinction des interactions physiques directes de celles qui sont indirectes. Pour ce faire, nous proposons premièrement un modèle d’un graphe probabiliste pour les données de IPPs et concevons un algorithme combi- natoire adapté aux graphes ayant des propriétés observées dans de vrais réseaux de IPPs.

Malgré le fait que le modèle traditionnel des réseaux de IPPs est un graphe binaire représen- tant les interactions par paires, un grand nombre d’interactions impliquent plus de deux parte- naires d’interaction. De tels groupes de protéines interagissant de concert se nomment des com- plexes de protéines. Plusieurs approches ont déjà été proposées afin d’identifier les complexes dans un réseau. Cependant, lorsque ces complexes se chevauchent, les méthodes existantes échouent. La dernière partie de cette thèse discute du problème de modélisation des réseaux de IPPs comme des hypergraphes en étudiant le problème de couverture par cliques sur des réseaux

ix x

clairsemés.

Pour chaque problème discuté au cours de cette thèse, nous obtenons un algorithme exact, un algorithme avec de bonnes garanties prouvables sur la qualité de la sortie, ou une heuristique avec un temps d’exécution efficient. De plus, chaque algorithme proposé est testé empirique- ment avec des données biologiques et simulées dans le but de valider l’efficience computation- nelle et la signification biologique. Declaration

This thesis contains no material which has been accepted in whole, or in part, for any other de- gree or diploma. Except for results whose authors are mentioned, materials presented in Chap- ters 3, 4, and 5 of this thesis is an original contribution to knowledge.

In Section 3.4 of Chapter 3, the biological data for experimental studies was generously pro- vided by Benoit Coulombe’s lab at Institut de recherches cliniques de Montréal (IRCM).

In Section 4.3.1 of Chapter 4, the work presented in Theorems 4.1 and 4.3 are done jointly with Ashish Sabharwal and my thesis advisors.

All other parts were done independently under the supervision of my thesis advisors, Adrian Vetta and Mathieu Blanchette.

xi xii Acknowledgements

During my first visit to Montréal in February 2005, my old advisor Sue Whitesides told me a number of reasons why McGill is one of the best places to study computer science and discrete maths. Seven years later, I can acknowledge every one of those reasons, and possibly double the list with my own.

First and foremost, I express my deepest gratitude to my thesis advisors, Mathieu Blanchette and Adrian Vetta, for their tireless efforts to guide my research, and making my life as a graduate student ever so enjoyable. Mathieu has introduced me to the field of protein-protein interac- tions, invited me to his Barbados workshop on the subject, was always ready to suggest the next steps in research, and taught me all about what makes good science. Adrian has introduced me to the field of approximation algorithms, spent countless hours discussing various lines of at- tacks to problems, and has provided me with seemingly infinite amount of resources in discrete mathematics. Without their guidance, this thesis, or my own scientific becoming, would have been impossible.

I thank my collaborators and colleagues: Ashish Sabharwal’s visit to Montreal resulted in the first part of work presented in Chapter 4; Mathieu L.-A., Javier, Mathieu R. and Javad taught me a thing or two about proteomics and machine learning; human PPI data used in Chapter 3 was generously provided by Benoit Coulombe’s lab at Institut de recherches cliniques de Montréal (IRCM).

I acknowledge NSERC, CIHR, Walter C. Sumner Fellowship, and my advisors’ generous fund- ing for financial support.

Special thanks to the “family” for all the laughters over pints: js, mk, sh, jm, sc, ij, sy, sj, jy, the list goes on. And finally, to my parents, for their love and patience over the years. Thank you.

xiii xiv List of Figures

1.1 The yeast PPI network ...... 2 1.2 A schematic of Y2H experiment...... 5 1.3 Pipeline of TAP experiment ...... 7 1.4 Protein interactions from the perspective of Y2H and AP-MS ...... 9 1.5 Degree distribution of yeast PPI networks ...... 12

2.1 Tree decomposition of a graph ...... 30 2.2 Branch decomposition of a graph ...... 31 2.3 A characteristics matrix and its perfect phylogenetic tree ...... 35

3.1 Bipartite graph model from peptide-protein relationships ...... 47 3.2 Robustness of the algorithms for protein quantification ...... 65 3.3 Peptide-protein graph from an AP-MS experiment (CDK9) ...... 68 3.4 Peptide-protein graph from an AP-MS experiment (POLR2A) ...... 69 3.5 Peptide-protein graph from an AP-MS experiment (RPAP3) ...... 70 3.6 Result of our algorithm on CDK9 dataset ...... 71 3.7 Result of our algorithm on RPAP3 dataset ...... 72

4.1 A schematic for indirect interactions in AP-MS data ...... 79 4.2 A direct interaction network and its connectivity matrix ...... 81 4.3 The outcome of our 3 phase algorithm for direct PPI network ...... 83 4.4 Comparison of results from our direct PPI data against Y2H interaction network from Yu et al...... 108

5.1 Node types in nice tree decomposition ...... 116 5.2 Sphere-cut decomposition ...... 122 5.3 Planar clique cover using Baker’s technique ...... 125

xv xvi

5.4 Performance of the clique cover algorithm for graphs with bounded treewidth129 5.5 Performance comparison of treewidth-based algorithm vs. branchwidth- based algorithm ...... 130 Chapter 1

Introduction

Upon completion of the Project1, one of the compelling topics in the biological sciences is the large-scale study of proteins encoded by the genes – a field re- ferred to as proteomics. Indeed, it is the proteins that execute the functions programmed in the genes, and analyzing the proteome will thus lead us to a better understanding of how cellular processes work in living organisms.

While the genome is considered as a long sequence of DNA, proteome is often re- garded as a much more complex system. Because cellular processes are caused by pro- teins interacting concertedly within the cell, the proteins in the proteome form a large network called protein-protein interaction (PPI) network, sometimes referred to as the interactome. As a result, both structural and functional analyses of the proteome often require detailed examinations of the PPI network. Efforts to construct the PPI network have begun by developing various experimental techniques to identify the interactions. Using these technologies, Saccharomyees cerevisiae (the yeast) was shown to contain approximately 6000 proteins, and over 78,000 interactions have now been identified. To give an example, Figure 1.1 depicts a small portion of the yeast PPI network, using only

1 An international scientific research program whose goal was to sequence the DNA, and iden- tify the genes of human genome; began in 1990, completed in 2003. (http://www.ornl.gov/sci/ techresources/Human_Genome/home.shtml)

1 2 Chapter 1. Introduction

Figure 1.1: The yeast PPI network constructed using Cytoscape software (http://www.cytoscape. org) with high confidence PPI data from Krogan et al. [93] the highest confidence interaction data from Krogan et al. [93]. Humans, more com- plex organisms as we are, are expected to have approximately 25,000 proteins sharing as many as 650,000 interactions [130]. Such sheer volume of data makes it intractable to analyze manually, and thus computational analyses have become essential to deeper understanding of the proteome.

More recently, it has been revealed that many of these interactions may occur only when three or more specified proteins are all present at the site of an interaction. Such a collection of proteins interacting concertedly is called a protein complex, and this the- ory around protein complexes opened a new set of computational problems, such as the identification and functional annotation of protein complexes. To make things even more interesting, many of these interactions depend on the phase of the cell develop- ment cycle during which the interaction takes place, as well as the localization within the cell [80, 97]. Consequently, such temporal and spatial dependence makes the struc- ture of PPI networks much more dynamic than originally expected. 1.1. Experimental Techniques 3

Due to the natural structure of the proteome as a network, graph theory has played a crucial role in computational analyses of PPI networks. As shall be demonstrated later, however, it is often difficult to formulate an optimization problem that correctly cap- tures all the properties of PPI networks, and even when a clean optimization problem is formulated, the problem often turns out to be computationally intractable. In many of these cases, heuristic algorithms targeting local optima have been popular, and very few algorithms with provable guarantee of output quality have been proposed. As such, clean mathematical models and sound algorithms remain in high demand in the field of protein-protein interactions.

This thesis and its contributions can be divided into three parts; each part addresses a particular challenge related to the analysis of PPI networks, and provides a novel ap- proach to reduce errors and noise present in the data. The principal results of this thesis will be outlined in Section 1.4. Prior to that, however, we first give an introduction to the field of protein-protein interactions from the perspective of bioinformatics and dis- crete mathematics. This introductory chapter consists of three parts: in Section 1.1, we give a brief survey on various experimental techniques for identifying protein interac- tions. Section 1.2 then discusses computational approaches to analyze the datasets pro- duced by those experimental techniques; one of the leading technologies that is preva- lent in the literature is affinity purification followed by mass spectrometry (AP-MS). In Section 1.3, we review studies on PPI networks produced by AP-MS experiments, and point out the limitations of existing computational approaches, which will then lead us to the topics of this thesis, namely: (1) protein quantification; (2) predicting direct inter- action network; and (3) hypergraph modelling of PPI networks.

1.1 Experimental Techniques

In this section, we introduce several medium to high throughput experimental meth- ods that have been proposed to identify protein interactions, and discuss the nature 4 Chapter 1. Introduction

and the quality of the data obtained by each technique. Each technique makes use of different mechanisms to identify the interactions, and bears its own advantages and dis- advantages. For example, some techniques identify direct, physical interactions while others are designed to find indirect, functional relationships between proteins. There- fore, understanding the intrinsic differences between the datasets from various experi- mental techniques will allow us to realistically model the given data, and devise highly customized algorithms for the formulated problems.

1.1.1 Yeast-Two Hybrid (Y2H) Method.

Yeast two hybrid (Y2H) is an in vivo2 technique to detect PPIs by testing physical in- teractions between two proteins [49, 75, 137]. It was discovered that transcription fac- tors found in eukaryotic organisms have two distinct domains: (1) a binding domain

(BD), and (2) an activating domain (AD) [49]. The binding domain is a module respon- sible for binding to a promoter DNA sequence while the activating domain activates the transcription. The transcription is inactivated while the two domains are far from each other, but it can be restored when a binding domain is in close proximity with an acti- vating domain.

A Y2H experiment starts by fusing the binding domain into a protein X (known as the bait), and the activating domain into a protein Y (known as the prey). If the bait X and the prey Y interact physically, the activating domain is in close proximity to the binding domain. The binding domain then binds to upstream activating sequence (UAS) of a promoter, and thus activates the transcription of the reporter gene3. Figure 1.2 shows a schematic representation of a Y2H system. To do this experiment in a high through- put manner, either a matrix of prey clones or a library of random cDNA fragments are

2 An experiment is said to be in vivo (Latin for “within the living”) when the subject is a living organism as opposed to in vitro, a controlled environment. 3 In order to determine whether a gene is expressed, a reporter gene (which is easily identifiable when expressed) is attached to a regulatory sequence of the gene of interest. Commonly used reporter genes often induce visually identifiable characteristics, e.g. LacZ, GFP,etc. 1.1. Experimental Techniques 5

Y AD X Mediator

BD UAS

Transcription of Reporter Gene

Figure 1.2: A Y2H experiment. Protein X (the bait) is fused with a BD that binds to the upstream activating sequence (UAS) of a promoter. The bait interacts with Protein Y (the prey) that is fused with an AD, activating the transcription of a reporter gene via the mediator complex.

used [11, 50, 142]: In the matrix approach, a matrix of prey clones is mated with baits, and the interacting prey proteins are identified by the expression of a reporter gene and the position on the matrix. In the library approach, each bait is screened against a li- brary of random cDNA fragments, and the interaction partners are identified by DNA sequencing.

One advantage of a Y2H experiment is that it allows us to identify physical (rather than functional) associations between interaction partners. Furthermore, being an in vivo high throughput technique, Y2H produces large scale PPI data that potentially in- cludes even the weaker, transient interactions. As such, the Y2H PPI data provide us a comprehensive snapshot of the proteome. However, because each interaction must occur between the proteins fused with the bait-prey pair, the interactions identified by Y2H experiments are restricted to binary interactions. There are other limitations of Y2H systems: proteins that initiate transcription on their own cannot be targeted since they would result in expression of the reporter gene regardless of the presence of bait proteins; the protein fusion may cause changes in the structure of the protein, and the use of such bait proteins may result in incorrect interaction partners. Such a non-native environment causes high false positive ratios for Y2H experiments. To rectify these is- sues for Y2H experiments, many in silico4 methods have been proposed to post-process Y2H experimental data (see Section 1.2.2). With the help of these computational analy-

4 Computational analyses are often called in silico methods as opposed to in vivo or in vitro. 6 Chapter 1. Introduction

sis tools, Y2H experiments have been the most widely used approach to obtain physical interaction data.

Protein-fragment Complementation Assay (PCA). Another similar technique has been proposed by Michnick [54, 103, 134]: Protein-fragment Complementation Assay (PCA) is a technique in which fragments of a reporter protein can be separately expressed with proteins (bait and prey) that are to interact with each other. In this method, the tar- geted proteins (bait and prey) are fused with incomplete fragments of a third protein (reporter), and are separately expressed in vivo. The interaction between the bait and the prey brings the fragments of the reporter in close proximity, allowing the reporter to reform, and then to be identified. While similar to Y2H experiments, PCA experiments can test protein interactions at various subcellular compartments within a pathway of interest in a quantitative manner, making it desirable for drug discovery purposes. Typ- ically, PCA detects many interactions for membrane proteins whereas Y2H identifies many interactions for nuclear proteins [78].

1.1.2 Affinity Purification / Mass Spectrometry (AP-MS)

Affinity Purification followed by Mass Spectrometry (AP-MS) [36, 58, 93, 119], some- times referred to as tandem affinity purification (TAP), is a technique that allows us to determine co-complex membership of protein interactions. This method is a two step process consisting of the purification of protein complexes, and the identification of the interaction partners via mass spectrometry.

Affinity purification. A TAP tag consists of a calmodulin binding peptide (CBP) fol- lowed by a tobacco etch virus (TEV) protease cleavage site and the IgG binding domains of Staphylococus protein A [118, 119]. The open reading frame of the target protein 1.1. Experimental Techniques 7

Calmodulin binding Protein A peptide TEV protease cleavage site + Bait

Cell extract TAP tag

IgG 1st Affinity Contaminants Bait beads Column

Cut at TEV protease Prey proteins cleavage

Calmodulin 2nd Affinity Bait beads Column

natively formed Bait protein complex

Figure 1.3: Pipeline of TAP experiments. At the first affinity column, the bait protein is cut at the TEV protease cleavage site after washing. At the second affinity column, the bait-prey pairs are pulled down by Calmodulin beads, and the resulting eluate form native complexes which can then be identified using gel electrophoresis and mass spectrometry.

(known as the bait) is fused with the TAP tag at the end, and is expressed in vivo within the cells of the host organism (e.g. yeast or human). Once the AP-tagged bait protein is expressed, it can interact with other proteins (known as preys), and form protein com- plexes. These protein complexes go through two purification processes: Initially, the protein A is binding tightly to an IgG matrix. After washing out the contaminants, the link between the protein A and IgG matrix is released by the protease. The eluate of this step is then incubated with calmodulin-coated beads, and after washing for the second time, the targeted protein complexes are released. See Figure 1.3 for a schematic repre- sentation of the AP process. 8 Chapter 1. Introduction

Mass spectrometry. The result of the purification processes is the target protein com- plexes formed by the interaction partners of the bait protein. These protein complexes go through gel electrophoresis, resulting in component peptide fragments that can be identified using mass spectrometers.

Mass spectrometers then read in the peptide fragments, and produce ions with charges based on their masses. Polypeptide sequences are then identified using their mass-to- charge ratios. Various methods have been proposed to convert the peptide molecules into ions in the gas phase using electrospray ionization (ESI) [144] and matrix assisted laser desorption ionionzation (MALDI) [85, 116]. Furthermore, there are several algo- rithms to analyze the mass spectra produced by MS and either identify the peptides present, or even quantify the peptide abundance [59, 115, 135].

Due to its ability to perform at biologically reasonable expression levels in human cells, as well as the ability to detect protein complexes with fewer false positives [36], AP-MS approaches have become increasingly popular. One caveat of this approach is, however, that a significant number of the co-purified prey proteins are in fact indirect interaction partners of the bait protein. This is an artifact of chains of binary interac- tions that occur simultaneously, and AP-MS cannot distinguish such indirect interac- tions from direct physical interactions. For example, Figure 1.4 shows how the same complex can be perceived differently by different experimental techniques. In Chap- ter 4 of this thesis, we discuss the problem of identifying direct physical interactions from AP-MS data.

Added to the difficulty of interpretation from indirect interactions, AP-MS carries a high chance of capturing contaminants at the AP level: gel contamination, nonspecific binding to TAP columns, etc. To reduce the false positives from these contaminants, several computational approaches have been proposed [124, 128]. Some of these ap- proaches make use of the network topology to assign a purification score to each hit [35]. On the other hand, Lavallée-Adam et al. [95] recently provided a Bayesian model specif- ically for contaminants to directly identify false positive interactions. 1.1. Experimental Techniques 9

X X X X Actual protein Y Z Y Z interaction Y Z

X X X

Binary interaction methods (Y2H, PCA) Y Z Y Z Y Z

X X X Co-complex membership methods (AP-MS) Y Z Y Z Y Z

Figure 1.4: Different types of protein interactions among three proteins, and how they are viewed in dif- ferent experimental techniques (assuming all three proteins have been tagged as a bait). Note that bi- nary testing methods (e.g. Y2H and PCA) cannot distinguish the last two types of interactions, while co-complex membership testing approaches (e.g. AP-MS) identify the first two types of interactions as the same topology.

1.1.3 Indirect Approaches

Aside from the methods described above, there are indirect approaches to associate pro- teins with functional interactions.

Correlated mRNA expression (Synexpression). Genes can be partitioned into distinct groups depending on the mRNA levels measured under a variety of different cellular conditions. Then, the genes in the same partition are enriched to encode proteins that may interact with each other [44]. Though this technique gives a broad coverage5, the accuracy of the data is poor as it is difficult to infer direct physical (or even functional) interactions solely from gene expression levels.

5 Coverage refers to (correctly identified interactions) / (total number of true interactions). Accuracy, on the other hand, refers to (correctly identified interactions)/(number of identified interactions) 10 Chapter 1. Introduction

Method Interaction Type Y2H physical, binary PCA physical, binary AP-MS physical, complex Synexpression functional Genetic interactions functional Genome analysis functional

Table 1.1: Different techniques to identify protein-protein interactions, and types of interactions captured by each technique.

Genetic interactions. If two nonessential genes cause lethality when simultaneously knocked out, they are often functionally associated, and thus their encoded proteins may interact [12, 148]. This technique allows pairwise detection of functionally associ- ated interactions.

Genome analysis. Some protein interactions can be detected via computational meth- ods [112]. For example, (1) interacting proteins often show similar phylogenetic pro- files,6 and (2) seemingly unrelated genes are sometimes fused together into one polypep- tide chain. Though fast and inexpensive, these computational methods rely heavily on the existing data such as phylogenetic trees, and orthology7 between proteins which are prone to biases towards well-studied proteins.

Observe that different techniques have different goals in detecting interactions. While Y2H, PCA, and AP-MS methods are designed to discover physical bindings between pro- teins, the others seek to predict functional associations between entities such as tran- scriptional regulators and their associated pathways. The different coverage and ac- curacy of the data causes each technique to produce a unique distribution of interac- tions [141]. Consequently, when comparing or merging PPI data from different experi- mental techniques, one must carefully take these intrinsic differences into account. Ta- ble 1.1 summarizes different aspects of the methods discussed in this section.

6 Across a number of species, each gene can be checked for presence, resulting in a binary vector called a phylogenetic profile. If two proteins interact in the same biological pathway, the corresponding genes would show similar phylogenetic profiles. 7 Homologous sequences are orthologous if they were separated by a speciation event, as opposed to paralogous if caused by a gene duplication event. 1.2. Computational Analyses of PPI Data 11

1.2 Computational Analyses of PPI Data

The traditional way to model PPI networks is to construct a combinatorial graph where each node represents a protein, and two nodes are joined by an edge if and only if there is an interaction between the corresponding proteins. While this classical model captures the most basic information about the interactome, it does not accommodate various sources of errors that are common in many experimental datasets.

On the other hand, protein complexes are hidden away within the constructed (bi- nary) PPI network. Discovering these complexes would allow us to model the inter- actome as a more complex system. Therefore, further post-experimental analyses are imperative to obtain a proper view of PPI networks. In this section, we review compu- tational methods for post-processing experimental PPI data, and we discuss how such approaches can help characterize unknown properties of the interactome.

1.2.1 Graph Models for PPI Networks

One approach to the analysis of PPI networks is looking at topological properties of the networks. While studying the topological structure of PPI networks may not immedi- ately lead us to biological discoveries, understanding the characteristics of the networks is typically a crucial step toward modelling real world networks.

Local Structure Models

Network motifs. Network motifs are patterns of interconnections (induced subgraphs) that occur in a given biological network much more often than they would in random networks. Shen-Orr et al. [127] have shown that the transcriptional regulatory network of Escherichia coli contains three highly frequent motifs, each with a distinct function in gene expression. This suggests that identifying different motifs in a PPI network would help us describe different functional protein groups within the network and, further- 12 Chapter 1. Introduction

Figure 1.5: Degree distribution of yeast PPI network of 1210 proteins with 2357 interactions [93]. On a log log plot, the distribution fits a power law distribution with y ax γ where a 874.87 and 1.809. = − = γ = − more, determine the function of uncharacterized proteins.

Graphlets. A similar notion of local structure called graphlets has been proposed by

Przulj et al. [117]. Graphlets are all possible simple graphs over 3 5 vertices (up to − isomorphism). In order to analyze the global structure of PPI networks, Przulj et al. looked at the frequency distribution of each graphlet appearing in the PPI networks, and compared it against other global model networks such as Erdös-Renyi random graphs, scale-free networks, and geometric random networks (see below). While network mo- tifs for a given network capture the most significant local structures within the network, this bottom-up approach of graphlet distribution provides a metric to compare different networks. More specifically, the distribution of each graphlet measured by the relative frequency of graphlets allows us to analyze networks of different sizes. This is particu- larly useful when PPI networks are compared to various graph models (see next section) whose network sizes may vary. 1.2. Computational Analyses of PPI Data 13

Global Structure Models

Scale-free networks. In many real world networks, including PPI networks, the degree distribution of the nodes follows the power law P k k γ; that is, there are many more ( ) − ≈ nodes with a small number of neighbours than nodes with a very large number of neigh- bours. Networks that exhibit such a degree distribution are called scale-free [10], and this property has profound effects on real world networks. For example, scale-free networks are resistant to random failures on the nodes due to the small number of high-degree hubs – there are many more low-degree nodes which are equally likely to fail. This ro- bustness can also be observed in PPI networks, and the connectivity of a node can be correlated to the essentiality of the corresponding protein. For example, studies on the yeast PPI network [79] have shown that:

1. 93% of the yeast proteins have degree at most 5, but only 21% of these caused lethality when deleted;

2. only 0.7% of the proteins had degree at least 15, but 62% of these are essential.

These results suggest that non-essential proteins, whose disruption is non-lethal, tend to have lower degree than essential proteins. Moreover, proteins in the same func- tional group tend to show similar number of interacting partners. Such structure-function relationships can be exploited for the prediction of protein interactions or complexes that are yet unknown, as we discussed in Section 1.1.3.

Geometric random networks. A geometric random network [113] is a graph model where the nodes correspond to randomly distributed points in a metric space, and nearby nodes are joined by an edge. Przulj et al. [117] used the graphlet distributions to charac- terize PPI networks as a geometric random network. In particular, upon comparing the yeast and fruitfly PPI networks against various random graph models (including scale- free graphs), the graphlet distributions of these PPI networks were closer to that of ge- ometric random networks than any other model. This result suggests that the geomet- 14 Chapter 1. Introduction

ric random network may be a better model for PPI networks than the widely accepted scale-free networks.

1.2.2 Identification of Protein-Protein Interactions

The PPI data from high throughput experiments such as Y2H, PCA or AP-MS contain a significant amount of noise, and such high false positive/negative ratios deteriorate the overall confidence level of identified interactions. These high throughput methods have all been used to perform a large-scale study of yeast proteome [58, 70, 93, 134, 149], but comparative assessments of those datasets continuously reveal that a significant portion of identified interactions in one study is unique to that dataset and not observed in others [78, 141]. Therefore, in silico approaches to improve the confidence level of PPI data are an attractive topic of research.

Topology-based PPI detection. Inevitably, the PPI data from high throughput experi- ments is used as the initial input. Since the global view of the currently known interac- tome may be heavily biased due to incomplete experiments, several authors often resort to examining the local topological structure. For example, Saito et al. [122, 123] devel- oped a measure called interaction generalities (IG1, IG2) which considers the restricted neighbourhood for each protein in the network. The main idea behind this measure is that if a protein interacts with many interaction partners but these interaction part- ners exhibit no further interactions among themselves then these interactions are more likely to be false positives. This idea works particularly well for PPI data from Y2H exper- iments, since some “sticky” proteins in the Y2H assays have a tendency to turn on the positive signals by themselves, irrespective of their interaction partners.

More recently, two other measures have been proposed by Chen et al. [27, 26], and Pei and Zhang [111]. The method of Chen et al., called Interaction Reliability by Alter- native Path (IRAP) [27], considers alternative paths between two interacting partners. If a pair of proteins exhibit a direct interaction, it is likely that there are alternative se- 1.2. Computational Analyses of PPI Data 15

quences of interactions that join the two proteins. Hence, IRAP looks for the shortest (measured by distance) such alternative path for each pair of proteins, and the final confidence level is determined by the strength of the discovered alternative path.

On the other hand, Pei and Zhang [111] have shown that, by considering all alterna- tive paths of different lengths, a better result can be achieved. For each pair of vertices, they consider all paths of length k > 1. Then, the proposed measure called PathRatio is computed as a weighted sum of all the paths between two vertices. Since enumerat- ing all paths between two vertices is computationally hard, a rough estimate is obtained by only computing for small values of k . Using the PPI data from various sources, the PathRatio method achieved better results in functional homogeneity, localization ho- mogeneity, and gene expression distance when compared against a simulation using IRAP. On the other hand, the iterative hill-climbing version of IRAP, called IRAP 26 , ∗ [ ] showed similar performance to PathRatio.

While these topology-based methods attempt to assign confidence levels to inter- actions, the resulting PPI networks still contain high ratios of false positives and false negatives. This may be due to the fact that the models used in these methods do not truly reflect the nature of the experimental PPI data – for example, the input PPI data may contain a large number of indirect interactions from AP-MS experiments, but no known algorithm addresses this issue explicitly (the algorithms discussed in this section are evaluated only using the PPI data from Y2H experiments).

Phylogeny-based PPI detection. The poor accuracy and coverage of topology-based methods may also be due to the lack of information contained in the PPI data itself. To address this, several approaches have been proposed using phylogenetic informa- tion. For example, a protein interaction in one species may be inferred from Rosetta Stone proteins (two genes fused into a single one) in another organism (Gene Fusion

Method [131]). Or, in the case of bacterial genomes, the conservation of the order of genes may be a good indication of interactions (the Gene Order Conservation Method [37]). 16 Chapter 1. Introduction

A potentially more powerful approach is using a set of reference organisms’ genomes. A phylogenetic profile of a protein A is a vector whose i -th entry represents the presence or the absence of protein A in organism i . Under the assumption that physically in- teracting proteins coevolve, the interaction can be inferred from the similarity of two phylogenetic profiles (the Phylogenetic Profile Method [112]). Moreover, the phyloge- netic trees for two proteins could also be used in comparative analyses. In this method, a multiple sequence alignment (MSA) is first obtained from a set of reference organ- isms, and the phylogenetic trees for protein A and B are constructed using the position of orthologs in the MSA. Then the similarity score can be computed by comparing the distance matrices of the two phylogenetic trees. For a more comprehensive exposition of these methods see Jothi and Przytycka [82].

Protein-protein docking. Another approach for detecting protein interactions is done by structural analysis of protein molecules: namely, protein-protein docking. Given a pair, or a small set of proteins, accurate and efficient protein docking algorithms pro- vide a useful tool for understanding both the tendency for protein interactions and the structure of protein complexes. Various docking algorithms have been proposed, and often work in two phases: (1) A population of candidate conformations are generated; and (2) each of the candidates is scored in order to find the most likely structural confor- mations (see, for example, Azé et al. [6] and Choi [31]). To verify the correctness of these methods, they are often compared against a database of NMR and X-ray structures [77].

1.2.3 Identification of Protein Complexes

While the PPI networks represent binary interactions between two proteins, protein complexes with three or more interacting partners are difficult to identify immediately from these networks. However, the protein complexes are often highly connected sub- graphs in PPI networks, and thus graph clustering algorithms are often used to detect these complexes. 1.2. Computational Analyses of PPI Data 17

Molecular Complex Detection (MCODE) algorithm. Bader and Hogue [7] proposed the three-stage algorithm, MCODE, to identify protein complexes. First, the vertices are given weights according to a local density measure, called the core-clustering coefficient.

Let N [v ] denote the subgraph induced by a vertex v and its neighbours. Then, a k -core of v is a subgraph of N [v ], induced by vertices with degree at least k . The highest k -core refers to a nonempty k -core with the highest k , and the core-clustering coefficient of a vertex v is then calculated as the edge density of the highest k -core of v .

After each vertex is assigned a weight, the second stage of the algorithm recursively starts to expand the clusters. The vertex with the highest weight is initially set to form a cluster, and all of its neighbours are joined to the cluster if (i) the weight of the neigh- bour is above a given threshold; and (ii) the neighbour has not been previously explored. When no further expansion is possible, the next heaviest unexplored vertex is set to form a singleton cluster, and the process is repeated. Note that the clusters found so far are disjoint at this stage of the algorithm.

Finally, the third phase the the algorithm is a post-processing step, where clusters that are not sufficiently dense are discarded from the set of clusters. Furthermore, any unexplored vertices are joined to nearby clusters. At this stage, these unexplored vertices may be joined to multiple clusters, thereby creating overlapping clusters.

Markov Cluster algorithm. The Markov Cluster (MCL) algorithm, proposed by Van

Dongen [139], simulates random walks on the graph by alternating two operations, ex- pansion and inflation, to transform one set of probabilities into another. First, a weight matrix M of G is normalized into M˜ so that each column of M˜ sums to 1. Then each

column of M˜ represents a stochastic process, where each entry M˜ (i , j ) corresponds to ˜ the probability of going from vertex i to vertex j . The inflation ∆r of M for a power coefficient r is defined as:

M˜ p,q r ˜ ( ) ∆r (M (p,q)) = n P M˜ i ,q r i =1( ( )) 18 Chapter 1. Introduction

Hence, higher values of r would yield tighter clusters by favouring more probable walks on the graph.

On the other hand, the expansion operation creates longer walks on the graph by computing the e -th power of the associated matrix M˜ , where e is the expansion param- eter. This operation will try to expand the information flow on the network. The MCL al- gorithm runs the expansion and the inflation process as alternatively until M˜ stabilizes. When the algorithm reaches its equilibrium, the resulting matrix would correspond to a collection of star-like components, each of which is declared a protein complex.

Min-cut based algorithms. Though not applied directly to the protein complex detec- tion problem, several approaches have been proposed to use minimum cut algorithms when finding clusters in biological networks (e.g. [69, 126]). These approaches first start with a similarity graph as an input, and repeatedly partition the set of vertices into clus- ters until a partition satisfies a particular stopping criterion. For example, in Hartuv et al. [69], the notion of highly connected subgraph (HCS) is used as a stopping criterion. V (H) A subgraph H is a HCS if the global minimum s t cut of H is at least | 2 | . The initial − PPI network is then recursively partitioned into two clusters by minimum s t cuts until − each connected component is a HCS.

Simultaneous clustering algorithm. More recently, Narayanan et al. [106] proposed a framework, called JointCluster, that finds a clustering of multiple networks that inte- grates large-scale datasets including PPI data and gene expression data. Extending from a previous study on clustering a single graph by Kannan et al. [84], JointCluster finds sparse cuts to generate clusters that allow theoretical approximation guarantees on the quality of the detected clustering relative to the optimal clustering. The biological sig- nificance of the resulting clusters preserved across physical and coexpression networks is verified by known protein complexes in the literature or enrichment of functionally coherent pathways. 1.3. AP-MS based PPI Networks 19

In a recent study by Brohee and van Helden [20], four clustering algorithms – MCL, Restricted Neighbourhood Search Clustering (RNSC) by King et al. [90], Super Param- agnetic Clustering (SPC) by Blatt et al. [15], and MCODE - were evaluated against PPI networks built from annotated MIPS database. In particular, to test for robustness of the algorithms against false positives and false negatives, edges were randomly added and removed to generate 41 altered graphs. After comparing against the hand-curated pro- tein complexes from the MIPS database, the MCL algorithm clearly outperformed the other methods under most conditions, while RNSC and MCL show similar behaviours in some tests such as altered graphs with edge removal but no edge addition.

Note that these complex detection methods use similarity matrices that are often populated using the number of times each pairwise PPI is observed. However, the sim- ilarity between two proteins can be interpreted in various ways, e.g., (1) the confidence level of a direct interaction, (2) the strength of an interaction, or (3) the timing of an interaction to distinguish those that are transient. However, the existing complex detec- tion methods do not reflect on how these similarity matrices are interpreted. In partic- ular, if the PPI data contains many indirect interactions, clusters that highly overlap will be be difficult to separate using the existing PPI data. Furthermore, because the clus- tering algorithms tend to first create disjoint clusters, and then join the ones that are close to each other, they often do not distinguish overlapping protein complexes well. As such, methods to discover a collection of overlapping clusters are still in demand in order to properly model PPI networks as hypergraphs8.

1.3 AP-MS based PPI Networks

In the previous section, we introduced the types of post-processing and analysis that need to be done computationally, and reviewed different approaches proposed in the literature. In general, these approaches do not take into account the experimental tech-

8 A hypergraph is a set system where, in the context of PPI networks, each set corresponds to a protein complex. 20 Chapter 1. Introduction

nique from which the dataset originates, and such generic models often fail to capture the inherent nature of the produced data.

To give an example, the methods introduced in Sections 1.2.1 and 1.2.2 assume that the dataset contains only direct, binary interactions. This is a reasonable assumption when the dataset is from Y2H experiments, but when analyzing AP-MS data, indirect interactions must be filtered out prior to applying these methods. On the other hand, the complex detection methods discussed in Section 1.2.3 are designed to find individual clusters within the network, and as is the case for most clustering algorithms, finding overlapping complexes remains a difficult problem in general.

In this section, we take a brief look at how the analyses are carried out for PPI net- works constructed from AP-MS experiments. In particular, we focus on two of the most recent large scale studies on AP-MS based PPI networks from a bioinformatics point of view.

1.3.1 Protein Complexes in Yeast

In Krogan et al. [93], the protein-protein interactions in the yeast were detected by the AP-MS method. In order to improve the accuracy as well as the coverage, two MS pu- rifications were done independently. Furthermore, both interacting partners for each protein-protein interaction were tagged and purified in order to ensure greater data con- sistency and reproducibility. After 2357 successful purifications of proteins, 4087 yeast proteins were identified as preys with high confidence scores from the mass spectrom- etry.

After the interactome with confidence scores had been established, protein com- plexes were detected using the Markov Clustering algorithm (described in Section 1.2.3), where the expansion and inflation operators were appropriately chosen to optimize the overlap against the MIPS database. The Markov Clustering algorithm identified 547 dis- joint protein complexes, half of which were not present in the MIPS database. Further- 1.3. AP-MS based PPI Networks 21

more, new proteins were identified for most complexes that had been known previously.

For the purpose of the graph theoretic analysis of the protein complexes, a soft- ware plug-in (http://genepro.ccb.sickkids.ca) for the Cytoscape environment was written to model the interactome in two different ways: (1) the traditional PPI net- work model, and (2) the protein complex network model, where each protein complex is represented as a node, and two nodes are joined by an edge if and only if the correspond- ing complexes share common subunits. Using these two models, the authors were able to verify (or re-verify) several hypotheses and findings from the previous literature; for example, (1) proteins in the same complex should have similar function and co-localize to the same subcellular compartment, and (2) highly connected proteins within the net- work tend to be highly conserved, and conversely, highly conserved proteins tend to be more highly connected and central (measured by betweenness9) to the network.

On the other hand, as discussed in Section 1.2.3, the Markov Clustering algorithm does not necessarily separate two or more complexes that share subunits, and the au- thors point out that the identified clusters may contain multiple complexes. Identifying such overlapping complexes remains to be explored, and better suited complex detec- tion algorithms will provide a finer view of the interactome.

1.3.2 Modularity in Yeast Protein Complexes

Gavin et al. [58] also studied the protein-protein interactions in the yeast via the AP- MS method, and successfully purified 1993 distinct proteins, for which 2760 interacting partners were identified via mass spectrometry. In order to measure the reproducibil- ity, 138 purifications were done repeatedly, and 69% of the repeated purifications were common, giving rough estimates of the false-positive and false-negative ratios.

As acknowledged by the authors, the current graph clustering approaches for iden-

9 Betweenness is a measure of graph centrality of a node. For a given node v , Betweeness(v ) is typically defined as the number of shortest s t paths that pass through v (sometimes normalized by the total number of s t paths), where s = v =−t . − 6 6 22 Chapter 1. Introduction

tifying protein complexes are inappropriate for the PPI data from AP-MS experiments since they do not explicitly capture the nature of the purification output. To avoid this problem, the socio-affinity index measure was devised to quantify the propensity of pro- teins forming an interaction. This index measures the log-odds of the number of times two proteins are observed together, relative to what would be expected from their fre- quency in the data set. In particular, the socio-affinity index A(i , j ) is defined as

A(i , j ) = Si ,j i =bai t +Si ,j j =bai t + M i ,j , | |

where Si ,j i =bai t measures the tendency for protein j to be pulled down when protein i is | used as a bait (known as the spoke model), and M i ,j measures the tendency for two pro- teins to co-purify when other proteins are tagged as a bait (known as the matrix model).

The socio-affinity model was the first attempt to quantify physical measurements of the interactions purely from the AP-MS data. These computed values can populate an n n matrix representing a PPI network, and a graph clustering algorithm (not de- × scribed) is then used in order to define protein complexes within this network. However, since each protein can belong to multiple protein complexes, a disjoint clustering from a single run of the clustering algorithm is unlikely to form the appropriate protein com- plexes. To avoid this pitfall, several runs of the clustering algorithm are carried out using different parameters.

Using this approach, 1784 different sets of complexes were generated, and were com- pared against hand-curated known complexes. Consequently, the sets of complexes that score highly ( > 70% ) in coverage and accuracy show similar clusterings. Thus the top scoring set of complexes is merged with other high scoring sets of complexes to form variants of complex formulations called complex isoforms. Interestingly, these complex isoforms showed that the proteins within each complex can be partitioned into two types: (1) core components, which are the subunits of a complex that are present in most isoforms, and (2) modules, which are present in only some of the isoforms, but al- ways together. Furthermore, each module (which can be regarded as a building block 1.4. Contributions and Thesis Outline 23

of a complex) appeared to be present in multiple complexes, where the core compo- nents differ. The authors verified that this organization of complexes is due to biological phenomena by looking at the co-expression level during the cell cycle, as well as co- localization in the cell. Furthermore, the cores and modules that were detected within the same complex tend to belong to the same functional category, or otherwise, the core–module cross-talk between functional categories show many known connections such as between protein synthesis, transcription, and the cell cycle.

This modularity of protein complexes gives a new insight towards the structure of PPI networks in that each protein complex can be further partitioned into functional building blocks – cores and modules, and this structural property of protein complexes may be beneficial to designing new complex detection algorithms. On the other hand, while the socio-affinity index attempts to quantify the probability measure for direct protein interactions, the spoke model and the matrix model described above only con- sider 1- or 2-hop neighbours of each protein from the raw PPI data. From the bioin- formatics standpoint, devising a better suited measure that fully models the stochastic nature of the AP-MS PPI data would certainly improve our understanding of the pro- teome.

1.4 Contributions and Thesis Outline

Due to its ability to test for protein co-complex memberships, AP-MS has been the method of choice for various recent studies in high-throughput proteomics. However, the AP-MS method has its own drawbacks that require systematic curation as well as reinterpretation of the experimental data. In this thesis, we (1) investigate three such limitations from a mathematical viewpoint, (2) formulate combinatorial optimization problems for each identified limitation, and (3) study algorithmic approaches with the appropriate applications in mind.

Chapter 2. We give an introduction to the mathematical tools that are used extensively 24 Chapter 1. Introduction

throughout this thesis. This includes techniques from approximation algorithms and combinatorial optimization as well as graph parameters that are useful for designing efficient algorithms for computationally intractable problems.

Chapter 3. Protein Quantification. When testing the existence and the abundance of proteins, mass spectrometry can only look at short fragments of proteins (known as peptides), and then identify which protein the peptide belongs to. However, there are a number of “ambiguous” peptides that belong to several different pro- teins, which makes the identification of constituent proteins a difficult task. We study the problem of quantifying the protein abundance despite the presence of these ambiguous peptides: given a set S of peptide sequences and a set W of pro-

tein sequences, together with the peptide abundance σ, find the protein abun- dance X for each protein. We formulate this problem into various mathematical programs for which we show their hardness, and devise approximation algorithms that can predict absolute abundance for all constituent proteins. This manuscript

is to be submitted [89].

Ethan Kim, Adrian Vetta, Mathieu Blanchette, “Protein quantification with • shared peptides using multi cover algorithms”, in preparation.

Chapter 4. Direct PPI network. Once the PPI data from AP-MS experiments are quan- tified, one needs to be sure that the identified interactions are accurate. However, as discussed in this chapter, a significant fraction of the detected interactions are artifacts of indirect interactions, i.e. detected via chains of simultaneous interac- tions. We study the problem of separating the direct interactions from indirect ones, in order to minimize the false positives in the AP-MS data. Here, we intro- duce a probabilistic graph model that captures the direct and indirect interactions within the AP-MS data. Then, under this model, the problem of distinguishing di- rect interactions from indirect ones is formulated, and we provide an algorithm

that is highly customized for the AP-MS data. This work is published in [88]. 1.4. Contributions and Thesis Outline 25

Ethan Kim, Ashish Sabharwal, Adrian Vetta, Mathieu Blanchette, “Predict- • ing Direct Protein Interactions from Affinity Purification Mass Spectrometry Data”, Algorithms for Molecular Biology, 5:34, (2010).

Chapter 5. Hypergraph modelling of PPI data. PPI networks are traditionally represented as combinatorial graphs where each interaction is assumed to occur between two proteins. Recent studies, however, revealed interactions involving several proteins working simultaneously. Such discoveries of protein complexes as building blocks of the proteome led to much interest in modelling PPI networks as hypergraphs. We thus study the problem of constructing hypergraphs from binary PPI data us- ing algorithms for the (edge) clique cover problem. This work resulted in various efficient algorithms for both PPI networks and other restricted classes of graphs such as graphs with bounded parameters, and an approximation algorithm in the

case of planar graphs, as published in [14].

Mathieu Blanchette, Ethan Kim, Adrian Vetta, “Clique Cover on Sparse Net- • works”, The 9th SIAM Meeting on Algorithm Engineering & Experiments (ALENEX) 93-102, Kyoto, Japan (2012)

Chapter 6. We conclude and give a list of open problems. 26 Chapter 1. Introduction Chapter 2

Approximation Algorithms and Optimization Techniques

This chapter introduces topics from discrete mathematics that are preliminary to dis- cussions throughout the thesis. In particular, we focus on algorithmic techniques that are frequently used to tackle problems in computational biology. As such, during this pedagogical exposition, we shall discuss problems with applications in computational biology in order to motivate the reader. We shall first define NP-hard optimization problems in Section 2.1. Then, Section 2.2 discusses graph parameters that often al- low us to design efficient algorithms for computationally hard problems. Various tech- niques to design approximation algorithms are discussed in Section 2.3, including α- approximation algorithms, polynomial time approximation schemes, and LP-based meth- ods. Finally we introduce genetic algorithms in Section 2.4.

2.1 NP-hard Optimization Problems

Combinatorial optimization problems ask to find an optimal object from a finite set

of objects [125]. Typical problems include shortest path, minimum spanning tree, and

27 28 Chapter 2. Approximation Algorithms and Optimization Techniques

travelling salesman problem, and these problems can often be stated as: given A, find B of size k , where k is the size of the solution for which the problem tries to either minimize or maximize. If k is indeed the minimum (maximum, resp.) possible value admitting a solution, we say k is optimum.

From the perspective of computational complexity, we say that a problem Π is in P if there is an algorithm to solve Π efficiently.1 On the other hand, if a problem Π is such that its solution can be verified efficiently, we say Π is in NP. By definition, this implies that problems in P belongs to NP. The hardest problems in NP deserves special attention: if every problem in NP can be reduced to a problem Π within polynomial time (without requiring that Π belongs to NP), we say that Π is NP-hard. Furthermore, if Π also belongs to NP itself, we say that Π is NP-complete. Therefore, if an NP-complete problem (the hardest problem in NP) can be solved efficiently within polynomial time, then every problem in NP can be solved in polynomial time. Therefore, NP-complete problems lie at the core of the class NP.

Consequently, it is of great interest in the theoretical computer science community to decide whether NP P = , thereby P = NP. Since it is widely believed that this is not \ ; true, algorithm designers face difficulties in finding efficient solutions to these prob- lems. Indeed, as is the case for problems that are discussed in this thesis, optimization problems in computational biology often turn out to be NP-hard. As a result, one must apply various techniques to either find plausible solutions to these problems, or restrict ourselves to special cases of the problem that allow efficient algorithms. We shall intro- duce some of these techniques in the following sections.

1 An algorithm is said to be efficient if its running time can be bounded by a polynomial in the input length. We avoid more rigorous definitions for these classes of problems; precise definitions involving Turing machines and certificates can be found in various complexity theory texts, for example [4, 56]. 2.2. Graph Parameters 29

2.2 Graph Parameters

Throughout this thesis, let G = (V, E ) be a finite graph with vertices V and edges E . Unless stated otherwise, we assume G is simple and undirected. The number of vertices

and edges are V = n and E = m, respectively. | | | | A graph parameter Γ is a function that assigns a non-negative number to every graph G . For a class of graphs , Γ( ) denotes the maximum Γ(G ) over all graphs G , and G G ∈ G we say a graph class has bounded Γ if Γ( ) O(1). G G ∈ Graph parameters are useful when designing efficient algorithms with parameter- ized running time. In particular, a problem is fixed-parameter tractable (FPT) if it can be solved in f (k ) I O(1) time, where f is a computable function depending on some · | | parameter k , independent of the input size I . | | A good example is treewidth, which measures how “tree-like” a given graph is. To define treewidth, we need the notion of tree decompositions.

Definition 2.1. [120] A tree decomposition of a graph G = (V, E ) is a pair (X = Xi i { | ∈ N ,T = (N , L)) where each tree node i N is associated with a set of vertices Xi V , such } ∈ ⊆ that

S (1) i N Xi = V. ∈

(2) For each (v,w ) E , there is an i N with v,w Xi . ∈ ∈ ∈

(3) For each v V , the set of tree nodes i N v Xi induces a subtree of T . ∈ { ∈ | ∈ }

For clarity, we shall refer to (tree) nodes of the tree decomposition T , and vertices of the original graph G . The width of a tree decomposition (X,T ) is defined as maxi N Xi ∈ | |− 1, and the treewidth of a graph G , denoted tw(G ), is the minimum width over all tree de- compositions of G . Following this definition, it is easy to see that all trees have treewidth of 1. See Figure 2.1 for an example of a tree decomposition of width 3. In general, it is NP-

complete to determine the treewidth of a graph [3]. However, for a fixed k , graphs with 30 Chapter 2. Approximation Algorithms and Optimization Techniques

1 2 1 2 3 4 5 4 5 6 7 8 4 5

9 10 11 7 8 Graph G 4

6 7

9 7 8

11 3 4 7

6 9 10 Decomposition Tree T

Figure 2.1: A graph G and its tree decomposition T of width 3. treewidth k can be recognized, and a width k tree decomposition can be constructed in linear time [17].

A tree decomposition of a graph is especially useful for designing divide and con- quer algorithms: given a decomposition tree T , one can pick an arbitrary tree node r as root of T . Then, for each tree node X, the problem can be solved for the subgraph induced by the vertices associated with X, and all its descendant nodes. Building up the solution from leaf nodes to r often provides a polynomial time algorithm (parameter- ized by the treewidth) if solutions to the subproblems can be combined using dynamic programming2.

Another related parameter for graph sparsity is branchwidth.

Definition 2.2. [121] A branch decomposition (T,φ) of a graph G is characterized by a ternary tree3 T , and a bijection φ from the leaves of T to the edges of G .

2 Dynamic programming is a method for solving complex problems by subdividing the problem into simpler subproblems. This technique is useful when there are overlapping subproblems, so that one can solve the problem in a bottom-up fashion. 3 A tree T is a ternary tree if every non-leaf node has degree 3. 2.2. Graph Parameters 31

1 2 3 1,4 1,2 4,5 2,5 4 5 6

7 8 9 Graph G 4,7

7,8

5,8 8,9 5,6 6,9 2,3 3,6 Branch decomposition T

Figure 2.2: A graph G and its branch decomposition T . The e -separation shown on the decomposition tree T corresponds to a middle-set of 5,6,8 . { }

Let us denote e to be a tree edge in T . Removing e from T partitions into T1 and T2, and this partition induces a partition of edges in G , called an e -separation, associated

with the leaves of T1 and T2. The set of vertices in G that are shared by both G1 and G2 is called the middle-set of e , and the width of this separation is the number of vertices in the middle-set.

Given a branch decomposition (T,φ), the width of this branch decomposition is the maximum width over all e -separations in T , and the branchwidth of G , denoted bw(G ), is the minimum width over all branch decompositions. Figure 2.2 gives an example of a graph and a branch decomposition. It is well known that the branchwidth is closely

related to the treewidth of a graph by the sandwich theorem [121]:

3 bw(G ) tw(G ) + 1 bw(G ). ≤ ≤ 2

These graph parameters often allow us to design efficient algorithms to solve oth- erwise NP-hard problems, and such problems come in different flavours: The input 32 Chapter 2. Approximation Algorithms and Optimization Techniques

graphs sometimes have bounded graph parameters due to the nature of the problem. In some other cases, different techniques can be used when the graph has unbounded parameters, so we can restrict ourselves to the bounded case. Sometimes, the graph decomposition gives rise to efficient algorithms to find near-optimal solutions.

All these techniques will be exploited extensively for modelling PPI networks as hy- pergraphs in Chapter 5. Indeed, graph decomposition finds applications in a number of different areas within computational biology. The perfect phylogeny problem, which will be discussed in the next section, is polynomially equivalent to the problem of trian- gulating coloured graphs, and it can be solved in polynomial time when the treewidth is bounded [16, 92]. Within the field of proteomics, several studies have shown that pro- tein interaction networks often exhibit low treewidth (see, e.g. Yamaguchi et al. [147] and Cheng et al. [28]). This opens a gate for designing efficient algorithms for many problems on PPI networks. For example, Dost et al. [43] study the problem of querying the existence of a pathwayQ within a base network G . While the problem of deciding if a graph is a subgraph of another is an NP-hard problem in general [56], they focus on the restricted class of graphs with bounded treewidth, and provide an efficient algorithm to query such pathways in a database of molecular networks.4

2.3 Approximation Algorithms

Many optimization problems in bioinformatics are NP-hard, and often researchers re- sort to heuristics for finding solutions that are close to optimal. A useful line of attack is the theory of approximation algorithms, which offers compelling techniques for solving NP-hard optimization problems with provable guarantees. For a minimization prob- lem, an algorithm is an α-approximation algorithm if it can find in polynomial time a solution of value at most α OPT, where OPT is the value of an optimal solution. The · 4 In Dost et al.’s work, similarity scores between vertices are computed using their sequence similar- ity, which is then used when matching a query network to the base network. The classical problem of subgraph isomorphism on graphs with bounded parameters has been studied by Eppstein [45, 46]. 2.3. Approximation Algorithms 33

usage of approximation algorithms in bioinformatics includes genome assembly (short- est superstring problem), multiple sequence alignment problem, and phylogenetic tree construction (Steiner tree problem), just to name a few.

Despite the diverse application of approximation algorithms in bioinformatics, the area of PPI networks has seen limited applications. In particular, to the best of our knowledge, no combinatorial algorithm has been proposed so far to identify direct pro- tein interactions (Chapter 4) or protein complexes (Chapter 5) purely from the topolog- ical structure of the given PPI network. Such scarcity of applications may be due to the fact that even defining a clean objective function is a nontrivial task. In fact, most ex- isting approaches propose different objective functions, for which heuristic algorithms are used. Moreover, it is often unclear whether these approaches directly capture the experimental processes that produced the PPI data.

In this section, we present a glimpse of approximation algorithms by discussing a few problems as examples. Since the focus of this section is to introduce techniques for devising approximation algorithms, we refrain from discussing the latest results on each problem with the best approximation guarantee. For more extensive surveys, Vazi- rani [140], and Williams and Shmoys [145] give excellent tutorials on the field.

2.3.1 Lower-bounding the Optimum

When designing an approximation algorithm, one needs to compare the cost of the ob- tained solution with the cost of an optimal solution in order to establish the approxima- tion guarantee. This is typically done by finding a good lower bound (for minimization problems) on the cost of an optimal solution. For example, consider the vertex cover problem:

CARDINALITY VERTEX COVER. Given an undirected graph G = (V, E ), find a set C V ⊆ such that every edge has at least one endpoint incident at C , and C is minimized. | | 34 Chapter 2. Approximation Algorithms and Optimization Techniques

A lower bound on the size of an optimal vertex cover can be found using a maximal matching5 of G . With respect to a maximal matching , any vertex cover has to pick at M least one endpoint of the edges in . Therefore, the size of is at most the size of an M M optimal vertex cover. This gives rise to the following simple algorithm:

1. Find a maximal matching in G . M 2. Output both endpoints of edges in . M

Observe that the chosen vertices are adjacent to all the edges of G , as otherwise any uncovered edge would have been added to the matching. Furthermore, the size of the cover is exactly 2 . Since OPT, the size of the cover is at most 2 OPT, and thus |M | |M | ≤ · the algorithm is a 2-approximation algorithm.

Application: Phylogenetic Trees [18] Consider a set of m species with n possible char- acteristics. We can represent this by a 0 1 matrix M where M i ,j = 1 if species i has − characteristic j . Then, we say M has a perfect phylogenetic tree if there exists a tree T such that: (1) there are m leaves, one for each species; (2) Each characteristic label cor- responds to one edge in T (there may be edges with blank labels); (3) for each leaf si , the path from root to si contains exactly those characteristics possessed by si . See Figure 2.3 for an example of M and a possible perfect phylogenetic tree.

Let Sj be the species that have characteristic c j = 1. Then the following characteri- zation holds.

Theorem 2.3. M has a perfect phylogenetic tree if and only if S1,...,Sn form a laminar { } family.6

Thus, we say that two characteristics ci and c j have a conflict if Si and Sj intersect, while

Si Sj = and Sj Si = . If we remove characteristics so that there are no conflicts, then \ 6 ; \ 6 ; 5 Given a graph G = (V, E ), a subset of edges E is a matching if no two edges in share a vertex. We say a matching is maximal if no more edgesM of E ⊆ can be added to . M 6 \M M A set system is a laminar family if, for each pair of sets Si and Sj , either Si Sj , or Sj Si , or Si Sj = . ⊆ ⊆ ∩ ; 2.3. Approximation Algorithms 35

R

c3 c2

c4 c1 c1 c2 c3 c4 c5 ∅ ∅ s1 1 1 0 0 0 s2 0 0 1 0 0 c s4 s2 s5 5 s3 1 1 0 0 1 ∅ s4 0 0 1 1 0 s1 s3 s5 0 1 0 0 0

Figure 2.3: Matrix M indicating the characteristics of a set of species (left) and a perfect phylogenetic tree for M (right). Tree edges with indicate unlabelled edges. ; there is a perfect phylogenetic tree for the remaining characteristics. Thus, we want to find the smallest set of characteristics to remove. We can turn this problem into an instance of vertex cover: define a conflict graph whose vertices correspond to character-

istics c1,...,cn , and vertices ci and c j share an edge if and only if they conflict. Then, a minimum cardinality vertex cover in the conflict graph consists of a minimum set of conflicting characteristics to remove.

2.3.2 Polynomial Time Approximation Schemes

Some NP-hard optimization problems are approximable to an arbitrary degree defined

by the error parameter ε > 0. For an NP-hard optimization problem Π with an objective function f , we say that algorithm A is an approximation scheme for Π if on input (I ,ε), A outputs a solution s such that:

f (I ,s ) (1 + ε) OPT if Π is a minimization problem. • ≤ · f (I ,s ) (1 ε) OPT if Π is a maximization problem. • ≥ − ·

The algorithm A is called a polynomial time approximation scheme (PTAS), if for each

fixed ε > 0, the running time of A is bounded by a polynomial in the size of the instance 36 Chapter 2. Approximation Algorithms and Optimization Techniques

I . Note that the running time of a PTAS depends arbitrarily on ε. If, in addition, the running time is bounded by a polynomial in both I and 1/ε, we say the algorithm is | | fully polynomial time approximation scheme (FPTAS). Here we give an example of an FPTAS for the knapsack problem.

KNAPSACKPROBLEM. Given a setS = a 1,...,a n of objects with cost(a i ) Z+, profit(a i ) { } ∈ ∈ Z+, and B Z+, find a subset of S whose total cost is bounded by B, and whose total ∈ profit is maximized.

The design of a PTAS is often done via first mapping the problem instance to a coarser instance, and the new instance is then solved exactly by dynamic programming ap- proach. Before describing the FPTAS for the knapsack problem, we introduce the no- tion of pseudo-polynomial time algorithms. An algorithm is efficient if its running time is bounded by a polynomial in I , the size of the instance. However, some problems | | contain numbers as part of the problem instance. For example, a knapsack problem instance contains cost and profit for each object a i . Let Iu denote the size of a prob- | | lem instance I , where all the numbers in the instance are written in unary. Then, an algorithm is a pseudo-polynomial time algorithm if its running time is bounded by a polynomial in Iu . Because the knapsack problem is NP-hard, we do not know whether | | an efficient algorithm exists; however, there is a pseudo-polynomial time algorithm for this problem.

Let P be the profit of the most profitable object in S. Then nP is an upper bound on the profit of an optimal solution. For every 1 i n and 1 p nP, let S(i ,p) ≤ ≤ ≤ ≤ denote a subset of a 1,...,a i whose total profit is precisely p and whose total cost is { } minimized. Further, define A(i ,p) to be the cost of the set S(i ,p). If no such set exists, set A(i ,p) = . Then A(1,p) is known for each value of p, where 1 p nP, and the ∞ ≤ ≤ following recurrence holds. 2.3. Approximation Algorithms 37

 min A(i ,p),cost(a i +1) + A(i ,p profit(a i +1)) , if profit(a i +1) < p A(i + 1,p) = { − }  A(i ,p), otherwise

A dynamic programming algorithm using this recurrence will compute all values of

A(i ,p) inO(n 2P) time. Finally, the optimal solution is found in timeO(nP) by looking for max p A(n,p) B . Thus, this is a pseudo-polynomial time algorithm for the knapsack { | ≤ } problem.

Observe that, if the profit of each object is bounded by a polynomial in n, the above algorithm would run in polynomial time. This observation leads us to an FPTAS for the knapsack problem. We first scale the profit down to small numbers so that the profit of each object is polynomially bounded by n. Then we can run the above algorithm to compute the optimal solution with respect to the scaled profit function. By carefully scaling with respect to the error parameter ε, we shall obtain a solution that is at least

(1 ε)OPT in time polynomial in I and 1/ε. − | | Here is the FPTAS for the knapsack problem:

1. Let K = εP/n.

profit a i 2. For each object a , define profit a ( ) . i 0( i ) = K b c 3. Run the dynamic programming algorithm with the scaled profit function, profit a 0( i ) to obtain the most profitable set S . 0

4. Return S . 0

We now show that profit S 1 OPT. Let S denote the optimal set with respect ( 0) ( ε) ∗ ≥ − to the original profit function. For each object a i , due to the rounding down at Step 2, 38 Chapter 2. Approximation Algorithms and Optimization Techniques

K profit0(a i ) is smaller than profit(a i ) by a difference at most K . Therefore, ·

profit(S∗) K profit0(S∗) nK . − · ≤

Furthermore, S must be at least as good as S under the scaled profits. By scaling the 0 ∗ profits back up by K we have:

profit(S0) K profit0(S∗) profit(S∗) nK = OPT εP (1 ε) OPT, ≥ · ≥ − − ≥ − · since OPT P. ≥ The running time of the algorithm is O n 2 P O n 3 1 , which is polynomial in ( K ) = ( ε ) b c · both n and 1/ε.

Application: Target Protein Set. Due to limited resources, structural biologists often face the problem of which protein to study. Because of diverse nature of protein struc- tures, certain protein molecules are more difficult to crystallize than others. It would be beneficial, therefore, if the chosen set of proteins are not too expensive to investigate. Furthermore, certain proteins are more important than others due to biological rele- vance, homology against existing results on similar proteins, etc. As such, researchers want to find the set of target proteins that maximizes the importance staying under their budget.

We can model this simple problem of task selection as follows: for each protein a i , define the cost of investigating it to be cost(a i ), and the importance of the particular protein to be profit(a i ). If we let B to be the total resources that we have at our dis- posal, an optimum solution to this knapsack problem gives us the most profitable set of proteins we can study. 2.3. Approximation Algorithms 39

2.3.3 LP-based Algorithms

Linear programming gives a profound framework in which approximation algorithms can be designed. Many NP-hard optimization problems can be stated as integer pro- grams. While solving an integer program itself is NP-hard in general, solving a (frac- tional) linear program is not, and thus the linear relaxation of the integer program pro- vides a natural way of bounding the cost of the optimal solution.

A widely used approach for designing LP-based approximation algorithms is LP- rounding. In this approach, a fractional solution to the LP-relaxation is first obtained, and it is then converted to an integral solution by a rounding scheme. The approxima- tion guarantee is established by comparing the cost of the fractional and integral solu- tions. For more detailed discussion on the this technique, see [140]. Here we present a simple example of the LP-rounding technique via a weighted version of the set cover problem.

WEIGHTED SET COVER. Given a universe U of n elements, a collection of subsets of U,

= S1,...,Sk , and a cost function c : Z+, find a minimum cost subcollection of S { } S → that covers all elements of U. S We can formulate this problem as an integer program by assigning a binary variable xS for each set S , whose value is set to 1 if S is picked in the set cover, and 0 other- ∈ S wise. Furthermore, for each element e U, at least one of the sets containing e must be ∈ picked. So we have:

X minimize c(S) xS S · X ∈S subject to xS 1 e U S:e S ≥ ∀ ∈ ∈ xS 0,1 S ∈ { } ∀ ∈ S 40 Chapter 2. Approximation Algorithms and Optimization Techniques

Let OPT denote the optimum solution of this integer program. The LP-relaxation of this integer program is obtained by replacing the domain of variables xS with 0 xS 1 for ≤ ≤ all S . The derived linear program can then be solved optimally to obtain OPTf . By ∈ S construction, OPTf OPT since the optimum integral solution is feasible for the LP. ≤

Then, we can apply the rounding scheme as follows: pick all sets S for which xS 1/α ≥ in the OPTf , where α is the frequency of the most frequent element (i.e. the maximum number of sets containing an element). Let C denote the resulting subcollection of sets chosen by the rounding scheme. To see C is a valid set cover, consider an arbitrary element e . Since e is in at most α sets, one of these must have been assigned at least

1/α by OPTf . Thus, e is covered by C .

To check the approximation ratio, the rounding scheme increases xS by a factor of at most α. Therefore, the cost of C is at most α OPTf . ·

Application: Representative Protein Set. In Section 2.3.2, we introduced the problem of choosing a set of target proteins to investigate. Here we modify this problem slightly. Instead of maximizing the profit we gain by studying a set of proteins, we consider the notion of representative protein set: from a large collection of protein molecules, we would like to study the ones that somehow represent the entire collection of proteins. Such a representative profile would provide us a succinct, yet global picture of the given protein sample.

We can model this problem as follows: for each protein x, define the cost function w (x) 0 measured by the difficulty of crystallizing x. Then, for each pair (x,y ) of pro- ≥ teins, define a distance function d (x,y ) 0 measured by the sequence alignment of the ≥ two proteins. We can then construct a graph G = (V, E ) where each vertex corresponds to a protein, and two vertices are joined by an edge if and only if the corresponding pro- teins are within some prescribed distance . We say a subset V V of vertices is repre- ∆ 0 ⊆ sentative, if every vertex in V is adjacent to at least one vertex in V . Then the problem 0 of finding a minimum cost representative set is precisely the dominating set problem. 2.4. Genetic Algorithms 41

WEIGHTED DOMINATING SET. Given a graph G = (V, E ) and a cost function c : V Z+, → find a minimum cost subset D V such that every vertex not in D is joined to at least ⊆ one vertex in D.

It is well known that dominating set is equivalent to set cover [83]. To reduce domi- nating set to set cover, construct a set cover instance as follows: the universe U is V . For each vertex v V (G ), create a set Sv containing the vertex v and all vertices adjacent to ∈ v . For each set Sv , define its cost to be c(v ). Now, if D is a dominating set for G , then

C = Sv : v D is a feasible set cover with the same cost. Conversely, if C = Sv : v D { ∈ } { ∈ } is a set cover, then D is a dominating set for G with the same cost.

Therefore, we can solve the representative protein set problem using the LP-rounding scheme for set cover.

2.4 Genetic Algorithms

In order to solve optimization problems, a search heuristic called a genetic algorithm (GA) often offers useful solutions. Inspired by natural evolution, genetic algorithms are especially useful when a large number of candidate solutions can be efficiently gener- ated, and pairs of candidates can be somehow “merged” to obtain a better offspring of the two parent solutions. Each candidate solution is typically encoded as a string of characters. A genetic algorithm can then be designed via the following four main phases:

I. Initialization: Many individual solutions are randomly generated to form an initial population that spans the entire search space.

II. Fitness function: Each candidate solution is evaluated by the fitness function that gives a score based on the optimization criteria.

III. Crossover: Given two parent solutions, the crossover operation generates an off- spring based on its parents. Various methods have been proposed for the crossover 42 Chapter 2. Approximation Algorithms and Optimization Techniques

operation, with one point “cut-and-swap” being the simplest form, while it can be fully customized in a problem-specific way.

IV. Mutation: In order to maintain genetic diversity from one generation to the next, a prescribed portion of the population goes through a mutation operation. The mutation operation is typically done by randomly switching characters in the can- didate string, which can be fine-tuned depending on the application.

Using these operators, a large pool of candidate solutions are maintained from one generation to another. Since a genetic algorithms is a search heuristic, one typically cannot provide a theoretical bound on the output quality. Further, the fitness operator only tells us whether one candidate solution is better than other existing candidates, and thus it is never clear when to stop the algorithm. However, when a large initial popula- tion can be easily generated, and the fitness function can be evaluated quickly, carefully designed GAs provide an efficient approach to find good solutions. In this thesis, we shall present a highly customized genetic algorithm to find direct physical PPI network in Chapter 4. For more detailed discussions on this optimization technique, see [64, 94]. Chapter 3

Protein Quantification with Shared Peptides

With the advance of mass spectrometry (MS) based techniques, biologists in proteomics research are not only interested in determining the presence of proteins within a mix- ture, but also the quantitative nature of the constituent proteins, i.e. their abundance. MS-based shotgun proteomics approaches this problem by first shattering the (unknown) protein molecules into shorter fragments known as peptides, and then reading the pep- tides present in the mixture using a mass spectrometer.

Mass spectrometers typically work by first producing ions with charges based on the masses of the peptides, and then separating the ions by their mass-to-charge ratios. Various methods have been proposed to convert the peptide molecules into ions in the gas phase using electrospray ionization (ESI) [144] and matrix assisted laser desorption ionization (MALDI) [85, 116]. The produced ions can then be detected and processed into mass spectra. There are several algorithms to analyze the mass spectra and either identify the peptides present or quantify the peptide abundance [59, 115, 135].

Among these quantification methods, approaches known as label-free methods have become popular due to the low cost and simplicity of the experiments. These methods

43 44 Chapter 3. Protein Quantification with Shared Peptides

work under the premise that more abundant peptides would produce a larger number of spectra. By counting the number of spectra, we can obtain measures such as spec- tral counting [30, 109], and peptide counting [87, 143]. Peptide count is the measure of how well a given protein is represented in a sample by counting the number of unique peptides. On the other hand, spectral counting simply counts the number of spectra for each peptide. As such, while peptide counting measures the coverage of the parent protein sequence, spectral counting provides a concrete measure of peptide abundance that can later be translated into protein abundance.

This chapter deals with the next step in the protein quantification problem: translat- ing the peptide abundance into protein abundance. In order to identify (and quantify) constituent proteins in a mixture, the detected peptides often serve as indicators for the parent proteins. For example, one of the most widely used approaches for MS pro- tein identification is the Mascot score1[114]. Given a set of peptide mass spectra, the Mascot score measures the probability that the match between the mass spectra and a particular protein sequence is a random event (thus a lower probability yielding a higher score), and is calculated for each protein in the protein sequence database. To compute this measure, each protein sequence in the database is digested and fragmented in sil- ico, and the resulting “theoretical spectrum” is compared against the experimental spec- trum observed by mass spectrometers. Consequently, the computation of Mascot scores involves considering all possible digested peptides for each protein in the database.

However, such an approach creates a problem when there is a peptide that belongs to more than one parent protein; in other words, the mapping of peptides to proteins is no longer one-to-one. These ambiguous peptides are called shared peptides, and

Nesvizhskii and Aebersold [108] recently highlighted the complications in protein iden- tification with the presence of shared peptides. Unfortunately, shared peptides are preva- lent in several mass spectrometry datasets. For example, proteins in a large protein fam- ily have similar sequences that may create identical peptides when trypsinized, and al-

1 Mascot score is the in silico method to identify proteins from a sequence database, used by the Mascot software (http://www.matrixscience.com). 45

ternative splicing may also create nearly identical protein isoforms. However, in most studies of protein identification and quantification, shared peptides are often simply ignored [100, 151].

In this chapter, we study the problem of identifying and quantifying the constituent proteins from MS spectra, without having to exclude the existence of shared peptides. Depending on the mixture of proteins under investigation, the shared peptides can con- stitute as much as 50% of the peptides in the mixture [42, 81]. Consequently, incorpo- rating the shared peptides into analyses would certainly lead to a more accurate quan- tification of constituent proteins.

Recently, shared peptides have been a useful tool for protein quantification in sev- eral studies. For example, Jin et al. [81] proposed a statistical model that finds groups of proteins sharing significant number of peptides, namely peptide-sharing closure groups (PSCG). In their work, proteins grouped together in the same PSCG were shown to be bi- ologically related, and when applied to data from differential expression analysis, all the peptides within each group showed consistent abundance relative to controls, justifying the inclusion of shared peptides in protein quantification studies.

On the other hand, Dost et al. [42] recently proposed a clean combinatorial model for quantifying relative protein abundance. Their model is built around peptide-protein relationships and, using the relative peptide abundance data from multiple samples, they set up a linear program to obtain relative protein abundance. As their input data is relative (e.g. peptide si is expressed 5 times more in sample x than in sample y , say), their output is also in relative ratios across multiple samples. As such, solving the linear program optimally (which gives a fractional value to the abundance of each protein) sufficed to give a viable solution.

As described, existing studies of protein quantification using shared peptides focus primarily on relative abundance measures across different samples. This may be be- cause the quality of peptide abundance measures is relatively poor, and thus most stud- ies only consider the differential profiling aspect of protein abundance. However, vari- 46 Chapter 3. Protein Quantification with Shared Peptides

ous methods for absolute peptide quantification have been recently proposed [9, 61, 91], and the quality of peptide abundance measures is improving rapidly. In this chapter, therefore, we consider the problem of absolute protein quantification under the as- sumption that absolute peptide abundance is available. In Section 3.1, we modify the model of Dost et al. to suit our needs for absolute protein quantification, and formu- late various optimization problems. Then, in Section 3.2, we establish the computa- tional hardness of these problems, and design approximation algorithms in Section 3.3. In Section 3.4, we evaluate the performance of our approaches using both simulated and real MS data. Finally, Section 3.5 discusses how our approaches can be modified to handle degenerate cases, and incorporate additional biological data such as peptide detectability.

3.1 Problem Formulation for Protein Quantification

In this section, we formulate optimization problems for protein quantification. Let W =

w1,...,wm denote the set of all known protein sequences, and letS = s1,...,sn denote { } { } the set of peptides we observe from a given mass spectrometry experiment2. We assume that we are given the set of protein sequences and peptide sequences, together with the abundance of each observed peptide sequence. Then we want to find the protein abundance that can be constructed from the given peptide abundance. For simplicity, we make the following assumptions.

1. For each protein w j , a peptide si may only appear in w j at most once.

2. The absolute abundance for each observed peptide is given as a positive integer, and thus protein abundance should be a nonnegative integer.

Lifting the first assumption can be done easily as shall be discussed in Section 3.5. The absolute peptide abundance can be estimated by the spectral counting method, as

2 We use w for words and s for syllables as an analogy to proteins and peptides. 3.1. Problem Formulation for Protein Quantification 47

1 =4 s1 s1 =QYUA w1 x1 =4 w1 =QYUATUAIL 2 =5 s2 s2 =TUAIL w2 x2 =7 w2 =LTAUQWTA 3 =8 s3 s3 =QWTA w3 x3 =1 w3 =QWTATUAIL 4 =7 s4 s4 =LTAU

Figure 3.1: An example of the bipartite graph from peptides (s1,...,s4) and the parent proteins (w1,...,w3). Peptides s2 and s3 are shared peptides. The abundance of the peptides are indicated by σi while x j ’s denote the protein abundance. it measures the number of corresponding mass spectra for each peptide.

Let us construct a bipartite graph G = (S,W ; E ) with S = n, W = m, where (i , j ) E | | | | ∈ if and only if peptide si is a substring of protein w j . Let A be an n m matrix where Ai ,j = × 1 if (i , j ) E , and 0 otherwise. Let σi be a positive integer that denotes the multiplicity ∈ of a peptide si , for 1 i n. Then, let x j denote the unknown variable for multiplicity ≤ ≤ of protein w j . Figure 3.1 gives an example of the formulation. Then, we can encode our problem into a system of linear equations as follows:

Problem 3.1.

m X Ai ,j x j = σi i = 1...n j =1 ·

+ x j Z 0 j = 1...m ∈ ∪ { }

This model captures the dependencies of peptides on proteins, and our problem re- duces to solving a system of linear equations. However, there remain a few practical issues. The multiplicity of peptides σi from MS contains a significant amount of noise, 48 Chapter 3. Protein Quantification with Shared Peptides

which would lead to inaccurate solutions or non-satisfiable system of equations. Fur- ther, even after assuming the peptide multiplicities are accurate, the system may not contain a unique, integral solution. We thus need a relaxed set of constraints and an optimization criterion to select the most plausible solution. We propose to modify this model in four different ways to progressively capture the data being analyzed in an in- creasingly realistic manner.

MULTICOVER. We can relax the equality constraints: rather than requiring x j ’s sum up to precisely σi for each peptide Wi , we turn them into covering inequalities, and mini- mize the sum of x j ’s.

m X minimize x j j =1 m X subject to: Ai ,j x j σi i = 1...n j =1 ≥

+ x j Z 0 j = 1...m ∈ ∪ { }

Observe that this problem is closely related to the set cover problem. In fact, our for- mulation above is a generalization of set cover known as multicover: instead of requiring each element (peptide, in our case) to be covered at least once, each element needs to be covered at least the prescribed number of times. If each set (protein, in our case) can be picked at most once, the problem is called constrained multicover, while if a set can be covered arbitrarily many times, the problem is called unconstrained multicover, which is precisely the formulation above.

MINIMUMPROTEINTYPES. One further refinement from above is to look for a solution with the smallest number of different proteins, that is, minimize the number of x j ’s with 3.1. Problem Formulation for Protein Quantification 49

positive values.

m X minimize χ(x j ) j =1 m X subject to: Ai ,j x j σi i = 1...n j =1 ≥

+ x j Z 0 j = 1...m ∈ ∪ { }

where χ(x j ) = 1 if x j > 0, and χ(x j ) = 0 otherwise.

In this problem, we want to find a solution that uses a smallest possible number of distinct protein types. Since the objective function is nonlinear, the resulting problem is no longer an integer linear program. While nonlinear integer programs are much harder class of problems in general, as we shall see later, we can still solve the above program, via a reduction to set cover.

MINIMUMUNIFORMERROR. Instead of simply looking for covers on σi ’s, we can allow an

error term ε such that the x j ’s sum to a value within σi ε. ±

minimize ε m X subject to: σi ε Ai ,j x j σi + ε i = 1...n − ≤ j =1 ≤

+ x j Z 0 j = 1...m ∈ ∪ { }

This formulation looks for a solution such that, for each peptide si , the proposed

multiplicities for proteins containing si sum up to a value within ε of σi . This introduc- tion of additive error term helps us incorporate noise in the MS data while adhering to

the peptide abundance σi by minimizing the error term. One issue with this approach is that some peptides are significantly harder to detect than others, and thus the error 50 Chapter 3. Protein Quantification with Shared Peptides

for those peptide multiplicities are much larger than peptides that are easier to identify. We can resolve this issue of non-uniform errors using the following formulation.

MINIMUMERRORSUM. While MINIMUMUNIFORMERROR looks for a solution where the uni- form error term ε is minimized, here we introduce an individual error term εi for each P σi , and minimize the sum i εi .

n X minimize εi i =1 m X subject to: σi εi Ai ,j x j σi + εi i = 1...n − ≤ j =1 ≤

+ x j Z 0 j = 1...m ∈ ∪ { }

Using these error terms allow different error for each peptide. We note that error terms εi can be weighted by “detectability” that would penalize differently any discrep- ancies for different peptides. We discuss such extensions in Section 3.5.

Each of these four problems captures a different aspect of protein quantification, and requires a different algorithmic technique. Although some problems clearly model the mass spectrometry data more realistically than others, the theoretical results build up progressively, and thus we shall discuss each problem in this chapter.

3.2 Hardness of Protein Quantification Problems

In this section, we study the computational hardness of the problems posed in Sec- tion 3.1. To show the hardness of a given problem, we need to find a problem that is known to be hard, and reduce its instances to our problem of interest. For that purpose, 3.2. Hardness of Protein Quantification Problems 51

we consider the following two NP-hard problems3.

SETCOVER. [56, 86] Given a set S of elements and a collection of sets W over the ground set S, choose a minimum cardinality sub-collection of W that covers all elements in S.

EXACTCOVER. 56, 86 Given a collection W of sets over elements S, find a subset W [ ] 0 ⊆ W such that each element in S belongs to exactly one set in W . 0

Before showing the hardness of our four problems, we first consider the original for- mulation; Problem 3.1 is simply a system of linear equations, but we can define an inte- ger program by adding an objective function as in MULTICOVER.

Problem 3.2.

m X minimize x j j =1 m X subject to: Ai ,j x j = σi i = 1...n j =1 ·

x j 0 j = 1...m ≥ x j Z ∈

The hardness of this problem is evident from EXACTCOVER. Due to unavoidable er- rors in MS measurements, this problem is of less practical interest, and we refrain from further discussing algorithmic approaches.

Theorem 3.3. Problem 3.2 is NP-hard.

Proof. Take an instance Π of EXACTCOVER, and transform into an instance φ(Π) of Prob-

lem 3.2 by assigning σi = 1 for all 1 i n. If φ(Π) finds a solution of any size, then Π ≤ ≤ has a solution for EXACTCOVER. Otherwise, Π has no solution for EXACTCOVER. 3 See Section 3.6 for inapproximability results on these problems. 52 Chapter 3. Protein Quantification with Shared Peptides

Next, the hardness of the multicover problem is well known [140].

Theorem 3.4. [140] MULTICOVER is NP-hard.

Proof. Consider the decision problem of MULTICOVER(K). Take an instance Π of SET-

COVER(K), and cast into an instance φ(Π) of MULTICOVER(K), where σi = 1 for all i . If Π contains a solution of size k , then the corresponding chosen sets form a solution for φ(Π) of size k . Conversely, if φ(Π) has a multicover of size k , then there are at most k x j ’s with positive values. These x j ’s correspond to a set cover of size k for Π.

The hardness of MINIMUMPROTEINTYPES comes from a reduction from SETCOVER.

Theorem 3.5. MINIMUMPROTEINTYPES is NP-hard.

Proof. Take an instance of set cover Π, and cast into an instance φ(Π) of MINIMUMPRO-

TEINTYPES by assigning the coverage requirement σi = 1 for each element. If Π admits a set cover of size k , then the corresponding chosen proteins form a solution for φ(Π) of size k . Conversely, if φ(Π) admits a solution of size k , the proteins chosen with positive multiplicity correspond to a set cover of size k in Π.

The error minimization problems can be shown to be hard by reductions from EX-

ACTCOVER.

Theorem 3.6. MINIMUMUNIFORMERROR is NP-hard.

Proof. Take an instance Π of EXACTCOVER. Then, cast into an instance φ(Π) of MINIMU-

MUNIFORMERROR by assigning σi = 1 for 1 i n. If φ(Π) contains a solution with ε = 0, ≤ ≤ then Π contains an exact cover. Otherwise, Π contains no solution.

A similar proof yields hardness for MINIMUMERRORSUM.

Theorem 3.7. MINIMUMERRORSUM is NP-hard. 3.3. Approximation Algorithms for Protein Quantification 53

3.3 Approximation Algorithms for Protein Quantification

In this section, we design approximation algorithms for the formulated problems. While

MULTICOVER and MINIMUMPROTEINTYPES are both covering problems, MINIMUMUNIFORMER-

ROR and MINIMUMERRORSUM share a common theme of minimizing the error terms. For the covering problems, we obtain O(logn) approximation guarantees, and we design randomized algorithms for the error minimization problems.

3.3.1 An Algorithm for Multicover

The MULTICOVER problem formulated in Section 3.1 is an unconstrained version: the solution vector x may contain any nonnegative integers. In other words, each protein may be picked multiple times in order to meet the coverage requirement σ. For the purpose of approximation algorithms, however, we can instead focus on the constrained 0 1 problem, where each protein can be either picked once, or not picked at all. − Lemma 3.8. If there is an α-approximation algorithm for constrained multicover, then there is an α-approximation algorithm for MULTICOVER.

Proof. Consider the unconstrained multicover problem. We first take the LP relaxation of the integer program, and solve the LP in polynomial time to obtain the fractional opti- mum solution, denoted by OPTf . From the fractional optimum OPTf : x = (x1,x2,...,xm ), each x j can be split into an integral part I j and a fractional part Fj , where 0 < Fj < 1.

Then, OPTf = I + F . Now consider the “residual” IP: 54 Chapter 3. Protein Quantification with Shared Peptides

m X minimize x j0 (3.1) j =1 m X subject to: Ai ,j x j0 σi0 i = 1...n j =1 · ≥

0 x j0 1 j = 1...m ≤ ≤

x j0 Z ∈ where m X σi0 = max σi Ai ,j I j , 0 { − j =1 } to make sure σi0 remains nonnegative. Then F = (F1, F2,...) forms an optimal solution for the LP relaxation of the above residual IP (since otherwise we can replace it with a better solution for the residual IP and improve OPTf ).

Now, suppose we have an approximate solution A that is at most α F for some α > 1. · Then our overall solution is:

I + A I + αF α(I + F ) = α(OPTf ) α OPT ≤ ≤ ≤ ·

Therefore, from the standpoint of approximation algorithms, we can focus on the constrained multicover problem. Due to their resemblance, standard techniques for set cover can be applied with similar approximation guarantees, including the simple greedy algorithm (which iteratively picks the set containing the largest number of un- covered elements), or the LP rounding technique using the residual solution F .

3.3.2 An Algorithm for Minimum Protein Types

In Section 3.2, we showed that MINIMUMPROTEINTYPES is NP-hard via a reduction from set cover. While MINIMUMPROTEINTYPES is formulated as a nonlinear integer program, 3.3. Approximation Algorithms for Protein Quantification 55

a harder class of problems in general, our problem is in fact equivalent to the set cover problem.

Theorem 3.9. Any algorithm for SETCOVER can be used to find a solution for MINIMUMPRO-

TEINTYPES of the same size.

Proof. Let Π denote an instance of MINIMUMPROTEINTYPES. Now consider the corre- sponding set cover instance φ(Π) constructed by replacing the coverage requirement

with σi = 1 for each element i . Suppose φ(Π) contains a set cover of size k . Take the cho-

sen sets in φ(Π), and assign the corresponding x j ’s with arbitrarily large values so that

each coverage requirement σi is satisfied. Since the value of the optimization function

is the number of x j ’s with positive values, we have a solution for MINIMUMPROTEINTYPES of value k .

This implies that any approximation algorithm for SETCOVER can be used to compute a solution for MINIMUMPROTEINTYPES with the same approximation guarantee, and the inapproximability results for SETCOVER also hold for MINIMUMPROTEINTYPES. See Sec- tion 3.6 for inapproximability results on SETCOVER.

Finding the protein abundance σ for Minimum Protein Types. Observe that, in the proof of Theorem 3.9, our algorithm assigns arbitrarily large values to the x j ’s chosen by the SETCOVER algorithm. In practice, however, such arbitrarily large values would result in solutions undesirable for biological purposes. To resolve this, we first run the SET-

COVER algorithm to choose the smallest set of proteins (which would be assigned posi- tive abundances), and then remove the unchosen proteins from the bipartite graph G .

Then, we apply the algorithm for MULTICOVER on the residual bipartite graph. This way, one can be sure that we choose the smallest possible number of proteins, and further, assign the smallest possible abundances to the chosen proteins. 56 Chapter 3. Protein Quantification with Shared Peptides

3.3.3 An Algorithm for Minimum Uniform Error

We reduce the problem to a 0,1 problem. { }

Lemma 3.10. If there is an α-approximation algorithm for constrained MINIMUMUNI-

FORMERROR, then there is an α-approximation algorithm for unconstrained MINIMUMU-

NIFORMERROR.

Proof. Let’s first spell out the original problem:

minimize ε (ILP1) m X subject to: σi ε Ai ,j x j σi + ε i = 1...n − ≤ j =1 ≤

+ x j Z 0 j = 1...m ∈ ∪ { } which we shall call ILP1. Now, consider the LP relaxation of ILP1, namely LP1. Then, by definition, OPT(LP1) OPT(ILP1). Each value assigned to x j in OPT(LP1) is a fractional ≤ value, which can be split into:

OPT(LP1) = I + F

where 0 < Fj < 1 for all j . We can then keep the solution given by I , and consider the residual problem given by the following integer program, called ILP2:

minimize ε (ILP2) m X subject to: σi0 ε Ai ,j x j σi0 + ε i = 1...n − ≤ j =1 ≤

0 x j 1 j = 1...m ≤ ≤ x j Z j = 1...m ∈

Pm where σi0 = max σi j 1 Ai ,j I j , 0 . { − = } 3.3. Approximation Algorithms for Protein Quantification 57

Now, consider the optimum solution for the residual integer program, namely OPT(ILP2).

By definition, we have the linear relaxation LP2 such that OPT(LP2) OPT(ILP2). For ≤ OPT(LP2), we claim that F OPT(LP2), and F is a feasible solution. To show F OPT(LP2), ≤ ≤ suppose otherwise. Then, OPT(LP2) can be pasted onto I to form:

I + OPT(LP2) < I + F OPT(LP1) ≤ contradicting the optimality of OPT (LP1).

Finally, suppose there is an approximate solution A for OPT(LP2) that is at most α F · for some α > 1. Then our overall solution is:

I + A I + α F α(I + F ) = α OPT(LP1) α OPT(ILP1) ≤ · ≤ · ≤ ·

Therefore, the overall approach would yield an α approximation algorithm. −

We thus focus on the constrained 0 1 problem. We can design a randomized round- − ing scheme where the fractional solution F is used as a probability, as shown in Algo- rithm 3.3.1.

Algorithm 3.3.1: Randomized Rounding Scheme for MINIMUMUNIFORMERROR Input: Fractional solution vector F from OPT(LP1). Output: Binary solution vector x.

for j = 1,...,m do p j Fj (each 0 < Fj < 1 is treated as a probability.); Flip← a p j -biased coin; if heads then

x j 1; end ← else

x j 0; end ← end Return x. 58 Chapter 3. Protein Quantification with Shared Peptides

P To analyze the rounding scheme, we define a random variable Yi x j where = (i ,j ) E ∈ x j 0,1 is the result of tossing a coin for each j with probability p j (a Poisson trial). By ∈ { } linearity of expectation, we have:

X E [Y ] = E [Y1] + E [Y2] + ... = E [p j ] = τi . (i ,j ) E ∈ P where τi Fj , the coverage achieved by F . Since the fractional solution F has at = (i ,j ) E ∈ most the optimum error from the integer program, the expected solution E [Y ] is opti- mal.

We now show that Algorithm 3.3.1 gives a small tail probability.

Lemma 3.11. Algorithm 3.3.1 returns a solution such that

6logn 1 Pr [Yi > (1 + δ)τi + ] < δ2 n 2 and 4logn 1 Pr [Yi < (1 δ)τi 2 ] < 2 − − δ n

Proof. To estimate the tail probability, we apply the Chernoff bound [29, 104] on Yi . For the bound on the upper tail, we have:

2 τi δ /3 Pr [Yi > (1 + δ)τi ] < e − and for the bound on the lower tail:

2 τi δ /2 Pr [Yi < (1 δ)τi ] < e − −

We show each case separately. 3.3. Approximation Algorithms for Protein Quantification 59

Case I. (Upper tail): Suppose 6logn . Then it follows that: τi δ2 ≥ 6logn τi 2 ≥ δ 2 τi δ 2logn ⇒ 3 ≥ τ δ2/3 2 loge i log(n ) ⇒ ≥ 2 1 τi δ /3 e − 2 ⇒ ≤ n and it follows that

1 Pr Y 1 if 6logn (1) [ i > ( + δ)τi ] < 2 τi δ2 n ≥

Now consider the case 6logn . We introduce an absolute error D such that 1 τi < δ2 ( +

δ)τi = τi + D. Then we can express as D = δτi , and rewrite the assumption with the additive error D:

6logn 6logn τi < 2 τi < 2 δ ⇒ (D/τi ) p D < τi 6logn ⇒ ·

Moreover, the Chernoff bound for the upper tail now becomes:

D2 3τ Pr [Yi > τi + D] e − i ≤ 2 (pτi 6logn) 3τ < e − i (by assumption) 1 e 2logn = − = n 2

Rewriting D in δ gives p 6logn D < τi 6logn = 2 , · δ 60 Chapter 3. Protein Quantification with Shared Peptides

and thus we obtain

6logn 1 6logn Pr [Yi > τi + ] < if τi < 2 (2) δ2 n 2 δ

Combining (1) and (2), we have

6logn 1 Pr [Yi > (1 + δ)τi + ] < . δ2 n 2

Case II (Lower tail): Consider the lower tail as below.

2 τi δ /2 Pr [Yi > (1 δ)τi ] < e − −

4logn 2 1 First, suppose . Rearranging this gives e τi δ /2 , and it follows that τi δ2 − n 2 ≥ ≤ 1 Pr Y 1 if 4logn (3) [ i < ( δ)τi ] < 2 τi δ2 − n ≥

Now consider the case 4logn . As before, let D be an additive error such that τi < δ2

(1 δ)τi = τi D. Then we have D = δτi , and it follows that − − 4logn 4logn τi < τi < δ2 D 2 ⇒ ( τi ) D2 < 4logn ⇒ τi p D < τi 4logn ⇒ 3.3. Approximation Algorithms for Protein Quantification 61

Now, the Chernoff bound can be rewritten:

2 τi δ /2 Pr [Yi < τi D] < e −

− D2 2τ = e − i 2 (pτi 4logn) 2τ < e − i 1 e 2logn = − = n 2

Rewriting D in δ gives p 4logn D < τi 4logn = 2 , · δ and thus we obtain

4logn 1 Pr Y if 4logn (4) [ i < τi 2 ] < 2 τi < δ2 − δ n

Combining (3) and (4), we have

4logn 1 Pr [Yi < (1 δ)τi 2 ] < 2 − − δ n

As a result, for each si , flipping a coin independently with probability p j for each neighbour w j results in a value within the error bound of

4logn 6logn [(1 δ)τi 2 , (1 + δ)τi + 2 ] − − δ δ with high probability for both the upper and lower tail. We can then put all the si ’s together to build a bound on the randomized algorithm.

Theorem 3.12. The randomized rounding scheme finds a solution with optimal expected value for MINIMUMUNIFORMERROR with high probability.

Proof. We use the union bound to put all si ’s together. Consider the upper tail: for each i = 1...n, let Ai denote the event that Yi is greater than the specified bound. By 62 Chapter 3. Protein Quantification with Shared Peptides

1 Lemma 3.11, Pr [Ai ] n 2 for each Ai . By the union bound, we have: ≤ X 1 1 Pr A Pr A n [ i i ] [ i ] = n 2 = n ∪ ≤ i ·

1 Hence, the probability that no si falls above the bound is 1 n . −

Similarly, for the lower tail, the probability that any si falls below the specified bound

1 1 is n . Hence, the probability that no si falls below the bound is 1 n . −

3.3.4 An Algorithm for Minimum Error Sum

Here we discuss the MINIMUMERRORSUM problem where we allow independent error term εi for each peptide abundance σi . As before, we focus on the constrained 0 1 − problem via LP relaxation.

Lemma 3.13. If there is an α-approximation algorithm for constrained MINIMUMERROR-

SUM, then there is an α-approximation algorithm for unconstrained MINIMUMERRORSUM.

Proof. Similar to the proof of Lemma 3.10.

We can thus focus on the constrained, 0 1 problem. The randomized rounding − scheme for MINIMUMUNIFORMERROR shown in Algorithm 3.3.1 can be applied here, too.

Theorem 3.14. There is a randomized rounding scheme that finds a solution with opti- mal expected value for MINIMUMERRORSUM with high probability.

Proof. Similar to the proofs of Lemma 3.11 and Theorem 3.12. 3.4. Experimental Results 63

3.4 Experimental Results

In this section, we discuss how our algorithms for the protein quantification problem have been tested empirically against simulated data as well as biological data from AP- MS experiments.

3.4.1 Performance on Simulated Data

String Trypsinization. To generate our simulated test data, we took a collection of 1000 random strings of length 150 over the alphabet of size 4. The abundance of each string w j was then set to a random value between 1 and 100. We call the resulting mul- tiset of strings xˆ. Then, we cut the strings at random into approximately 30 fragments to simulate the trypsinization of proteins. The string fragments are counted to form the peptide abundance σ for this input string set.

We acknowledge that the parameters chosen for the generation of protein / peptide data are not based on a realistic model of protein sequences (e.g. alphabet of size 20, realistic length for protein sequences, realistic trypsinization process, etc.). However, note that the performance of our approaches depends directly on the edge density of the bipartite graph constructed from the strings. The chosen parameters ensure that each protein consists of up to 30 peptides, and that the edge density of the resulting bipartite graph is similar to that seen in the real AP-MS data.

We first ran our algorithms on the synthetic data, and compared the results against the protein abundance of the original set of strings xˆ. Table 3.1 summarizes the results of these comparisons. However, because each problem has its own distinct objective function, comparing the output of our algorithms based on any one objective function is not very meaningful. Furthermore, the abundance xˆ of the original set of strings may not be “optimal” for the input vector σ depending on the objective. In fact, as shown in

Table 3.1, the original abundance vector xˆ does not achieve the optimum objective value 64 Chapter 3. Protein Quantification with Shared Peptides

Problem Objective Problem Objective Original xˆ 6439 Original xˆ 0 MULTICOVER MINIMUMUNIFORMERROR Our result 5371 Out result 74 Original xˆ 1000 Original xˆ 0 MINIMUMPROTEINTYPES MINIMUMERRORSUM Our result 887 Our result 782

Table 3.1: Output quality of the algorithms compared to the original abundance vector xˆ (Objective indi- cates the value from the corresponding objective function). Note that in MULTICOVER and MINIMUMPRO- TEINTYPES, our solution achieves better objective values than the original abundance of strings, while in the error minimizing problems (i.e. MINIMUMUNIFORMERROR,MINIMUMERRORSUM) the original string abundances produces no error.

for MULTICOVER and MINIMUMPROTEINTYPES, while our approximate algorithms provide slightly better solutions than xˆ.

Consequently, we instead base our performance analyses on the distance between the proposed solutions and the original abundance x. Let x,x denote two solution vec- ˆ 0 tors. Then, we define the distance between x and x as the L -distance: 0 1

X d (x,x 0) = xi xi0 i | − |

where xi denotes the i -th component of x. We can interpret this L 1-distance between our proposed solution x and the original abundance xˆ as the error present in our pre- diction.

Robustness Analysis. When Gentleman and Huber [60] discussed errors that are preva- lent in proteomics, they classified the measurement errors as two distinct types: stochas- tic errors and systematic errors. Stochastic errors have random variability, and thus can be reduced by simply repeating the experiment. On the other hand, systematic errors are recurrent in the measurements, which makes them difficult to identify and remove. From the perspective of the protein quantification problem, stochastic errors are em- bedded into the peptide abundance σ. In our formulation of the problems, MINIMUMU-

NIFORMERROR and MINIMUMERRORSUM in particular, we attempt to remove the stochas- tic errors by looking for a solution with the smallest error ε. 3.4. Experimental Results 65

5500

5000

4500

4000

3500

3000

Errors 2500

2000 MUE MC 1500 MES

1000

500 MPT 0 20 30 40 50 60 70 80 90 100

Coverage (%)

Figure 3.2: Robustness of the algorithms for MPT (MINIMUMPROTEINTYPES), MC (MULTICOVER), MES (MIN- IMUMERRORSUM) and MUE (MINIMUMUNIFORMERROR). Errors indicate the L 1 distance from the original protein abundances xˆ (averaged over 5 distinct partial datasets).

On the other hand, because some peptides are notoriously difficult to identify, and the sensitivity of mass spectrometers is relatively poor, the mass spectrometry data is far from comprehensive. As a result, certain peptides (and their abundances) are missing in data, and form a systematic error that are recurrent throughout the dataset. We test our algorithms against such systematic errors by analyzing the robustness of the algorithms.

In order to test the robustness of our algorithms, we randomly selected and removed4 a specified proportion of constraints before running the algorithms. For example, if we remove 25% of the constraints (the peptide abundance σ) from the system, solving the problem with the remaining 75% of the constraints can simulate the problem of quanti- fying the protein abundance with 75% sensitivity of the MS data.

After removing a fixed proportion of the input data (peptide abundance σ), we ran our algorithm for each problem and compared the results in terms of the errors pro-

4 We note that, by removing a particular peptide abundance σi , we simply remove the corresponding constraint in the system rather than setting σi = 0 for that peptide. 66 Chapter 3. Protein Quantification with Shared Peptides

duced as estimated by the metric d (x,xˆ). Figure 3.2 shows the result of this comparison for different input coverage (i.e. percentage of the remaining data). Among the four dif- ferent problem formulations, MINIMUMPROTEINTYPES showed a closest fit to the original abundance xˆ across the different input coverage, while MINIMUMERRORSUM and MUL-

TICOVER showed similar performance until the input coverage fell below 50%. Further- more, when comparing the amount of errors produced between input coverage at 50% and at 100%, the algorithm for MINIMUMPROTEINTYPES is shown to be the most robust among the four problem formulations.

3.4.2 Protein Quantification from AP-MS Data

To test our algorithms on real biological data, several datasets have been acquired from AP-MS experiments by our collaborators5. Spectral count is used to estimate the abun- dance of peptides, as it measures peptide concentrations in protein mixtures based on the number of tandem mass spectra obtained for each peptide [22, 30, 100]. Recall that each AP-MS experiment consists of a bait protein, together with peptides from its in- teracting partners, the prey proteins. We obtained the dataset from 3 separate mass spectrometry runs for 3 bait proteins, RPAP3, CDK9, and POLR2A. Since each of these datasets is produced by a distinct mass spectrometry experiment, we treat them as dif- ferent samples.

As shown in Table 3.2, the three datasets from AP-MS experiments contain approxi- mately 2000 peptides that belong to 1000 proteins. However, when restricted to only the connected components (in the bipartite graph) containing shared peptides, the dataset contains approximately 500 peptides over 100 proteins.

While the nontrivial portion of the problem is smaller in terms of the number of peptides and proteins, the corresponding bipartite graphs are much denser due to the peptides shared by many proteins. In fact, peptides in our datasets are shared by as

5 Our test data was generously provided by Benoit Coulombe’s lab at Institut de recherches cliniques de Montréal (IRCM), and consists of human PPI data from AP-MS experiments. 3.4. Experimental Results 67

Number CDK9 POLR2A RPAP3 proteins 1593 1386 901 Overall peptides 2592 2861 1689 components 1320 1211 703 proteins 125 70 125 Non-trivial peptides 672 525 589 components 21 19 26

Table 3.2: Network analysis of prey proteins from AP-MS experiments. Overall indicates the total number of proteins, peptides, and connected components for each dataset, while non-trivial indicates only the connected components containing shared peptides.

many as 19 proteins. For example, Figures 3.3 through 3.5 show the bipartite graphs for the dataset with CDK9, POLR2A, RPAP3 as the bait, respectively.

We executed our algorithms for MINIMUMPROTEINTYPES and MINIMUMERRORSUM for the nontrivial components (containing shared peptides) of each dataset. Furthermore, for smaller connected components with fewer than 20 proteins, we also computed exact solutions to the corresponding integer programs6 , and found the results to be promis- ing. In the case of MINIMUMPROTEINTYPES, both the integral optimum solution and our approximate solution often picked the same set of proteins (FP 10%, FN 15%), with ∼ ∼ only marginal differences in the abundance. In the case of MINIMUMERRORSUM, the sum P of errors ( i εi ) in our approximate solution differs from that of integral optimum solu- tion by at most the maximum peptide abundance value (σi ) in that component. While a pleasant result, this is no surprise since we obtain the significant part of our protein abundance from the optimum of LP relaxation.

Within our dataset, there were various instances where a group of peptides are shared by a large number of proteins. In such instances, our algorithm for MINIMUMPROTEIN-

TYPES picked out only a few proteins as constituent proteins. For example, Figure 3.6 shows a connected component from CDK9 dataset, where all the peptides are shared by variants of the histone family. Among the 14 proteins that share the peptides in this component, only 3 proteins were selected with positive abundance values. This partic-

6 Gurobi Optimizer (http://www.gurobi.com/) was used to solve the integer programs as well as their LP-relaxations. 68 Chapter 3. Protein Quantification with Shared Peptides

SLDLDSIIAEVK,AEAESLYQSK, STMQELNSR, LLQDFFNGR, NKPIPALR,EDNIQAKEENMDTSNTSISK, DAPQPYELNTAINCRDEVVSPLPSALQGPSGSLSAPPAASVISAPPSSSSR,GFQRPVYLFHK,DEVVSPLPSALQGPSGSLSAPPAASVISAPPSSSSR, RIWNWR,MPIEGSENPERPFLEK, VLGTSPEAIDSAENR,TLGVDLVALATR, YLTVATVFR, TPEVTCVVVDVSHEDPEVKFNWYVDGVEVHNAK,TPEVTCVVVDVSHEDPEVK, QLWWRR, NSTIPTK, KEYLALQK, EQLENSPSR, HNSVEDSVTK, THTCPPCPAPELLGGPSVFLFPPKPK, LLRDYQELMNTK, DIENQYETQITQIEHEVSSSGQEVQSSAK, LVNHFVEEFKR, STLEPVEK, FLREQIEK, GGGGTELGPPAPPRPR, YYGYRNPSCEDGR, IDRWFLHR, NERPDGVLLTFGGQTALNCGVELTK, THNLEPYFESFINNLR, AIEFLNNPPEEAPR, IKFEMEQNLR, RGGGGTELGPPAPPRPR, TYSLSSSFSSSSSTR, WYFTR, QTQIFTTYSDNQPGVLIQVYEGER, GFAFVEFETKEQAAK, HLRPGGILVLEPQPWSSYGKR, GHNQPCLLVGSGR, EATAGNPGGQTVR, AEDTAVYYCAR, AQYEDIAQK, FSSSSGYGGGSSR, ATAGDTHLGGEDFDNR, DRVEASSLPEVR, STSGGTAALGCLVK, MKEIAEAYLGYPVTNAVITVPAYFNDSQR, TLNAETPK, TEQLYSQK, GGGGGGYGSGGSSYGSGGGSYGSGGGGGGGR, LEILSGDHEQR, NGYQPHRPPGGGGGK, WLSSQPSFKLEPTQGHR, DHYIAAQVEQQHK,LEPTQGHR, QGVDADINGLR, CQEVISWLDANTLAEKDEFEHK, GFAFVEFETK, KLETLDLDVR, LFVEALGQIGPAPPLK, MSSTFIGNSTAIQELFKR, VEILANDQGNR, ELSDLESAR, IMNTFSVMPSPK, SCDKTHTCPPCPAPELLGGPSVFLFPPKPK, SLVNLGGSK, QEIECQNQEYSLLLSIK, DNNLLGR, GQHHQQQQAAGGSESHPVPPTAPLTPLLHGEGASQQPR, NSCNVGGGGGGFK, AILVDLEPGTMDSVR, FKTPEDAQAVINAYTEINK, AASSKPEEIKMR, SPVGLSSDGISSSSSSSR,VHAAADKHNSVEDSVTK, EIVHIQAGQCGNQIGAK, AAAIGIDLGTTYSCVGVFQHGK, WSNWEIPVSTDGK, HHGPISTTPGIIPQK, MASTFIGNSTAIQELFK, FNWYVDGVEVHNAK, SLNNQFASFIDKVR, LLQDFFDGRDLNK, HSPA6, SRPTSEGSDIESTEPQK, IINEPTAAAIAYGLDR, YVAPPSLR, INVYYNESSSQK, LASYLDKVQALEEANNDLENK, HWPFQVINDGDKPK, NPSCEDGRLR, VLKPEWFR, CDFALFLGASSENAGTLGTVAGSAAGLK, IGHM, EENMDTSNTSISK, MSGECAPNVSVSVSTSHTTISGGGSR, EIETYHNLLEGGQEDFESSGAGK, HSPA1A, WLSSQPSFK, ISDHSSVKQEYTHK, SSSLNFSFPSLPTMGQMPGHSSDTSGLSFSQPSCK, GPSVFPLAPSSK, LLQDFFNGK, CGVEADKELSCR, IGHV4-31, ELEQVCNPIISGLYQGAGGPGPGGFGAQGPK, AIEFLNNPPEEAPRKPGIFPK, KLEHVIK, ISVYYNEASSHK, SGAFGHLFRPDNFIFGQSGAGNNWAK,TUBB3, NKLNDLEDALQQAK, ARFEELCSDLFR, KFQYGNYCK, MREIVHIQAGQCGNQIGAK, HSPA1B, TLTETIYKNYYR, SGPFGQLFRPDNFIFGQTGAGNNWAK, IGHG1, VVSVLTVLHQDWLNGK, SRDGYVDISLLVSFNK, SILEQLAEKNFELVINLSMR, VFNTGGAPR, TUBB6, YCGQLQMIQEQISNLEAQITDVR, FWEVISDEHGIDPTGTYHGDSDLQLDR, HSHSQLPVGTGNKRPGDPK, QYISSHNSVFNHPLPPPPPVTYQVGYGHLSTLVK, GPSEETGGAVFDHPAK, MASTFIGNSTAIQELFKR, SKAEAESLYQSK, LSKEEIER, VNATGPQFVSGVIVK, TTPPVLDSDGSFFLYSK, QVLDNLTMEK, IQLKPEQFSSYLTSPDVGFSSYELVATPHNTSK, HSADGIPPTVLR, NQVALNPQNTVFDAKR, ITITNDK, GPAVGIDLGTTYSCVGVFQHGK,LLQDFFNGKELNK, DITDPLSLNTCTDEGHVVLASPLK, CCNT2, GHYTEGAELVDSVLDVVRK, QTQTFTTYSDNQPGVLIQVYEGER,DAGTIAGLNVLR, LALDLEIATYR, KRT1, KRT9, QVLADIAK, GHYTEGAELVDSVLDVVR, VVSVLTVLHQDWLNGKEYK, RPVINAGDGVGEHPTQALLDIFTIR, TPHVLVLGSGVYR, YLTVAAIFR, FWEVISDEHGIDPSGNYVGDSDLQLER, FWEVISDEHGIDPAGGYVGDSALQLER, GTLDPVEK, IWNWR, RWYFTR, SCDTPPPCPR, GGSGGSYGGGGSGGGYGGGSGSR, TTPSYVAFTDTER, STAGDTHLGGEDFDNR, KPGIFPK, AAPPDVGEERR, NIISSTALFLAAKVEEQAR, FPGQLNADLR, GSGGGSSGGSIGGR, LDKAQIHDLVLVGGSTR, FQYGNYCK, FWEVISDEHGIDPTGSYHGDSDLQLER, EIVHIQAGQCGNQIGTK, WYVDGVEVHNAK, HGVQELEIELQSQLSKK, DAGVIAGLNVLR, AQIHDLVLVGGSTR, QGQSQAASSSSVTSPIK, TUBB, HCWKLEILSGDHEQR, TUBB2A, TNAENEFVTIK, LVNHFVEEFK, MVNHFIAEFK, NSSYFVEWIPNNVK,LAVNMVPFPR, ARFEELNADLFR, ILALDCGLK, VPQFSFSR, EAESCDCLQGFQLTHSLGGGTGSGMGTLLLSK, YKAEDEVQR, TPEDAQAVINAYTEINKK, GRDPVEILIPK, HHHHHPLPAAGFKK,QLENMEANVK, VAHACLHPLEPLLDTK, FYMIQSFTQFPGNSVAPAALFLAAK, FPGQLNADLRK, GSYGSGGSSYGSGGGSYGSGGGGGGHGSYGSGSSSGGYR, MTLDDFR, NQVALNPQNTVFDAK, FNKNIISSTALFLAAK, TUBB8, GFYPSDIAVEWESNGQPENNYK, IGHG3, NYSPYYNTIDDLKDQIVDLTVGNNK, LARP7, DTLAAISEVLYVDLLEGDTECHAR, MEPCE, CCNT1, CAD, NQVSLTCLVK, TPEVTCVVVDVSHEDPEVQFK, TVTNAVVTVPAYFNDSQR, KEWESCDCLQGFQLTHSLGGGTGSGMGTLLISK, GFYPSDIAVEWESNGQPEDNYK, THNLEPYFESFINNLRR, QSLEASLAETEGR,ADLEMQIESLTEELAYLK, IINEPTAAAIAYGLDKK, EQLENTPSRR, INVYYNEAAGNKYVPR, NMMAACDPR, IREEYPDR, GGGPQAQSHGEAR, QETSLSGSQYNINFQQGPSISLHSGLHHRPDK,CTQLVR, IMNTFSVVPSPK, TPEVTCVVVDVSHEDPDVQFKWYVDGVEVHNAK, IRLENEIQTYR, VEIIANDQGNR, VEASSLPEVR, HYLSEELR, FGVDPDKELSYR, AKHAEELAAQK, LSLDDLLQR, FSSCGGGGGSFGAGGGFGSR, KGPAAIQK, TIDDLKNQILNLTTDNANILLQIDNAR, HSPA8, NIISSTALFLAAK,EQLENTPSR, TLHEWLQQHGIPGLQGVDTR, FLEQQNQVLQTK, GGSGGSHGGGSGFGGESGGSYGGGEEASGSGGGYGGGSGK, KLTTDGK, LNVSQLTINTAIVYMHR, ISEQFTAMFR, SISISVAR,VDLLNQEIEFLK,TAAENDFVTLKK, ALEESNYELEGK, VCNPIITK, MSATFIGNSTAIQELFK, LHFFMPGFAPLTSR, LENEIQTYR, GHYTEGAELVDAVLDVVR, IGHG4, TPLGDTTHTCPR, VQVEYK, CAPSAGSPAAAVGR, SCFPASLTASR, GGSGGGGSISGGGYGSGGGSGGR, YLDGLTAER, WELLQQVDTSTR, SGGGGGGGLGSGGSIR, SSSEDAESLAPR, SGNTDKPRPPPLPSEPPPPLPPLPK, HSSQTSNLAHK, QISNLQQSISDAEQR, HGVQELEIELQSQLSK, LLDTIGISQPQWR, VLGTPVETIELTEDRR, QVDFWFGDANLHK, TLLEGEESR, SKELTTEIDNNIEQISSYK, GGGGSFGYSYGGGSGGGFSASSLGGGFGGGSR, TAVCDIPPR, NSKIEISELNR, QEYEQLIAK, VQVEYKGETK, MSATFIGNSTAIQELFKR, SGPFGQIFRPDNFVFGQSGAGNNWAK, TNAENEFVTIKK, FSSSGGGGGGGR,DAEAWFNEK, ILDKCNEIINWLDK, DPVEILIPK, AAPPDVGEER, TUBB4B, GHYTEGAELVDAVLDVVRK, RSTSSFSCLSR,FGGFGGPGGVGGLGGPGGFGPGGYPGGIHEVSVNQSLLQPLNVK,IEISELNR, QVDFWFGDANLHKDR, TTPPVLDSDGSFFLYSR, QQAANLLQDMGQR, VAHTCLHPQESLPDTR, MAVTFIGNSTAIQELFK, EVDEQMLNVQNK, TKPYIQVDIGGGQTK, MKEIAEAYLGK, IIQKDIIK, WFESSGIHVAALVVGECCPTPSHWSATR, LAADFSVPLIIDIK, VTHAVVTVPAYFNDAQR, SQIHDIVLVGGSTR, GRDVLDLGCNVGHLTLSIACK, KRT2, VLDELTLTQADLEMQIESLTEELAYLKK, VTMQNLNDR, KTLTETIYK, MAATFIGNSTAIQELFK, IGH@, AETECQNTEYQQLLDIK, LTPEEIER, TVYVELLPK, IGHG2, GSSSGGGYSSGSSSYGSGGR, GGGFGGGSSFGGGSGFSGGGFGGGGFGGGR, SINPDEAVAYGAAVQAAILSGDK,FEELNADLFR, MRIPVAGGDK, EAESCDCLQGFQLTHSLGGGTGSGMGTLLISK, YEELQITAGR, LDKSQIHDIVLVGGSTR, RGPSEETGGAVFDHPAK, INVYYNEATGGK, TUBB4A, STSESTAALGCLVK, IEWLESHQDADIEDFKAK, SSAVVELDLEGTR, NSCNVGGGGGGFKHPAFK, HHHHHPLPAAGFK, VIECNVR, VYFLPITPHYVTQVIR, VKQVLADIAK, LTTPTYGDLNHLVSATMSGVTTCLR, GPSVFPLAPCSR, LASYLDK, DNHLLGTFDLTGIPPAPR, GFSSGSAVVSGGSR, GGSISGGGYGSGGGK, YCVQLSQIQAQISALEEQLQQIR, ISSVLAGGSCR, HSHSQLPVGTGNK, HAEELAAQKR, KRT10, ADLEMQIESLTEELAYLKK, TQEKVNATGPQFVSGVIVK, DAPQPYELNTAINCR, ESPGAAATSSSGPQAQQHR, KRT77, LASYLDKVR, VVSVLTVVHQDWLNGKEYK, HKMGEEVIPLR, LVQNGTEPSSLPFLDPNARPLVPEVSIK, LLESLGYSLYASLGTADFYTEHGVK, ITITNDQNR, WELLQQMNVGTRPINLEPIFQGYIDSLKR,SKEEAEALYHSK, RSRPTSEGSDIESTEPQK, HYLSEELRLPPQTLEGDPGAEGEEGTTTVR, LPPQTLEGDPGAEGEEGTTTVRK, EQLENSPSRR, TYSLSSSFSSSSSTRK, MAVTFIGNSTAIQELFKR, VVSVLTVVHQDWLNGK, NLDLDSIIAEVK, VLYDAEISQIHQSVTDTNVILSMDNSR,YQELQITAGR, LLEGEECR, VNATGPQFVSGVIVKIISTEPLPGR, MALLATVLGRF, VTAVDWHFEEAVDGECPPQR, FWEVISDEHGIDPTGTYHGDSDLQLER, CCVECPPCPAPPVAGPSVFLFPPKPK, ELTTEIDNNIEQISSYK, NELESYAYSLK, CGNVVYISIPHYK, LPPQTLEGDPGAEGEEGTTTVR, HRGQHHQQQQAAGGSESHPVPPTAPLTPLLHGEGASQQPR, HAEELAAQK, VSLKEYR, YENEVALR, IISTEPLPGRK, MREIVHLQAGQCGNQIGAK, KCCVECPPCPAPPVAGPSVFLFPPKPK,TPEVTCVVVDVSHEDPEVQFNWYVDGVEVHNAK, HSPA5, SRPTSEGSDIESTEPQKQCSK, MVGLDIDSR, DVLDLGCNVGHLTLSIACK, AASSKPEEIK, SQYAYAAQNLLSHHDSHSSVILK, LALGIPLPELR, KVLILGSGGLSIGQAGEFDYSGSQAIK, EIVHLQAGQCGNQIGAK,YLTVAAVFR, WTLLQEQGTK, EYLALQK, IISTEPLPGR, WVHLNWGDEGLK,WVHLNWGDEGLKR, EIEYEVVR, IEIESFYEGEDFSETLTR, NVNHSWIER, HLRPGGILVLEPQPWSSYGK, IPVAGGDKAASSKPEEIK,TSENLALTGVDHSLPQDGSNAFISQK, HPQPGAVELAAK, QLDSIVGER, KRT16, VLDELTLAR, AQYEEIAQR, NVQALEIELQSQLALK, ALEEANADLEVK, KRT14, IKEWYEK, IDTRNELESYAYSLK, LALDVEIATYR, NLSDVATKQEGLESVLK, KRT6B, KLVEGEEGR, LVQLLVK, VILEIDNARLAADDFR, LVQNCLWTLR, KRT15, SQIFSTASDNQPTVTIK, QNLEPLFEQYINNLRR, KRT4, LKYENEVALR, KRT17, QSVEADINGLRR, SQYEQLAEQNRK, LNTIPLFVQLLYSSVENIQR, YEELQQTAGR, KRT6C, HPEAEMAQNSVR, TWNDPSVQQDIK, TAAENEFVALK, GSLGGGFSSGGFSGGSFSR, KRT5, KRT6A, KRT75, NVSTGDVNVEMNAAPGVDLTQLLNNMR, IINEPTAAAIAYGLDKR, TAAENEFVALKK,FASFIDK, KRT80, SLLEGEGSSGGGGR,VLDELTLTK, EGLLAIFK, ITPSYVAFTPEGER, NQLTSNPENTVFDAK, VSLAGACGVGGYGSR, NHEEMNALR,TLVTQNSGVEALIHAILR, JUP, LALDIEIATYR, KRT19, LLNDEDPVVVTK, HVAAGTQQPYTDGVR,EQALRMSVEADINGLR, GQLEANLLQVLEKVEEFR, KRT13, TAAENEFVVLKK, KRT8, TMQNTSDLDTAR, LAEPSQLLK, SEIIELNR, NLALCPANHAPLQEAAVIPR,LLNQPNQWPLVK, LALDIDIATYR, LKYDNELALR, KRT3, LAADDFR, KRT7, GRLDSELK,

FLEQQNK, AEIEHAKAQR,

TAAENEFVVLK, LIVQIDNAK, KRT84, KRT38, LVVQIDNAKLAADDFR,

KRT37, QVVSSSEQLQSYQAEIIELR,QLVESDINSLR, LASYLEK, QNQEYQVLLDVK, LNVEVDAAPTVDLNR, KRT33B, LVLQIDNAKLAADDFR, KRT40, QVVSSSEQLQSYQAEIIELRR, QLVESDINSLRR, LANYLEK, LVVQIDNAK, DVDCAYLR, FLEQQNKLLETK, TKYETEVSLR, ETMQFLNDR, CQYETLVENNRR, KRT31,LVVEIDNAK, LASYLTR, LTAEVENAKCQNSK, QNHEQEVNTLR, TIEELQQK, QLERENAELESR, KRT36, ENAELESR, YESELSLR, LTAEVENAK, KNHEEEVNSLR, KRT33A, ARLEGEINMYR, YETEVSLR, SGGVCGPSPPCITTVSVNESLLTPLNLEIDPNAQCVK,ATAENEFVVLK, ATAENEFVVLKK,LGLDIEIATYR, KRT35, ECCQSNLEPLFEGYIETLRR, KRT32, ATAENEFVALKK, EAECVEADSGR, LGLDIEIATYRR, SDLEAQVESLKEELLCLK, LNVEVDAAPPVDLNR, KSDLEANVEALVEESSFLR, LTAEIENAK, DLNMDCIVAEIK, CQLGDRLNVEVDAAPPVDLNR, LVLQIDNAK, RILDELTLCK, CKLAGLEEALQK, SDLERQNQEYQVLLDVR, LQFYQNR, DALESTLAETEAR,QNQEYQVLLDVR, QLVEADINGLR, LLEGEEQRLCEGVGSVNVCVSSSR,KRT86, LYEEEIR, ARLECEINTYR,NHEEEVNSLR, ARLEGEINTYR, ATAENEFVALK, QLVESDINGLRR, LECEINTYR, GLTGFGSR, KRT85, LYEEEIRVLQAHISDTSVIVK, ILDELTLCK, KRT83, LAGLEEALQK, QVVSSSEQLQSCQAEIIELRR, LAGLEEALQKAK, HGETLRR, QLVEADINGLRR, CCITAAPYR, KRT82, QLVESDINGLR, SDLESQVESLREELICLK, RLLEGEEQR, ECCQSNLEPLFEGYIETLR, KRT34, VLQAHISDTSVIVK, WQFYQNQR, LASDDFR, ILDDLTLCK, FAAFIDK, LLEGEEQR, LAELEGALQK, LLETKWQFYQNQR,CCISAAPYR, EYQEVMNSK, FASFINK, AQYDDIASR, QDMACLLK, KDVDCAYLR, LLETKLQFYQNR, DVDCAYLRK, QVVSSSEQLQSCQAEIIELR, KYEEEVALR, AEAESWYR, SRAEAESWYR, YEEEVALR, AKQDMACLLK, LCEGVGSVNVCVSSSR, LASELNHVQEVLEGYKK, LASELNHVQEVLEGYK,

FAAFIDKVR, CKLAELEGALQK, TKEEINELNR, RLYEEEIR,

EGFPITAIR, LDHKFDLMYAK, KHLEINPDHPIVETLR, VDGAYVAVK, MEDENNRLR, NDEELNKLLGGVTIAQGGVLPNIQAVLLPK, VAPDEHPILLTEAPLNPK, LOC100287364, RAB1C, GYISPYFINTSK, NDSPTQIPVSSDVCR, TEEDSEEVREQK, AALLETLSLLLAK,LVLPSLLAALEEESWR, DSTYSLSSTLTLSK, GNLANVIR, NKSTGQMVALK, VILHLKEDQTEYLEER, IDIIPNPQER, HVAYVFQALIYWIK, LRAENLQLLTENELHR, KDFSETYER, DYLELEK, LOC100287478, LOC100287513, RAB3D, RAB4A, VGGTSDVEVNEKK,AAVEEGIVLGGGCALLR, LQAEAQQLRK, HIYYITGETK,ALLFIPR,ELHINLIPNKQDR, ILGLCLLQNELCPITLNR, GQPVAPYNTTQFLMDDHDQEEPDLK, VTNYIFDSLR, EFSFLDILR, TGDEKDVSV, YHTSQSGDEMTSLSEYVSR, MIA-RAB4B, VGKGEPGAAPLSAPAFSLVFPFLK, LSVADSQAEAK, LADQCTGLQGFLVFHSFGGGTGSGFTSLLMER, DQVANSAFVER, DLVILLYETALLSSGFSLEDPQTHANR, NNPLYHAGAVAFSISAGIPK, GQNGDDSSAGGDFPPPAEVEPTPEAELLAQPCHDSEASK, LOC100287205, TFVEKYEK, VDNALQSGNSQESVTEQDSK, YFPTQALNFAFK, GHYTIGK, DLVVLLFETALLSSGFSLEDPQTHSNR, RLQQLQACTGQQSCR, TSGAPGSPQTPPER, TALLDAAGVASLLTTAEVVVTEIPKEEK, CEFQDAYVLLSEKK, TSDSPWFLSGSETLGR, SFNRGEC, TAVAPIER, HNDDEQYAWESSAGGSFTVR, HWKPYYK, LOC728369, LQAEAQQLR, EIIDLVLDR, ELKIDIIPNPQER, NPATTNQTEFER,CDK2, EIIDLVLDRIR, VPDCFQR, RAB37, LLIESLEK, TLLVNCQNK, VYACEVTHQGLSSPVTK, AAYFGIYDTAK, DPYALDLIDKLLVLDPAQR,GMLSTHLTSMFEYLAPPR, NPDDITNEEYGEFYK, LGIHEDSQNR,YESLTDPSKLDSGK, EYLELEK, NDEELNKLLGR, ACTBL2, RAB10, TFEQLHSTIGHQALEDILPFLLK, VLPLEALVTDAGEVTEAGK, LLIYWASTR, CDK13, ELISNASDALDKIR, RAB1A, GANPVEIRR, MEQFQK, VVTLWYRPPELLLGER, NVAIFTAGQESPIILR, LRQENQMWNR, DYLELEKR, VGLQVVAVK, AYVRDPYALDLIDK, EDAANNYAR, H2AFX, LOC100287441, LOC100287404, VDNALQSGNSQESVTEQDSKDSTYSLSSTLTISK, HIST1H2AC, LSVDYGKK,TIGGGDDSFNTFFSETGAGKHVPR,GHYTIGKEIIDLVLDR, TKPIWTR, RAPFDLFENK, ELELELDR, CALMEQVAHQTIVMQFILELAK, SLC25A4, LAGNTLGSR, USP17L1P, RAB8A, KGSQITQQSTNQSR, ELISNSSDALDKIR, SYELPDGQVITIGNER, QYDSVECPFCDEVSK, HSQFIGYPITLFVEKER,APFDLFENK, ADLINNLGTIAK, YHTESLQNMSK, HIST1H2AA, HIST3H2A, KDSSSVVEWTQAPK, IDVEDILCKAEAISLQMVK,LGPGGLDPVEVYESLPEELQK, NPEILAIAPVLLDALTDPSR, AASQSTQVPTITEGVAAALLLLK, LLIYGASSR, LADFGLAR, HFSVEGQLEFR, RAB43,ALMLQGVDLLADAVAVTMGPK, KPLVIIAEDVDGEALSTLVLNR, DFLAGAVAAAVSK, GEHLLLFLVQTVAR, RAB3A, IGK@, SRSPYSSR, QLFHPEQLITGKEDAANNYAR, FHTESLQGR, LSQAEEETR, LOC100287178, IGKC, SLC25A5,SLC25A6, RLSELLR, GCN1L1, LYACEVTHQGLSSPVTK, PSSLSASVGDRVTITCR, LLVLDPAQR, RNLDIERPTYTNLNR,SIQFVDWCPTGFK, ELELELDRLR, HEXIM2, DSYVGDEAQSKR, FLQEQNK, LQIWDTAGQER, HSPD1, TBC1D15, CDC37, SWEQKLEEMR, ILQLLKHENVVNLIEICR, WSSGVGGSGGGSSGR, HIST1H2AE, DFSETYER, LOC100287238, EQGVLSFWR, HLEINPDHSIIETLR,KHSQFIGYPITLYLEK, HSQFIGYPITLFVEK, LVQDVANNTNEEAGDGTTTATVLAR, VDNALQSGNSQESVTEQDSKDSTYSLSSTLTNSK, HSQFLGYPITLYLEKER, IIAPPER, RAB12, VTDALNATR, HFGMLR, QIGSVIRNPEILAIAPVLLDALTDPSR, SSGSAALLALTWTCLLVR, YFAGNLASGGAAGATSLCFVYPLDFAR, TUBA1C, AVFVDLEPTVIDEVR,QLFHPEQLITGK, SIYYITGESK,FICIGALYSELLAVSSK, EGEKGQNGDDSSAGGDFPPPAEVEPTPEAELLAQPCHDSEASK, WSFLFSLTDLK, INQQEEPGFEVITR, HLQLAIRNDEELNK, USP17L5, RAB1B, DFLAGGIAAAISK, EGFPITALR, GLADQCTGLQGFLVFHSFGGGTGSGFTSLLMER,FDGALNVDLTEFQTNLVPYPR,TUBA1B, HLEINPDHPIVETLR, QVEELAAEVQR, QELVRDYLELEK, SMPWNVDTLSK, VLMENEKEGFPITALR, AYHEQLSVAEITNACFEPANQMVK, AGLQFPVGR, HIST1H2AB, HKVYACEVTHQGLSSPVTK, HSP90AB1, UBR5, SGTISTSAAAAAAALEASNASSYLTSASSLAR,LTWEEKK, HEXIM1, HLQLAIR, DSTYSLSSTLTISK, DFLAGGVAAAISK, HSP90AA1, LOC100287327, ISSIQSIVPALEIANAHR, EDQTEYLEER,HSQFIGYPITLYLEK, H2AFJ, GVMLAVDAVIAELKK, LKELEVAEGGK, ILPEIIPILEEGLR, VLAFLSSVAGDALTR, TVAAPSVFIFPPSDEQLK, KHLEINPDHSIIETLR, HGLEVIYMIEPIDEYCVQQLK, IWHHTFYNELR, RAB39B, SGTASVVCLLNNFYPR, DRGSGLLGSQPQPVIPASVIPEELISQAQVVLQGK, NDEELNK, ACTG1, RAB3C, RAB4B, SLSQSFENLLDEPAYGLIQK, VLEKDAEVIVDWRPLDDALDSSSILYAR, GAWSNVLR, HSP90AB2P, GQPVAPYNTTQFLMDDHDQEEPDLKTGLYSK, ACTB, USP17L2, LOC728379, RWDDSQK, AANVLITR, RSIQFVDWCPTGFK, RVFIMDSCDELIPEYLNFIR, HWRPYLELSWAEK, LSQAEEETRR, LOC100287144, GLGDCLVK, QYDSVECPFCDEVSKYEK, TUBA1A,AYHEQLTVAEITNACFEPANQMVK,TIGGGDDSFNTFFSETGAGK, VTIAQGGVLPNIQAVLLPK, IQEIIEQLDVTTSEYEKEK, GANPVEIR, CDK9, AASTAPSSTSTPAASSAGLIYIDPSNLR, HIST1H2AI, ELEVAEGGKAELER, VLPQLISTITASVQNPALR, FSSVQLLGDLLFHISGVTGK, VDNALQSGNSQESVTEQDSKDSTYSLSSTLTLSK, AKFENLCK, IRAEMFAK, RAB3B, LISQIVSSITASLR, VVVITKHNDDEQYAWESSAGGSFTVR, LOC100288520, HYGFNEILK, FLLGYFPWDSTKEER, RLGPGGLDPVEVYESLPEELQK, RHWRPYLELSWAEK, TQSPGGCSAEAVLAR, LSDGVAVLK, KISSIQSIVPALEIANAHR, AVCMLSNTTAIAEAWAR, EGEEAGPGDPLLEAVPK,ASEAKEGEEAGPGDPLLEAVPK, ENVNSLLPVFEEFLK,LFNDSSPVVLEESWDALNAITK, TKPIWTRNPDDITNEEYGEFYK, ALLFVPR, NTPVQSPVSLGEDLQWWPDKDGTK, LQQLQACTGQQSCR, AGFAGDDAPR, RAB35, LKVGLQVVAVK, SKWSFLFSLTDLK, IDSDDALNHDFFWSDPMPSDLK, LGIHEDSTNR, LGAPAAGGEEEWGQQQR, HIST1H2AG, USP17L3, USP17, RAB8B, RAB30, GSQITQQSTNQSR, VGINYQPPTVVPGGDLAK, HIST2H2AA3, HSQFLGYPITLYLEK, VATWVDETLSSVASKLEHTAQTYSELQGER, HIST2H2AC, VAPEEHPVLLTEAPLNPK, LIGQIVSSITASLR, YMACCLLYR, SLTNDWEDHLAVK,LVSSPCCIVTSTYGWTANMER, AFPQLGGRPGPEGEGSLESQPPPLQTQACPESSCLR, NLDIERPTYTNLNR, IVLLSANSIR, HIST1H2AD, HIST1H2AH, GYSFTTTAER, HENVVNLIEICR, IHFPLATYAPVISAEK, APFDLFENR, FENLCK, RAPFDLFENR,HSQFIGYPITLYLEKER,EKYIDQEELNK, AENLQLLTENELHR, DSYVGDEAQSK, IGQGTFGEVFKAR, HIST1H2AJ, NPDDITQEEYGEFYK, LTYQDAVNLQNYVEEK, LSVDYGK, LTWEEK, WSESEPYR, TLTIVDTGIGMTK, FYEQFSK,TKPIWTRNPDDITQEEYGEFYK, VRELELELDR, TKASPYNR, TUBA4B, LGIHEDSTNRR, IRYESLTDPSK, LSAVEAIANAISVVSSNGPGNR, NDEELNKLLGK, FTLSEIK, GVVDSEDLPLNISR, YIDQEELNK, RLGGDDAR, EQVANSAFVER, SFYTAIAQAFLSNEK, TGLYSKR, HIYYITGETKDQVANSAFVER,TLTLVDTGIGMTK,RVFIMDNCEELIPEYLNFIR, ELISNASDALDK, VLQDWNALK,CTADILLLDTLLGTLVK, QELIKEYLELEK, ELISNSSDALDK, LPAPPIGAAASSGGGGGGGSGGGGGGASAAPAPPGLSGTTSPR, VMQMLLNGLYYIHR, QIFHPEQLITGK, DPYALDLIDK, FTLSEIKR, GSQITQQSTNQSRNPATTNQTEFER,IGQGTFGEVFK,

YEEIVKEVSTYIK,YEEIVK, TITIEVEPSDTIENVK, HIST1H2BC,HIST1H2BH, H3F3A, H3F3B, ISGLIYEETR, MQREEQCILYLGPR, ALAELFGLLVK, POTED, CTAGE15P, FASWALESDNNTALLLSK, AIGNIVTGTDEQTQVVIDAGALAVFPSLLTNPK, YSNENLDLAR, LTEEETVCLDLDKVEAYR, ETKPIPNLIFAIEQYEK, ELWFSDDPNVTK, AVLLAGPPGTGK, NLLIFENLIDLK, SYSDPPLK,ATENDIYNFFSPLNPVR, EATADDLIKVVEELTR, KTLEEMELR, POMZP3, AKAP9, PPP3CA, KDGNASGTTLLEALDCILPPTRPTDKPLR, DNIQGITKPAIR, VFPGSTTEDYNLIVIER, NVSSFPDDATSPLQENR, IAYTFAR, IGLVRPGTALELLEAQAATGFIVDPVSNLR, ELVKHLK, SLSAIDR, GTEDITSPHGIPLDLLDR, VLVNDAQKVTEGQQER, NNQGTVNWSVDDIVK, NMITGTSQADCAVLIVAAGVGEFEAGISK, TITLEVEPSDTIENVK, HIST1H2BG, FAEQDAKEEANK,TTDEGAKNNEESPTATVAEQGEDITSK,QSTTHLADGPFAVLVDYIR, SLLSILQR, QLEVEPEEPEAENKHKPR, TLELQGLINDLQR, ETGHVSGPDGQNPEK, DGEQHEDLNEVAK, ALESSIAPIVIFASNR, GQAVTLLPFFTSLTGGSLEELRR, LSPPYSSPQEFAQDVGR, LLHHDDPEVLADTCWAISYLTDGPNER, DGNASGTTLLEALDCILPPTRPTDKPLR, ALYYLQIHPQELR, UBC, HIST1H2BM, IVSGEAESVEVTPENLQDFVGKPVFTVER, LKQLIDK, FVSSLLTALFR, ISGLGLTPEQK, YSVQLLTPANLLAK, YVELFLNSTAGASGGAYEHR, MQIFVK, H3F3AP5, KLLASLVK, NLTWTLSNLCR, GIAPGDER, POM121C, KIAA1598, PPP3CB, LPLQDVYK, HIST1H3A, HIST2H4A, DAVTYTEHAKR, STGEAFVQFASQEIAEK, LOC100290337,TEPATGFIDGDLIESFLDISRPK, YGFGEAGKPK, SDDELLDDFFHDQSTATSQAGTLSSIPTALTR,EEQCILYLGPR, WITPVLLLIDFYEK, LSSDVLTLLIK, QLQNIIQATSR, SLELLPIILTALATKK, TGGLEIDSDFGGFR, DSIEKEHVEEISELFYDAK, VCSPVTVR, KELELK, INERMPPR, RPS27A, AMGIMNSFVNDIFER, HIST2H2BF, KLPFQR, LOC100287399, CTAGE9, VSLERLDLDLTADSQPPVFK, NKNPAPPIDAVEQILPTLVR, HNRNPF, YYVTIIDAPGHR, HIST1H4A, QVHPDTGISSK, EIAQDFKTDLR, DAVTYTEHAK, LOC100288966, CTAGE8, PRKDC, TQEEVAALK, ERC1, ERC2, TLTVELEPSDTVENLK, FKBP5, HUWE1, TRIM28, LDLDLTADSQPPVFK, KPNA2, NPAPPIDAVEQILPTLVR, LONP1, TENPLILIDEVDKIGR, DSP, SVQNDSQAIAEVLNQLKDMLANFR, FANCI, TADKLQEFLQTLR, AIFM1, ALGTEVIQLFPEK, RUVBL1, LDPSIFESLQK, LOC731751, KTQEEVMALK, EVSTYIK, EEF1A1, HIST1H2BL, ESYSVYVYK, DDB1, STELLIR, HIST1H4C, KEEDLLR, QVISYEK, HNRNPH1, KLIYFQLHR, TDCSPIQFESAWALTNIASGTSEQTK, YVEVFK, EEF1A1P5, EIQTAVR, FGIEPNAELIYEVTLK, DVAFTVGEGEDHDIPIGIDK,HADHSSLTLGSGSSTTR, KQLAAFLEGFYEIIPK, TIRDIIALNPLYR, NQCTQVVQER, SLELLPIILTALATK, LNDGSQITYEK, QAASGLVGQENAR, EHALLAYTLGVK, UBA52, HIST1H3I, EIAQDFK, HIST1H4E, LLLPGELAK, HIST1H2BD, VFLENVIR, DCQLNAHKDHQYQFLEDAVR, LLGASELPIVTPALR, ITGEAFVQFASQELAEK,VHIEIGPDGR, GLPWSCSADEVQR,KTEPATGFIDGDLIESFLDISRPK, IGGIGTVPVGR, HIST1H3D, VLFICTANVTDTIPEPLRDR, HKQSLEEAAK, VGQQGDSNNNLSPFSIALLLSVTR, SITIIGGGFLGSELACALGR, VEAGDVIYIEANSGAVKR, POM121, INVS, PPP3CC, TLSDYNIQK, YKEVYAAAAEVLGLILR, SDPNRETDDTLVLSFVGQTR, QHIEVLK, HIST1H2BN, TVTAMDVVYALKR, VLVNDAQK, ASLSLIEK, UBB, DFSAFINLVEFCR, THINIVVIGHVDSGK, IQDKEGIPPDQQR, DIIALNPLYR, LLQLQEQMR, MNLQEIPPLVYQLLVLSSK, SATEQSGTGIR, GLGLDESGLAK, FDSSHDRNEPFVFSLGK, EKLEQAAIVK, QLAAFLEGFYEIIPK, RQQQAATSESSQSEASVR, FQWDLNAWTK, GINSSNVENQLQATQAAR, HTGPNSPDTANDGFVR, QLIVGVNK, H2BFS, POTEB, CTAGE4, LTEDKADVQSIIGLQR, EKQPPIDNIIR, ALCGLDESK, SVEEVASEIQPFLR, EALLLVTVLTSLSK, SNIWVAGDAACFYDIK, QLKLDPSIFESLQK, LQSVQALTEIQEFISFISK, ALGLDSANEK, EVLGSLPNVFSALCLNAR, POTEC, CTAGE6P, STAPSAAASASASAAASSPAGGGAEALELLEHCGVCR, QDQIQQVVNHGLVPFLVSVLSK, LAQPYVGVFLK, ETQSQLETER, VTEAFDYLSFLPLQTVQR, IDNSVLVLIVGLSTVGAGAYAYK, TALALAIAQELGSK, MQLFVK, HIST1H2BK, HIST1H3B, VFLENVIRDAVTYTEHAK, EVSTYIKK, HIST3H3, VETGVLKPGMVVTFAPVNVTTEVK, TLSDYNLQK, HIST1H2BE, QTVAVGVIK,STTTGHLIYK,

DCDC2C, LOC100506888, SCN5A, PKD1, PRAMEF3, LOC440258, RPL36AL, MAP4K4, PANK3, C7orf73, RAB5C, RPL7, EIF4E, ACTR3, PCBP1, RAB39A, KGTVAPHDQSPR, LOC652826, CALM2, FSNISAAK, KDFGFLVDILYSALR, HLMPYWGI, LFAQLAGDDMEVSATELMNILNK, FLVTVIKDLLGLCEQK, LALVTGGEIASTFDHPELVK, CMKIPTGQGYAAK,GTADGGGSVLGR, GKGEGGAGSR, MGLTPGMLGPSR, EKDTIR, KLEGETYR,ETKTFGGGGGGAR,AGEEAKR, YISAAPGAEAK,GLFLPEDENLR, RNEFLAR, KIAESEIK, LNIPVSQVNPR, ATP1A4,

LIIVEGCQR, KLGGTIDDCELVEGLVLTQK, VIDPATATSVDLR,TAHNLENVLIHFWER,ISEPEADVESVLGVSNLLQVLQKPK, QLIDYER, MISEGDIGGIAQITSSLFLGR,TDGFGIDTCR, LGFEEFK, AVGHPFVIQLGR, LFVTGLFSLNQDIPAFK,QVLLSAAEAAEVILR, VQDDEVGDGTTSVTVLAAELLR, LOC100510063, TCEB3C, LOC100294386, NPIP, PRAMEF25, L1RE2, RPL36A, TNIK, PANK1, CEP85L, RAB5B, RPL7P9, EIF4E1B, ACTR3C, PCBP2, RAB6B, ENAPAIIFIDEIDAIATKR, VFDKDGNGYISAAELR, VPSEVQQR, TEKDNAALAR, ESLAAIEKR, LLVAELQR, LQVLELR, EALNMERNNR, LECVEPNCR, ALFLIPR, GNLHFIR, KLEEIK, LVLLGESAVGK, TTHFVEGGDAGNREDQINR, WALWFFK, DITYFIQQLLR, INISEGNCPER, LQLWDTAGQER, CCT4, LTN1, DUSP14, CAPNS1, XPO1, CCT2, CAMK2D, DSG4, MADD, SCUBE1, FNTA, VPS13C, TDRD3, MAP7D2, PUM1, ASPH, OXTR, CCDC147, ATP1A1, LAKENAPAIIFIDEIDAIATK, DTDSEEEIREAFR, ENAPAIIFIDEIDAIATK, EAFSLFDKDGDGTITTK, DALSDLALHFLNK, ALIAGGGAPEIELALR,WTLLSLVLSQHVK, VLGNTSGLLQLLFNR,RPVIRPNVGFWR, VPLADMPHAPIGLYFDTVADKIHSVSR,THYSNIEANESEEVR, ILGGVISAISEAAAQYNPEPPPPR,LVLDSIIWAFK, LDINLLDNVVNCLYHGEGAQQR,GATQQILDEAER, LTSFIGAIAIGDLVK,

AQDIEAGDGTTSVVIIAGSLLDSCTK, NFLTSLVAGLSTER, KHGATLVHCAAGVSR, RLFAQLAGDDMEVSATELMNILNK, YYGLQILENVIK, SLHDALCVLAQTVK, IPTGQEYAAK, TGEIQFSR, EKGEGGAGSR, CQLISR, QQVLLEK, DGSMNVSVK,ILESSIPMEYAK, SSGSGRTK,GHVLSLALQMYGCR,MHLGLVIPK, RNEMLAR, IIAEADGER, LOC100509473, TCEB3CL, SCN3A, PKD1P1, PRAMEF22, LOC440264, RPL36AP7, MINK1, PANK2, LOC100289196, RAB5A, RPL7P32, EIF4EP2, ACTR3B, PCBP3, RAB6A, AVAGDASESALLK, PSMC4, CALM3,

NSETDTIIFIR,MPAPGASLALR, QLTEHLR, AGTERWMK,GCQLMATATLK,DGADIHSDLFISIAQALLGGTAR,SCVFLSIAVK, QENKLK, VLPRGLAAR,EMLLLTRMAEDVK,RDEMLGLVPMR, SPQEEALQR,SSAAAAAALDLSGRR,CQDVSAGSLQELALLTGIISK, TFEEAAAQLLESSVQNLFK, ALLDQKK, EKENELK, AAMCKPLMQNR,NAAQELATLLLSLPAPASVQQQSK,LPFETFR, ELEFLR, MGKEQELVQAVK,IAAQDLLLAVATDFQNESAAALAAAATR, LDSLTYK, SSDPKSR, SSNNNGSVR,LTDHCLPLFR, YEEILDAR, QTELQDK, CTNILR,TEDPDLPAFYFDPLINPISHR,ESANVEDLEPVR, QNIENVSVEMLLR,VIQATGGGAYK, FFCFQVLEHQVK,LPDFLQR, ILLLEAGPK, LLGLLFPLLAR, FLDALISLLS,AEPYCSVLPGFTFIQHLPLSER,VTVHRNK, VSFELFADKVPK,SNVKPNSGELDPLYVVEVLLR,LTADDLR, LLLPWLEAR,HSEVQQANLK, LIGEYGLR, LASYVEKVR,NVVHQLSVTLEDLYNGVTK,QLGYIHRNFAGNR, RLDLAGPLLAFLFR, LSNQVSTIVSLLSTLCR, KQESTVMVLR, VIHLSNLPHSGYSDSAVLK,LVGSQEELASWGHEYVR, AIAYLFPSGLFEKR, TLFSFLGEIEELR, AALSSQQQQQLALLLQQFQTLK,LFGSAANVVSAK,SVNSLDGLASVLYPGCDTLDKVFTYAK, DLEEFFSTVGK,TIEYLEEVAITFAK, LSTHLQQEGSELLWYLAEK,LFIGGLPNYLNDDQVKELLTSFGPLK, KNFIAVSAANR,TPGQVLSLISSLGFNTPIAEK, AIGPHDVLATLLNNLK, FADDQLIIDFDNFVR, IFVGGLSPDTPEEK, VYEFLDKLDVVR, LAVDTDTFSFGVVVLETLAGQR,ENIWSASEELLLR, TYQQSCVSSCR,IADFGLAR,ACLIFFDEIDAIGGAR,ISGSILNELIGLVR, VNDVVPWVLDVILNK,DISEASVFDAYVLPK, APLDLDKYVEIAR, ALVDGPCTQVR,NYQQNYQNSESGEKNEGSESAPEGQAQQR,SIATLAITTLLK, STIIGESISR,ISLGLPVGAVINCADNTGAK, LTTDFNVIVEALSK, IADGYEQAAR, IQCJ, TSPY4, LUC7L2, TSPY1,

MYBPC1, TDRD6, CCDC22, TTN, CDC42BPB, DNAJA3, OR6K4P, PYGL, SCRIB, FSIP2, RAD50, ARHGEF2, PPP1R15A, NUP205, GTF3C4, NET1, KTN1, ZMYM3, UBR4, ZMAT1, SKI, CASKIN1, IPO9, CLCC1, TNN, CHSY1, FBXL19, GOLGA4, OFD1, BARD1, PRPF8, DHX29, FCHO1, PANK4, XPOT, ULK1, COQ6, IPO4, CDC45, HNRNPUL1, PRMT3, PPIA, SKIV2L2, KRT39, CLTC, COPB2, RPS9, KRT23, DNAJA4, DHX9, POLR2B, HECTD1, PUF60, MATR3, PSMD2, MRPS9, SREK1, GIGYF2, ZBED1, ACSL3, RBM39, HADHA, FASTKD5, U2AF2, MYLK2, MASTL, SF3B1, CAPN2, HNRNPD, PSMD3, IRAK1, IQGAP2, KRTAP11-1, FYN, PSMC2, CAND1, KIAA0368, RPS26, PPP6C, RPL14, COPG2, YBX1, RARS, RPL23, SSB, CCT5, LEELKR, VPERLR,

KDGAEIDK, CRVVSR, LVVPELSSR, LTPESTR, DVIVRAADCK,RDGADIHSDLFISIAQALLGGTAR,KLFCPQK, IFVDIEK,AFAALPSSRPVYDIQSPDFAEELR,KEVDENK, KQNELK, SESLESPRGER, EATSVPR,VLVDSLVEDDRTLQSLLTPQPPLLK,ILPVNLKAVK, VQDFLQR,LLEEQLQHEISNK,NTRVYPCVWCK, KCLPIR, NREMVDVRPR, LREATEAK, QDSSLGGRAR, QITEEQIK, EILRER, AQTELDPPR,SSNNNGSVRTA, LDLRSCR, RYEEILDAR, LKNELLK, LLLSYGASR,TDMIQALGGVEGILEHTLFK,ARQEGGYR, KLQDLIK, VYFGGFFIR,AAQVFALLFVTEYLTK,APFQASSPQDLR, LVSRCGAVR,SFAVGTLAETIQGLGAASAQFVSR,SKFLDALISLLS, HLPSTEPDPHVVR,LINFIRLK, IIPGFMCQGGDFTR,KSNVKPNSGELDPLYVVEVLLR, LVSQIDNTK,KFDVNTSAVQVLIEHIGNLDR, SFKPDFGAESIYGGFLLGVR, QVVNIPSFIVR, LASYVEK,NVVHQLSVTLEDLYNGVTKK,FSDHVALLSVFQAWDDAR, LDLAGPLLAFLFR, LFHFLGIFLAK,VGRPSNIGQAQPIIDQLAEEAR,YQLLQLVEPFGVISNHLILNK,AELATEEFLPVTPILEGFVILR, AIAYLFPSGLFEK, SSSRSPR, QLREEQR, LALFPLLPK,VLSEAAISASLEKFEIPVK,GIAYVEFVDVSSVPLAIGLTGQR,KLDSLTTSFGFPVGAATLVDEVGVDVAK, VVELSYWEWLPLLK,IFVEFTSVFDCQK, NFIAVSAANR,ILGTPDYLAPELLLGR,ICFELLELLK, LFAQLAGEDAEISAFELQTILR,GFGFVLFK, HDADGQATLLNLLLR,VADLVHILTHLQLLR, VLNSIISSLDLLPYGLR, TYQQSCVSSCRR, SDVWSFGILLTELVTK,FVNLGIEPPKGVLLFGPPGTGK, IDLRPVLGEGVPILASFLR, TEALSVIELLLK, NIVEAAAVR,GYYSLETFTYLLALK,LVAIVDVIDQNR, SVGDGETVEFDVVEGEKGAEAANVTGPGGVPVQGSK,FGAQNESLLPSILVLLQR,DFVSEQLTSLLVNGVQLPALGENKK, GSAITGPVAK, ICHQIEYYFGDFNLPR, WVGGPEIELIAIATGGR, IQCJ-SCHIP1, NT5DC4, TSPY3, TSPY2,

CDY2B, GSTA3, NFIX, CCDC72, KRTAP2-4, RPS27P17, AQFEGIVTDLIR, VISGVLQLGNIVFKK, EKPYFPIPEEYTFIQNVPLEDR, NVVHQLSVTLEDLYNGATR, ITFTGEADQAPGVEPGDIVLLLQEK, FQGEDTVVIASKPYAFDR, GIFEALRPLETLPVEGLIR, AFHNEAQVNPERK, VFEVSLADLQNDEVAFRK, ASQVECTGAR, LVINGNPITIFQERDPSK, IIDVVYNASNNELVR, TLGILGLGR, EVAAFAQFGSDLDAATQQLLSR, LAATNALLNSLEFTK, TKVHAELADVLTEAVVDSILAIK, NSLLASYIHYVFR,AKLEEAILGSIGAR, YAIQLITAASLVCR,LLIVSTTPYSEKDTK, TVLEHYALEDDPLAAFK,RAPDQAAEIGSR, IGGDAATTVNNSTPDFGFGGQKR,IGGGIDVPVPR, MELQEIQLKEAK, IEHYASR, ELQHDVQK, TVEFLNEK, EVPDADSLQR, AQFEGIVTDLIRR, CDY1, GSTA1, NFIA, EXOG, KRTAP2-1, RPS27, LQQELDDLLVDLDHQR, NQSQGYNQWQQGQFWGQKPWSQHYHQGYY, TIVITSHPGQIVK, YGEQGLR, SLSALGNVISALAEGSTYVPYR, TPM3P4, RYDDPEVQK, AHQANQLYPFAISLIESVR, VLCELADLQDKEVGDGTTSVVIIAAELLK, ACQSIYPLHDVFVR, YLSLANNK, GALQNIIPASTGAAK, LTPEEEEILNK, QADVNLVNAK, VLSIGDGIAR, VLANPGNSQVAR, AQLGVQAFADALLIIPK, KLEGDSTDLSDQIAELQAQIAELK, GYFEYIEENKYSR, NVVHQLSVTLEDLYNGATRK, LYDILGVPPGASENELKK, ILQDSLGGNCR, TTSGDDACNLTSFRPATLTVTNFFK, DVYVPNTTYR,DKVQAGDVITIDK, TQGFLALFSGDTGEIKSEVR,QSRLEQEEQQR, IPDEFDNDPILVQQLR,AINQQTGAFVEISR, MILIQDGSQNTNVDKPLR, QAVTNPNNTFYATK,

DOCK7, RUVBL2, TIMM50, KHSRP, TPM3, DMD, LVEPQISELNHR, SYNE2, ELMQLEK, TMF1, AATEELLANK, XIRP2, GEGLEYENIK, ASANEMLIAGR, TRISNLPTVK, QADKVWR, KLEELK, CTRPICEPCR, LTEGCSFR, HSPA9, TTPSVVAFTADGER, MYH9, QNKELK, HNRNPU,NFILDQTNVSAAAQR, DNAJA1, LIIEFK, DNAJA2, NVLCSACSGQGGK, KIF5B, TNLSVHEDKNR, DYNC1H1,ENFIPTIVNFSAEEISDAIR, TCP1,SLLVIPNTLAVNAAQDSTDLVAK, RPS3A, LITEDVQGK, LRRC15, LTLFGNSLK, GAPDH, LISWYDNEFGYSNR, RPS8, LLACIASRPGQCGR, PHGDH, DLPLLLFR, ATP5A1, EIVTNFLAGFEA, KPNB1,GALQYLVPILTQTLTK, CCT6A,VHAELADVLTEAVVDSILAIK,

STNGDTFLGGEDFDQALLR, SAIFSITYPSQDVFLVIK, SYTEDWAIVIR,FVQCPDGELQK, AVLIAGQPGTGK,AQGPQQQPGSEGPSYAK, TIALNGVEDVR,IINDLLQSLR,GGGGPGGGGPGGGSAGGPSQPPGGGGPGIRK, VSHLLGINVTDFTR, SSGPTSLFAVTVAPPGAR, ITFHGEGDQEPGLEPGDIIIVLDQK, ITFTGEADQAPGVEPGDIVLLLQEKEHEVFQR, NTIQWLENELNR, LLGQFTLIGIPPAPR, ILDDDTIITTLENLKR, YPVNSVNILK, VFEVSLADLQNDEVAFR, LYDNPWR, VPTANVSVVDLTCR, ADGYVLEGKELEFYLR, AGTGVDNVDLEAATR, STVAQLVK, EHIKNPDWR, VATAQDDITGDGTTSNVLIIGELLK, TFHIFYYLLSGAGEHLK, DLPEHAVLK, QISQAYEVLSDAK, VIEPGCVR, SLSALGNVISALAEGSTYVPYRDSK, EEIKIK, KNLEDGINNLK, KPFSVSSTPTMSR, AIECLEK, VLENAEGAR, VLIPIHEANR,LLVKCLSLK, GTSYQSPHGIPIDLLDR,GLGLDDALEPR, VLLDLSAFLK,AQGPQQQPGSEGPSYAKK, KLASQGDSISSQLGPIHPPPR,VQISPDSGGLPER, MELQELQLKEAK, CDY1B, CDY2A, GSTA5, GSTA2, NFIC, NFIB, LOC729973, PPP2R5A, KRTAP2-3, KRTAP2-2, RPS27P9, RPS27L, QAVTNPNNTFYATKR, QAQQERDELADEIANSSGK, AVVVCPK, VNFPENGFLSPDKLSLLEK, IGLVEALCGFQFTFK, ISFLENNLEQLTK, VQVALEELQDLKGVWSELSK, FATEAAITILR, TTDGYLLR, CDSDILPLR, VIHDNFGIVEGLMTTVHAITATQK, NCIVLIDSTPYR, QNLSKEELIAELQDCEGLIVR, AVDSLVPIGR, AAVENLPTFLVELSR, GIDPFSLDALSK,

QLATVNEQPLQNGFEELIQWTK, TILEMLLGFLK, LDGSRLIK, EWAAQYR, LVLDAFALPLTNLFK, HGSGSGHSSSYGQHGSGSGWSSSSGR, ESLNASIVDAINQAADCWGIR, VELSEQQLQLWPSDVDKLSPTDNLPR, NKEEAAEYAK, INALTAASEAACLIVSVDETIK, LVEGILHAPDAGWGNLVYVVNYPK, IVSRPEELREDDVGTGAGLLEIK, TNTPADVFIVFTDNETFAGGVHPAIALR, TQLRNEFIGLQLLDVLAR, LPHLPGLEDLGIQATPLELK, TDYNASVSVPDSSGPER, FAGGDYTTTIEAFISASGR, RAB11A, NOA1, BIVM-ERCC5, EDNRB, ZFP57, USP17L4, MYH6, DDX17, KIR3DL3, ACAA2, PFKFB4, MAPK6, TBC1D4, SDLRHLR, DRLHLR, IREFCQR, AVASWSR, EAVSILK, MKPEFNVR, IEELEEELEAERTAR, KIVDQIRPDR, KNAVVMDQEPAGNR, SSWPNSEK, QCALAALR, HALREIK, MEKENQK,

MRPS31,SELLSQLQQHEEESR, MMS19,ACDSVTSNVLPLLLEQFHK, RPS7,TLTAVHDAILEDLVFPSEIVGKR, MAGED2,EIDKNDHLYILLSTLEPTDAGILGTTK, CSE1L, IIIPEIQK, HRNR,QSLGHGQHGSGSGQSPSPSR, STOML2, ILEPGLNILIPVLDR, AHITLGCAADVEAVQTGLDLLEILR,CNP, RPS6,GCIVDANLSVLNLVIVK, CCT7,INALTAASEAACLIVSVDETIKNPR, TARDBP,AFAFVTFADDQIAQSLCGEDLIIK, CCT3, TLIQNCGASTIR, TROVE2, TRDELEVIHLIEEHR, INF2, NEFIGLQLLDVLAR, NDUFA9,NFDFEDVFVKIPQAIAQLSK, HNRNPK,GSYGDLGGPIITTQVTIPK, EPRS, LQAILEDIQVTLFTR,

IEPLSPELVAAASAVADSLPFDK, SGNYTVLQVVEALGSSLENPEPR, HVVFIAQR, VPNSNPPEYEFFWGLR, FLESVEGNQNYPLLLLTLLEK, GPYESGSGHSSGLGHR, APVPGTPDSLSSGSSR, DKPELQFPFLQDEDTVATLLECK, LFNLSKEDDVR, VHTVEDYQAIVDAEWNILYDKLEK, TSDLIVLGLPWK, TAVETAVLLLR, IPTHLFTFIQFK, SSQLLWEALESLVNR, SSVSGIVATVFGATGFLGR, ILSISADIETIGEILKK, YNVYPTYDFACPIVDSIEGVTHALR, RAB11B, SNURF, ERCC5, EDNRA, PRKCE, USP17L8, MYH7, DDX5, KIR2DL3, MYO5B, PFKFB3, MAPK4, TBC1D1,

MYH4, LOC729562, CGN, LOC100509661, ROCK2, SENP7, PFKM, PNMA5, PAPD4, CDKN2B, KLHL9, TAOK1, CDRT1, LOC100505868, LOC338579, THEX1, BAT2L, ANGPT1, RAB11FIP4, CD209, AP1M1, TMEM120B, NPHP3, SHROOM3, LOC653566, KNMEQTVK, KSLFMLK, QREVLR, AFLVRSR, KNEIQK, EELKLK, EQWWLK, RLLESLR, QEEQRR, AGARLDVR, APMTTVR, IRSLLER, ENLESRLK, EALASHLR, ITWEEK, LTIMLEK, ELEKIK, GPSYSLR, EATQELIEDLR, AAVGELPEKSK, VFLSGMPELR, FAYKDEYEK, YEQAEHFR, ELKENLDR, EKSIFLVAHR,

MYH2, NOS2, HERPUD1, OVOL3, SOS1, MTMR6, PFKP, MOAP1, USHBP1, CDKN2A, KLHL13, TAOK2, FBXW10, ZFHX2, ANKRD30B, ERI1, PRRC2B, ANGPT4, RAB11FIP3, CLEC4M, AP1M2, TMEM120A, NPHP3-ACAD11, SHROOM2, SPCS2,

05-Sep, EIF3FP3, BEX2, TRPC7, STRCP1, LOC643751, RPS18, RPL27AP6, EIF4A1, ANXA2P2, LOC100291405, ARF3, SNRPB, NDEL1, EEF1DP1, GFPT2, ATAD3C, U2AF1, RPS2, PKP1, QRCPPGVVPACHNSK, DLVEAVAHILGIR,EMPPTNPIR,LGLALNFSVFHYEIANSPEEAISLAK,QESGYLIEEIGDVLLAR, RQESGYLIEEIGDVLLAR, QLHDDYFYHDEL, EALLGVQEDVDEYVK,LLQLLGRLPLFGLGR, AWGILTFK,LLDAVDTYIPVPAR,HYAHTDCPGHADYVK, STAISLFYELSENDLNFIK, TFCQLILDPIFK,SAYESQPIR, VAVLQALASTVNR, RMQEMLQR, IQDALSTVLQYAEDVLSGK, MREENMER, YELLEEK, ASQPSAHISPR, NVFDEAILAALEPPEPK, IPDWFLNR, NQSFCPTVNLDKLWTLVSEQTR, VLITTDLLAR, SALSGHLETVILGLLK, VELSDVQNPAISITENVLHFK, MLAEDELRDAVLLVFANK, VLGLVLLR, ISALNIVGDLLR, IASLEVENQSLR, VIQQLEGAFALVFK, GLLLFVDEADAFLR,

WFNGQPIHAELSPVTDFR, AVIDLNNR,GTGIVSAPVPK, TYSYLTPDLWK,SPNQNVQQAAAGALR, AVQYLSSQDEK, FASN, SFN, ARHGEF1, RCN2, MRPS34, TUFM, EEF2, PTCD3,

TLLEGSGLESIISIIHSSLAEPR,ALGLGVEQLPVVFEDVVLHQATILPK, SAYQEAMDISKK, LAEQAER, LLLKSHSR, HLKAEVDAEKPGATDR,LSEEEILENPDLFLTSEATDYGR, SDYDREALLGVQEDVDEYVK, LPLFGLGR, HEEEAFTAFTPAPEDSLASVPYPPLLR,DLEKPFLLPVEAVYSVPGR, GTVVTGTLER,ALLELQLEPEELYQTFQR,YVEPIEDVPCGNIVGLVGVDQFLVK, VSNSPSQAIEVVELASAFSLPICEGLTQR, AIGIEPSLATYHHIIR, SEPT5-GP1BB, EIF3F, BEX1, TRPC6, STRC, CDC42, RPS18P5, RPL27A, EIF4A2, ANXA2, PTPLAD1, ARF1, SNRPN, NDE1, EEF1D, GFPT1, ATAD3A, LOC100509022, RPS2P40, LOC652798,

TTSNDIVEIFTVLGIEAVR,LSGEAFDWLLGEIESK, SSGPYGGGGQYFAKPR, LFIGGLSFETTDESLR, NLRDIDEVSSLLR, TVGATALPR, NR3C1, KCLQAGMNLEAR, BTNL8, KENPGGTGLEK, KCNV1, KELSETLDFK, SHPRH, TSGGVNPEPCPICAR,CCDC39, STCGLLETQIKK, WDSUB1, IESLGLR, ZFP37, LLKFESCGK, APAF1, SGLLMKLINLCTR, RELB, EIEAAIERK, TCHH, REQQER, LOC442132, LRDQEK, GTF3C5, KVEEELR, DUOX2, KEIEDIR, TMEM57, KLEEATAAR, CSPP1, RYLTQGITQGK, CNTFR, VCEKDPALK, C14orf133, SYAPELGR, C9orf142,CPGESLINPGFKSK, PDGFRB, ISLIILIMLWQK, LPIN3, AESHMQWAWGRLPK,ZNF83, CMNCGKVFSR, RIT1, RSFHEVR, CDR2L, RGMSILR, MSH5, VSGQEGGLR, CDK11B, VSMAPPR, STK32C, ARLPGEPLVLLPGR, DBNL, NEQESAVHPR,ANKRD20A8P,SLALETVQNDLR, PLCB1, RVETAMEACSLPSSR,BCAR1, QGKNVLAK, CCHCR1, RLEEEVR, MCC, RLQQTER, RFC1, TLALLDEEPK, HLA-G, SWTAADTAAQISKR,C11orf51, QLCPESIK, HCRP1,AFGGPVPTSALRTVNLGTPNQDA,PHF3, SLRQSTIAK, CTCF, SDLGVHLR, ST8SIA6,LSTGLMITSVAVELCKNVK,ILK, ELLRER, PLSCR1, MLLTRK,

POLR2A, HNRNPA1, CCT8,

SIAANMTFAEIVTPFNIDRLQELVR,GFVENSYLAGLTPTEFFFHAMGGR, SESPKEPEQLR, NQGGYGGSSSSSSYGSGR,LFVTNDAATILR, AIADTGANVVVTGGK,

HNF1B, SVSTTGLQTAGR, RPP30,YTISSALNLMQICKGK,TIGD6, LTALFCCNASGTEK,UGGT2, LEGKMMNGLR,TMPRSS11A, GDSGGPLVTR, MAN2B2, SSLKGLAR, PLXNA1, VQIDYK, RNF167, CVDPWLTQTR, NUMA1, EAEKQR, CPNE2, VITLSLAGR, CELF4, MLNKQQSEDDVR, WFDC5, RCCHSACGR, DENND2D, NPPAGYFQQK, ARHGAP5,MPLQDLVTAEKPIPLFVEK,DYNC2H1, TVLRGSGNLLR, RPL7P34, AGNCYILAEPK, BUB3, NMGYVQQR, DKC1, IMLPGVLR, MBLAC2, LIELVDR, VPS13D, QLNLTIR, PHLPP1, FEEIDLSGNK, LRP2, ILVESQNR, DDX12P, FSGVCPDSR, CCNL1, NDSLRTNVFVR,SERPINA10, NLQVSRVLR, AKNAD1, ITENAANKK, CCNO, EVQPAPELEIK, MRPL38, EYFGEK, UCN, ALLLLLAER, CCT6P4, ESIAALHR, CHRDL2, QSDQDITK, ICT1, EDVKLHR, BCOR, EGGHPATK, FAM47B, KLEDAWAR, C9orf96, QSPGSLKAVLK, ZSCAN29, LLTQSAPK, C17orf66, LTIKSEMR, HADH, CHTMAFVTR, KIAA1217, LRDQTR, KLF10, SPVSAPKLPK, JAKMIP3, LPCHPNDGTR,LOC100131193,IRDSEAQLQVLLR, ZNF148, LNTCAMKGGLLR,ARRDC5, NTFTPGEK,

MIDN, QGPQSPER, TBC1D19,SLNSMCTELSIPLARK,DAPK1, YNTNNGAPK, NKX3-1, QLSSELGDLEK, LGALS3, GNDVAFHFNPR, SPOCK3,MPPHPPLCLLSQMLK,PHF20L1, GSEVTAPVASDSSYR,SUPT4H1, DLRHLR, CTSZ, GQTCYRPLR, GNAZ, IDRHLR, LOC100506052, AIFHLTR, SREBF1, ALHFLTR, ALDH1L1, DALENGR, NAB1, TTKMGFQISR, FAM105B, EKELLIHER, FAM98A, RSVLSPK, TBKBP1,LEEALEAAQGEARGAQLR,EPHB6, GAQRAHIR, HIST1H1D, VAGAATPKK, C2CD3, SSFLSSPER, SNCAIP, QSRQNSR, PRSS33, KSAACGQPR, YSK4, LIDFGCAR, ZNF737, SSILTAHK, ZSCAN4, SSGKNLER, HOMER1, LKQEIDNAR, WNT2B, SVGEGAREWIR, SPRYD3, VCGTLLEYLGK, PCDHB17, MMETPLPK, GRIN2D, IEVMSSK, RYR1, DATTEKNK, DOCK3, LDSMVSEGK, EML6,CRAFQLETGQLVECVR,EMB, KNESLGQ, MAGI3, VTGTIGMAEKR,ARHGAP27, CVRDLVR, RSAD1, LELLEEVLALGLR, TWIST1, LPLPGPPER, ARL6IP1, EINKLLK, LOC100287367,ASLTWASPR, FANCM, ELSLVEQR, STRBP, ILRDLCNR, TMEM233, SQYAPSPDFKR, ITPR3, NAGEKIK,

AMOT, LEAELATAR, GFI1B,RSHSGTRPFACDICGK,IGFN1, KDCGQYSVTLR, NSRP1, QMARVNAK, CCDC23, QKSAQQELK, DIAPH2, ILNELAELK, C10orf137, SVDSSIKK, MYH14, LEGELEELK, CNPY3, GDTAALGGKK, RRP7A, STWPQKR, LAPTM4B, KMVAPWTR, HIGD1A,EAPFVPVGIAGFAAIVAYGLYK,TRIP13,HNIVFGDYTWTEFDEPFLTR, COL4A4, KGESGIGAK, CEP164, RLEDLR, PACRGL, SSLSTSSPESARK, PCCA, YSSAGTVEFLVDSK, ATP9A, VKSSNIQVGDLIIVEK,CRTC3, SSSGLQSSR, KISS1, EAAPGNHGR,LOC100507508,SPPASSAALCVWLAPR,OPLAH, LEGLLSR, CCDC112, LEKEIR, PARP14, IEKIER, C11orf2, LEKELR, ARHGEF9, LEEKIR, ANKRD6, LSGDSRACR, PRAMEF23, FVKMSIR, C16orf48, RPTSAQGR, TP53I13, QPSSSGAKR, RGL1,HYLDSDPAEEYELVQVISEDK,FAM163A,STAFGMNGLGGIAAKLFAAPFSMR, TRIP10, LKLEVQK, BPTF, IKLEGGIK, GPR45, VPERIR, HARS, QMAVREK, MLLT4,ARTAMPAISVLDLLQDEER,PNKP, FREMTDSSHIPVSDMVMYGYR,PKN1, QMNIDVATWVR, CDH11, SSCTRPIMPTCLR, HOXB3,AHQNAYALPSNYQPPLK,RBMY2FP, SPSGCLRSAR, ARID5B, TVGSALPSSPKR, DNAH8, LSLDTMKR,

IL1RAPL1, RDTLLIR, KLC3, RLAQENVWLR, NAA38, MIVGTLK, CDKL5, MLRTLK, LOC652586, YSSLLRGASR, CHTF18, QVDGERR, USP42, FHEHENGKSR, RARS2, FQSVESSTR, ZNF516, LLNGASQADGAR,ATXN7L1, KHQNGTK, NANOS1, YLGSALELR, NAA16, ILKCYEQK, SEMA4D, QTECLNYIR, KIAA0226L, QQHVEDVQR, RBM22, ASTMPRLDPPEDK, CYTH2, KQELLVEIQR, MAGEB16, VAFLVNFMLHK, ZNF668, MEVEAAEAR, QSOX1, RDTGAALLAESR, ITSN1, AEQERK, CD6, IENKESR, FAM186A, KVMQVWTEK, CHD9, EIEDLLRR, ABP1, ELRLQPSSTTTMAK,CCDC168, ILLLSEEK, RINT1, MMLVLEK, PSMC5, IDILDSALLRPGR, EXOSC4, KSCEMGLQLR, LYSMD4, MAVGTRGTLLK, EIF4G1, AASLLLEILGLLCK,CCDC18, PAIDLGQELR, FRY, FMAELKELR, HIVEP3, RSSMESPK, ACSS2, ELADEALQK, NUB1, EWRGAGMAQK, F5, SLDRQGIQR, LYST, AICGLSR, PLEK, AIQMASR, RHBDL2, QPREER, GDF7,TLAQAAGAAAVPAAAVPR,CTC1, QAPGGGWDCR, PRKG1, QIMQGAHSDFIVR, KCNH7, HESAKADLLR, SCXB, CGLQGARR,

FAT4, LTGELDR, ZHX2, DQLAIAASR, FBXL7, LTDEGLR, RNF215, LLDALLQR, PIEZO2, GADAGNGIR, PSD3, NAVSVHHALASK, PDE1B, SECAIVYNDR, TNS1, TSVHIPR, LMAN1L, QLAQAER, C15orf39, GQTLDGTFLR, PDE1C, GEASVVDLK, CACNA1E, NAPMFQR, REEP5, NCMTDLLAK, N4BP3, MAQQEQRR, TNMD, KIYMEIDPVTR, C19orf45, EAPPHGAPPR, CLEC4F, GSCPLRK, FBRSL1, WTRSCAPGMAR,METTL10,SSGADGSGGAAVAAR,KLHL29, HGCVVIK, LAMC1, NISQDLEKQAAR, CARS, ADSSGQQAPDYR, PHF7, EAARASR, TM7SF4, FLGPCGWK, C19orf60,LAGGAGGGSRAGR, OVOL1, GHLTDPQSR, HLA-B, FNSDAASPRMAPR, TANK, RQEVSSPR, IFT74, SSAARPVSR, CXorf59, EMDAGVSGLRK, CCDC63, KEIEDLR, PTPRT, KGYHEIR, MYC, RSFFALR, NPAS2, SVTREGTGTTR, DICER1, ENFNSQQK, CLDN15, CTNIGGLELSR, CX3CL1, AIILETR, CYP2D6,FGDIVPLGMTHMTSR,KAT7, YMKSQTILR, MCM3,TVDLQDAEEAVELVQYAYFKK,ESD, SYPGSQLDILIDQGK,SH3TC2, EGECLTLCK, PLCH1, NTKGGGLEGR, IQUB, NCINLQNEAQKR,

SLC9A2, LLIRENQPK, WDFY3, TCIIWDLNK, POLN, LHFVCSPR, ALDH3B2, VAIGGQSNESDR, PPIL2, LDGRSLIK, SCFD2, IAVDELFTSLR, KDM3B, NCAIISDVKVR, OPA1,GNSSESIEAIREYEEEFFQNSK,NEDD4, IVLSNNTPR, WDR11, MCPPLTTK, KIFC3, IKVEHLK, PSD2, VGSDDLER, MTPN, GEDVNRTLEGGR, ZNF444, REHVLR, ZNFX1, NGNQDCR, MYBPC3, SSSFRTPR, SLC4A8, AKEEEEAEK, LOC81691, NCNLRALK, AHNAK2, EGKGSPLK, CDKN2AIP, DKILSMAEGIK, LTB4R, YPSEGHR, UCHL3, WLPLEANPEFLK,C10orf76, EAAFFKELAVR, NSD1, FEELKAQK, GCH1, TVTSTMLGVFR, ALB, LTCASLQK, GALNT5, KMQNALGR, ACAP3, LFVSGVR, CDK19, KPMQLPRSMVK, ERGIC3, ISSAVTPVK, SYNE1, KFEAELK, UBE2D4, DLGSPQHPPPR, P4HTM, QALFQQEMAR, MED14, DKIIPPDPITK, PTPN7, RTWESPGVR, PIK3AP1,QFIDEYVETVDMLK,MYO18B, WDGPQNK, AK8, DAEEQVKLK, TRIM4, KVMHLQDVEVK, PTPRG, NSMVVPSERAR,KIAA1826, LELEKR, KIF3B, LEIEKR, PAG1, IPPESAVDTMLTAR,LOC441736, EERAQWAEQQR,

SFI1, ESAQGLR, C12orf51, SINEMLLSR, LPHN3, VFLCPGLLK, MIS12, EIEQLQEK, CNNM4, SPAHPTPLSR, HMGCL, AVSTSSMGTLPK, SYT12, VTVAESSSDGR, C2orf73, ACPSTPESR, GNG12, SSKTASTNNIVQAR, MBNL1, AVSVTPIR, UROC1, RCWSGNQK, TLE3, VEEKDSLSR, MCTP2, STIAFAHR, IQGAP1, NKYQELINDIAR, SYNM, ENLLLEEELR, ASUN, EELAEAEIIK, HELLS, GGQSGLNLSK, ACSL5,DNSNNTLYLQMSSLR,HMX1, CLAWCEPR, CHD4, LHDMLGPHMLRR, ATN1, EAEQRAR, TRAF3, QEAVMGK, ZNF181, LSKGMIPDWESR, MPRIP, EPGLESK, MPHOSPH8, HQNSALHFAK, ANXA6, MAKPAQGAK, SH3GL1, MVSEKVGGAEGTK, MDM2,QQHISNDCANLFPLVDLSIR,THADA, FSALTILGSIAEVFHVPEGR,SCARB2, DQSVGDPK, CCDC66, EISPATPNMQK, DLL1, LISRLATQR, PRRC2C, QPRAGPIK, CALHM2,AAVAPVTWSVISLLR, GIPR, GWHHCRLR, NEXN, EYEELIK, LAMA3, HMETQAKDLR, SCNN1A, KETLDR, FAM129B, TDMDQIITSK, SLC22A11, TQTAPAGR, DNAH1, AYPTMPR, POF1B, NVVYER, HPS4, LPTGGDAPQEHGK, HIF3A, MRPAAGAARRPR,

ABCB11, SGGQKQR, SMCR8, KAFAGELEK, HEATR1, SPQILVPTLFNLLSR, KIF1B, CLQLLTHTFNR, TPM1, MEIQEIQLKEAK, MMGT1, MTPLGSGPPR, KIDINS220, VLQMLDTVR, RNFT2, KLELEK, FAM120B,CLSSYTSVKENFDK, TREM2, QLGEKGPCQR, HES2, AGDAAELR, ELN, SLSPELR, GPS2, MEQKMK, ANKRD32, LDCTFIK, ANKRD30A, MEQMKK, LOC646934, GAEGEARR, ASIC4, IPNRGSAR, LRP1B, IENAWGIR,LOC100293539,VGGSGLGPPR, PAPPA2,LGEECDDGDLVSGDGCSK,PC, VFRACTELGIR, PVRL2, MARAAALLPSR,ARHGEF25,TDRILGVMGGMLR,DPY19L2P4, AGELQEVAAWR, ACVR1C, MTRALCSALR, CHPT1, MAVGASIAAR, ORC1, KLHYQTYR, TWISTNB, EQLDAELLR, SLAMF7, KLCEGDCLSPLHR, TGM1, KEVELAPGASDR, ARID5A, GLTENSR, SYT13, HSVAGELR, TOM1L1, YNLPLDIQNR, MTFMT, MPSPSPK, CLRN1, IHHLSEK, POLR3G, KMEELEK, FAM170A, TCVSSLCVNK, C9orf79, SQKTLGCADK, ANKRA2, FILPNR, TMEM214, DLASYLNYK, NNAT, QVLGERR, WDFY4, STEGVDIVMDNLK, TCOF1, EAASGTTPQKSR,FAM174B, LDSSALR,

SGCE, QANNFIISR, RER1, QNEEFRPFIRR,LOC100287837,APCPPRATSLR, EMILIN2, GYEGQLR, ECE2, LLVPGPSGER, SECISBP2,QQPQDNFKNNVK, SMC2, YEVLEK, C6orf25,GPGPGTGKGMGMGR, PKP4, FAAYIRAAVR, VPS41, QFVTGGKK, LINC00313,ECVGATGLNAAR, UBXN4, MLEERNR, MYO7B, QQVQAKR, PLS1, ENSTTTISR, ELFN1, AASAAAAGSLK, MRAS, LVVVGDGGVGK,B3GALNT2, ERCGDPCR, DEPDC4, LSKEDVWK, TRIM25, LTVMYSQINGASR,ERVFC1-1, WASGVDGGLYEHKTFMYPVAK,RNF213, SCVQSAVGMLR, SUGP2, SQDWKFALR, COL25A1, SYEHMAKIR, KLHL12, SILNSMNSLR, F11, SCALSNLACIR, SEMA5A, VCVGQNREER, FAM198A, CRNPAELR, PSMA2,YNEDLELEDAIHTAILTLK,PDHB, MAAVSGLVR, NFKBIB, AGVACLGK, GGA2,VTFGNRVTSSLGDIPVSR,IGSF6, MGTASRSNIAR, HERC5, VTQIACGR, HMG20B, KILPNGPK, NREP, MKNFHLK, KIAA1755, SKELMEAVLR, GIGYF1, REVELR, DNAAF1, RDLEIR, UBN2, RIEDLR, C9orf91, SLAESALLEPQVRR, ABCG4, RDLELR, BCAT1, GVDNKIR, C22orf25, CPGYGTR, ZNF687, FISHKK,

EHBP1, EQLLLDELVALVNKR,GTPBP10, GMGHKFLK,LOC100134174,GAQVGTRGSR, ODF3L2, MGTLSCDSTPR,ARHGEF11, TAVGSSDSK, RUSC1, MPPTCSPGLR, GCKR, MKGGSATK, MRPL15, QYKPMSLNR, USH1C, MAGCCCKGNIGR, RALGDS, VTSQDKAPAVIR, UBXN2A, FNITHR, RBBP5, GSCFLINTADR, DYTN, KEAGNIK, ALG13, FSAFLDK, RB1CC1, DEAIQTALK, LRRC4,YLVEVDQASFQCSAPFIMDAPR,DPY30, ERPPNPIEFLASYLLK,LOC100131673, VRAEGELEPR, VCL, GQGSSPVAIQK, C16orf7, LRLQEAANR, ZNF236, CPQCFR, SH2D4A, RLSLAAQK, MLL, SSQIPKR, TTC38, SLGGIEGCLSKPK, FLT3LG, GGGHPALGSR, ANK3, DSNQRPKNNR, MYO5A, EEERPQIR, ZFYVE28, DECIPQDR, HAUS7, LPLVLLGTR, KIF11, ILQDSLGGR, ITPR1, LEELGDQR, INTS7, SGKLGEQCEAVVR, POLR1A, LTITFPAMVHR, HYAL1,GALSLEDQAQMAVEFK,FNDC7, APANIQVSFDSGALK,FBXO31, IVLEVRER, CCDC135, QVETQLDYLK, FLNC, VIPNGNGTFR, BIRC6, NANDANSAAR, AHNAK, EGQTPKAGLR, ZBTB48, KDLQSHMIK, SOX8, LLSESEK, GPR179, EAGSMGSR, TOB2,CVHIGEMVDPVVELAAKR,

MAEA, AGGQAGTRR, SRRM2, SSPELTRK,LOC100507023,LEAPSMGHSPR,LOC145783, TGMERELIESR, SPTBN1, QLQEDAAR, C8orf58, DGAGEGLAR, KIAA0947, KGGEESLR, PLEKHH2, DLEHCLQGNIK, PPP2R3A, SSLWNVVMR, C11orf35, SSWVGRMLR, AUH, LEGLLAFK, TXLNA, EAVASQR,LOC100289079,EELSNVLAAMK, CENPE, EKEDQIK, CCDC75, AEEKLESYR, SIN3B, RGAAGGNLSSR, RBM25, GAIEVLIR, LPPR1, AVGNNTQR, FN3KRP, LGEMRLK, KIAA1257, QRSQIK, SUPT6H, LEELLIK, MED12L, IAGFDSIDKK, RAD52,ALRSFGNALGNCILDK,DTNB, QLFIEMR, C4orf27, MVGGGGKR, CRABP1, ALGVNAMLR, EXT2, CASVKCNIR, XRN1, IEEEVEK, MAML3, MEQHPVGLPR, CLSTN1, GNLAGLTLR, NFRKB, ELAALFQLWLETK, SWT1, RLSVEIDTLR, CERKL, EQNKLK, DDX58, MMNDSILR, ZNF761, KEFLSTAQGNR, IL6, EFLQSSLR, OIP5, MVCYLLK, FAM111A, ESEKIIENFK, PMFBP1, KLEEENENLR, SLC4A4, LLLMPLK, PSME3, LIISELR, GK, LRWLLDNVR, FAM193A, KQLNHVK, KIAA0355,ASTFLTDLFSTVFR,

ASNSD1, DENLTANEVLK, ZNF233, NQELPLR, ENAH, QERQER, EDN1, SSFHDPKLK, CAMK4, LKTVEEAAAPR,RAD51AP2, MSSVYLK, RBM20, VACSVPGAR, HMGXB3, QPSPSSSR, NEBL, KDLENEIK, YTHDF2, LENNENKPVTNSR,LOC100287896, LEEEDIK, INTS3, KFLACR, LOC401317, FLEGQDAK, FCRL5, TVGNQTGIK, UTP11L, GVTNQTGLK, SNRNP70, AAAAAASLR, TULP2, HEASLAIR, LCE4A, SHCHRPK, DOCK6, ILVKCLSLK, VWF, TCQSLHINEMCQER,C13orf35, VTVQTDDSNK, PPP1R7, LELGSNRIR, C1QTNF8, GEKGEAGVR, ZMYM4, IVKETVR, RSPRY1, ISPHGLEAR, GALNTL2, SEKPDCMER, ZNF253, STDLTTHK, HIST1H1C, KAGGTKPK, ZNF185, SWITKQDESEGR, CAMKK1, NGVDPPPR, UPF3B, EKDTLR, GABPA, NKPTMNYEK, DAP3,NDWHGGAIVSALSQTGSLFKPR,RHOT2, RLDQEK, TRIP11, QSDTMTEKER, TAP2, SIGAEEHEVCR, THSD4, TSVPLHR, YIPF4, RHLSGSIASPDVK, GINS3, IYQEGWR, USP43, CPHCQVLQQGMVK, CFH, IYKENER, EXO1, GAIACAEK, REEP1, GQGALSER, FBLN1, SSAATASWR,

CTPS, SGSSSPDSEITELK, BBS9, ITIDTNK, DRD5, SSAACAPDKSLR, GORAB, LMEEKNK, NEB, AGDALNER, NBPF3, AEMNILEINKK, RASAL3,ESLATLSELDLGAER, PHC3, IPLHSPPSK, ENC1, EIVEEAIR, SUPT5H, QAIEGVGNLR, ANO2, LKLMDEPALR, LARP1B, TGTHMSRAK, SPON2, WTLGSGPSEAIR, LUZP1, KYLNEK, PLEC, ARAEEESR, UNC5A, VTWRGEGR, EBPL, GWRFAAAVR, CRLS1, DVMLIAAVFYVRYR, CVLVGDGAVGKSSLIVSYTCNGYPAR,RHOV, SHB, QSSPSPSR, KAT2B, QTIVELAK, AHCTF1, NKLEDELK, TULP1, SSADLKER, SLK, QTLDALNYLHDNK,MRPL46, ITEADEK, VRK1, TVLQLSLR, EYA3, SLLLIQSR, MIA3, TVLVVKDR, LDLRAP1, GEELSAAAIK, FTH1, MKLQNQR, GOLGA6L2, KMWQEEK, SIK3, DLKAENLLLDANLNIK, PFKL, EVQKAMDDK, GON4L, CCEEIQPR, TTC34, LGLLQLK, FRMD3, KIVEIHK, TFAP4, VPSLQHFRK, LRRC38, KLLVAGNR, LOC644950,LTIMPKEIQLAR, MAP9, TIEDRNIK, SLC26A6, MGLADASGPR, CEP290, QIDSQKETLLSR,LOC644667, QAPESRK, NRIP3, RLMETNLSK,

CHD3, IDGGITGALRQEFK,ZNF850, SFTFHSALIR, TFDP1, RVCDALNVLR, OR2AT4, ISSLEGR, NARG2, VEMGLKLEK, KITLG, RQPSLTR, WDR43, LQAKESPQR, CCDC114, YLEIEER, PPP2R5C, MGGTSREELR, ZDHHC6, IDMSAARR, PDE6C, LAHLLQADR, ACADVL, QSWVRAR, CCPG1, ENQKLK, NRBF2, ENKQLK, FAM83H, SLESCLLDLR, EEA1, QLLIQQK, ZHX3, YPEEQLK, ZNF335, FLSHEDLR, PARD6G,QAAWRSTGPAVTL,CALML6, AQNQESELR, ANKRD24, ARDAAEAR, MTTP, LKAVVEAK, PRRT1, ELLGSLK, SOST, MRVEVGIGSR, WAC, ELEKLK, WARS2, EIEKLK, MRPS10, ELGISIK, EPHA3, EISVAIK, C3orf26, QPKECFLIQPK, VRK3, LDAKDGR, TCP10L2, MLEGQLEAR, FAM26F, ASDVQDLLK, IRF2BP1, EEAVAEGARR, DGKI, EGGSRSPR, SPATA24, EKLAFEK, MYLK, TWKELATCR, GTPBP1, YQAMVHCGSIR, ACVR2A,RGTSVDVDLWLITAFHEK,DNAH17, QLKACHR, EIF2C4,EFGIVVHNEMTELTGR,GFRA1, LDCVKASDQCLK, DNAH3, LEGYHR, LOC100130466,QTQQAASK, GFRA2, SSYISICNR,

ESYT1, LYMKLVMR,LOC100507141,VWEGHEGLQSEMALSSTCSALLIR,PHLDB2, YSSSSLSHMGAYSR,GTF3C1, KNITNDIR, APOL1, VRAEEAGAR, SRSF7,NPPGFAFVEFEGPRDAEDAVR,ZNF841, EPWTVVSQVK, DST, LDDMLSELR, PRSS12, ESLSSVCGLR, DACH2, IQEKQIQQEK, DENND5A, MLVCLGAR, PCDP1, KSGACPNK, CLIP4, QNAISSNK, TIGD7, LKSIIIGK, METTL4, MEYHTMIR, ARF5,MLQEDELRDAVLLVFANK,DDX3X, TAAFLLPILSQIYSDGPGEALR,APOE, EGAERGLSAIR, CDC5L, LECLKEDVQR, GLYR1, TAEKEGAR, MED13L, TAERQEK, CCDC116, LLLEKNLK, MCM10, LIRLSQIK, MCAM,ANSTSTGKPGLAREQGCAR,MALSU1, EIYELEK, DISC1, QFSCLSLR, RNF32, IQVQALR, CEP152, EIQLEAQIK, ACTR2, QEYQEKGVR, STK32A, LLEPNPDQR, TRIM22, LQVALQR, FAM101A, VSPEPTHEIR, DDX41, ILESVAEGR, DGCR2, SQTCVDIK, CKAP5, QCKSEFPIR, HACL1, SEEQVSGAK, CFP, CEELQGQK, ARR3, KGEEESQK, TRAIP, EIMSLKK, TTC25, KLSEVGR, RDX, QIEEQTIK, RIN2, MTAWTMGAR, CALB1, DLCEKNK, KANK3, YIEELER,

ATP6V0A4, FYVGDGYK, CCDC83, EDVEEAMKEK, SLC6A15, GTTGVPKNSK, NGRN,GTGSGALPSGQKLEELK,IMPA1, NEMNVMLK, RPS15AP17, QVLLRPCSEVIIR, MYO18A,ALLADAQLMLDHLK,MCPH1, MAAPILK, SDHD, LSAVCGALGGR, COL21A1, QVCTDVIK, GTF2E2, KPASQKK, PXDNL, VLGDPGTR, SPINK5, LFCPQDK, IPO5,LVLEQVVTSIASVADTAEEK,CHD8, YTEDLDIK, LAS1L, CLALSGR, KIF24, CIQSLR, RPLP0, CLISQR, IL1RAPL2, CALFYSYIR, KIAA1109, LSVGSNRDR, ZNF248, TFIESLK, DDX3Y, QSGGASTASK, NRG1, CALPPRLK, KIAA1875, EPLSSLR, EIF3D, VADWTGATYQDKR,RNF146, GASWLGRR, JAG2, CLDGRR, DNAH10, GVMSDPNFLR, IGF2BP1,AISVHSTPEGCSSACK, FOSB, AELESEIAELQK, MAP3K9, SLVDGYK, AP5B1, EPLKVVLR, DDX4, TGGLFGSR, GPC3, TVGGGGGEAEGKR, BDP1, METDLKEIR, S1PR3, MRPYDANKR, RPS20, LIDLHSPSEIVK, CSF3R, LEPPMLR, CHSY3, LEEFLR, OSBP,LWIDQSGEIDIVNHKTGDK,DRAP1, KPGSGGRK, NPC2, SGINCPIQK, CFTR, MQRSPLEK, THAP4, DASKATGGVR,

GCET2, ILIFEK, FDX1, MGPGGVGVR, RASAL1, EMPGAPSPLR, ZNF316, HQWAHAEEKPHR,ADAM15, ALLARGTK, LRRC41,ALHLSDLFSPLPILELTR,LOC646214, SDSVLLGMEGSVK, NCOA6, MEIRCNIEK, MPP5, ALLAKEGK, PLP2, HTAAPTDPADGPV, SLIT2, CTSPRR, NSF, LLDCVPIGPR, C19orf40, VRNSNNLK, MB21D1, DCLKLMK, TSNARE1, SSCPQER, ASB16, QGAELDAR, ACBD4, GLCSQLR, YTHDC2, IAVNIALER, GRID2, RAILVMNPATAK, DGCR8, LLIDPNCSGHSPR, DPH2, EALAHLR, TMEM194B, GEAAAAALSVR, CAMSAP2, SKSLADIK, ZNF709, QCDKAFSCSR, RPS6KC1, RDYLEK, PSAP, MYALFLLASLLGA, CCNL2, SAPYKGSEIR, FASTKD1, VQFHLMELNR, SARDH, HLSSPIR, LOC339535, WEGVSGK, XIRP1, LLDIEEAVHR, JAK1, QRNLLTR, RAPGEF5, EQDQSVLVLKK, KCNAB2, DKILSEEGQR, SLC7A6, NVLAAITR, FPGS, GVTTQVAAR, CSMD1, ITCVQLNNR, NPM1, DMQLLSISR, LMNB1, LLQGEEER, DLG5, ELKELK, PPP6R1,VAGALVQNTEKGPNAEQLR,MOS, LLLGATLPR, DIO1, VLLILFPDR, HNRNPA2B1,GFGFVTFDDHDPVDKIVLQK,

CNOT1,APGFVYAWLELISHR, DRG1,TCNLILIVLDVLKPLGHKK,UBXN11, MAFMTR, FCHSD2, AELEQKIDEAR, COPA,NFDKLSFLYLITGNLEK,DDHD1, LEPLILKR, SLC1A5, NIFPSNLVSAAFR, DHH,AGDSVLAPGGDALRPARVAR,MEGF10, HGEKTMYR, SPEG, LQFFEER, SLC9A5, YLSQLLMR, OTC, QSDLDTLAK, FAT1, IEARDQAR, CDC34, VDAERDGVK, ORAI2, DCVAKSVGGK, PPP6R3, QDVLNKSVMK, KIF23, QLEMQNK, AARS2, VVADHIR, LINC00479, LLIPRWR, CARNS1, VVCGVGRGDRPLR,SMARCAL1, CRAAMPVLK, GIN1, VGHEVLR, VWA5B2, ISQERLCR, SV2A,GGLSDGEGPPGGRGEAQR,GTF2IRD1, ILAVADKIK, DTX3L, QECVSLADSK, POLD1, GLLPQILENLLSAR, PIK3R6, VLVLGDDR, ITIH4, LCVDPR, ZFYVE27, QEDLQRVR, KIAA0020, AIAFAHDSTR, IPO7, GIDQCIPLFVEAALER, UTS2, LTPEELER, RBCK1, EYQEDLALR, EIF3E, HLVFPLLEFLSVK, HLA-C, SWTAADTAAQITLR, GNAS,SNEYQLIDCAQYFLDKIDVIK,HMGCR, VPDNCCR, DHX33, EAISDSLLRK, CDH26, NEQGGVR, TMCO6,TEDAGLELLACPVLR,TRAF3IP1, MNAAVVR, MYO15A,LHRLINPNFFGYQDAPWK,ZNF223, SALTVHCK,

RAPH1, ETSLLLR, RRNAD1, QLAVNLTR, STAG1, EYDVAVEAIR, ROPN1L,SMSQASKELGPCAAATSPTTFLR,MSI1,IFVGGLSVNTTVEDVKQYFEQFGK, NLRC5, QLNLITR,LOC100506012,CAYQGLALSPR, LGALS3BP,TLQALEFHTVPFQLLAR,MON2, AWDVLLDHIQSAALSK,AFF4, AVYYLDAVVSFIECGNALEK,AGTR1, KDILQLLK, ARFGEF2,GSSLSGTDDGAQEVVKDILEDVVTSAIK,RPL38, KIEEIK, UPF2, ALFIVPR, IBTK, MLIDIISSKMISHGIK,PIKFYVE, SSALDTRR, TOMM70A,ILLDQVEEAVADFDECIR,NUP133, QHEIVLK, SORCS1, VGAGGGSQAR, IL10, FLPCENK, TBX18, KLGAEEAAR, SDCCAG8, TNRDLEIK, JMJD1C, YEDLLK, C2orf16, ELAAHLR, DOPEY2, ILEPVLLLLLQPK, DNAH7, RLQILK, FAM214A, IINENYNPK, DROSHA, SSRSPSR, RILPL1, LELDRLR, PIK3CA, SLWVINSALR, IPO13,MLLIAVLEAIGGQASR,GAPVD1, LSAQAQVAEDILDKYR,METTL3, QLDSLRER, ZFHX3,LECDSCGKLFSNILILK,BCKDHA, HLQTYGEHYPLDHFDK,KLK11, LLCGATLIAPR, TSEN54, SSSSPRSINK, TAGAP, ESQLAGR, ITGA2B, LSLNAELQLDR, PLEKHA4, QDSSGLR, LHX2, ISSAIDR, EFCAB10, LLLYCLGK, RPL14P5, ALVDGPCTQRR,CCDC167, ASNYEKELK,

RPL4, NIPGITLLNVSK, GSDMD, SSSQPCAWLRSR, WFIKKN2, ADFPLSVVR, DLGVLDVIFHPTQPWVFSSGADGTVR,BOP1, BHLHB9, KIELHVK, CHN2, QPGCYTLALR, PASK, IPVSVWMK, PKN3, QLHVELK, PLA2G5, KLVYCLK, RAN, SNYNFEKPFLWLAR,ANKRD40, TPESTKPGPVCQPPVSQSR,PEG3, KLLSLGLK, XYLT1, ITNWNR, GALNT6, RHMPLR, C6orf99, DGGLAMLPILVSK, RASEF, IVLAGDAAVGK, TBC1D5, QIKDILK, INSC, LSCMSRLIELCR, FAM190A, IVSQNLSTR, PIAS1,LPFYDLLDELIKPTSLASDNSQR,RPN1, LDAQVKELVLK, NEK10, LILPNKQK, MTOR,AIQIDTWLQVIPQLIAR,MRPS14, HLADHGQLSGIQR, RPL24, QINWTVLYR, ELL2, DLSYTLK, HADHB,EVVDYIIFGTVIQEVK,RPS13, LILIESR, KIF3A,INLSLSTLGNVISALVDGK,NCAPD2, LLNILGLIFK, CGRRF1, LYEALQK, MCOLN2, LGLQILK, SAMHD1,LTDNIFLEILYSTDPK,tcag7.977, TKVGIIAGR, TNRC6A, MDMNSIKEPQSR, WDR61,YLASGAIDGIINIFDIATGK,RPS10, IAIYELLFK, ILF2,INNVIDNLIVAPGTFEVQIEEVR,CPNE1,EALAQTVLAEVPTQLVSYFR, BCR, ELDLEKGLEMR, SNRPB2,SLYALFSQFGHVVDIVALK,RPS16, LLEPVLLLGK, WDR26,LALLNVATQGVHLWDLQDR,TIFTGHTAVVEDVSWHLLHESLFGSVADDQK,RBBP4,

ASAP2, EIISEVQR, APRT, LAPVPFFSLLQYE, MRPL10,TVPFLPLLGGCIDDTILSR,DEFT1P2, SSLEQMIRK, NUP188,ASPESQEPLIQLVQAFVR,YRDC, MERSEELNK, CNTRL, QLEEAIQLKK, SLC25A3,YYALCGFGGVLSCGLTHTAVVPLDLVK,LRRC59, LVTLPVSFAQLK, ADCK3,LANFGGLAVGLGFGALAEVAKK,ARL1, FQVWDLGGQTSIRPYWR,KIAA1967,QEGLDGGLPEEVLFGNLDLLPPPGK, TPR,GQNLLLTNLQTIQGILER, MECOM, ASGTKLTEPR, UNC45A,VLISNLLDLLTEVGVSGQGR,ASB7, LLLEFK, SETX, IILEFK, CPSF2, VLELAQLLDQIWR, MRPS7, KPVEELTEEEKYVR, MLL2, IYEEQNR, XRN2, NSPGSQVASNPR, POLM, AFLTGLAR, SERPINH1,DQAVENILVSPVVVASSLGLVSLGGK,PCNA, FSASGELGNGNIK, CCT6B, GIDPFSLDSLAK, PTBP1,VTPQSLFILFGVYGDVQR,RPS4X, LRECLPLIIFLR, TIVAINKDPEAPIFQVADYGIVADLFK,ETFA, CEP170,KPLTTSGFHHSEEGTSSSGSKR, RPL18, TNRPPLSLSR, P4HA2,SQVLDYLSYAVFQLGDLHR,MCM6, LFLDFLEEFQSSDGEIK, IARS, FVDILTNWYVR, HNRNPAB, GFGFILFK, ADRM1,SQSAAVTPSSTTSSTR, TBR1, LSPVLDGVSELR, MYO1B,NFHVFYQLLSGASEELLNK,GTPBP4, ASLTVLEK, KCMF1, SLFVQELLLSTLVR, DHX36, DCAVLSAIIDLIK, SRSF3,NPPGFAFVEFEDPRDAADAVR,NDUFA4, FYSVNVDYSK, RPL15, FFEVILIDPFHK, PARP1,SLQELFLAHILSPWGAEVK,

MDN1, SLLQAWGLILR, PSMC3,EKAPSIIFIDELDAIGTK,CPNE3, SPLGEVAIRDIVQFVPFR,XPNPEP3, AILFVPR, C21orf2, NVLTAILLLLR, TFRC, ILNIFGVIK, ASNS, ELYLFDVLR, GEMIN5, LVFCLLELLSR, IPO11,VLLDFLDQHPFSFTPLIQR,LOC100287551,QTQTFITYSDNQPGVLIQVYEGER, IRS4, LNTEVASVVVQLLSIR,RPS24, TTPDVIFVFGFR, CPSF6,AVSDASAGDYGSAIETLVTAISLIK,RPL3, NNASTDYDLSDK, NAP1L1, GIPEFWLTVFK, EFTUD2, SFVEFILEPLYK, EAATQAQQTLGSTIDKATGILLYGLASR,QARS, POLR2C,FYYNVESCGSLRPETIVLSALSGLK, SUB1,GISLNPEQWSQLKEQISDIDDAVR, CLPX, LLQDANYNVEK, GGT7,GLSGLTQVLLNVLTLNR,SERPINB5, ILVVNAAYFVGK, GNL3,QITIIDSPSFIVSPLNSSSALALR,TTI1, ALADILSESLHSLATSLPR,CLINT1, SLLLLAYLIR, SNRPD1,YFILPDSLPLDTLLVDVEPK,HK1, SANLVAATLGAILNR, EMD, TYGEPESAGPSR, RPL19, LLADQAEAR, DDX1,FLVLDEADGLLSQGYSDFINR,VASP, VKEEIIEAFVQELR,HNRNPM, AFITNIPFDVK, PSMC1,VAEEHAPSIVFIDEIDAIGTK,NAP1L4, GIPEFWFTIFR, HSPB1, LFDQAFGLPR, PHKB, LFLWGQALYIIAK,MRPS18B, YLESEEYQER, CDC123, TLSDIFLLFK, MAP7D1,TPETLLPFAEAEAFLKK,ASPRV1, ILGVWDTAVSLGK, KPNA7, NITWTLSNLCR, RPS5, TIAECLADELINAAK,FANCD2, LLLGIDILQPAIIK, FLG2, SVVTVIDVFYK,

RAB15, IQIWDTAGQER, RAB33B, IQLWDTAGQER, RPS14, TPGPGAQSALR, DPM1, ENLPLIVWLLVK, MSH6, IIDFLSALEGFK, KPNA6,DYVLNCSILNPLLTLLTK,UBE3C, QLAVLTELPFVVPFEER,SEC61A1, FSGNLLVSLLGTWSDTSSGGPAR,LDHAL6B, LIIVSNPVDILTYVAWK,LDHA, LLIVSNPVDILTYVAWK, VCP, QAAPCVLFFDELDSIAK,RPN2, SIVEEIEDLVAR, RPL6, ASITPGTILIILTGR, IARS2, SCQTALVEILDVIVR, MSH2, LNLVEAFVEDAELR,MRPS27, EALDVLGAVLK, RPL13, STESLQANVQR, LPCAT1,TALGVAELTVTDLFR,TRAP1, GVVDSEDIPLNLSR, NCL,TLVLSNLSYSATEETLQEVFEK,ZC3HAV1, ATDLGGTSQAGTSQR,SPATA5, AVAPSIIFFDELDALAVER,

Figure 3.3: Peptide-protein graph constructed from an AP-MS experiment with CDK9 as a bait. Proteins are drawn as red triangular vertices while peptides are drawn as blue round vertices. Note that there are various peptides shared by several different proteins. 3.4. Experimental Results 69

CIESNMLTDMTLQGIEQISK,NSINQVVQLR,LSGEAFDWLLGEIESK, DCQIAHGAAQFLR,ITSQIFIGPTYYQR, AQIHDLVLVGGSTR,LVNHFVEEFK, QIALAVLSTELAVRK,LVQNGTEPSSLPFLDPNARPLVPEVSIK, DVDAAYMSK, MHGGGPPSGDSACPLR,SMVVSGAK, NICEGGEEMDNKFGVEQPEGDEDLTK, ARGPIQILNR, ICRPLLIVEK, WFESSGIHVAALVVGECCPTPSHWSATR,AIVHAVGQELQVTGPFNLQLIAKDDQLK, ISPWLLR, LKELINISK, MTIGHLIECLQGK, LVNHFVEEFKR, RPVINAGDGVGEHPTQALLDIFTIR, FHPKPSDLHLQTGYKVER, KGFDQEEVFEKPTR, CQEVISWLDANTLAEKDEFEHK, LCPPGIPTPGSGLPPPR, GVSENIMLGQLAPAGTGCFDLLLDAEK, VSANKGEIGDATPFNDAVNVQK, QATKDAGVIAGLNVLR, TLHEWLQQHGIPGLQGVDTR, YEELQSLAGK, VVVENGELIMGILCK, MQEEEEVVDK, QAMGVYITNFHVR, ELEQVCNPIISGLYQGAGGPGPGGFGAQGPK,YKAEDEVQR, NSVTGGTAAFEPSVDYCVVK, INISQVIAVVGQQNVEGKR, GTCGIQYR, VFNTGGAPR, RSGLELYAEWK, NALESYAFNMK, ERLFEASDPYQVHVCNLCGIMAIANTR, EIEYEVVR, LQAEIEGLK, VVVENGELIMGILCKK, AKDILCR, TTSGYAGGLSSAYGGLTSPGLSYSLGSSFGSGAGSSSFSR, AGVSQVLNR, SAVEDEGLKGK, AFYPEEISSMVLTK, CDFALFLGASSENAGTLGTVAGSAAGLK, VVLPCNLLR, IVATLPYIKQEVPIIIVFR, KRT7, RMAEEFR, KELEQVCNPIISGLYQGAGGPGPGGFGAQGPK, RLAADFSVPLIIDIK, VRGNLMGK, AAAIGIDLGTTYSCVGVFQHGK, LLESLGYSLYASLGTADFYTEHGVK, DVFLER, LRNLTYSAPLYVDITK, LSKEEIER, RLDLAGPLLAFLFR, MKEIAEAYLGYPVTNAVITVPAYFNDSQR,QTQIFTTYSDNQPGVLIQVYEGER, KRT8, LLVDSNNPK, GGSGSGPTIEEVD, FLSSAAAVSKEHPVVISK, ISSSSFSR, YEELQVTAGK, IKDILAK, VQVSYKGETK, NLTYSAPLYVDITK, IEWLESHQDADIEDFK, FHLPPR, AQYEDIANR, ASLEAAIADAEQRGELAIK, LEHTTLR, NQLTSNPENTVFDAK, LTPEEIER, HLALLCDTMTCR, GPIQILNRQPMEGR, GFSSGSAVVSGGSRR,RSTSSFSCLSR, DAGVIAGLNVLR, SQIFSTASDNQPTVTIK, EELSALVAPAFAHTSQVLVDK, SKEEAEALYHSK, RVQFGVLSPDELK, IPQIGDKFASR, SDIDEIVLVGGSTR, GQPLPPDLLQQAK, RISDEECFVLGMEPR, LLQDFFNGR, GSSSGGGYSSGSSSYGSGGR,LQAEIEGLKGQR, ATAGDTHLGGEDFDNRLVNHFVEEFK, IEIESFYEGEDFSETLTR, HGGGGGGFGGGGFGSR, LEGLTDEINFLR, KRT79, IVEDAPPIDLQAEAQHASGEVEEPPR, HSPA1B, NERPDGVLLTFGGQTALNCGVELTK, HGVNRQDTGPLMK, IDTRNELESYAYSLK, NQVALNPQNTVFDAKR,FKEIAEAYLGYPVTNAVITVPAYFNDSQR, HSPA1L, DNGDRIDLR, HIDQLKER, LLQDFFDGR, MAEIGEHVAPSEAANSLEQAQAAAER, GFSSGSAVVSGGSR, KFGDPVVQSDMK, SLSEYNNFK, IFVNGCWVGIHK, VTHAVVTVPAYFNDAQR, FGGFGGPGGVGGLGGPGGFGPGGYPGGIHEVSVNQSLLQPLNVK, AEIQELAMVPR, SINPDEAVAYGAAVQAAILMGDK, LSSFVTK, HSPA1A, VYEGERPLTK, KRT75, HWPFQVINDGDKPK, HPQPGAVELAAK, TLLEGEESR, FASFIDK, HAIYDKLDDDGLIAPGVR, WELLQQVDTSTR, TNAENEFVTIKK, NKLNDLEEALQQAK, NQDDLTHKLADIVK, EIAEAYLGYPVTNAVITVPAYFNDSQR, NSTIPTK, GNSQYPGAK, LLQDFFDGRDLNK, TKPYIQVDIGGGQTK, FEQIYLSKPTHWER, LDKAQIHDLVLVGGSTR, RLSSFVTK, FSSCGGGGGSFGAGGGFGSR,VDLLNQEIEFLK, NSKIEISELNR, NQDDLTHK, DNNLLGR, VEIIANDQGNR, TWNDPSVQQDIK, SAVEDEGLK, LALGIPLPELR, NLDLDSIIAEVK, VQFGVLSPDELKR, GFDQEEVFEKPTR, IINEPTAAAIAYGLDK, GSYGSGGSSYGSGGGSYGSGGGGGGHGSYGSGSSSGGYR, QISNLQQSISDAEQR,SISISVAGGGGGFGAAGGFGGR, GRLDSELR, TAETGYIQRR, TVTLPENEDELESTNRR, ITITNDK, KVTHAVVTVPAYFNDAQR, ELSDLESAR, YEELQQTAGR, FRFDYTNER, YEELQVTVGR, YEDEINKR, NQVALNPQNTVFDAK, ATAGDTHLGGEDFDNR, IINEPTAAAIAYGLDKR,QIALAVLSTELAVR, VSLAGACGVGGYGSR, EIAEAYLGK,FDDAVVQSDMK, LTFASTLSHLRR, ITITNDKGR, VCNPIITK, HSPA5, THNLEPYFESFINNLRR, SISISVAR, RIPFGFK, CNEIINWLDK, KRT2, TLPHFIKDDYGPESR, QQLDSFDEFIQMSVQR, NQLTSNPENTVFDAKR, RFDDAVVQSDMK, VYFLPITPHYVTQVIR, VDPEIQNVK, KRT5, TTPSYVAFTDTER, FLEQQNQVLQTK, MVNHFIAEFK, AVEEKIEWLESHQDADIEDFK,VNEISVEVDSDPR, TAAENDFVTLKK, KLLEGEECR, ELINISK, ARFEELCSDLFR, MKEIAEAYLGK, SKAEAESLYQSK, GSGGGSSGGSIGGR, ILLSPER, EVAYCSTYTHCEIHPSMILGVCASIIPFPDHNQSPR, IINEPTAAAIAYGLDR, KRT6A,KRT6C, STLEPVEK, TPHVLVLGSGVYR, GEIGDATPFNDAVNVQK, NQVAMNPTNTVFDAKR, IEWLESHQDADIEDFKAK, TSQNSELNNMQDLVEDYKK, GSKINISQVIAVVGQQNVEGK, FEELCSDLFR, SFYPEEVSSMVLTK, AKFEELNMDLFR,RIIAHAQLLEQHR, QNLEPLFEQYINNLR, LPSDLHPIK, GGSGGGGSISGGGYGSGGGSGGR, KRT6B, AQYEEIANR, QEDMPFTCEGITPDIIINPHAIPSR, FLEQQNQVLQTKWELLQQVDTSTR, SLDLDSIIAEVKAQYEDIAQK, MVQEAEK, CAD, LSLDDLLQR, LPSDLHPIKVVEGVK, LSKEDIER, TFAPEEISAMVLTK, FRELPAGINSIVAIASYTGYNQEDSVIMNR, VLYDAEISQIHQSVTDTNVILSMDNSR, YSPTSPK, MVNHFIAEFKR, ITPSYVAFTPEGER, AQYEEIAQR, SVGEVMGIGR, IEISELNR,EIKIEISELNR, QNLEPLFEQYINNLRR, LSKEEVER, LLQFHVATMVDNELPGLPR, LFEASDPYQVHVCNLCGIMAIANTR, NKITITNDQNR, IIAHAQLLEQHR, VEILANDQGNR,QATKDAGTIAGLNVLR, DAGTIAGLNVMR, MSGECAPNVSVSVSTSHTTISGGGSR, TNAENEFVTIK, TYQDIQNTIKK, DVDNAYMIK, WTLLQEQGTK, SCLENSSRPTSTIWVSMLAR, GTLDPVEK, ELEEIVQPIISK, LLDTIGISQPQWR, GGSISGGGYGSGGGK, QLDSIVGER, HSPA6, LYGSAGPPPTGEEDTAEKDEL, RDVFLER, LQGEIAHVK, LSSEDKETMEK, ITITNDQNR, GEVAYIDGQVLVPPGYGQDVR, TAAENDFVTLK,KRT1, RGNSQYPGAK, QMDIIVSEVSMIR, DNHLLGTFDLTGIPPAPR, YEELQITAGR, TVTNAVVTVPAYFNDSQR, SLVNLGGSK, GGGFGGGSSFGGGSGFSGGGFGGGGFGGGR,YLDGLTAER, DYQELMNTK, VLIAQEK, QTQTFTTYSDNQPGVLIQVYEGER, ASDPGLPAEEPK, MIVTPQSNRPVMGIVQDTLTAVR, TTSNDIVEIFTVLGIEAVRK, LREQGSLLGK, GKDFNLELAIK, QTQTFTTYSDNQPGVFTQVYEGER, STAGDTHLGGEDFDNR, GHNQPCLLVGSGR, TPSLTVFLLGQSAR, NKLNDLEDALQQAK, LDSELKNMQDMVEDYR, TFIGKIPIMLR, VQFGVLSPDELK, HSPA8, NQTAEKEEFEHQQK, VTAVDWHFEEAVDGECPPQR,

SFEEAFQK, DDYGPESR, POLR2B, KSAIGQR, DAGTIAGLNVLR,LLQDFFNGK, DVDPVRTTSNDIVEIFTVLGIEAVR, AIVHAVGQELQVTGPFNLQLIAK, DVDGAYMTK, AQYEDIAQK, TLSSSTQASIEIDSLYEGIDFYTSITR, TQISLVR, ASDPGLPAEEPKEK, MDDDVFLR, VLILGSGGLSIGQAGEFDYSGSQAIK, AHNNELEPTPGNTLR, EGEEQLQTQHQK, ARFEELNADLFR, SLNNQFASFIDKVR, AEAESLYQSKYEELQITAGR, ALKEENIQTLLINPNIATVQTSQGLADK, VQVEYKGETK, INISQVIAVVGQQNVEGK,MATNTVYVFAK, KVLILGSGGLSIGQAGEFDYSGSQAIK, YPETTEGGRPK, RLNSPIGR, QDVEALWENMAVIDCFASDHAPHTLEEK, SDQSRLDSELK, LNDLEDALQQAKEDLAR, SINPDEAVAYGAAVQAAILSGDK, VLSEPNPRPVFGICLGHQLLALAIGAK, NSLESYAFNMK, HISPGDTK, EATAGNPGGQTVR, NICEGGEEMDNK, AYFLGYMVHR, TLGVDLVALATR, GGGGGGYGSGGSSYGSGGGSYGSGGGGGGGR, SLNNQFASFIDK, VQVEYK, IVATLPYIK, EHPVVISK, THNLEPYFESFINNLR, LLRDYQELMNTK, SLGTSAGSLVHISYLEMGHDITR, SQIHDIVLVGGSTR, VLGTSPEAIDSAENR, LAADFSVPLIIDIK, SLDLDSIIAEVK,AEAESLYQSK, MAEEFR, IPQIGDK, ATGYPLAYVAAK, FEELNADLFR, TSSSFAAAMAR, VNHFIAEFK, MALLATVLGRF, YAYTGECR, ILALDCGLK, LFVEALGQIGPAPPLK, ILPWSTFR, KNILLTIGSYK,VIECNVR,AAFALGGLGSGFASNR, ATVEDEKLQGK, IFHINPR, LDKSQIHDIVLVGGSTR, KITSQIFIGPTYYQR, NQVAMNPTNTVFDAK, NTTIPTKQTQTFTTYSDNQPGVLIQVYEGER, THSTHPDDEDSGPYK, MDTLAHVLYYPQKPLVTTR, GPAVGIDLGTTYSCVGVFQHGK,HWPFMVVNDAGRPK, ISDEECFVLGMEPR, VSGDDVIIGK, LNLSVTTPYNADFDGDEMNLHLPQSLETR, POLR2A, ISNLLSDYGYHLR,

GHLMAITR, TVIKEGEEQLQTQHQK, TLQEDLVK, NTYQSAMGK, DNNLLGK, DVLSNAHIQNELEREFER, QEVPIIIVFR, MQEEEEVVDKMDDDVFLR, LTFASTLSHLR,

AKQDVIEVIEK, GNEVLYNGFTGR, SIAANMTFAEIVTPFNIDR, LLLAALGR,

CSFEETVDVLMEAAAHGESDPMK, TSETGIVDQVMVTLNQEGYK, HMCDGDIVIFNR, TSETGIVDQVMVTLNQEGYKFCK, LOC100287551, DFNLELAIK, KSLGTSAGSLVHISYLEMGHDITR, ELDDRDHYGNK, YSPTSPTYSPTTPK, GNEVLYNGFTGRK, SMEYLR, VELDRK, IYTDAGR, QTQTFITYSDNQPGVLIQVYEGER, MIWNAQK, TVTLPENEDELESTNR, LLFQELMSMSIAPR, DCSTFLR, QIFSLIIPGHINCIR, RDCSTFLR, KILLSPER, YSLATGNWGDQK, YSLATGNWGDQKK, QLHNTLWGMVCPAETPEGHAVGLVK, EMLPHVGVSDFCETK, ELPAGINSIVAIASYTGYNQEDSVIMNR,GPIQILNR, LDLAGPLLAFLFR, FNQAIAHPGEMVGALAAQSLGEPATQMTLNTFHYAGVSAK, QDTGPLMK,

YARPEWMIVTVLPVPPLSVRPAVVMQGSAR, QAQENATLLFNIHLR,

ILNDARDK, QDVIEVIEK,

VLSEKDVDPVR, INAGFGDDLNCIFNDDNAEK,

SIAANMTFAEIVTPFNIDRLQELVR, LTHVYDLCK,

NEQNGAAAHVIAEDVK, LVIVNGDDPLSR,

FDYTNER, TYQDIQNTIK,

DVLSNAHIQNELER, IIITEDGEFK,

LTMEQIAEK, MSVTEGGIKYPETTEGGRPK,

EGLIDTAVK, FGVEQPEGDEDLTKEK, VHEIFKR, TAETGYIQR, THSTHPDDEDSGPYKHISPGDTK, YSPTSPTYSPTSPVYTPTSPK, VVEGVKELSK, RVDFSAR, FHPKPSDLHLQTGYK, TLQEDLVKDVLSNAHIQNELER, TVITPDPNLSIDQVGVPR, KIIITEDGEFK, SGLELYAEWK, KLVIVNGDDPLSR, INAGFGDDLNCIFNDDNAEKLVLR, YSPTSPTYSPTSPK, YTPQSPTYTPSSPSYSPSSPSYSPTSPK, VHEIFK, RTLQEDLVK, VRILPWSTFR, ELYHVISFDGSYVNYR, LEHTTLRK, TTSNDIVEIFTVLGIEAVR, YGEDGLAGESVEFQNLATLKPSNK, GFVENSYLAGLTPTEFFFHAMGGR, NVTLGVPR, KLTMEQIAEK, FGVEQPEGDEDLTK, ALQEWILETDGVSLMR, QTFENQVNR, RNEQNGAAAHVIAEDVK,LFYSNIQTVINNWLLIEGHTIGIGDSIADSK,VYMHLPQTDNK, YTPTSPSYSPSSPEYTPTSPK,

GSCGIGGGIGGGSSR, LYAHTIAGFLDPEK, QIFHPEQLITGK, FWEVISDEHGIDPTGTYHGDSDLQLER, CCVECPPCPAPPVAGPSVFLFPPKPK, KDSETGENIR, FMDDIAALVSTIASDIVSR, TQGFLALFSGDTGEIK, DKQHLDDITAAR, LKASENSESEYSR,YVLGEETTK, VLAQNSGFDLQETLVK,SVTLLIK, ALQDLENAASGDATVR, HNDDEQYAWESSAGGSFTVR, ATHLIAAR, LYELHVFTFGSR, EAESCDCLQGFQLTHSLGGGTGSGMGTLLISK, YLTVAAVFR, VVSVLTVVHQDWLNGKEYK, YDDPEVQK, IDISPAPENPHYCLTPELLQVK,LKEALQPLINR, QASQGMVGQLAAR,IRGTSYQSPHGIPIDLLDR, ASLENSLEETKGR, IGSVLAVFEAAASAQSAGASQSR, MREIVHLQAGQCGNQIGAK, TTPSVVAFTADGER, SGSPISSEERR, AAIAECEEVRR, IREFYR, ETLIEWK, GGSGGSHGGGSGFGGESGGSYGGGEEASGSGGGYGGGSGK, SQVFSTAADGQTQVEIK, SLTNDWEDHLAVK, VQALEEANNDLENK,SGGGGGGGLGSGGSIR, VLADVAIIFSGLHPTNFPIEK, TPNEEIDRQNDDQR, YSDPQIK, KEVVHTVSLHEIDVINSR, LLIVSTTPYSEKDTK, RRPEEQEEEPQPR, LTTPTYGDLNHLVSATMSGVTTCLR, AQFEGIVTDLIR, VSSQAEDTSSSFDNLFIDR, SVSSSVQVCPEVGKR, FLDTLLEELHLK, TLLDIDNTR, SGPFGQIFRPDNFVFGQSGAGNNWAK, IFDSFAK, VHAELADVLTEAVVDSILAIK, HKSETDTSLIR, QEYEQLIAK, VLQAQECGHLHVVNPDWLWSCLER, VQQTVQDLFGR, EVATNSELVQSGK, VLDELTLAR, SAVRPASLNLNR, AKLEEAILGSIGAR, TEALTQAFR, ARDYDAMGSQTK, GPPAPSSSLPIR, TLKLTTPTYGDLNHLVSATMSGVTTCLR, NAVITVPAYFNDSQR, SQDSEEHDSTFPLIDSSSQNQIR, TNKVYDITER, HFSVEGQLEFR, TLNDMRQEYEQLIAK, STSESTAALGCLVK, KHLEINPDHPIVETLR, FSSSSGYGGGSSR, DIWPPAQAPTSSQELAGAPEPQGSCAQGGR, TUBA4B, TTPPMLDSDGSFFLYSK, DAGQISGLNVLR, DRVPPSSEASEHHPR, QNYEEMQAK, RPRD1B, NPDDITNEEYGEFYK,EQVANSAFVER, KEAESCDCLQGFQLTHSLGGGTGSGMGTLLISK, HIYYITGETK, KIVPELK, TUBB4B, IGHG2, ASNGDAWVEAHGK, GSWACSIFDLK, GYQTSPDLR, VYSLFLDESR, IRCEEEDVEMSEDAYTVLTR, ALLFIPR, LDSQEKDATCELPLQK, VNTQSSSNSTLPER, ALQFLEEVK, CCT6A, TKVHAELADVLTEAVVDSILAIK, KDIENQYETQITQIEHEVSSSGQEVQSSAK, IMNTFSVVPSPK, LEGCSHPVVMK, MRENVHLQAGQCGNQIGAK, EIVHLQAGQCGNQIGAK, EFESVLVDAFSHVAR, HLEINPDHPIVETLR, KRT14, QEIECQNQEYSLLLSIK, STNGDTFLGGEDFDQALLR, SVYGGEFIQQLK, HSP90AA1, LASYLDKVR, APSTYGGGLSVSSSR, MAVTFIGNSTAIQELFKR,MSATFIGNSTAIQELFKR, IGH@, KSELFNPVSLDCK, LLPLHHMPTQLLSIEESLALQK, KRT16, LVLSPDAPDR, KEPESCDCLQGFQLTHSLGGGTGSGMGTLLISK, GFYPSDIAVEWESNGQPENNYK, SLSNSNPDISGTPTSPDDEVR, VQAGDVITIDKATGK, LYSPSQIGAFVLMK, TQELAFATHQDPADPK, GTEVQVDDIKR, AELEAAVR, ALHIVEQLLEENITEEFLMECGR, KLSELSNSQQSVQTLSLWLIHHR,FYEAFSK, DIENQYETQITQIEHEVSSSGQEVQSSAK, FWEVISDEHGIDPTGTYHGDSDLQLDR, VVSVLTVVHQDWLNGK, LTFLYLANDVIQNSKR, INVYYNEATGGK, ISEQFTAMFR, VPAEGAPTAAVAEVR, MAVTFIGNSTAIQELFK, THTCPPCPAPELLGGPSVFLFPPKPK, GIDPFSLDALSK, GGSGGSYGGGGSGGGYGGGSGSR, GFYPSDIAVEWESNGQPEDNYK,GPSVFPLAPSSK, AQFEGIVTDLIRR, AQLGVQAFADALLIIPK, KLTFLYLANDVIQNSK, ADLINNLGTIAK, NVSTGDVNVEMNAAPGVDLTQLLNNMR, IASLPVEVQEVSLLDK, HSP90AB1, HSQFIGYPITLYLEKER, NQILNLTTDNANILLQIDNAR, DECIDPFSK, HRVSSQAEDTSSSFDNLFIDR, MLADFLR, TKYEHELALR,LLEGEDAHLSSSQFSSGSQSSR,SQYEQLAEQNRK, VLENAEGAR, TTSGDDACNLTSFRPATLTVTNFFK, KTQELAFATHQDPADPK,LLIVSTTPYSEK, TQGFLALFSGDTGEIKSEVR, AAIAECEEVGR,SFCSNFCYQASK, NLDIERPTYTNLNR, MSATFIGNSTAIQELFK,TUBB4A, SCDKTHTCPPCPAPELLGGPSVFLFPPKPK, FITPAHYSDVVDER, LENEIQTYR, SDLEMQYETLQEELMALKK, NAIDDGCVVPGAGAVEVAMAEALIK, HSQFIGYPITLYLEK, ALLFVPR, GPPAPSSSLPIRQEPSSFR, LISQIVSSITASLR, TUBB, VATAQDDITGDGTTSNVLIIGELLK, LTFLYLANDVIQNSK, VTMQNLNDR, NHEEEMKDLR, MTLDDFR, GHYTIGKEIIDLVLDR, RISEQFTAMFR, TTPPVLDSDGSFFLYSK, HSPA9, RYDDPEVQK, DOCK7, RUVBL2, POLR2M, RPAP2, LAAEIDDR, FENLCK, HSQFLGYPITLYLEK, EEESCDCLQGFQLTHSLGGGTGSGMGTLLISK,MSMKEVDEQMLNVQNK, QALYGDKKPR, MAATFIGNSTAIQELFK, DTLMISR, TIDDLKNQILNLTTDNANILLQIDNAR, KRT9, GTEVSEPSPPVRDPEGVTQAPGVEPSNGLEKPAR, KEQESCDCLQGFQLTHSLGGGTGSGMGTLLISK, GPSVFPLAPCSR, IGHG4, IGHG1, YHTSQSGDEMTSLSEYVSR, QLFHPEQLITGKEDAANNYAR, VINEPTAAALAYGLDK, VNVTHKEEIILTPIEVAIEDMQK, LQEAFSK,DKVQAGDVITIDK, GLGLDDALEPR, KAELEAAVR, VLPGLLVPLQITLGDIYTQLK, LASYLDK, IKFEMEQNLR, FDGALNVDLTEFQTNLVPYPR, EIIDLVLDRIR, ITIADQGEQQSEENASTK, TNVLPFR, GLQDVLR, YENEVALR, EDAANNYAR, QLFHPEQLITGK, EVDEQMLNVQNK, GTLVTVSSASTK, ELISNASDALDK, YESLTDPSK, EHYHATALGAK, MAATFIGNSTAIQELFKR, EQQIVIQSSGGLSKDDIENMVK, NYSPYYNTIDDLKDQIVDLTVGNNK,ALEEANADLEVK, GHYTEGAELVDSVLDVVR,INVYYNEATGGKYVPR, NSSYFVEWIPNNVK,MREIVHIQAGQCGNQIGAK, NPDDITQEEYGEFYK, ADLEMQIESLTEELAYLK, TPEVTCVVVDVSHEDPEVK, ALQDLENAASGDAAVHQR, RPRD1A, IDIIPNPQER, ELCAQPGQVVAPGAVLVR, KEWESCDCLQGFQLTHSLGGGTGSGMGTLLISK,AVLVDLEPGTMDSVR, ESQNSLDESLPFR, LCGYPLCQK, IQSLPDLSR, YIDQEELNK, HSP90AB2P, RTIAPCQK, AHGELHEQFK, SAIFSITYPSQDVFLVIK,IKEETEIIEGEVVEIQIDRPATGTGSK, YAIQLITAASLVCR, LASYLDKVQALEEANNDLENK, TIQFVDWCPTGFK, AVFVDLEPTVIDEVR, TAVCDIPPRGLK, FDLMYAK, AYHEQLTVAEITNACFEPANQMVK, TAVCDIPPR, VREEYPDR, WQEGNVFSCSVMHEALHNHYTQK, FNWYVDGVEVHNAK, SYNPEGESSGR, AAIAECEEVR, SQYEQLAEQNR, CTDP1, TLGAPASSER, STSGGTAALGCLVK, MKETAENYLGHTAK, QVLDNLTMEK, TIGGGDDSFNTFFSETGAGKHVPR, QSLEASLAETEGR, LSVNMVPFPR, ALTVPELTQQMFDAK,MSSTFIGNSTAIQELFKR, SCDTPPPCPICPAPELLGGPSVFLFPPKPK, KSFCSNFCYQASK, FFEAQIPK, RP11-631M21.2, GHYTEGAELVDAVLDVVR, ETGVDLTKDNMALQR, TILTYAEEDLELR, GLLRPHVPPAAITTLAR, DYDAMGSQTK, FVQCPDGELQK, CCT6B, LPNVTGSHMHLPFAGDIYSED, YCGQLQMIQEQISNLEAQITDVR,ELNGSEAATPR, LHFFMPGFAPLTSR, TUBB3, TYEQIKVDENENCSSLGSPSEPPQTLDLVR, EDMAALEKDYEEVGADSADGEDEGEEY, TIGGGDDSFNTFFSETGAGK, LLGQFTLIGIPPAPR, HSQFLGYPITLYLEKER, TUBA1C, SGSPISSEER, SSQTSQNQGLGRPTLEGDEETSEVEYTVNKGPASSNR, ADLEMQIESLTEELAYLKK,EIETYHNLLEGGQEDFESSGAGK, EIEEAPDIRK, FPGQLNADLR, EEQSGHSGEEVQLCSK, TSDIDNPSHFEK, IHFPLATYAPVISAEK,YMACCLLYR,LIGQIVSSITASLR, YDDPEVQKDIK, SIYYITGESK, FPGQLNADLRK, TPEVTCVVVDVSHEDPEVQFK,VVSVLTVLHQDWLNGK, NSLLASYIHYVFR, DSTEVEISTGER, MIESLTK, AVLIAGQPGTGK, SVYENDVLEQLK,DFAPVIVEAFK, IRLENEIQTYR, LEKEIETYHNLLEGGQEDFESSGAGK, QAVTNPNNTFYATK, KRT10, AVPPPQPQMFGEELPDAQDGEQPGPSRR, GHYTEGAELVDSVLDVVRK, KLAVNMVPFPR, SCDTPPPCPR, TUBA1B, FWEVISDEHGIDPSGNYVGDSDLQLER,GHYTEGAELVDAVLDVVRK, NLTGLSSGTEK, SSQTSQNQGLGRPTLEGDEETSEVEYTVNK, SEITLVGISKK, ASENSESEYSR, QFSSSYLSR, KLADQCTGLQGFLVFHSFGGGTGSGFTSLLMER, QAVTNPNNTFYATKR, SYTEDWAIVIR, KQISGQYSGSPQLLK, TTEMETIYDLGTK, GTSYQSPHGIPIDLLDR, IREEYPDR, IGHG3, ERVEAVNMAEGIIHDTETK, GIDPFSLDSLAK, HGVQELEIELQSQLSKK, RPEEQEEEPQPR, GHYTIGK, EIVHIQAGQCGNQIGAK, KIEFER, YVLGEETTKSQDSEEHDSTFPLIDSSSQNQIR, AETECQNTEYQQLLDIK,LKYDNELALR, AILVDLEPGTMDSVR,LAVNMVPFPR, NAEKYAEEDR, SNSWVNTGGPK, NLLYIYPQSLNFANR, FVQCPDGELQKR,EYQDAFLFNELKGETMDTS, SELFNPVSLDCK, GFEPQAPEDLAQR, STMQELNSR, DVNAAIATIK, NTTIPTKK,QAASSLQQASLK, NSLPDALLPNLLDR, YAIQLITAASLVCRK, AIAEVDVGTDKAQNSDPILDTSSLVPGCSSVDNIK, EEERHPDFQLLK,VYDITER, SIQFVDWCPTGFK,TUBA1A, AVCMLSNTTAVAEAWAR, NMMAACDPR, KRT13, HGVQELEIELQSQLSK, YCVQLSQIQAQISALEEQLQQIR, GGGGSFGYSYGGGSGGGFSASSLGGGFGGGSR,FSSSGGGGGGGR, AVPPPQPQMFGEELPDAQDGEQPGPSR, EIVHIQAGQCGNQIGTK, EIIDLVLDR, EAESCDCLQGFQLTHSLGGGTGSGMGTLLLSK, MASTFIGNSTAIQELFKR, LVLSPDAPDRATHLIAAR, SCDTPPPCPGCPAPELLGGPSVFLFPPKPK, DAEAWFNEK, TPLGDTTHTCPR, DSPRPGKPDER, YMACCLLYRGDVVPK, LADQCTGLQGFLVFHSFGGGTGSGFTSLLMER, TUBB8, QSVEADINGLRR, GLADQCTGLQGFLVFHSFGGGTGSGFTSLLMER, GSSGGGCFGGSSGGYGGLGGFGGGSFR, DYEEVGVDSVEGEGEEEGEEY, LTTPTYGDLNHLVSATMSGVTTSLR, ALEAELNDLM, AYHEQLSVAEITNACFEPANQMVK, AVFVDLEPTVIDEVRTGTYR, KEAESCDCLQGFQLTHSLGGGTGSGMGTLLLSK,TUBB6, VLDELTLTK, YLNKEIEEAPDIR, MASTFIGNSTAIQELFK, SKELTTEIDNNIEQISSYK, KYVYFQGTGDMNAPPGSR, AVCMLSNTTAIAEAWAR, LSVDYGK, LDHKFDLMYAK, LHFFMPGFAPLTAR, SLLEGEGSSGGGGR, ENSPAAFPDR, YLTVATVFR, ALEESNYELEGK, QSVEADINGLR, EIEEAPDIR, AFVHWYVGEGMEEGEFSEAR, FWEVISDEHGIDPAGGYVGDSALQLER, GSLGGGFSSGGFSGGSFSR, LYAHTIAGFLDPEKK, RNLDIERPTYTNLNR,VGINYQPPTVVPGGDLAK, INVYYNESSSQK, NVQALEIELQSQLALK,LKYENEVALR, TREHYHATALGAK, GLCAECGQDLTQLQSK, SKVLADVAIIFSGLHPTNFPIEK, FAPNLITVK,VHTDYYAK,

KRT24,

QTMQNLNDR,

LTYQDAVNLQNYVEEK,SFYTAIAQAFLSNEK,QLSIDTRPFRPASEGNPSDDPEPLPAHR, VAPDEHPILLTEAPLNPK, EACGVIVELIK,AQTEGINISEEALNHLGEIGTK, VVTTNYKPVANHQYNIEYER, QSLVDMAPK, LSSDVLTLLIK, WAIIGWLLTTCTSNVAASNAK, NAGVEGSLIVEK, RYEEIVK, FAELKEK, HTVYPKPEEWPKSEYSELDEDESQAPYDPNGKPER,FIIENTDLAVANSIR, GGGGPGGGGPGGGSAGGPSQPPGGGGPGIRK, TGLSSEQTVNVLAQILKR, AQGPQQQPGSEGPSYAK, NLDPDVFNQIVK,QDVCQSYSEK, TAVAPIER, FAM75A7, RGPD5, LOC100509836, CBWD5, VEFTEDLQK, TPADIYR, YSVQLLTPANLLAK, EHPFVLQSVGGQTLTVFTESSSDK,QNTGVWLVK, SVSLTGAPESVQK, TTPNSGDVQVTEDAVR, VLLDLSAFLK, FAM75A2, RGPD1, ANAHFILK, MQEENITR, GYLVTQDELDQTLEEFK, DGPSAGCTIVTALLSLAMGRPVR,EIFDIAFPDEQAEALAVER, GAAAAAAASGAAGGGGGGAGAGAPGGGR,NIWLAESVLDILTEQR, VGGTSDVEVNEK, GANPVEIR, KLEDGPK, NRDNDPNDYVEQDDILIVK, KIGQQPQQPGAPPQQDYTK, TLTAEEAEEEWER, RAPDQAAEIGSR, VLKIEEVSDTSSLQPQASLK, GHWDDVFLDSTQR, TALALAIAQELGSK, LLLQVQHASK, GDFLNYALSLMR, STTTGHLIYK, FYYNVESCGSLRPETIVLSALSGLK, LOC100509749, CBWD3, GTEDITSPHGIPLDLLDR, LSQQLDKVVTTNYKPVANHQYNIEYER, EIGVQNVK, QAQVATGGGPGAPPGSQPDYSAAWAEYYR, TGLSSEQTVNVLAQILK, QNLFLGSLTSR, YNHLVYSQIPAAVK, YDEAIDCYTK, LNEAKQDFETVLLLEPGNK, DLLADKELWAR, CTADILLLDTLLGTLVK, LPLQDVYK, ITELTDENVKFIIENTDLAVANSIR, NAQNPSLHHPR, TISHVIIGLK, IIRPSETAGR, GYLVTQDELDQTLEEFKAQFGDKPSEGRPR,YLLQEQLK, ILCFYGPPGVGK, LLQNVAR, VLAHLAPLFDNPK, IQEIIEQLDVTTSEYEKEK, ISSIQSIVPALEIANAHR, VQISPDSGGLPER, MHFSLKE, DISCLNRDPAR, ATFLGLTNEK, ACTBL2, DFLAGAVAAAVSK, KVDNDYNALR, YLSQQWAK, YEEIVKEVSTYIK, FYYNVESCGSLRPETIVLSALSGLKK, SLC25A4, FAM75A4, RGPD4, VTFKDEPGEGSGVAR, VPFCPMVGSEVYSTEIK, AECRPAASENYMR, NSSQDDLFPTSDTPR, QDFETVLLLEPGNK, LOC100510407, LOC653510, DQGGFGDRNEYGSR, YNIMAFNAADKVNFATWNQAR, IPDEFDNDPILVQQLRR, URI1, NVAIFTAGQESPIILR, YYVTIIDAPGHRDFIK, WNPTAGVAFEYDPDNALR, GAWSNVLR, FAM75A6, RGPD8, QYLMNLEQR, VEAGDVIYIEANSGAVK, VLLLPLER, SYELPDGQVITIGNER, GMGGAFVLVLYDEIK, KMSQEVCTLK, SQQVIVQGVHELYDLEETPVSWKDDTER, ALIVVQQGMTPSAK, KTIMQLCHDR,DIIALNPLYR, TENPLILIDEVDKIGR,LLCHLTPSIYTEFPDETLR, TQLVWLVR,GVMLAVDAVIAELKK, LVQDVANNTNEEAGDGTTTATVLAR, KLASQGDSISSQLGPIHPPPR, TLTAEEAEEEWERR, SNKQNLFLGSLTSR, LALPSRVQK, LVELYR, TSDIFEADIANDVK, GEHLLLFLVQTVAR,SFYTAIAQAFLSNEKLPNLECIQNANK, CDTYATEFDLEAEEYVPLPK, LSQQLDK, DLVDITKQPVVYLK, YYVTIIDAPGHR, HTVYPKPEEWPK, AIECYTR, AIELQLQVK, AYSIVIR, SLC25A5, AFVDVVNGEYVPR, C19orf2, RPAP3, SLC25A6, LOC100510529, CBWD7, QCVVGPNHAAFLLEDGR, DSYVGDEAQSKR, RUVBL1, LDPSIFESLQK, GTF2F2, POLR2E, LONP1, INTS3, HSPD1, EEF1A1, KDGNASGTTLLEALDCILPPTRPTDKPLR, POLR2C, KLSDLQTQLSHEIQSDVLTIN, KHSRP, QLEDGDQPESK, GTF2F1, IYQEEEMPESGAGSEFNRK, TIMM50, TIALNGVEDVR, FAQLALER, LOC730429, DRGSGLLGSQPQPVIPASVIPEELISQAQVVLQGK, IEEVSDTSSLQPQASLK, KISQALASK, IIAPPER, QAASGLVGQENAR, EHPFVLQSVGGQTLTVFTESSSDKLSLEGIVVQR, LQIEESSKPVR, SVEMHHEALSEALPGDNVGFNVK, LGLIPLISDDIVDK, TDLTVLVAHNDDPTDQMFVFFPEEPK, IQAGDPVAR,TIRDIIALNPLYR, IVSGEAESVEVTPENLQDFVGKPVFTVER,DIIHNPQALSPQFTGILQLLQSR, HDELLAEHIK,VTDALNATR, KPLVIIAEDVDGEALSTLVLNR, IINDLLQSLR, IYQEEEMPESGAGSEFNR, QSRLEQEEQQR, FAM75A5, RGPD2, SISCEEATCSDTSESILEEEPQENQKK, UBR5, DFLAGGVAAAISK, LVHTNEVTVLLGDNWFAK, WFWSIVEK, IRAQTEGINISEEALNHLGEIGTK, THINIVVIGHVDSGK, ITELTDENVK, FAM75A1, RGPD3, GOLGA8A, CBWD1, IQFKQDDGTGPEK, EFRPEDQPWLLR, AQGPQQQPGSEGPSYAKK, YFPTQALNFAFK, ILGLCLLQNELCPITLNR, GIAADGANALLPANR, LLCDSVVLQPYLR, ALESSIAPIVIFASNR, FAVAESDCNLAVALNR, GMGGAFVLVLYDELK, GOLGA8B, CBWD2, NSTGSGHSAQELPTIR, LSLEGIVVQR, QHVLDMLFSAFEK, EHALLAYTLGVK, DNDPNDYVEQDDILIVK, HVAYVFQALIYWIK, IWHHTFYNELR, AVLLAGPPGTGK, HSVGVVIGR, TTPNSGDVQVTEDAVRR, TVLEHYALEDDPLAAFK, FAM75A3, LIF, ACTG1, LRENQLPR, AQFGDKPSEGRPR, QLEVEPEEPEAENKHKPR, VLFICTANVTDTIPEPLRDR,LAQLTLEQILEHLDNLR, FSDLFSLAEEYEDSSTKPPK,ALMLQGVDLLADAVAVTMGPK, CEFQDAYVLLSEKK, EQGVLSFWR, NTPVQSPVSLGEDLQWWPDKDGTK,FICIGALYSELLAVSSK, ACTB, NMITGTSQADCAVLIVAAGVGEFEAGISK, SEYSELDEDESQAPYDPNGKPER, SYDYEAWAK, ILDELDKDDSTHESLSQESESEEDGIHVDSQK, NNPLYHAGAVAFSISAGIPK, MKIEEVK, MILIQDGSQNTNVDKPLR, KTGLSSEQTVNVLAQILK, YFKDYR, NTWELKPEYR, HQYYNLK, EVSTYIK, DCTCEEFCPECSVEFTLDVR, GLGLDESGLAK, AINQQTGAFVEISR, GNSRPGTPSAEGGSTSSTLR, APDQAAEIGSR, EEVTELLAR,RTDLTVLVAHNDDPTDQMFVFFPEEPK, LAQPYVGVFLK, YSNENLDLAR, GKGAAAAAAASGAAGGGGGGAGAGAPGGGR,DLALVSR, TALLDAAGVASLLTTAEVVVTEIPKEEK,AAVEEGIVLGGGCALLR, IGGIGTVPVGR, LGLIPLISDDIVDKLQYSR, ILHDFYIEK, ISQALASK, LSAVEAIANAISVVSSNGPGNR,RSGTISTSAAAAAAALEASNASSYLTSASSLAR, VEAGDVIYIEANSGAVKR, TEVSFTLNEDLANIHDIGGKPASVSAPR,RLQIEESSKPVR, LASQGDSISSQLGPIHPPPR, LRLDTGPQSLSGK, VVVVDCKK, AASTAPSSTSTPAASSAGLIYIDPSNLR, LSLELFGR, DSIEKEHVEEISELFYDAK, VETGVLKPGMVVTFAPVNVTTEVK,QLIVGVNK, LSDLQTQLSHEIQSDVLTIN,PYANQPTVR, DAFADAVQR, VLNHFSIMQQR, IPDEFDNDPILVQQLR, EKPLLIFEILQR,VLELEPNNFEATNELR, LCYVALDFEQEMATAASSSSLEK, ERVEAGDVIYIEANSGAVK, GELDLTGAK, YFGIKR, NYLDWLTSIPWGK, QNFFSQTPILQALQHVQASCDEAHK, GYISPYFINTSK, LYVPLYSSK, LDLLYR, LEHTAQTYSELQGER, VAPEEHPVLLTEAPLNPK, VFMEDVGAEPGSILTELGGFEVK,WSSGVGGSGGGSSGR, GYSFTTTAER, GSGLLGSQPQPVIPASVIPEELISQAQVVLQGK, DSYVGDEAQSK,

TASTLLKPAPSGLPSER, EATADDLIKVVEELTR, NVVHQLSVTLEDLYNGATR, POTED, YDSQVAEENR, ELLADLER, STVWQVR, AAYNPGQAVPWNAVK, QAVEQQIQSHR, ESTLHLVLR, LVLPSLLAALEEESWR,VLPQLISTITASVQNPALR, TPVIDADKPVSSQLR,VTDFGDKVEDPTFLNQLQSGVNR, TFKEICAVSR,GTGAASFDEFGNSK, IFYPETTDVYDRK,VLWLDEIQQAVDEANVDEDR, CLIATGGTPR,SNIWVAGDAACFYDIK, LFTEVEGTCTGK,MFYHISLEHEILLHPR, LLLQGEADQSLLTFIDK, LLSSDRNPPIDDLIK, N4BP2L2, C7orf73, LOC440258, AKAP9, MAP4K4, TFAP2A, H2AFX, PCBP1, PPP1CA, DNSVFHVSNLEFPFRK, ANLFYDVQFK, YEYPIQR, LAEAEETAR, SLSALGNVISALAEGSTYVPYRDSK, RRPEILSFFSTNLQR, DAVWWLHTVVPSISK, TYEVSLR, ALYYLQIHPQELR, NVVHQLSVTLEDLYNGATRK, QISQAYEVLSDAK, YLIYDIIK, IQDKEGIPPDQQR, SPHQNVCEQAVWALGNIIGDGPQCR, YTGEEDGAGGHSPAPPQTEECLR, VVCAGYDNGDIKLFDLR, VHLTPYTVDSPICDFLELQR, ILQDSLGGNCR, IIIEDLLEATR, VLPLEALVTDAGEVTEAGK,AHQANQLYPFAISLIESVR, GIFEALRPLETLPVEGLIR, TQKEIGDIAGVADVTIR, APDLFPTDFK, FQATSSGPILREEFEAR, VLWLDEIQQAVDEANVDEDRAK,ELWFSDDPNVTK, SITIIGGGFLGSELACALGR,VDKNDIFAIGSLMDDYLGLVS, VGLFTEIGPMSCFISR, NLLIFENLIDLK, LKLGAIFLEGVTVK, TLTVELEPSDTVENLK, EWNLFYQK, GHKEIINAIDGIGGLGIGEGAPEIVTGSR, SPAPPLLHVAALGQK, SLSALGNVISALAEGSTYVPYR, UBA52, FENNSWVFMR, GQAVTLLPFFTSLTGGSLEELRR, LOC645084, CEP85L, L1RE2, KIAA1598, TNIK, TFAP2C, HIST2H2AB, PCBP2, PPP1CB, DSQVVQVVLDGLK, RPPPHILDQVK, TVPLYESPR, FKQISQAYEVLSDAK, ILLGQNR, KLEEIK, EALNMERNNR, KELELK, ALFLIPR, NGGRSLR, NDEELNKLLGGVTIAQGGVLPNIQAVLLPK, INISEGNCPER, EIFLSQPILLELEAPLK, SEHQGALWDCLLSFIR, LLASINSTVR, LIIEFK, WLNCPR, VFGFDSFK, YLATGDFGGNLHIWNLEAPEMPVYSVK, HDLPPYR, ANLEAFTVDKDITLTNDKPATAIGVIGNFTDAER, NPEILAIAPVLLDALTDPSR, ENVNSLLPVFEEFLK,FYFVGDEDLLEIIGNSK, ENFIPTIVNFSAEEISDAIR,AVELDLVPGR, SPISVAAAAIYMASQASAEK,SLIINTNPVEVYK, GVLLDIDDLQTNQFK,VNAELQAR, ALGTEVIQLFPEK,YFGPNLLNTVK, TMDEDIVIQQDDEIRLK, KPNA3, LOC100287399, RPS27A, LOC100288966, TLSDYNLQK, TITLEVEPSDTIENVK, UBB, GCN1L1, DYNC1H1, GTF2B, IQGAP2, AIFM1, POLR2G, PRKDC, INTS1, DDB1, DNAJA1, RNGTT, NKPFFDICTSR, RECQL5,GSVSASEQGTLNPTAQDPFQLSAPGVSLK, WDR92, LVATSLEGK, KIAA1967,IQVSSEKEAAPDAGAEPITADSDPAYSSK, KIF5B, QLDDKDEEINQQSQLVEK, LOC731751, KPNA4, KEEDLLR, DAQVVQVVLDGLSNILK, GHGECPTTENTETFIR, QIGSVIRNPEILAIAPVLLDALTDPSR, GEPGAAPLSAPAFSLVFPFLK,LALESICLLLGESTTDWK, KLVPLLLEDGGEAPAALEAALEEK,QVQMAATHIAR, EIGDIAGVADVTIR,SSTSNANDIIPECADKYYDALVK, IYDVEQTR, LNDGSQITYEK, ISGLGLTPEQK,HSIPSEMEFDPNSNPPCYK, YGFVIAVTTIDNIGAGVIQPGR, SLNQSLR, LSPSAASDAVLSALLSIFSR,DLLFILTAK, SDPNRETDDTLVLSFVGQTR,ELYDKGGEQAIK, TIVITSHPGQIVK, KPVAIFK, GTGVIQLYEIQHGDLK, ADSWVEKEEPAPSN, HVAVTNMNEHSSR, AQNVTLEAILQNATSDNPVVQLSAVQAAR, MDWSIEAAVATFAQARPPGIYK, UBC, IEEFVPPDENCPLKEASSR, EIINAIDGIGGLGIGEGAPEIVTGSR, VLLLSSPGLEELYR, LYLVDLAGSEK, MQLFVK, TLSDYNIQK, SLGPPQGEEDSVPR, DBF4, LOC100289196, LOC440264, INVS, MINK1, TFAP2B, HIST1H2AA, PCBP3, PPP1CC, WKPPSLNSVDFR, ILPEIIPILEEGLR, AASQSTQVPTITEGVAAALLLLK,SSLQSQCLNEVLK, LVPLLLEDGGEAPAALEAALEEK, VIDVGSEWR, FCSNLCLPK, VLNSIISSLDLLPYGLR, KYDYYYNTDSK, SATEQSGTGIR,AALSASEGEEVPQDKAPSHVPFLLIGGGTAAFAAAR,AIVFRPFKGEVVDAVVTQVNK, GEVVDAVVTQVNK, LQSVQALTEIQEFISFISK, ALGQEADKGLSGCGIVYCR, DCWTVAFGNAYNQEER, FSATEVTNK, QLEESVDALSEELVQLR, MQIFVK, EAAWAISNLTISGR, LAEALAFRQDLEVVSSTVR, SFQNQIAAIQR, LEELHVIDVK, KTEPATGFIDGDLIESFLDISRPK,ITFHGEGDQEPGLEPGDIIIVLDQK, VNFPENGFLSPDKLSLLEK, GVTQVTTQPK, POTEB, FNSQPVGDCDFNVR, NDRDQVSFLIR, ELFLTAGGAGGLHLWK, QEGLDGGLPEEVLFGNLDLLPPPGK, FQGEDTVVIASKPYAFDR, ESTIHLVLR, LFNDSSPVVLEESWDALNAITK,FSSVQLLGDLLFHISGVTGK, SRQELEQHSVDTASTSDAVTFITYVQSLK,VEDVLGK, SPISVAAAAIYMASQASAEKR,KAVELDLVPGR, NPNAVLTLVDDNLAPEYQK,EIEDIIEEVTVGYIR, IDNSVLVLIVGLSTVGAGAYAYK,DKVVVGIVLWNIFNR, TMDEDIVIQQDDEIR,GFVLYPVK, DVLIQGLIDENPGLQLIIR, QTITESSSLLLSQLTSLDPQGPPR, VVEELTR, HYNGEAYEDDEHHPR, POTEC, YDSQVAEENRFHPSMLSNYLK, RLEALER, GFASVSEK, ALVSHNGSLINVGSLLQR, SAEIDSDDTGGSAAQK, TITIEVEPSDTIENVK,

KPNA1, PPP2R2A, QVVEYYSHR, NLQASGLTTLGQALR, POLR2J2, LLGHEVEDVIIK, IGLVEALCGFQFTFK, TDGFGIDTCR, AFADAMEVIPSTLAENAGLNPISTVTELR, ILLYACR, LLHHDDPEVLADTCWAISYLTDGPNER, SFSLLVR,VTEPSAPCQALVSIGDLQATFHGIR,SHGIDLDHNR, SSAGERHGSR,WVTCASAVQK,VLVVHDGFEGLAK, QKSIQR, GTEGRTGPVAVR,SELLSQLQQHEEESR,LPNGEGTR, FAISLPKAR,KMWEELPEVVR,DSNTQIEQLR, LGALPSMPR,FQAMEETHLRHMK,ISQKAER, LEVDLDK, VEAVLSR, QFQVLDAPLLK,LSVGSNRDR,WCGHKEVPPR,EQQETIDKLR,CMMLTNVQMLNKEPEDMITGER,LLGRCSGTR, QLEIYR, SPPPPPPR,IPMSKMMTVLGAK,GKGCYNTR, CLKSCKPR,MAGQRVELPCK,KEQAGER, LMEPNLIK,ELEGELEAEQKR,SFTTTIHK,ETKTFGGGGGGAR,AIGPHDVLATLLNNLK, LRVPLLEILK, TVLEQVLPI, EVATAIR, TILEMLLGFLK, LLLPWLEAR,IAALQAFADQLIAAGHYAK,HFVTISSPLATQIPQAVGAAYAAK, DIEQFTEFFENPAFRPDGLK, DIKPSNLLVGEDGHIK, ALVSFSDYQIFQELIK, LPADTCLLEFARLVR, VLDFEHFLPMLQTVAK, LASYVEKVR, GANFLTQILLRPGASDLTGSFR,GGGGNFGPGPGSNFR, RDNYPQSVPR, IKELVVTQLGYDTR,DLVEAVAHILGIR, GQGSVSASVTEGQQNEQ,TYEEGLKHEANNPQLK, QAAPCVLFFDELDSIAK, YLVEVEELAEEVLADKR, VKEEIIEAFVQELR, GEGIKMTPR,ATGREEGLEVGPPEVLDTLSQLLK,FLEFQYLTGGLVDPEVHGR, GALQYLVPILTQTLTK, SPLGEVAIRDIVQFVPFR, CPNE7, YSELGHPFGYLK, DDX26B,

DIVQFVPFR, PILLFLIDTSASMNQR, IPSEQEQLR, GNTSGSHIVNHLVR,ITFTGEADQAPGVEPGDIVLLLQEKEHEVFQR, VIEPGCVR, YLWNNIK, LGFEEFK, IDDVVNTR, VIDPATATSVDLR,NCDYQQEADNSCIYVNK, LYYVCTAPHCGHR,LLGASELPIVTPALR, GINSSNVENQLQATQAAR, KPNA6, PPP2R2D, VPNACLFTINKEDHTLGNIIK, IVQVALNGLENILR, SFFSEIISSISDVK, UPF1, DNAJA2, CAPNS1, CCT4, POLR2I, KPNA2, TIMP1, PPP6R1, SMC4, FLG, TTN, PFKM, CCDC19, NUP205, MRPS31, NFRKB, PRX, BDH1, CCDC158, SIK3, FCHO1, COL16A1, NALCN, EMILIN1, MYO18B, KIAA1109, PDGFD, CENPE, FRY, VAC14, ERCC4, SRCIN1, CHD5, DDX1, SMYD3, DSCAM, HENMT1, EFCAB7, MYH6, MAD2L1, TDRD3, SF3B1, SETX, DNAJB1, RPS2, MMS19, CLTC, SPTAN1, BCKDHA, ELP3, CAMKK2, POLR1B, LPCAT1, MYL6, KRT23, DPM1, HNRNPA2B1, MYO1C, PFKP, FASN, SEC13, STIP1, VCP, PDRG1, VASP, P4HA1, MYO9B, DSP, KPNB1, CPNE3, INTS6, SQLLKDPQVLFAGYK, VPHPLEHK, VLVEGPLNNLR, CFLSWLVK,ELYDRYGEQGLR,ITFTGEADQAPGVEPGDIVLLLQEK,THYSNIEANESEEVRQFR, LFAQLAGDDMEVSATELMNILNK, DALSDLALHFLNK, ALIAGGGAPEIELALR,ITHEVDELTQIIADVSQDPTLPRTEDHPCQK, EAVFFQSHSAR, TDCSPIQFESAWALTNIASGTSEQTK,QDQIQQVVNHGLVPFLVSVLSK,

NVFLLGFIPAK, EISFAYEVLSNPEKR, THYSNIEANESEEVR, AQDIEAGDGTTSVVIIAGSLLDSCTK, ITHEVDELTQIIADVSQDPTLPR, NKNPAPPIDAVEQILPTLVR, HLACLPR,SCTLGLQPHRGPSVCPR,NNALEKEK, HHEASSRADSSR,QLVGTNK, QLVRTPQR, KMETEAELR, VTSLNRQR, HLESFPK, VELGDSDLK, LLQCAEPYK,LRTVQLNVCSSEEVEK,LQELQHLK, DLKAENLLLDANLNIK,VSSDKLALCHLELTR, GTPGEKGPR, INVSVSKNLNLK, RLAELER, KLSSPTTPR, SSMRAASLK, LNDDAKR, LKNGIQK, AVDNILR,DIVTESNKFDLVSFIPLLR,EKEAFEK, REMVYASR,ELMLGEDTGLPKR,FLVLDEADGLLSQGYSDFINR, KIEELK, AVGDPSPAVKWMK,LCANEEIMR, SSLLSATR, LKENQAR, AIQDEIR, ILESSIPMEYAK,ICFELLELLK, IILEFK,FKEIAEAYDVLSDPR,TYSYLTPDLWK, FAEFLLPLLIEK,KFDVNTSAVQVLIEHIGNLDR,DLQSWINGIR, HLQTYGEHYPLDHFDK,ILALVPPWTR, SQALEFPDQPDIAEDLKDLITR,ELFFLPLGFALK, TALGVAELTVTDLFR,ALGQNPTNAEVLK, LASYVEK, ENLPLIVWLLVK,GFGFVTFDDHDPVDKIVLQK,MSLLQLVEILQSK,LNIIIVAEGAIDTQNKPITSEK, TLLEGSGLESIISIIHSSLAEPR,DVAWAPSIGLPTSTIASCSQDGR, LAYINPDLALEEK, NAPAIIFIDELDAIAPK,GFNLNPLNQDELK, TPKDESANQEEPEAR, LNTEWSELENLVLK,KPTGLFYLLDEESNFPHATSQTLLAK, LSSEVEALR,AAVENLPTFLVELSR, KPNA5, PPP2R2B, LYGPTNFSPIINHVAR, DFQQLLQGISEDVPHR, POLR2J,

TNTPADVFIVFTDNETFAGGVHPAIALR,ALAELFGLLVK,KPVEELTEEEKYVR, TPETLLPFAEAEAFLKK,YYALCGFGGVLSCGLTHTAVVPLDLVK, NILETLLQMK,SLQALGEVIEAELR, ASWESLDEEWRAPQAGSR, SALTEMCVLYDVLSIVR, LTDNIFLEILYSTDPK, AFIPAIDSFGFETDLR, SCSLVLEHQPDNIK, INEILSNALK, VVLAITDLSLPLGR,NVVHQLSVTLEDLYNGVTK, TPGQVLSLISSLGFNTPIAEK, KNFIAVSAANR, LLLAVFVTPLTDLR,QQTNTLLNLVWR,VELSEQQLQLWPSDVDKLSPTDNLPR, LLMEELVNFIER, VIGSGCNLDSAR,ISEPEADVESVLGVSNLLQVLQKPK,TAAPGHDLSDTVQADLSK, IESEGLLSLTTQLVK, ACLIFFDEIDAIGGAR, HTGPNSPDTANDGFVR, FAM90A19, NFIX, CCDC72, RPS27P17, DFSALESQLQDTQELLQEENRQK, FAHIDGDHLTLLNVYHAFK, GNVFSSPTAAGTPNKETAGLK, EAETFKEQGNAYYAK, DSTFDLPADSIAPFHICYYGR, VEENFVILFSDLTMHELK, LTSFIGAIAIGDLVK, AVDSLVPIGR, LATNAAVTVLR, QADVNLVNAK, ENAPAIIFIDEIDAIATK, TWFGTFIR, LAPAHVDEQVFLYENAR, FEDEELQQILDDIQTK, AFLTGVDPILGHQLSAR, IAAQLLQQEQK, LSNQVSTIVSLLSTLCR, DDKHGSYEDAVHSGALND, FAM90A17P, NFIA, EXOG, RPS27, IIGLDQVAGMSETALPGAFKTR, TLATDILMGVLK, ITGEAFVQFASQELAEK, LEQYTSAIEGTK, NAQLELKK, MIIEDETEFCGEELLHSVLQCK, FAVALDSEQNNIHVK, VQDDEVGDGTTSVTVLAAELLR, HALIIYDDLSK, ETEGDVTSVK, TLGILGLGR, VISGVLQLGNIVFKK, HRLDLGEDYPSGK, PSMC4, TIFTQEIFTEQVVTAHAVR, LFGSAANVVSAK, FEDEELQQILDDIQTKR, AVLQLLVEGALHR, LVSIPEEIGK, ILQQIEEPLALASGALPDWCEQLTSK, VLCELADLQDKEVGDGTTSVVIIAAELLK, LQEALER, EALGDAQQSVR, LWGIPDQAR, VELHSTCQTISVDR, SLHDALCVLAQTVK, LKEIVTNFLAGFEA, FAEAFEAIPR, TQTSDPAMLPTMIGLLAEAGVR, LQQELEFLEVQEEYIKDEQK, YVEVFK, QLLQANPILEAFGNAK, HQSFVLVGETGSGK, GLPWSCSADEVQR, TROVE2, HUWE1, MRPS7, MAP7D1, SLC25A3, MRPS35, FARSA, CDC42EP1, MED17, SAMHD1, EFTUD2, FKBP8, HNRNPM, MED24, DNAJA4, MASTL, MYLK2, MSH2, UBE3C, CNP, INTS9, LDHA, LTN1, LRCH1, MRPS27, PSMC2, HNRNPH1, MYH9,LQQELDDLLVDLDHQR, DHX15, EVDDLGPEVGDIK, CALD1,ASVDTKEAEGAPQVEAGK, DNAJC7, FREALGDAQQSVR, FTSJD2,VAELALSLSSTSDDEPPSSVSHGAK, SUPT5H,SLGGDVASDGDFLIFEGNR, CCT2, QVPLSAAEAAEVILR, ATP5A1,FENAFLSHVVSQHQALLGTIR, CCT8, TVGATALPR, PHGDH, DLPLLLFR, INTS2,QAQGLQQELGGLHSALLR, ZBED1,LKPESSQQPGQDALAVK, POLR2D, TLNYTAR, INTS5, IASVVGILGHLASR, TDSQKDQEVYDFVDPNTEDVAVPEQGNAHIGSFVSFFK,LRCH2, HECTD1, DLVDKGGDIFLDQLAR, TCP1,SLLVIPNTLAVNAAQDSTDLVAK, HNRNPF,HSGPNSADSANDGFVR, SAAEMSGRGSVLASLSPLR, QADKVWR, KLEELK, LTEGCSFR, ATENDIYNFFSPLNPVR, IAQLEEQLDNETKER, LQLPVWEYKDR, MQNDTAENETTEKEEK, KDYNEAYNYYTK, QLLLCQFLMALSIVR, SSVGETVYGGSDELSDDITQQQLLPGVKDPNLWTVK, QVLLSAAEAAEVILR, VVDALGNAIDGKGPIGSK, LVPGGGATEIELAK, VTADVINAAEK, ENAPAIIFIDEIDAIATKR, LOC652826, IAEFTTNLTEEEEKSK, SNLGSVVLQLK, QLLLELMGILPTVR, LALFPLLPK, EFETAETLLNSEVHMLLEHR, GNTELFGGQVDGDNETLSVVSASLASASLLDTNRR, YLFDLPLK, GQVKPSTSSQPILSAPGPTK, YPVNSVNILK, YVELFLNSTAGASGGAYEHR, NDDDEEEAARER, AVQFFVQALR, ISAIHILDVLVLNGTDVR, TPHYGSQTPLHDGSR, EALLSSAVDHGSDEVK, EVAAFAQFGSDLDAATQQLLSR, AIADTGANVVVTGGK, ILQDGGLQVVEK, STGEAFVQFASQEIAEK, VSHLLGINVTDFTR, TCTDIKPEWLVK, SNPEATNQPVTEQEILNIFQGVIGGDNIR, SLESSQTDLK, ALIPSLEGRFEDEELQQILDDIQTK, VPHATGALNELLQLWMGCR, NNLHVLPDELGDLPLVK, MEPLATVESLEQYLLK, AFHNEAQVNPERK, ALLQEMPLTALLR,QSTTHLADGPFAVLVDYIR, TSSVFEDPVISK, TPETLLPFAEAEAFLK,IQTQPGYANTLR,MAVDQDWPSVYPVAAPFKPSAVPLPVR, LPSGKVGFSK,LSPVGWVSSSQGK, SLAGAAQILLK, YVGETQPTGQIK,SFVEFILEPLYK,VLAQQGEYSEAIPILR,AFITNIPFDVK, HREDIEDYISLFPLDDVQPSK,NVVHQLSVTLEDLYNGVTKK, ILGTPDYLAPELLLGR, NFIAVSAANR, LNLVEAFVEDAELR,QLAVLTELPFVVPFEER, DKPELQFPFLQDEDTVATLLECK, VLKPLLSGSIPVEQFVQTLEK, LLIVSNPVDILTYVAWK, VLGNTSGLLQLLFNR, ALEEAANSGGLNLSAR, EALDVLGAVLK, FDDGAGGDNEVQR, SYSDPPLK, FAM90A8, FAM90A9, NFIC, NFIB, LOC729973, PPP2R5A, RPS27P9, RPS27L, VIQYLAYVASSHK, IRVESLLVTAISK, YEIEETETVTK, LAYELYTEALGIDPNNIK, FFELIQGTEIDIFSYKPTLLTSK, GIYKDDIAQVDYVEPSQNTISLK, LAVEAVLR, GIRPAINVGLSVSR, DIDEVSSLLR, NAGNCLSPAVIVGLLK, LAKENAPAIIFIDEIDAIATK,

VRPDYTAQNLDHGK, QSLGHGQHGSGSGQSPSPSR, GNFTLPEVAECFDEITYVELQKEEAQK, AEAGDNLGALVR, VVQHSNVVINLIGRDWETK, IIDEVVNKFLDDLGNAK,LLESLDQLELRVEALR, FYQQVTK,DDTHKVDVINFAQNK, GAQDVAGGSNPGAHNPSANLAR,LLQEEEQER, GQGCTAYDVAVNSDFYRR,ELVITIAR, IEIQNIFEEAQSLVR,AVWNVLGNLSEIQGEVLSFDGNR, VDNALQSGNSQESVTEQDSK, SGTASVVCLLNNFYPR, IEGDETSTEAATR,LVIASTLYEDGTLDDGEYNPTDDRPSR, MELQEIQLKEAK, LNTIPLFVQLLYSSVENIQR, DGADIHSDLFISIAQALLGGTAR, TQGNVFATDAILATLMSCTR, RYLPDGSYEDWGVDELIITD, QESGYLIEEIGDVLLAR, AHYTHSDYQYSQR, ILLAELEQLKGQGK, SHSWVNSAYAPGGSK, ILVLTELLER, VYEFLDKLDVVR, LVEGILHAPDAGWGNLVYVVNYPK, QLREEQR, VINEEYK, ASHEEVEGLVEK, IDLLQAFSQLICTCNSLK, TQLRNEFIGLQLLDVLAR, AQIFANTVDNAR, AQELIQATNQILSHR, SSGPYGGGGQYFAKPR, QLDAGEEATTTK, LAQQISDEASR, AQENLNPLVVLNLFK, TDYNASVSVPDSSGPER, TPM3P4, LPLFGLGR, GPYESGSGHSSGLGHR, EKPYFPIPEEYTFIQNVPLEDR, YEEIDNAPEER, LPHLPGLEDLGIQATPLELK, LLESLDQLELR, TLQQNAESR,LLPSAPQTLPDGPLASPAR, SSLSSHSHQSQIYR,SWSSYFSLPNPFR, LNTEVASVVVQLLSIR,ELQQAQTTRPESTQIQPQPGFCIK, IQELGDLYTPAPGR,QLFEIDSVDTEGKGDIQQAR, SVEQQVIGFSGLSDDKNYK,VDNALQSGNSQESVTEQDSKDSTYSLSSTLTISK, DSTYSLSSTLTISK, ADQFEYVMYGK, VYRIEGDETSTEAATR,

JUP, PESGAAAARPRR, DNAJA3,FSQLAEAYEVLSDEVKR, EIF3D, VADWTGATYQDKR, POLR2F, KIPIIIR, ARHGEF1,RQESGYLIEEIGDVLLAR, PKP2,TSSVPEYVYNLHLVENDFVGGR, VIM, ISLPLPNFSSLNLR, DOCK6, VAELYLPLLSIAR, ASB6,LQLPENFDIHPVGSLAEK, PSMD3, ELDTVTLEDIKEHVK, TARDBP, MSEYIR, GIGYF2,AALSSQQQQQLALLLQQFQTLK, RBBP7, TVALWDLR, LGALS3BP,TLQALEFHTVPFQLLAR, INTS7,LQQQQAQQPLQQQQQR, INF2, NEFIGLQLLDVLAR, KRT18, LASYLDR, SCAF1,AARPPPAASATPTAQPLPQPPAPR,HNRNPA1, NQGGYGGSSSSSSYGSGR, P4HA2,SQVLDYLSYAVFQLGDLHR, USP9X, GSASDWYDALCILLR, POLR3A, NNIFYILLR, HNRNPK, ILSISADIETIGEILK, MRPS34, AWGILTFK, HRNR,GSGSGQSPSSGQHGTGFGR, HNRNPU,NFILDQTNVSAAAQR, TUFM,HYAHTDCPGHADYVK, NDUFA9,NFDFEDVFVKIPQAIAQLSK, BAG2, SEC16A, IRS4, PIH1D1, BAG5, IGKC, POLR2H, TPM3,

TLTVEVSVETIRNPQQQESLK, QISDGEREELNLTANR,GLANPEPAPEPK, YVDVLNPSGTQR,DAEIPEGAARGPHR, DVWQVIVKPR, MQNSDFLR, FEELLLQASK,LAQNILSYLDLKSDEWEY, VDNALQSGNSQESVTEQDSKDSTYSLSSTLTNSK,YLDLEEEADTTK, VDNALQSGNSQESVTEQDSKDSTYSLSSTLTDSK, LSAYVSYGGLLMR, VYLLMK, DSQLYAVDYETLTRPFSGR, GPYESGSGHSSGLGHQESR, SSGPTSLFAVTVAPPGAR, DLEKPFLLPVEAVYSVPGR, SSVSGIVATVFGATGFLGR, MELQELQLKEAK, TLVTQNSGVEALIHAILR, RDGADIHSDLFISIAQALLGGTAR, SVYSWDIVVQR, ITTPYMTK, LLLKSHSR, IQEQVQQTLAR, QDVDNASLAR, TLLACVLWVLK, TALLHALASSDGVQIHNTENIR, HDADGQATLLNLLLR, TSDLIVLGLPWK, DVGRPNFEEGGPTSVGR, AIFTGHSAVVEDVAWHLLHESLFGSVADDQK, SDLAVPSELALLK, SATDKQQELLVSLATVIFVASQK, SSQLLWEALESLVNR, ASLENSLREVEAR, GAEETSWSGEER, SESPKEPEQLR, APQLLIAPFKEEDEWDSPHIVR, IIEDCSNSEETVK, LQQQPGCTAEETLEALILK, ILSISADIETIGEILKK, LLQLLGRLPLFGLGR, HGSGSGHSSSYGQHGSGSGWSSSSGR, DLPEHAVLK, LLDAVDTYIPVPAR, VFEISPFEPWITR, FLDDLGNAK,TLQQNAESRFN, LVLIGSNHSLPFLK,DAQGQPGLER, WFQPVANAADAEAVR,LNTEVASVVVQLLSIRR, GQGCTAYDVAVNSDFYR,NRPFMGSISQQNIR, AVIEVQTLITYIDLK,TELQGLIGQLDEVSLEKNPCIR, RTVAAPSVFIFPPSDEQLK,TVAAPSVFIFPPSDEQLK, LQGDANNLHGFEVDSR,LHCESESFK,

GAVSAEQVIAGFNR, SLANAESQQQR, HQEEESETEEDEEDTPGHK, LEGGSGGDSEVQR, LOC728989, LOC100294412, FAM50B, NBPF4, SPIN2B, PCDHB8, LOC651921, POC1A, PFKFB4, ZCCHC18, LOC642846, LOC653566, LOC652147, USF1, LOC653707, OR2M2, LOC100293491, BAT2L, GRIA4, CDK11A, CDKL4, RAD21L1, AMOT, FAN1, AKAP2, QQIAEGK, QEMETQVR, EEEENRLR, DQEEIEDQSPPCPR, ATGSANMTK, VVSRGNK, YATDSETVEPIISQLVTVLLK, CAKFSPDGR, QCALAALR, ASLYVIR, MAELYR, EKSIFLVAHR, WTELGALDILQMLGR, RAQHNEVER, EAMVKPFEK, LICTTVPK, LYQDEK, ELEKIK, IDISRR, RDSLEEGELR, IALREIR, ILSPKVK, NKLEGEIR, EVVEKR, QAVAKGQSTPR,

PFDN2, IIETLTQQLQAK, LTGHLHAQGTPPYVINLDPAVHEVPFPANIDIR,GPN1, FLG2, FSNSSSSNEFSK, PSMC5,LLREELQLLQEQGSYVGEVVR,

MVGGVLVER, SMSLVLDEFYSSLR, HSSWSEGEEHGYSSGHSR, TMLELLNQLDGFEATK, PDE4DIP, HIP1R, FAM50A, NBPF6, SPIN2A, PCDHB13, ATR, POC1B, PFKFB3, ZCCHC12, DDX11, SPCS2, SNRNP200, USF2, PDCL2, OR2M1P, GOLM1, PRRC2B, GRIA1, CDK11B, CDKL1, SNPH, AMOTL1, RSPH3, PALM2-AKAP2,

FAM201B, PIAS2, ANXA2P2, EIF3FP3, ARHGEF33, ANKHD1-EIF4EBP3, FOXK1, MEX3A, EPHA10, PRDX4, CYP2D7P1, TRIM6, CDK16, CCDC162P, APITD1-CORT, SLCO1B7, RPL23P8, PPAN-P2RY11, TNFRSF6B, AKT3, LOC100291405, RPS18, SRPK1, CSNK2A1, PPP2CA, LOC643751, HNRPDL, ARF3, MPAATLR, LTADPDSEIATTSLR, SALSGHLETVILGLLK, IQDALSTVLQYAEDVLSGK, LKEAGQGR, TTSEIFLSSTAEGADLR, LASVPEYR, EIISAAEHFSMIR, LEGVVTR, GLFIIDDKGILR, ACLGEPLAR, ELISDLER, LSLPADIR, AIARQMGK, TLISGLPGR, LSLVGLAK, NLYIISVK, QLSLDVRR, SLEQELASPILDIEDLVK, KEVIIAK, VELSDVQNPAISITENVLHFK, IPDWFLNR, QAELLEKR, LIDWGLAEFYHPGQEYNVR, GAGYTFGQDISETFNHANGLTLVSR, NVFDEAILAALEPPEPK, GFGFVLFK, MLAEDELRDAVLLVFANK,

LOC100134240, PIAS1, ANXA2, EIF3F, MORN2, ANKHD1, KIAA0415, MEX3B, EPHA8, PRDX1, CYP2D6, TRIM6-TRIM34, CDK17, CCDC162, CORT, LST-3TM12, RPL23, PPAN, RTEL1, AKT2, PTPLAD1, RPS18P5, SRPK2, CSNK2A1P, PPP2CB, CDC42, HNRNPD, ARF1,

CUL4A, NDEL1, GFPT2, ATAD3C, MYO5B, PRPS1, CAMK2B, CALM3, ELPQLVASVIESESEILHHEK,QQLVKLEMNTLNVMLGTLNLGNLLLTGDK, EAYLLLQLFK,TLSALLLPAAGLDDVSLPVAPR, DFEPILEQQIHQDDFGESK, TEQGTVLIGSEHGLIEVDPVSR, QLSQSLLPAIVELAEDAK, YFAQEALTVLSLA,SLELLPIILTALATKK, FVSSLLTALFR,VADLVHILTHLQLLR,LAVDTDTFSFGVVVLETLAGQR, TIADIIR, TLIQNCGASTIR,LDGLVETPTGYIESLPR,NVDLLSDMVQEHDEPILK, LSTLIEFLLHR, LPSFAIPYAIDVLTTGSYPR,SPALQVLR, AIGIEPSLATYHHIIR,LSPPYSSPQEFAQDVGR, DCQLNAHKDHQYQFLEDAVR, EQAELTGLR, AQETEAAPSQAPADEPEPESAAAQSQENQDTRPK,NVESGEEELASK, EIDKNDHLYILLSTLEPTDAGILGTTK,ETGANLAICQWGFDDEANHLLLQNNLPAVR, HKLDVTSVEDYK, IASLLGLLSK, ATSPADALQYLLQFAR,INALTAASEAACLIVSVDETIK, INALTAASEAACLIVSVDETIKNPR, MAP1A, DKIPCEK, MACF1,LFQVENQSAQEKVK, TCHH, EQEERLEQR, 14-Sep,VFMIPVENENHCDFVKLR,LOC100128703, RASVASPGEK, SLC1A6, RLHSADGQR, DHX38, QEQEKR, GTF3C5, KVEEELR, DUOX2, KEIEDIR, PDE1C, RSSLNSISSSDAK,LOC494150, VEQQKK, RYR2, YNILTLMR, HIST1H1D, VAGAATPKK, KLHL10, KVYNIPGISPDMMK,KIAA1244, SINTAVR, GBP2, EKAIEVER, TIDGILLLIER, ISALNIVGDLLR, VIQQLEGAFALVFK, GLLLFVDEADAFLR,

LNVGMENKVVQLQR, QVLMEHK, VTSIADR, VYAILTHGIFSGPAISR,DLKPENLLLASK, LTQYIDGQGRPR,EAFSLFDKDGDGTITTK, VFDKDGNGYISAAELR, UBR4, IQGAP3, IKBKAP, PPP2R1A, FANCI, IRAK1, CCT3, NAP1L1, MED14, PTCD3, TRIM28, PGAM5, MAGED2, CCT5, INTS4, CCT7,

TLLPLLLESSTESVAEISSNSLER,QAACTIVEALATIPSR, DASGFWSSLVNPATGLAEVEGENAQR, GVLVEIEDLPASHFR, FITVGWGR, VHELQGNAPSDPDAVSAEEALK,AISHEHSPSDLEAHFVPLVK, QLSQSLLPAIVELAEDAKWR, SLELLPIILTALATK, EALLLVTVLTSLSK,TASVLWPWINR, AIQFLHQDSPSLIHGDIK,ELGIWEPLAVK, EILSEVER, GIPEFWLTVFK, YAVLYQPLFDKR,VTPENAGQWKPDELQVLEK,LELSNLEIPHQGVQVEGDGFSHAIR, VSNSPSQAIEVVELASAFSLPICEGLTQR, SAYESQPIR, VLVNDAQKVTEGQQER, MNEAFGDTK,HSQYHVDGSLEKDR, VSTDLLR, GPIAFWAR, VPNSNPPEYEFFWGLR,IAILTCPFEPPKPK, WVGGPEIELIAIATGGR,VYSLQHLDPQGAQELLEFTIR, LLSDDYEQVR, QVKPYVEEGLHPQIIIR,SQDAEVGDGTTSVTLLAAEFLK, CUL4B, NDE1, GFPT1, ATAD3A, ACAA2, PRPS2, CAMK2G, CALM2,

ASB8, SSSMWYIMQSIQSK, SOGA1, SSSLVSVR, ARHGAP21, DLISRR, TGM2, LAEKEETGMAMR, PJA1, KTNSEVPMHR, SWT1, EELSAELLR, ZMYM3, NTRVYPCVWCK, SPTA1, LPEITDLK, EIF3L, FEELNRTLR, TARS2, EAQQSLR, FAM198A, SRESPGGDLR, LAD1, LERYHTAIR, NCOA1, YLLDKDEK, NAA16, ILKCYEQK, VPS41, EGMEIPNLR, FHL2, ISVCPAMR, MEPCE, SSPLPAKGR, NHS, QENIASGISAK, CSPG4, LVLGQEELR, PSMA3, AVENSSTAIGIR, RAE1, CWEVQDSGQTIPK, LRRC53, IKGQNLK, RGP1, DSWLAELAGER, DNAH17, QLKACHR, C7orf59, SSSHSCFPGGR, IQSEC3, AACTIQTAFR, CCNB3, TNFKEDSLVK, ZNHIT6, SLLDNLRNK, HOPX, TLGSAVSPQQR, PROCA1, QVTQAGR, FAM99A,MAAAGPMTGAAASR,SPATA16, QMETMGKR, KIAA1257, QRSQIK, TEX264, KLCAYPR, ATP6AP1L, DTAEEKELLR, OLFM2,CICTAVIPAQSTCSRDGR,CFH, IVSSAMEPDR, CCDC87, ASPSPFCPELR, DST, ITEIQYTCR, MAGI3, KHWLSK, ZNF469,EAGGQAQAMELPEAQPRQAR,GPR171, QLYRNK, WNT7B, KVLEDR, ZNF208, WPSTLR, THSD1, VSSLSPSQCR, SLC45A4, LLWLSMLK, MYOF, KVIEDR, PTPRC, VDVYGYVVK,

CDH11, DTNRNSLSR, MYH1, DTLETQLSR, TOM1L1, ISKNYNHK, TRAF3IP3, GTQTKAEGPTIK, IGLL3P,GSWTGPGCWPRGFQSK,SEC14L1, LDVDAPR, FLJ43944, NNSTLPK, MGRN1, APDSTLR, PLAGL2, VAGGAKEK, RTL1, MKEASVNPSGAR,PRPF40A, FYVEDLK, USP13, NQAIQALR, ARL9, QEKQER, ZNF250,SSESPGDQVMAAAR,PTGDS, VLADAGGGR, C7orf34, SMPPLAPQLCR, NTN1, RPCAPTR, LAMTOR3,AINTAMYHIMMFQLVKDHIR,SPAG17, VITSQGTVVK, C6, NSGLTEEEAK, SSPO, AGWARAR, UROC1, QVSAINR, ZNF285, SSSLHNHHR, IMMT, TVEGALEER, WDHD1, KNLNLSK, HSD11B1L, DRASAAEAVR, SDHD, LSAVCGALGGR, DNAAF2, TSLDGARR, GIT1, TPIDYAR, MCM3,TVDLQDAEEAVELVQYAYFK,BRPF3, EPGPFVPR, PENK, VGAAAARAR, TIMM44, TIESETVRTSEVLR, PRRC1,QMIYSAARAIAGMYK,MED13, TNEKQEK, DPT, GATTTFSAVER,LOC645954, MWRAVSLATR, DOCK8, SYQASPDLR, KIAA2026, ILGEFLQEK, PRR25, SSRPSCANVLLR, BRIX1, IPGPVCK, LYRM1, EGTSGRLR, P2RX4,TCEVAAWCPVEDDTHVPQPAFFK,STXBP2, AGVEARAGPR, PARP9, VQEEMARK, MICAL3, DWVSPWLPR, CAMSAP2, KEELESK, SOCS6, LPLPNKMK,

GRIA3, DHGLLHLK, DGAT2L6,EYVMSMGVCPVSSSALK,CYBB, QSISNSESGPR,LOC100508384,MASFFSTLTFPSLK,PLA2G4E, EGLSPCHLLTVR, FNIP2,LPSCEVLGAGMKMDQQAVCELLK,ODF2, MKGDTVNVR, ALPK1, WPFVPEK, VRTN, QEKEAGR, IEKMELALVPVSAHGNFYEGDCYVILSTR,AVIL, HTR3B, VNMSNQVPR, DTNA, TQGGVSPR, MOSPD2, KEISVNGK, LARP1, RDFVEAPPPK, NUP214, SSVLPSPSGR, BRCA2, QMLNDKK, NAV2, AGTSAASSWGGGK,NCAM2, EALNPETIEIK, GTPBP5, APARCFSAR, ESPNL, SLSLLLK, SRCAP, AEGHRVLIFTQMTR, CXCL2, ACLNPASPMVK, YLPM1, SSYLESPR, PHC3, IPLHSPPSK, ERBB2IP, LLQPGDK, ARG2, LSSLGCHLK, GAD1, TFDRSTK, CAGE1, LNPADIK, MKL2, YEEAIKQTR, AJUBA, LSGLGGPRK, SEZ6L, CRYYSNLR, CD248, ACRELGGDLATPR,FAM203A, LGRAGGKPGGR, PSMA5, SSLIILK, RHBDL2, QPREER, TSTD2, MSLINPSSSLK, SHQ1,SGVLQRLQDELSDVIDIK,TRIM4, KVMHLQDVEVK, MAPK7,QPVPWETVYPGADR,KALRN, MAPPTPPK, GPR153, YRADLK, UBA2, SRFDIK, EYS, TSVSHMPIR, GRHL1, LARIEEPK, SMC5, MATPSKK, KCNQ5, STSANISR, FNDC3B, TLSIAPGQCR, SP140, SHEESLAHTGTLRR,

GLYAT, VKQTQR, SMAP2, MQEMGNGKANR,NUDT21, LTTEILGR, CCDC63, KEIEDLR, GPRIN1, KAEALASGK, PTPRT, KGYHEIR, CD200, GELSQGVQK, TTLL1, WTVSNLR, NDUFS1, LSVAGNCR, HAPLN1, LTSDYLK, FAF2, QNLQEEKER, CHADL, ALPALPSAR, FBXL13, CVRLSDASVMK,LOC729040, SPGWSEEVVR, MYLIP, YVFDIK, COL6A1, GTPGTRGPR, RERE, EEASSLLK, WDR82,YTHAANTVVYSSNK,MUC16, IQTEPTSSLTLGLRK, DMBT1, VEILYR, ITGA7, IVTCAHR, TNFRSF25, AEGAPGRTR, BOD1L, DKDVTLSPVK, CDR2L, RGMSILR, TMEM79, MTEQETLALLEVK, ABCE1, NVEDLSGGELQR, RBFA, DAALFPGCER, RAB23, NEEAEALAKR, ERN1, TYKGGSVR, ZNF687, FISHKK, PIP4K2C, FLDFITNIFA, PIGO, MASRFSR, PNMA1, LSAYVIR, MOAP1, LSAYVLR, BAZ2A,RDGEVDATASSIPELER,GNAS, RNAVSSSTNNSR, NUB1, EWRGAGMAQK,ATP6V0A2,KNLLELIEYTHMLR,C14orf180, AQALRGLR, KIF16B, LIEEMEEK, DDAH1, KEVDMMK, MDN1, LLSVLAQVFTELAQK, GBP7, MLSHKMK, CD1B, ILLYETCPR, GHDC, EYELVLTDR, KIAA1217, QVGEAVATLK, PMPCA,GLDTVVALLADVVLQPR,ZNF30, ISRQKPR,

DNA2,AGCSPSDIGIIAPYRQQLK,KDM2A, QLQSDGK, MRI1, TLEAIRYSR, DMXL2, DTLDECGLR, LARP4B, KPSYAEICQR, HRASLS2, MALARPRPR, KDM5B, RMGCPTPK, XKR6, IDMPRK, ANKRD27, HIASGNQK, SLC43A3, ACGTSIFQNR, LRRC41,ALHLSDLFSPLPILELTR,DENND2D, NPPAGYFQQK, FIGN, GTGKTLLGR, RPL19, RLASSVLR, RC3H1, KEIMAQLEER, C16orf70, MKIHSPSPHK, FXR2, EEGMVPFIFVGTR,GHRLOS2, VCCELELR, GRIK1, FMNLIK, CHSY1, SSNNNGSVRTA, TRIM22, ISSLNKR, MYO16, DLLAKSLYSR, UTF1, DADPTWTLR, TTC17, KAIDCLR, GPI, NLVTEDVMRMLVDLAK,C12orf51, SINEMLLSR, CNNM4, SPAHPTPLSR, LCP2, LSQEINK, RARS, STIIGESISR, SCARF1, DRPARDGAAVSR, FADS6, EPARSAHR, MRS2, KTLLADR, HEATR5A, SHPFTNPR, PRSS16, TELSACGPLGR, SPN, TGALVLSR, ZNF525, SYPLPAIVDFIVER, SH3TC2, EGECLTLCK, CACNA1S, KLEQKPK, KIF21A, QKLIDELENSQK, FAT2, GNNGSAFR, LOC441736,EERAQWAEQQR,SLC22A11, TQTAPAGR, GNA12, QRGEPLR, PPARGC1B, QAGLRSK, TLE3, VEEKDSLSR, SYNE1, SMEMVKTK, FBXO31, KDEFSTK, ABCA13, VQYDLK,

PLCB1, ELLDVGNIGR, DACT1, ESKAEQAESK, KIF20B, QEQKEVSVMR, C1orf65, QSQEQWQEKEQR, SPIRE1, DALSLEEILR,LOC100507699, RLHACAR, MAGEB17, YEEALR, ASUN,ADPECCSILHGLVAAVETLCK,UCN, ALLLLLAER, IL4I1, MPSSHRILHK, MTFR1, QSVLWSR, HLA-F, GAAVLGVDHRVR,ADARB2-AS1, CSPGEKSQLVR, DAK, IKMATADEIVK,LOC642620,RQMPQGITSCMLHLDGK,UPF3B, EKDTLR, FNTA, EKDTIR, GTF2H5,VGELADQNAFSLTQK,SMTN, AGSLAALEKR, GNAL, QDMLAEKVLAGK,PABPC1L, SSLPWLPK, GNPTG, RDPSPVSDR, SEMA6C, VSRGPFSPR, OTC, QSDLDTLAK, C17orf85, GLYADTR, EXOSC9, ETPLSNCER, FEM1C, GADVNRK, SMARCAL1, TEGRLQQK, FAAH2, VMAGPGIKR, MICU1, ISQERGK, PVRL2, MARAAALLPSR, KCNMA1, FIAANDK, FAM53B, SSSFSLPSR, SYNM, ENLLLEEELR, APLP2, AKEQLEIR, ZNF791, ALSCSSYIR, MFN2, QFVEHASEK, GPSM3, LTGMGGAR, GGA2,VTFGNRVTSSLGDIPVSR,OSM, LADLEQR, DDX41, QVSNIGR, IQCH, GILSMIER, ZNF425, RVPACTAPGR, FGD3, FHGQFLLPELK, ACACA, GSVLEPEGTVEIK, USH2A, DGTIVYTGLETR, F5, SLDRQGIQR, NDRG4, AGLQELR,

MMGT1, MTPLGSGPPR, VPS13A, DGTGNQMLQASK, BYSL,QQQEELEAEHGTGDKPAAPR,SEMA4D, HGGTAELK, COX4I2, MLPRAAWSLVLR, CDC5L, LECLKEDVQR, JPH3, NKVAHFSR, SLC7A10, GHGEGRLGCR, PEG3, IYDQEK, SLFN14, VIFGEENR, CD36, VFNGKDNISK, C19orf55, QAQPTSR, KANSL2, CSSQSLPMTR, BBX, KCVPVPR, SGCE, QANNFIISR, FREM3, CEVTVLDALPR, CIITA,LSLYNNCICDVGAESLAR,GOLGA4, KTEALLEAK, NFSSASALQIHERTHTGEKPFVCNICGR,SALL4, TLK2, RTQSDLTIEK, ACADVL, QSWVRAR, C10orf103, WGPTSNNPR, ZKSCAN4, MAREPR, WAC, ELEKLK, WARS2, EIEKLK, ACBD4, CALSLPR, CSF2RB, EEECSPVLR, SRRM1, RMPPPPR, LEPRE1, MQLCLEK, C8orf74, EALRLER, C8orf82, ERNFLR, SYNE2, KNLEDGINNLK, CARKD,HAGPGAVMALRRPGER,AHI1, IEKSPAPQK, CA3, EPMTVSSDQMAK,LOC100507203, QVQTQR, MCTP2, STIAFAHR, TRIT1, VSAQEQR, NPAT, LSIGQRNGGLR, C3orf77, LSCQRTIPMTGK, PCCA, YSSAGTVEFLVDSK,MAP3K10, WLDGLFFPR, EFCAB6, MAIIPDWLR, USP24, ESNSLCPAGIR, PACSIN3, ADSAVSQEQLR, PRPH, ADQLCQQELR, CHEK2, ILGETSLMR, MRPL4, AEVRGGGR,

TCOF1, EAASGTTPQKSR,CASP8AP2,MDVLQTDVSQNFGLELDTKR,STOX2, TRSTTLR, FUBP1, IQPQLDR, PLEKHA7, IIKSSSK, KIF13B, LGAPEARR, TBC1D24, LQPEVQR, PNMA3, RVLSGATLPDK, DNAH12, ILITFR, GCET2, ILIFEK, GQVCVMIHSGSRGLGHQVATDALVAMEK,C22orf28, ZMYM5, LYENEK, AKR1C3, IADGSVMR, ARHGAP32, TCTTTRVSK, ESRRA, KACQACR, AGPAT4, MDLAGLLK, RELB, EVEVTACLVWK, LRRN3, IEATNNPR, UNC5A, VTWRGEGR, SPAG5, EVRDLK, ASB2, DGDEEALKTMIK, DOCK10, RTWLESMAK, LOC388906,AGGCPSSLASSR, PLEKHG6, IETYGGR, MDC1, IQKVEPAGR, TLN1, AVEDEATK, TPRN, TATALADRAIR,CDC42BPA,KLAMQEFMEINER, OXCT1, MLGAMQVSK, CARS, ADSSGQQAPDYR, MYBL1, KPNPNTSKVVK, FXYD3, DASLPVPGQR, LSM14A, VSRPENEQLR, DDX31, ESKIQR, TBKBP1,LEEALEAAQGEARGAQLR,DNAH2, FNIINMTFPTK, FAM161B, ETQEATR, WBSCR16, RLSGPGLGR, XIRP2, GEGLEYENIK, NBPF8, AEMNILQINEK, ESYT1, LYMKLVMR, CHMP2A, KIIADIK, NBPF3, AEMNILEINKK, IFI16, LPQEQR, POLR1D, CPLASTNKR, CASC4, SLALWPR, ZNF155, RSLIHVMSVEK, TTC40,ATRTAALQPTPSSGTSGK,

CAPN3, MEICADELKK, SRRD, KSGNLSEK, MSTO1, LSRQPQR, C3orf35, SSVQSNILILQ, PACSIN1, VLEDVGK, HAUS6,EIKLEELIDSLGSNPFLTR,SKIV2L2, KSNVKPNSGELDPLYVVEVLLR,LOC644667, QAPESRK, VLCLSQSEGRVVSLTGICGAEEVEGGQDPK,PDCD11, COL9A2, ASPASRGPR, ERVFC1-1,WASGVDGGLYEHKTFMYPVAK,SIRPD, LLGLTGLLSK, CYP7B1, LSSYSTTIR, ZNF429, RIYVGEK, THOC2, KEENGTMGVSK, TOB2,CVHIGEMVDPVVELAAKR,SMARCA2, SLGDLEK, CLPX, SIIKEPESAAEAVK, ERC1, KIAELESH, CNOT1, RCPSILGGLAPEK, TJP3, QDVQMKPVK,ARHGEF10L,VSFQPKLSPDLTR, TAF1A, ISIMYSTLR, CALCOCO1, LQLEGQVTELR, CD320, LACLAGELR, LRRC55, GSLQHCCCLLPK, APRT, LAPVPFFSLLQYE, PRPF8,TDMIQALGGVEGILEHTLFK,RNF146, GASWLGRR, C8orf85, VEATLAR, SUV420H2, LELLVGCIAELR, INCENP, KEEQQR, MOBP, EIVDRK, PREX2, ISSGNIQER, ATG4A, ILQCFLDR, OR4K3, MIADFLR, TRIM68, QNVDRK, ITGB3, QSVSRNR, MOK, RTLSLPR, TMF1, SKTVESAEGK, SLAMF7, QEEYIEEKK, NCAPH2, SPQQSAALPR, CDC42BPB, VKDANLTLESK,HLA-DPA1, MRPEDRMFHIR, NCKAP5L, RVFDLER, DUX4, MERGTGETR, ASB4, IQMAALSEISK, UTP18, LSKDNLK,

TAF7, LLSTDAEAVRTR, GLG1, DRLQYR, FBXO40, IKNLANK, CCDC25, LSSAHVYLR, BMPR1A, CATLDTR, ADARB1, IESGEGTIPVR, CTPS, VPLLLEEQGVVDYFLR,TIGD6, QFSLEEK, SCRIB, MQEEEARK, ARMC8, TLPCLVR, DLEC1, LNNSSPCDIR, EPB41, PKSQVSEEEGK, CECR2, QEETPVLTR, SMARCA1, TAMEFQRR, PABPC4, EAAQKDSK, XRN2, DCEGLSREK, CADPS, LQGGPGLHLK, AFMID, LGAEEALR, ANKRD26P1, LQEEIAR, C14orf49, LQELEAR, HSPA4, SQVISNAK, NTS, ALEADFLTNMHTSKISK,SEC23A, VFTKDMHGR, OR2AT4, ISSLEGR, ALG13, FSAFLDK, DGKI, LAPAMIR, SLIT2, RIVTGNPR, C3orf15, LQEIRLEVLK, GTPBP3,DVLETPVDLAGFPVLLSDTAGLR,ZNF85, FSTLTTHK, EPB41L3, LSRSPLK, NRG1, CALPPRLK, C3P1, GIQDLHVQIR, TMEM201, ALSLGTIPSLTR, CDO1, TLADLIR, ZNF674, SSYHLIR, SLC7A6OS, LSRDSQQR, LTA4H, LFGEKFR, DCC, MGDTVLLK, LARP4, KLTTDPDLILEVLR, AHSA1, KAKPAPSK, HOOK3, SQREMEEK, KCNB1, EQMNEELK, DLG5, ELKELK, PRKAG1,GIVSLSDILQALVLTGGEK,QRFPR, LCEQTEEK, M6PR, ESWQTEEK, UBTF, KLAEEQQK,

TSEN54, SSSSPRSINK, TAF3, LKDGLVR, UBAP1L, GVRLELAGAR, WHSC1L1, GASEISDSCKPLK, NAP1L4, GIPEFWFTIFR, FMR1, EVDQLR, ADRA2A, GRSASGLPR, FOXP1, QAILESPEKK, ZNF391, LFSQRKPCK, KLK11, LLCGATLIAPR, C9orf50, MNSSLSFCSWKK, C4orf27, MVGGGGKR, TRIM37, TDVKNTLSEIK, MITF, LQREQQR, RP1L1, EVFGRGGQPGPK, PLXNA1, VQIDYK, ZFP37, LLKFESCGK, LRRTM2, MLPALGMACPPK, PIGN, VSAQQFDDAFLK, ZC3H4,TVLWNPEDLIPLPIPK, TPM1, MEIQEIQLKEAK, ZHX2, KLCEEDLEK, ADAM15, ALLARGTK, AKAP17A, LQEAELR, 11-Mar, LPEAAAAK, FAM120B,CLSSYTSVKENFDK, KDM3B, NCAIISDVKVR, NFAT5, KSLHSCSVK, C19orf45, EAPPHGAPPR, DOCK4, LECLVHR, BBS9, ITIDTNK, ARHGEF2, SESLESPRGER, CLCN5, DLIISIENARK, FAM163A,STAFGMNGLGGIAAKLFAAPFSMR,ATP2B3, TLMELR, PLEC, VEEARR, SMARCA5, RWVPTLR, C20orf132, MPLSRR, ZNF518B, AGAKRPVK, VCAN, MFINIK, SRSF7,NPPGFAFVEFEGPRDAEDAVR,KIAA0586,ISPSGIDSATTVAAATAAAIATAAPLIK, DOCK1, LDKDDLEK, UNC45B, DFIDMRVK, PLCH1, QLPSPQSLK, HPSE, VLMASVQGSKR, FGF19, QLYKNR, CAND1,IDLRPVLGEGVPILASFLR,

C19orf47,AKATSSATTAAAPTLR,MEX3C, EILSAAEHFSMIR, SLCO1B1, ITPTDSR, CPSF6,AVSDASAGDYGSAIETLVTAISLIK,MYH10, SLEAEILQLQEELASSER,TMEM14B, LFLPEAVALR, ANK2, MNGNTALAIAKR, TBL2, QDPYLLK, PRKAA2, MPPLIADSPKAR, OSMR, KLCTHK, ZNF223, SALTVHCK, RAD51AP2, MSSVYLK, EEF2,GVQYLNEIKDSVVAGFQWATK,DAP3,NDWHGGAIVSALSQTGSLFKPR, GAS2L2, QSGPRTK, TNFAIP8L1, EFTRSR, STX18, LYSEQR, RP1, LYATDGR, ZBTB11, YLTNER, UBE2D2, IYQTDR, BMX, LRHPVSTK, AGK, GDFITIGSRK, SUCLG1, VSGSNGLAAAR, SLIRP, CILPFDK, IPO5,ATAAFILANEHNVALFK,EMB, KNESLGQ, SKOR1, GMKDTQR, MTUS1, QIAAPKAK, C1orf123, LMDSVALK, PHF1, DRFISGR, LOC100133315,KDGGGLYEK, SKA3, SLASTLDCETARLQR, MTAP, NVDCVLLAR, APBA3, DFPTISR, GLYR1, MAAVSLR, ARID5A, LGAYELVTGRR, CYB5R1, KSPPEPR, TJP2, KPSPEPR, KIF4A, IMKEEVK, C19orf40, VRNSNNLK, SPTBN1, ELVDRK, DYNC2H1, YYREMK, NIT1, MVLAISSCR, EMID1, LKVLEAK, EPS8, LKLLDAK, HEXIM2,TQSPGGCSAEAVLAR,NIPBL, RSTVNEKPK, SBF1, GLLALLFPLR,

SEMA4G,GQAQNYSTLLLEEASAR,KIR2DS5, EGTFNHTLR, ZNF645, SPPPSTLQGR, PHKG2, KQWIGK, RAB3IL1, IDMLQAEVTALK, RNF103, TPEGIYR, YTHDC1, MAADSREEK, ULK1, LPDFLQR, E2F6, AYDITNVLDGIDLVEK,TTLL4, IKDVVVK, CEP44, ATGDLKR, PRSS12, ESLSSVCGLR, MRPL55, EAQLQSR, MREG, AREMLLK, RPL36, RAMELLK, AARS,NVGCLQEALQLATSFAQLR,ATXN7L1, KHQNGTK, VWF, TCQSLHINEMCQER, URB2, QLENQNPQGR, MYCN, DHVPELVK, C16orf7, SMELYR, EHBP1, EQLLLDELVALVNKR,CACNA1C, QLTLPEEDK, ZNF253, STDLTTHK, MAST4, LQIPGLTLDCR, C12orf34, LACLQR, ENAH, ASSTSTPEPTR, SCN4A, IALDCNLPR, NAALADL1, TAGSVILR, LOC151760, KLDPLLK, ACTR2, QLYLER, PTPRU, ATYKEMIR, ELCDGLELENEDVYMASTFGDFIQLLVRK,ALDH1L1, HOOK1,CSQVQQDHLNQTDASATK, THUMPD3, STLAYGMLR, RPL38, KIEEIK, CALHM2,AAVAPVTWSVISLLR,BAZ1B, RTPIGTDR, TMEM194B, GEAAAAALSVR, ATP5B,VALVYGQMNEPPGAR,SPACA7, SSGNIFHKEQQR, TES, YKSEALGVGDVK, ANXA7, LLLAIVGQ, CCT7P2, IIIQALR, TTC27, IILVNVR, MTMR2, LILAGALR, SUPT7L, SSFDLLPR, AASS,QDLVISLLPYVLHPLVAK,

SPTBN2, SAVAWEER, LAG3, LPAGVGTR, TEKT2, GCLSLNLR, ANKRD34A, RSPGLLER, SULF1, MPYDFDIR, TDG, MAVKEGK, OFD1, MPLPSPTESR, HIPK1, TPEEHELETGIK, RPS24, TTPDVIFVFGFR, PPIL4, KNLSIDER, ENO4, SCGTTRELQK, LRRC38, KLLVAGNR, DMD, LVEPQISELNHR, PCLO, KQAELDK, C9orf80, VAILAELDKEK, CDC45, SKFLDALISLLS, PPP1R14A, MAAQRLGK, DHX40, SISDARAR,LOC100509073,PPTDSPLER, ZNRF1, DGMLYLGSR, FKBP2, MRLSWFR, LONRF2, KVDLGYISDVDK, SLC9A5, YLSQLLMR, ZNF619, TAQLNISK, BAI1, TVPLDALR, EIF4E2,TPGRPTSSQSYEQNIK,AFAP1L1, LDLDKR, MFHAS1, LLLMGHK, NOP56, VLLGVGDPK, IBTK, QKCGATPK, CSTF2, MAGLTVR, CRTC2, LPSALNR, KIAA0226L, QQHVEDVQR, ACSL5,DNSNNTLYLQMSSLR,AP1S2, AIEQADLLQEEAETPR,DDHD1, LEPLILKR, CSF3R, LEPPMLR, FAM59A, SEAVREECR, PIAS4, QVELIRNSR, DCLK1,TAHSFEQVLTDITDAIK,MYO1A, VPGSAVPEAGR, RUFY1, HLSCTVGDLQTK, SUSD5, SRGAHLASADELR, PNPLA5, LLQGARR, UGGT2, ILFLDVLFPLAVDK, VPS54, LSLLLDNER, EIF2C4, QEEISRLVK, DCDC2,STVGSSDNSSPQPLKR,

FAM19A5, LPQGTVR, RAPGEF3, DAIGLQPDAR, DHX33, EAISDSLLRK, MTOR, DASAVSLSESK, SACS, KATTHTK, DNAJC1, LGASEKNER, USP9Y, IIEDCSNSEDTIK, LRBA, LLASKSEGIR, FAM214B, RGPGLGAR, PIK3CA, SLWVINSALR, NPM1, DMQLLSISR,LOC100131673,VRAEGELEPR, ANGEL2,LTQLAMLLAEISSVAHQK,ACRV1, NQSFCNKI, ZMAT1, LPFETFR, TMCO6,TEDAGLELLACPVLR,RSAD1, LELLEEVLALGLR,RABGGTB, MSEYLR, DBF4B,MMHVQQLSLASLCVKK,LRRIQ1, DTLVISVK, SNAP47,LLTELESPAWWPFSSK,RNF25, GRVPGVDR, PRAMEF10, IQDPQLR, KIF1C, QSKDAPK, ABHD1, TKIDAIR, ZNF318, EKFGSFLCHK, MUC6,FGAACAPTCQMLATGVACVPTK,ZEB1, VAVDGNVIR, RGS3, ACVAAACTVAAR, PFKL, SEWGSLLEELVAEGK,GPR112, IIPTVDR, WEE1,ADIFALALTVVCAAGAEPLPR,SLCO1B3, LSLVGIAK, ZNF91, SSTLTEHK, CPSF2, VLELAQLLDQIWR, NTELAVFHDETEIQNQTDLLSLSGK,IARS, FCHSD2, AELEQKIDEAR, TRPM7, QAEKISR, SCO2, SAEQISDSVR, MON2,AVEGPSTVLTTAVMTDLPVISNILSR,ATP13A1, CTLVTTLQMFK, MRPS16,FVEQLGSYDPLPNSHGEK,TMPRSS4, IPMETFR, GAK, LSIAEVVHQLQEIAAAR,FBP1, SPNGKLR, UPF2, ALFIVPR, APOB, EQHLFLPFSYK, TTI1,ALADILSESLHSLATSLPR,

APOL1, VRAEEAGAR, RTN4RL1, LLRAEDFR, FSIP1, LDKILAK, MMAA, VLLYHR, IARS2, SCQTALVEILDVIVR, ASB7, LLLEFK, CDCP1, IDATVVR, GOLGA3,LQAQVECSHSSQQR,C19orf24, AKAGWAGDPR, PASK, IPVSVWMK, GRM7, IPQERK, C4orf29, LDILYR, METTL3, QLDSLRER, PARN, EQYDEKR, KRT78, YEEIAR, MGLL, YEELAR, ARL1,GTGLDEAMEWLVETLK,SQSTM1, DHRPPCAQEAPR, UACA, KEDMLLK, ASAP2, EIISEVQR, KIF27, MKEDLIK, PARP11, SNTQAPRLAPSHR, LHX2, ISSAIDR, RAF1,VKEERPLFPQILSSIELLQHSLPK,AFAP1, LQTKEEELLK, CNNM1, VEVEVGK, HAX1, ITKPDGIVEER, MUM1, LSQMQGAVR, LONRF3, LRPMGFK, VDAC1, AVPPTYADLGKSAR, ASGR1, MKSLESQLEK,LOC643371,CISEQFTAMFRCK, NISCH,TLNLAGNLLESLSGLHK,MRPL37, LQPDQLR, POLA1, QEALERLK, CRYGA, IIPHTSSHK, IGBP1,AAQQQEEQEEKEEEDDEQTLHR,BCR, ELDLEKGLEMR, GAB2, KAKPTPLDLR, MYO18A,ALLADAQLMLDHLK, PIN4, MPMAGLLK, FAM184A, LSQLEEER, EEF1D,SLAGSSGPGASSGTSGDHGELVVR,OTOG, LSSVFLR, KPRP, RSEPIYNSR, C13orf35, VTVQTDDSNK, SYNCRIP, DLFEDELVPLFEK, FUS, AAIDWFDGKEFSGNPIK,

HFM1, LKMAPIK, NDUFS3,ESAGADTRPTVRPR, RPS6, LIEVDDER, DUOXA2, QTARGVK, NCAPD2, LLNILGLIFK, EAATQAQQTLGSTIDKATGILLYGLASR,QARS, SAT1, NLSQVARR, TMPRSS6, EDSTCISLPK, SNED1, LASYTVR, ATP8B1, AMVVDLVKR, ASB15, KLVEAIK, GART,GVEITGFPEAQALGLEVFHAGTALK,POLR1A, RVDYAAR, STK40, LLGAGRAR, RCOR3, QVNSALK, ELL2, DLSYTLK, POLR2K, LVVFDAR, RPL15, FFEVILIDPFHK, WDR11,FLLTGLLSGLPAPQFAIR,MED15, LHHQNQQQIQQQQQQLQR,TFDP1, RVCDALNVLR, UTS2, LTPEELER, LRRK2, ELKILNLSK, SRSF3,NPPGFAFVEFEDPRDAADAVR,FASTKD5,VHSYNASETSQLLSVSEGELILHK, C17orf75, NVIAVLEEFMK, EPRS, VAVQGDVVR, CEP170,KPLTTSGFHHSEEGTSSSGSKR,MED1, VPLILNLIR, PMF1, LYTENR, ADCK3,LANFGGLAVGLGFGALAEVAKK,HNRNPAB, GFGFILFK, XPR1, SIWPLIR, ACAP3, MAEEMR, TRIP13, VVNAVLTQIDQIKR, MATR3,YQLLQLVEPFGVISNHLILNK,PRR11, DTIFPSR, DHX9,FSDHVALLSVFQAWDDAR,IPO4, LLGLLFPLLAR, BAZ1A, LSSSFSSR, RPS15AP17, QVLLRPCSEVIIR,KIAA0368, TEALSVIELLLK, ANLN,LANLAATICSWEDDVNHSFAK,NUP210, CDAIVDLIHDIQIVSTTR,NOLC1, VVPSDLYPLVLGFLR,XPNPEP3, AILFVPR, GAPDH,VIHDNFGIVEGLMTTVHAITATQK,EIF3E, HLVFPLLEFLSVK,

MYO1B,NFHVFYQLLSGASEELLNK,GAPVD1, LSAQAQVAEDILDKYR,HAS1, LAVEALVR, ASNS, ELYLFDVLR, WDR26,LALLNVATQGVHLWDLQDR,CEP55, SEELLSQVQFLYTSLLK,TOMM70A, ILLDQVEEAVADFDECIR,ARAF, EERPLFPQILATIELLQR,MED23, TFADFLVYEFSTSAGGQQLNK,MRPS18B, YLESEEYQER, MRPS9, AIAYLFPSGLFEK, TRIO,LLVRPTSSETPSAAELVSAIEELVK,MRPS28,FLGATTDTTVLEANAVLLGIQESK, CLCC1, LDSLTYK, NCL,TLVLSNLSYSATEETLQEVFEK,SUB1,GISLNPEQWSQLKEQISDIDDAVR, EIF3I, SYSSGGEDGYVR, EMD, TYGEPESAGPSR, PDE4D,NSSIASDIHGDDLIVTPFAQVLASLR,RPS10, IAIYELLFK, KCMF1, SLFVQELLLSTLVR,ANKRD40, TPESTKPGPVCQPPVSQSR,MTR, AALFALQNLFEEK,DYNC1LI2,NNAASEGVLASFFNSLLSK,RBM39, DLEEFFSTVGK, RPRD2, VEITPESILSALSK, ADRM1,SQSAAVTPSSTTSSTR,KIAA0355, ASTFLTDLFSTVFR,TRMT11, SPFWILSIPSEDIAR, RPL6, ASITPGTILIILTGR, DSG1, IHSDCAANQQVTYR,NTPCR, SSGVPVDGFYTEEVR,RPAP1, LLAQLDPSLVAFLR,C15orf44, AALAFGFLDLLK,ZCCHC11,NTESLGELWLGLLR,C21orf2, NVLTAILLLLR, ARMC5,FLLPGLEEELEEAVGR,PPP6C, APLDLDKYVEIAR, CLINT1, SLLLLAYLIR, HK1, SANLVAATLGAILNR,MRPS23, LETVGSIFSR, IQCB1, ALYSILDEVIFK, RPS14, TPGPGAQSALR, PPOX, SILLGLLLGAGR, UNC45A,VLISNLLDLLTEVGVSGQGR,UXT, YETFISDVLQR, CPNE1,EALAQTVLAEVPTQLVSYFR,TPR, GQNLLLTNLQTIQGILER,

SPATA5L1,ASTPAILFLDEIDSILGAR,XPO1, LDINLLDNVVNCLYHGEGAQQR,MSH6, IIDFLSALEGFK, PSMA2,YNEDLELEDAIHTAILTLK,NDUFA10, LLQYSDALEHLLTTGQGVVLER,NUBPL,LAQTLGLEVLGDIPLHLNIR, LDHAL6B, LIIVSNPVDILTYVAWK,NYQQNYQNSESGEKNEGSESAPEGQAQQR,YBX1, EWSR1, GDATVSYEDPPTAK, DLIDDATNLVQLYHVLHPDGQSAQGAK,MRPS22, C20orf4,WISLTNFISEATVEK, AKAP8L, TVEDLDGLIQQIYR, ISYNA1, SVLVDFLIGSGLK, MAP1B, TPDTSTYCYETAEK,HEATR6, ACALQVLSAILEGSK,MRPL10, TVPFLPLLGGCIDDTILSR,SPATA5, AVAPSIIFFDELDALAVER,SLFN11, ELIQSSDLQAFFETK, RPL13, STESLQANVQR, NUBP2,TLEEGHDFIQEFPGSPAFAALTSIAQK,RFC3, TVAQSQQLETNSQR,SETD1A, FQGSGAATETAESR, GPN3,FLDPDMYSLLEDSTSDLR,KEAESCDCLQGFQLTHSLGGGTGSGMGTLLISK4:3MKEIAEAYLGYPVTNAVITVPAYFNDSQR7:6 Incorrect!

Figure 3.4: Peptide-protein graph constructed from an AP-MS experiment with POLR2A as a bait. Proteins are drawn as red triangular vertices while peptides are drawn as blue round vertices. 70 Chapter 3. Protein Quantification with Shared Peptides

ELTTEIDNNIEQISSYK,QSLEASLAETEGR, LALDIEIATYR, SDDTVVYYCAR, ENDLSVLR, SRGFGFILFK,DLKDYFTK, LKYENEVALR, YCVQLSQIQAQISALEEQLQQIR, LGLDIEIATYR, QEPSQGTTTFAVTSILR, QSVEADINGLR, RVLDELTLTK, VAAEDWK, GSSGGGCFGGSSGGYGGLGGFGGGSFR, MFVGGLSWDTSKK, FLEQQNQVLETK, TRPV5, EYFGEFGEIEAIELPMDPK, GSLGGGFSSGGFSGGSFSR, LLQDSVDFSLADAINTEFK, SKELTTEIDNNIEQISSYK, LKYDNELALR, DVDAAYMSK, EIVLTQSPGTLSLSPGER, IGHV1OR15-1, LSISIDTSK, HNRNPAB, SEITELR, KRT7, WLQGSQELPR, IREYFGEFGEIEAIELPMDPK, KRT84, IFVGGLNPEATEEK, ALEESNYELEGK, KDVDAAYMSK, HTGPNSPDTANDGFVR, VIM, WLAWYQQKPGK, YVEVFK, IGKV3D-20, QSVEADINGLRR, VLDELTLTK, GFPSVLR, FFSDCK, FGEVVDCTIK, RSTSSFSCLSR,YLDFSSIITEVR, KRT4, FHTVSGSKCEIK, EVYQQQQYGSGGR, IRLENEIQTYR, KRT13, STSSFSCLSR, LQGEIAHVK, VTMQNLNDR, IGH@, LALDIDIATYR, VTISLDTSK, DASGVTFTWTPSSGK, GFGFILFK, LNDLEEALQQAKEDLAR, LQGEIAHVKK, KRT74, LALDIDIATYRK, IQNGAQGIR, ALEEANADLEVK,NQILNLTTDNANILLQIDNAR, LLIYVASSR, SEDTAMFFCAR, YHQIGSGKCEIK, NVQALEIELQSQLALK, SLNSFGGCLEGSR,KRT73, IGHA2, NVQDAIADAEQR, DVDNAYMIK, SCFV, VEDTAVYYCAR, SNNVEMDWVLK, KRT8, ZCWPW2, STGEAFVQFASQEIAEK,HNRNPH2, EGRPSGEAFVELESEDEVK, FGEVVDCTIKTDPVTGR, KRT78, IGHA1, KRT10, EVFTSSSSSSSR, ISSVLAGGSCR, AETECQNTEYQQLLDIK,TKYEHELALR, FSGSGSGTDFTLTISR, TLNNQFASFIDKVR, NQFSLR, ADLEMQIESLTEELAYLK, SKEEAEALYHSK, VLYDAEISQIHQSVTDTNVILSMDNSR, VTGEADVEFATHEDAVAAMAK, DLTEYLSR, EVQLLESGGGLVQPGGSLR, HNRNPH1, FASFIDK, YVELFLNSTAGASGGAYEHR, HNRPDL, SEITELRR, FLEQQNK, TFGQGTKVEIK, QAPGQGLEWMGR, TPLTATLSK, TFTCTAAYPESK, GLPWSCSADEVQR, IIAATIENAQPILQIDNAR, ATGIPDRFSGSGSGTDFTLTISR, DAASVDKVLELK, TIDDLKNQILNLTTDNANILLQIDNAR, QRPSEIKDYSPYFK,ASLENSLEETK, FASFIDKVR, LEGLTDEINFLR, APSTYGGGLSVSSSR, HGGGGGGFGGGGFGSR, YQELQVSAQLHGDR, GGSSSGGGYGSGGGGSSSVK, KRT16, AQYEEIAQR, IGKV3-20, DNSKNTLYLQMNSLR, DAEEWFFTK, SLDLDSIIAEVR, DLNYCFSGMSDHR, NHEEEMKDLR, LAADDFR, VDDTAVYYCAR, AEDTALYYCAK, YGDGGSTFQSTTGHCVHMR,VTGEADVEFATHEDAVAAMSK, MFIGGLSWDTSKK, SLLEGEGSSGGGGR, NKILTATVDNANVLLQIDNAR, LLEGEECR, KLLEGEECR, GFCFITYTDEEPVKK, APSTYGGGLSVSSR, ATGVPDRFSGSGSATDFTLTISR, GRFTISR, GGSGGGGSISGGGYGSGGGSGGR, NVQDAIADAEQRGEHALK, NTLYLQMNSLR, QRPAEIKDYSPYFK,SGGGGGGGGCGGGGGVSSLR, DAETWFLSK, ASLENSLEETKGR, TAAENEFVTLKK, LLIYGASSR, NSLYLQMNSLR, FSGSGSATDFTLTISR, AEDTAVYYCAR, LENEIQTYR, KRT14, ATLFCR, AEDTAVYYCAK, GLEWIGR, VHIEIGPDGR, NKIIAATIENAQPILQIDNAR, KRT2, KRT76, ASQSVSSSYLAWYQQKPGQAPR, SRGFGFVLFK, LASYLDKVR, ADLEMQIESLKEELAYLK, KRT3,AIGGGLSSVGGGSSTIK,GRLDSELR, VTISVDTSK, SQYEQLAEQNR, DSTYSLSSTLTLSK, DAEAWFNEK, GFGFVLFK, TAAENDFVTLKK, LLRDYQELMNVK, SLVGLGGTKSISISVAGGGGGFGAAGGFGGR,KRT79, ATENDIYNFFSPLNPVR, LLEGEDAHLSSSQFSSGSQSSR,IKEWYEK, TEAESWYQTK, TVAAPSVFIFPPSDEQLK, HKLYACEVTHQGLSSPVTK, FGEVVDCTLKLDPITGR, NVSTGDVNVEMNAAPGVDLTQLLNNMR, GSCGIGGGIGGGSSR, ATGIPDRFSGSGSGTDFTLTITR,IGK@, GRFTVSR, IREYFGGFGEVESIELPMDNK, KDAEAWFNEK, TTAENEFVMLKK, ISIGGGSCAISGGYGSR, SQYEQLAEQNRK,ADLEMQIESLTEELAYLKK, TKYETELNLR, TSFTSVSR, ILTATVDNANVLLQIDNAR, WQQGNVFSCSVMHEALHDHYTQK,FTISRDDSK, IDASKNEEDEGHSNSSPR, IQDWYDKK, SISISVAR, KRT75, VDNALQSGNSQESVTEQDSKDSTYSLSSTLTLSK, YIEVFK, MTLDDFR, FEMEQNLR, NSKIEISELNR, KQISNLQQSISDAEQR, KRT6A, VQWKVDNALQSGNSQESVTEQDSK, FSSSGGGGGGGRFSSSSGYGGGSSR, SEIIELNR, ATGGGLSSVGGGSSTIK, AEDTGVYYCAK, EVATNSELVQSGK, VLDELTLAR, YGSGGGSKGGSISGGGYGSGGGK, NMQDMVEDYR, GSGGGSSGGSIGGR, YEDEINKR, KRT6B, LYACEVTHQGLSSPVTK, WQQGNIFSCSVMHEALHNR, SEDSALYYCAR, HSGPNSADSANDGFVR, FGEVVDCTLK, GGSGGSYGGGSGSGGGSGGGYGGGSGGGHSGGSGGGHSGGSGGNYGGGSGSGGGSGGGYGGGSGSR,ALEEANTELEVK, LASYLDK, TSQNSELNNMQDLVEDYKK,AQYEEIANR, LLIYGVSSR, SCDTPPPCPGCPAPELLGGPSVFLFPPKPK, QEIECQNQEYSLLLSIK, QSSVSFRSGGSR,SRGSGGLGGACGGAGFGSR,SGTASVVCLLNNFYPR, SEDTAVYYCAR, HNRNPD, WELLQQVDTSTR, SKAEAESLYQSK, TPLGDTTHTCPR, GLEWVAFIR, NYSPYYNTIDDLKDQIVDLTVGNNK, FSGSGSGTDFTLTITR, LSCAASGFTFR, HNRNPF, TIEELQNK, SCDTPPPCPNCPAPELLGGPSVFLFPPKPK, ADDTAVYYCAR,TEDTAVYYCTR, VTVSSASTK, MFIGGLSWDTTKK, YCGQLQMIQEQISNLEAQITDVRQEIECQNQEYSLLLSIK, LALDVEIATYR, LRAEDTAVYYCAR,GTLVTVSSASTK, VDNALQSGNSQESVTEQDSK,RTVAAPSVFIFPPSDEQLK, TLLEGEESR, FLEQQNQVLQTKWELLQQVDTSTR, KRT5, SGFSSVSVSR, ALPAPIEK, KGPAAIQK, YEELQVTVGR, TEAESWYQTKYEELQQTAGR,LALDVEIATYRK, NTKQEIAEINR, HSEAATAQREEWK, SISISVAGGGGGFGAAGGFGGR, VELKTPLGDTTHTCPR, LSGGLGAGSCR, QGVDADINGLR, SRTEAESWYQTK, FSGSGSGTDFTLK, IGHG1, KRT17, SCDTPPPCPYCPAPELLGGPSVFLFPPKPK, GTAVTVSSASTK, IFVGGLSPDTPEEK, MSGECAPNVSVSVSTSHTTISGGGSR, WTLLQEQGTK, STSGGTAALGCLVK,LSSVTAADTAVYYCAR, GTTVVVSSASTK, TNAENEFVTIKK, IGHV4-31, GLEWVGFIR, ESESVDKVMDQK, VCGRGGGGSFGYSYGGGSGGGFSASSLGGGFGGGSR,TEELNREVATNSELVQSGK, SDDTAVYYCAR, NHEEEMNALR, SFNRGEC, IGHG3, EPQVYTLPPSRDELTK, ITGEAFVQFASQELAEK, STMQELNSR, VDLLNQEIEFLK, NLDLDSIIAEVK,YEELQQTAGR, IGKC, TKFETEQALR, DVDAAYMNKVELEAK,MSGDLSSNVTVSVTSSTISSNVASK, SFSTASAITPSVSR,ADTLTDEINFLR, SCDTPPPCPR, SLVNLGGSK, YEELQVTAGR, HKVYACEVTHQGLSSPVTK, GDDTAVYYCAR, FNWYVDGVEVHNAK, TPEVTCVVVDVSHEDPDVQFK, GQPREPQVYTLPPSR, QVLDNLTMEK, FSSCGGGGGSFGAGGGFGSR,IEISELNR, VSNKALPAPIEK,SLRSDDTAVYYCAR, SEDTAVYFCAR, ILTATVDNANILLQIDNAR, SRAEAESWYQTK,GLGVGFGSGGGSSSSVK, VSLAGACGVGGYGSR, QNLEPLFEQYINNLR, VYACEVTHQGLSSPVTK, GTLVIVSSASTK, VQALEEANNDLENK, GTTVIVSSASTK,IGHM, TTPPVLDSDGSFFLYSK, EIKIEISELNR, DSGVPDRFSGSGSGTDFTLK, CPAPELLGGPSVFLFPPKPK, SDDTAVYFCAR, GLVWVSR, LDNLQQEIDFLTALYQAELSQMQTQISETNVILSMDNNR,NKLNDLEEALQQAK, SCDTPPPCPICPAPELLGGPSVFLFPPKPK, LSCAASGFTFSR, GPSVFPLAPSSK, QISNLQQSISDAEQR, GGSGGSYGGGGSGGGYGGGSGSR, QNLEPLFEQYINNLRR, EPQVYTLPPSR, TPEVTCVVVDVSHEDPEVQFNWYVDGVEVHNAK,TPEVTCVVVDVSHEDPEVQFK,KCCVECPPCPAPPVAGPSVFLFPPKPK, GGSGGSHGGGSGFGGESGGSYGGGEEASGSGGGYGGGSGK, WYVDGVEVHNAK, GGSISGGGYGSGGGK, EPQVYTLPPSREEMTK, LOC100290146, GFSSGSAVVSGGSR, GPSVFPLAPSSKSTSGGTAALGCLVK, FLEQQNQVLQTK, AQYEDIAQK, EIETYHNLLEGGQEDFESSGAGK, GSYGSGGSSYGSGGGSYGSGGGGGGHGSYGSGSSSGGYR, GLPAPIEK, CCVECPPCPAPPVAGPSVFLFPPKPK, WQQGNVFSCSVMHEGLHNHYTQK, TAAENDFVTLK, THTCPPCPAPELLGGPSVFLFPPKPK, SGGGGGGGLGSGGSIR, GSSSGGGYSSGSSSYGSGGR, TPEVTCVVVDVSHEDPEVKFNWYVDGVEVHNAK, IGHG2, GFSSGSAVVSGGSRR, GGGFGGGSSFGGGSGFSGGGFGGGGFGGGR, VVSVLTVLHQDWLNGK,WQQGNVFSCSVMHEALHNHYTQK, SGGGFSSGSAGIINYQR, SCDKTHTCPPCPAPELLGGPSVFLFPPKPK, KRT9, THNLEPYFESFINNLRR, DVDNAYMIKVELQSK,KRT1,VELQSKVDLLNQEIEFLK, NQVSLTCLVK, NHKEEMSQLTGQNSGDVNVEINVAPGK, VVSVLTVVHQDWLNGKEYK, TTPPMLDSDGSFFLYSK,SRWQQGNVFSCSVMHEALHNHYTQK, TLLDIDNTR, LTVDKSR, DTLMISR, TPEVTCVVVDVSHEDPEVK, LNDLEDALQQAKEDLAR, VVSVLTVLHQDWLNGKEYK, QFSSSYLSR, KDVDGAYMTK, FDPWGQGTLVTVSSASTK,VVSVLTVVHQDWLNGK, TLNDMRQEYEQLIAK, GFYPSDIAVEWESNGQPENNYK, SGGGGGRFSSCGGGGGSFGAGGGFGSR, FSSSSGYGGGSSR, SLDLDSIIAEVKAQYEDIAQK, GPSVFPLAPCSR, KDIENQYETQITQIEHEVSSSGQEVQSSAK, WQEGNVFSCSVMHEALHNHYTQK, SDQSRLDSELK, TTPPVLDSDGSFFLYSR, QEYEQLIAK, TNAENEFVTIK, QGVDADINGLRQVLDNLTMEK,

GGGGGGYGSGGSSYGSGGGSYGSGGGGGGGR, HGVQELEIELQSQLSK, LALDLEIATYR, IGHG4, DIENQYETQITQIEHEVSSSGQEVQSSAK, STSESTAALGCLVK, GGGGSFGYSYGGGSGGGFSASSLGGGFGGGSR, NKLNDLEDALQQAK, SLDLDSIIAEVK, SDLEMQYETLQEELMALKK, GFYPSDIAVEWESNGQPEDNYK, YGPPCPSCPAPEFLGGPSVFLFPPKPK, HGVQELEIELQSQLSKK, THNLEPYFESFINNLR, QLWWRR, DVDGAYMTK, LEKEIETYHNLLEGGQEDFESSGAGK, LLRDYQELMNTK, IKFEMEQNLR, AEAESLYQSKYEELQITAGR, YCGQLQMIQEQISNLEAQITDVR, SLNNQFASFIDK, LASYLDKVQALEEANNDLENK,VQALEEANNDLENKIQDWYDK, SLNNQFASFIDKVR,LRSEIDNVK,

AVAREESGKPGAHVTVK, HIST1H2AH, SMVAVMDSDTTGK, YGKIETIEVMEDR, VAPDEHPILLTEAPLNPK, VSYARPSSEVIKDANLYISGLPR, RGPPPPPR, SSLGQSASETEEDTVSVSKK, HGSGSGQSSSYGPYR, SITVLVEGENTR, SKGYAFIEFASFEDAK,SISLYYTGEK, SRGFGFVTFSSMAEVDAAMAARPHSIDGR,KLFVGGIK, HIST1H2AG, AAVLLEQER, GRHGSGSGHSSSYGQHGSGSGWSSSSGR, LFADAVQELLPQYK, LFVGGIKEDTEEHHLR, LFIGGLSFETTDDSLREHFEK,EDSVKPGAHLTVKK, GPPPSYGGSSR, DVYLSPR, KPGDLSDELR, RLFAQLAGDDMEVSATELMNILNK, SSSGQSSGYSQHGSGSGHSSGYGQHGSR, LAQHITYVHQHSR, FGYVDFESAEDLEK, GFGFVDFNSEEDAK, IDTIEIITDR, LLGGVTIAQGGVLPNIQAVLLPK, H2AFJ, LFAQLAGDDMEVSATELMNILNK, GFGFVTYATVEEVDAAMNARPHK,IEVIEIMTDR, GIEKPPFELPDFIK, KLFIGGLSFETTEESLR, YHTINGHNCEVKK, AVSREDSVKPGAHLTVK, SSSRGPYESGSGHSSGLGHQESR, AGILTTLNAR, LTDCVVMR, DYFEQYGKIEVIEIMTDR, GFAFVTFDDHDSVDKIVIQK, ACTBL2, YGPPPSYPNLK, LFIGGLNTETNEK, DRDYSDHPSGGSYR, SSSSGQHGSGLGESSGFGHHESSSGQSSSYSQHGSGSGHSSGYGQHGSR, TLLAILR, GFGFVTFSSMAEVDAAMAARPHSIDGR, H2AFX, TTGIVMDSGDGVTHTVPIYEGYALPHAILR, LELQGPR, NDLAVVDVR, VGAGAPVYLAAVLEYLTAEILELAGNAAR,VTIAQGGVLPNIQAVLLPK, SMVAVMDSDTTGKLGFEEFK, QTGIVLNRPVLR, NDEELNKLLGK, HIST1H2AE, CAPNS1, TDGFGIDTCR, WGTLTDCVVMRDPQTK, FAANPNQNK, ELAVL1, VLVDQTTGLSR, YHTINGHNAEVR, HIST1H2AJ, SHFEQWGTLTDCVVMR, SSGPYGGGGQYFAKPR,LFIGGLSFETTDDSLR, HGSGSGQSSSYSPYGSGSGWSSSR, TQRPADVIFATVR, ISLGMPVGPNAHK, LFIGGLSFETTEESLR, SYELPDGQVITIGNER, HLQLAIRNDEELNK, VAPEEHPVLLTEAPLNPK, LFIGGLNTETNEKALEAVFGK, ALEAVFGK,LTIHGDLYYEGKEFETR, HGSGSGHSSSYGQHGSGSGWSSSSGR, RFELYFQGPSSNKPR, QKVEGTEPTTAFNLFVGNLNFNK, TEADAEKTFEEK, LFIGGLSFETTEESLRNYYEQWGK, VVTRHPDLK, HIST3H2A, THYSNIEANESEEVR,KLFIGGLSFETTDESLR, RGFAFVTFDDHDSVDK,IFVGGIKEDTEEYNLR, GFAFVTFDDHDTVDK, RGFGFVTFDDHDPVDK, HIST1H2AI, HNRNPA3, RBMX, SF3B2, FTVAELK, HRNR, QSSSYGPHGYGSGR, MCM7, ALLLLLVGGVDQSPR, NCL, HLQLAIR, THYSNIEANESEEVRQFR, SLFSSIGEVESAK, AINTLNGLR, GFGFVTFDDHDPVDKIVLQK, LGFEEFK, KIFVGGIK, HNRNPA1, IIAPPER, SISLYYTGEKGQNQDYR, AGLQFPVGR, YLWNNIK, DSYESYGNSR,LAEIGAPIQGNREELVER, GPYESGSGHSSGLGHQESR, FDLLWLIQDRPDR, VTQDELKEVFEDAAEIR, HIST1H2AB, NDEELNKLLGR, LFIGGLSFETTDESLR, EDSVKPGAHLTVK, DYGHSSSRDDYPSR, GGGGNFGPGPGSNFRGGSDGYGSGR, GFGFVTYSCVEEVDAAMCARPHK,EDSQRPGAHLTVKK, ACTG1, ACTB, ELAVL4, EKKPGDLSDELR, HNRNPA2B1, GGGGNFGPGPGSNFR, HIST1H2AD, QEYDESGPSIVHR, QSLGHGQHGSGSGQSPSPSR, FLQEFYQDDELGKK, HIST1H2AC, LDAMFR, LTIHGDLYYEGK, TGISDVFAK, VEGTEPTTAFNLFVGNLNFNK, EESGKPGAHVTVKK, NQGGYGGSSSSSSYGSGR, GFAFVTFDDHDTVDKIVVQK, YHTINGHNCEVK, AIKVEQATKPSFESGR, GHYESGSGQTSGFGQHESGSGQSSGYSK, SLEQNIQLPAALLSR, IFVGGIKEDTEEHHLR, VEQATKPSFESGR, LHDAFFK, LTDCVVMRDPASK, ELAVL2, GPYESGSGHSSGLGHR, MVDVVEKEDVNEAIR, H2AFV, GIEKPPFELPDFIKR, VFGNEIKLEKPK, VFGNEIK, CAPNS2, SRGFGFVTYSCVEEVDAAMCARPHK,SSGSPYGGGYGSGGGSGGYGSR, GSGSGQSPSSGQHGTGFGR, IAQPGDHVSVTGIFLPILR, GGNFGFGDSR, GFAFVTFDDHDSVDK, AGFAGDDAPR, GYSFTTTAER, IVEVLLMK, GFAFVTFESPADAK, TGIQEMR, EDSQRPGAHLTVK, IPQALEK, HGSGSGQSSSYGPYGSGSGWSSSR, QVVQGLLSETYLEAHR, KFGYVDFESAEDLEK, GEEGHDPKEPEQLR,KLFIGGLSFETTDDSLR, GFAFVTFESPADAKDAAR, VGEPVALSEEERLK, HGSGSGHSSSHGQHGSGSSYSYSR, EVVNKDVLDVYIEHR, ALELTGLK, GFGFVTFDDHDPVDK, LCYVALDFEQEMATAASSSSLEK, GFGFVTMTNYDEAAMAIASLNGYR, H2AFZ, SRGFGFVTYATVEEVDAAMNARPHK,SESPKEPEQLR, ALSRQEMQEVQSSR, YHTVNGHNCEVR, YHTINGHNAEVRK, YLWNNIKK, TLETVPLER, EESGKPGAHVTVK, HLQLAIRGDEELDSLIK, NYYEQWGK, NMGGPYGGGNYGPGGSGGSGGYGGR, DYFEEYGKIDTIEIITDR, GFGDGYNGYGGGPGGGNFGGSPGYGGGR, QEMQEVQSSR,DYFEEYGK,

LOC100287178,LOC100287513, FGR, CDK6, TVAPTECS, MIAGQVLDINLAAEPK, TTPSYVAFTDTER, VLRDNIQGITKPAIR, HIST2H2BF, HSVGVVIGR, SPRR2A, DKLEEILR, HAAENPGKYNILGTNTIMDK, TEEGPTLSYGR, SDAYYCTGDVTAWTK, FGFR3, CDK18, SDVEAIFSK, TVTAMDVVYALKR, H2BFS, FQDELESGKRPK, NFILDQTNVSAAAQR, GNLGAGNGNLQGPR, USP17L2, NREELGFRPEYSASQLK, YES1, CDK14, AGVETTTPSK, SYSCQVTHEGSTVEK, DAGTIAGLNVLR, IGGDAATTVNNSTPDFGFGGQKR,GGGGPGGGGPGGGSAGGPSQPPGGGGPGIRK, FADQKNPPNQSSNERPPSLLVIETK, HLYTKDIDIHEVR, DLSAAGIGLLAAATQSLSMPASLGR, USP17, VPPPPPIAR, NQVAMNPTNTVFDAK, VVDRDSEEAEIIR, VQVEYKGETK, VEILANDQGNR, HIST1H2BK, FLFENQTPAHVYYR, EKPYFPIPEEYTFIQNVPLEDR, YQLLQLVEPFGVISNHLILNK, FGFR2, CDK4, LLIYSNNQRPSGVPDR, ISGLIYEETRGVLK, HIST1H2BO, HIST1H2BD, NTHATTHNAYDLEVIDIFKIER, USP17L1P, DAVTYTEHAKR, SNLELFKEELK, NGQDLGVAFK, GIDLLKK, MAK, CDK9, VDSLLENLEK, VFIGNLNTLVVKK, GTLDPVEK,HSPA6, EIQTAVR, HNRNPC, VEIIANDQGNR, LTVLSQPK, SHKSYSCQVTHEGSTVEK, HIST1H2BC, IINDLLQSLR, MILIQDGSQNTNVDKPLR, QQVPSGESAILDR, LOC100287205, ESYSVYVYK, SPRR2G, KPGQSFQEQVEHYRDK, GYFEYIEENKYSR, GDADQASNILASFGLSAR, HCK, CDK15, HIST2H4A, VLKQVHPDTGISSK, SPRR2B, NDQRPSGVPDR, HIST1H4A, HIST2H2BE, GDDLQAIKK, TVTAMDVVYALK, KHSRP, U2SURP, NPPNQSSNERPPSLLVIETK, HNRNPU, FIEIAAR, MATR3, LCSLFYTNEEVAK, PARP1, SLQELFLAHILSPWGAEVK, LOC649330, QTQTFTTYSDNQPGVLIQVYEGER, HIST1H2BE, LOC100287441, FGFR1, CDK2, CYAT1, IINEPTAAAIAYGLDK, LLLPGELAK, QPCQPPPVCPTPK, HSPA8, KESYSIYVYK, KESYSVYVYK, MYSYPAR, DNIQGITKPAIR, HIST1H4C, VAPSKWEAVDESELEAQAVTTSK, YNILGTNTIMDK, IGPYQPNVPVGIDYVIPK, VCSTNDLKELLIFNK, FLQEQNK, LOC100287327, IADFGLAR, LADFGLAR, AMGIMNSFVNDIFER, AINQQTGAFVEISR, SVSLTGAPESVQK, ATLVCLISDFYPGAVTVAWK, AAPSVTLFPPSSEELQANK, HIST1H4E, HIST1H2BM, VTISCSGSR, HNRNPCL1, HIST1H2BB, TIQGHLQSENFK, DIDIHEVR, GPLPLSSQHRGDADQASNILASFGLSAR, LYN, CDK5, IGLV1-44, QKVDSLLENLEK, GPAVGIDLGTTYSCVGVFQHGK, TLGDFAAEYAK, MIASQVVDINLAAEPK, SQIHDIVLVGGSTR, LOC100287238, LLQDFFNGKELNK, DAVTYTEHAK, QVHPDTGISSK, HIST1H2BG, LYLVSDVLYNSSAK, LLEQYKEESK, VVHIMDFQR, BLK, CDK3, VFSATLGLVDIVK, LOC100287144, HIST1H2BL, KLASQGDSISSQLGPIHPPPR, IGGGIDVPVPR, FGPLASVK, DLPEHAVLK, DSFDDRGPSLNPVLDYDHGSR, STAGDTHLGGEDFDNR, DNIQGITKPAIRR, HIST1H2BJ, SPRR2E, LYSILQGDSPTK, ISKEVLAGRPLFPHVLCHNCAVEFNFGQK, ITPENLPQILLQLK, VVSEDFLQDVSASTK, FEELNADLFR, VQISPDSGGLPER, SPRR2D, AAAEIYEEFLAAFEGSDGNKVK, SSGPTSLFAVTVAPPGAR, SFQQSSLSR, LTVNPGTK, FYN, CDK1, YAASSYLSLTPEQWK, HIST1H2BN, LOC100287478, SYSCQVTHEGSTVEKTVAPTECS, SGTSASLAISGLR, VDSLLENLEKIEK, VFIGNLNTLVVK, IINEPTAAAIAYGLDKK,ARFEELNADLFR, ISGLIYEETR,VFLENVIRDAVTYTEHAK, HIST1H2BH, ICK, CDK12, LOC100287404, ADSSPVKAGVETTTPSK, CDK20, CDK16, QSNNKYAASSYLSLTPEQWK, USP17L3, LCK, CDK13, LOC100288520,LOC100287364, FGFR4, CDK17,

GSYGDLGGPIITTQVTIPK, ELGTVMR, FESPEVAER,MAAPIDR, YYVTIIDAPGHR, ASGPPVSELITK, HIST3H3, IQDKEGIPPDQQR, ALYETELADARR, TUBB, C7orf73, AKAP9, ENO3, LOC100506888, CUL3, ZNF461, CHD4, RBFOX1, IDEPLEGSEDRIITITGTQDQIQNAQYLLQNSVK, ALAAAGYDVEKNNSR, YEEIVK, KLPFQR, LADALQELR, TUBB2A, DTDSEEEIREAFR, DAALATALGDKK, IILDLISESPIKGR, VGSEIER, MGANSLER, UBB, KATGAATPK, LLIHQSLAGGIIGVK, H3F3B,HIST1H3D,HIST1H3B, LTTPTYGDLNHLVSATMSGVTTCLR, VFDKDGNGYISAAELR, VETGVLKPGMVVTFAPVNVTTEVK, EIAQDFK, CEP85L, KIAA1598, ENO1, TCEB3C, CUL4A, ZNF20, CHD5, RBFOX3, TITIEVEPSDTIENVK, LMNB1, LFQECCPHSTDR, VGEVTYVELLMDAEGK, GNFGGSFAGSFGGAGGHAPGVAR, EEF1A1P5, KLEEIK, KELELK, SGETEDTFIADLVVGLCTGQIK, TEKDNAALAR, LEGMFR, QCGKAFIR, HHYEQQQEDLAR, ILDVEIIFNER, UBC, KASGPPVSELITK, HIST1H1E, LLEGEEER, UBA52, LMNA, SGPFGQIFRPDNFVFGQSGAGNNWAK, HNRNPK, ALRTDYNASVSVPDSSGPER, CALM2, CALM3, HNRNPM, IGGIGTVPVGR, RVTIMPK, RPS27A, TTIPEEEEEEEEAAGVVVEEELFHQQGTPR, TUBB2B, EADIDGDGQVNYEEFVQMMTAK, HIST1H1D, KALAAAGYDVEK, TUBB4B, STELLIR, AGGPTTPLSPTR, IILDLISESPIK, DKFNECGHVLYADIK, QGGGGGGGSVPGIER, EEF1A1, EAFSLFDKDGDGTITTK, YRPGTVALR, KCGGIDK, THINIVVIGHVDSGK, H3F3A, IITITGTQDQIQNAQYLLQNSVK, TUBB4A, LOC100289196, INVS, ENO2, TCEB3CL, CAPN3, ZNF571, CHD3, RBFOX2, EIAQDFKTDLR, TITLEVEPSDTIENVK, VAVEEVDEEGKFVR, AFITNIPFDVK, VGQTIER, ILSISADIETIGEILKK, LPLQDVYK, VAGAATPKK, DTDSEEEIR, ALAAAGYDVEK, HIST1H3A, ILSISADIETIGEILK, EHALLAYTLGVK, SGVSLAALKK, TDYNASVSVPDSSGPER, MKDTDSEEEIR, INEILSNALK,MDERALPK,

ILEVDLK,SGKAPILIATDVASR,APILIATDVASR, QVSDLISVLR, ISDSEGFKANLSLLR, ATGATQQDANASSLLDIYSFWLNR, KTTIENIQLPHTLLSR, HIANYISGIQTIGHR, SNLGSVVLQLK, GMQPTEFFQSLGGDGER,SEFYSLQEQHLGLALDVDR, KSNSSPSR, AISEQESAGQLEIR,ILIGTVFDK, TLSVSRR, NGQVNLTVR,QDLMIEDNLLK, IDEVPSR, QYNKLK, QACRTPSR,KTEPATGFIDGDLIESFLDISRPK,CTNILR, SQELEVKNAAANDK,LRWLLDNVR, NYYGYQGYR,ASSLSESSPPK,VTIFTVPENCR,DAINAETVLVLVNAVYFK,VIHDNFGIVEGLMTTVHAITATQK, LLAAVEAFYSPPSHDRPR, ALDVMVSTFHK, DIVQFVPYRR,TLLADQGEIR, LLIYETEAK,DTSSVAVNLPVPSGVAFPAVFLR,MPCQLHQVIVAR, FSGVPDR, FRHENIIGINDIIR,RLPDAHSDYAR,LSGEAFDWLLGEIESK,NQGGSSWEAPYSR, LASYVEKVR,QRQEEPPPGPQRPDQSAAAAGPGDPK, GGHPPAIQSLINLLADNR, FLICTDVAAR, AAFGISDSYVDGSSFDPQRR,ALIEILATR, QQLSAEELDAQLDAYNAR,YSGGNYRDNYDN, TNEGVIEFR, SRSF4, LLQLVEDR, QAGEVTYADAHK, LFVGNLPADITEDEFKR, FATHAAALSVR, SPAVKPAAAPKQPVGGGQK, VVPSDLYPLVLGFLR,RTCCFPALPK, NLNPEDIDQLITISGMVIR,SVAGGFVYTYK, NFGDQPDIR,TLATDILMGVLK, EVDDLGPEVGDIK, DDX5, SFPQ, NOLC1, MCM4, SF3B3, DHX15, PSMB8, NBPF3, RBBP6, ALG13, PON3, SPECC1L, MAGI2, CCDC39, CCDC93, VPS13D, RHPN2, DDB1, BARD1, DYNC1H1, GK, HNRNPUL2, LIMA1, DSC1, SERPINB12, GAPDH, CHERP, S100A4, CPNE1, MTA2, RBM25, RPAP1, CAPN2, IGKV2-24, MAPK1, RBM14, POLR2A, ZNF326, KRT23, C19orf43, NCOA5, DDX1, SRRM2, ANXA6, ALYREF, RBM3, SRSF6, DDX17, AGEVFIHKDK,

WNLDELPKFEK, FGQGGAGPVGGQGPR,NONO, DNQLSEVANK, RVVPSDLYPLVLGFLR,GLQVDLQSDGAAAEDIVASEQSLGQK, YQQLFEDIRGQSDIAITK, FLAVGLVDNTVR, KFVIHPESNNLIIIETDHNAYTEATK,HQSFVLVGETGSGK, HRLDLGEDYPSGK, NFYQEHPDLAR, FAQHGTFEYEYSQR,

STCIYGGAPK, LLEVDLK, VLEEANQAINPK, GIVEFASKPAAR,AVVIVDDR, NKPGPYSSVPPPSAPPPKK, ALADDDFLTVTGK, NVSEELDRTPPEVSK, IRVESLLVTAISK, GMKPTEFFQSLGGDGER,AEMNILEINKK, QNVNILIDTEK, FSAFLDK, ILIGTVFMK, STLDINEMVR,SASPQRAARPR, RISVADR, QENKMR, ELVHDRFHK, MDLRQACR,EATADDLIKVVEELTR,LLLSYGASR, SALEQMR, WLLDNVR, LVQIASR,HRADHPPAEVTSHAASGAK,VQDQDLPNTPHSK, FCFDLFQEIGKDDR, GALQNIIPASTGAAK, EQGIQDPIKGGDVR, SELKELLTR, SDPFLEFFR,EFEEESKQPGVSEQQR,EYSSELNAPSQESDSHPR,IIERDTSSVAVNLPVPSGVAFPAVFLR, LGLKEFYILWTK,FSGVPDRFSGSGAGTDFTLK, VADPDHDHTGFLTEYVATR, YSGSYNDYLR,INAGFGDDLNCIFNDDNAEKLVLR, SGYGFNEPEQSR, LASYVEK,AHQCGDDDKTRPLVK,DPYGFGDSR, FLVLDEADGLLSQGYSDFINR,SLSYSPVER, SEIDLLNIRR,SLGTADVHFER,YYDSRPGGYGYGYGR,

LSGLNAFDIAEELVK,NQDLAPNSAEQASILSLVTK, ISFEEFCAVVGGLDIHK, HILLAVANDEELNQLLK, VLHEAEGHIVTCETNTGEVYR, ISLVLGGDHSLAIGSISGHAR,FGQGAHHAAGQAGNEAGR, GVTFLFPIQAK,IVSGEAESVEVTPENLQDFVGKPVFTVER, GRIA4, CCDC72, CNN2, GOLGA6D, DVLIQGLIDENPGLQLIIR, TKAPDDLVAPVVK, CSGEEQSLEQCQHR, NDQDTWDYTNPNLSGQGDPGSNPNKR,ATCNNYTPQFQLRPSYSSCFPQYR, ILVNQLSVDDVNVLTCATGTLSNLTCNNSK, LFVGSIPK, SSELQAIKTELTQIK, DILCGAADEVLAVLKNEK, GFGFVYFQNHDAADK, SVVTVIDVFYK, VAVGELTDEDVK, DLLDLLVEAK, LYGPTNFSPIINHVAR, LQETLSAADR, DVNLASCAADGSVK, EATLQDCPSGPWGK, VFNVFCLYGNVEK, GRIA1, EXOG, CNN1, GOLGA6A, GRPAVCQPQGR, NKTLVTQNSGVEALIHAILR, TGYTLDVTTGQR, SNIDALLSRLEQIAAEQK, LTAIDILTTCAADIQR, LFVGGLKGDVAEGDLIEHFSQFGTVEK, GHRHQEEESETEEDEEDTPGHK, QINIHNLSAFYDSELFR, VGSTSENITQK, LAGANPAVITCDELLLGHEK, QINVSTDDSEVK, LVGGDNLCSGR, TDNAGDQHGGGGGGGGGAGAAGGGGGGENYDDPHKTPASPVVHIR, SPLGEVAIRDIVQFVPFR, RFEPCSSSYLPLRPSEGFPNYCTPPR, LVQLLVK, LCDSYEIRPGK, VFIGNLNTALVK, LIGLSATLPNYEDVATFLR, KLFVGGLK, HSSWSEGEEHGYSSGHSR, ESLVVNYEDLAAR, NINITKDLLDLLVEAK, QLFSSLFSGILK, LWSLDSDEPVADIEGHTVR, GQWGTVCDDGWDIKDVAVLCR, LCFSTAQHAS,

CPNE3, DIVQFVPFR, MBD3, ILF2, PPP3R1, H2AFY, SNRPD3, ARG1, SBSN, DDX21, LONP1, EEVIDFSKPFMSLGISIMIK, KLEELK, RHLYDPK, EKADGTEQVER, PRKDC, LAAVVSACK, PRPF4,IWSLDSDEPVADIEGHTVR, CD5L,ELGCGAASGTPSGILYEPPAEK, HNRNPL,AITHLNNNFMFGQK, KPRP, RLDQCPESPLQR, JUP, NLSDVATKQEGLESVLK, HNRNPR, NLATTVTEEILEK, RALY,VLAGQTLDINMAGEPKPDRPK, SNRNP200,HLILPEKYPPPTELLDLQPLPVSALR, HNRNPA0,LFIGGLNVQTSESGLR, FLG2, FSNSSSSNEFSK, MCM2, DNNELLLFILK, DDX3X,HVINFDLPSDIEEYVHR,

INQVFHGSCITEGNELTK, LWIANYSLPR, GVWGSVCDDNWGEKEDQVVCK, YYGGGSEGGRAPK, CPVEIPPIR, VSVELTNSLFK, AGPIWDLR, VTVPLVR, DILCGAADEVLAVLK, AVPKEDIYSGGGGGGSR, QSSYGQHGSGSSQSSGYGQYGSR, RDNNELLLFILK, SFLLDLLNATGK, DFSAFINLVEFCR, FEPIHGNFLLTGAYDNTAK, HQNQWYTVCQTGWSLR, NPNGPYPYTLK, SSPVEFECINEK, TSFSPCVPQCQTQGSYGSFTEQHR, TMQNTSDLDTAR, DLYEDELVPLFEK, SNIDALLSR, HLSDHLSELVEQTLSDLEQSK, SGGGGGGGGSSWGGR, SGQSSYGQHSSGSSQSSGYGQHGSR, IFASIAPSIYGHEDIKR, KGADSLEDFLYHEGYACTSIHGDR, SLGTIQQCCDAIDHLCR, QAEVLAEFER, LVGGDNLCSGRLEVLHK, SSSGLLEWESK, TFIIDYYFEVVQK, KLSGLNAFDIAEELVK,VKPAPDETSFSEALLKR, ISFEEFCAVVGGLDIHKK, GVTIASGGVLPNIHPELLAK, FLILPDMLK, TVNTAVAITLACFGLAR,FGQGVHHAAGQAGNEAGR,GAVEALAAALAHISGATSVDQR, LSSDVLTLLIK, GRIA3, GRIA2, LOC729973, PPP2R5A, CNN2P9, CNN3, GOLGA6C, GOLGA6B, NLLIFENLIDLK, AGIEAGNINITSGEVFEIEEHISER, IWLDNVR, ASLNGADIYSGCCTLK, LRPEPCISLEPRPRPLPR, TLVTQNSGVEALIHAILR, GYAFITFCGK, LQASNVTNKNDPK, YAQAGFEGFK, EDIYSGGGGGGSR, SSQSNYGQHGSGSSQSSGYGQHGSSSGQTTGFGQHR, EWTLEAGALVLADR, HAIPIIK,

KFESEILEAISQNSVVIIR, LTEEETVCLDLDKVEAYR,AELIVQPELK, HNGPNDASDGTVR,ATENDIANFFSPLNPIR, MELQEIQLKEAK,DFPVESVKLTEVPVEPVLTVHPESK, VLVDVER, ALINADELASDVAGAEALLDR, RHPYFYAPELLFFAK, AFGYYGPLR, AYHNSPAYLAYINAK, LTEQKGEQQIQK, RFEDLGSR, GIAYVEFVDVSSVPLAIGLTGQR, GVDEVTIVNILTNR, ASAISVTVLNVIEGPVFRPGSK, QYGLGPNGGIVTSLNLFATR, KIAA1949, EEA1, MKRN4P, MYT1, LOC100508206, C9orf174, TPM3P4, WTLNSR, QLEMEK, REEPQR, SLSGCPRAK, LGLLGDSVDIFKGIPFAAPTK, LEKEAAR, DINTDFLLVVLR, FLDQNLQK, ALLQAILQTEDMLK, YIEIFR, VHIDIGADGR,

PNN, FKQESTVATER, SNRNP70, IHMVYSK, SPTAN1,IAALQAFADQLIAAGHYAK, SHCIAEVENDEMPADLPSLAADFVESK,ALB, SRSF3, NPPGFAFVEFEDPRDAADAVR, SMARCE1, IAAEIAQAEEQAR, SUPT16H, ELAAQLNEEAK, YLPM1,SQAEPLSGNKEPLADTSSNQQK, RBM39, DLEEFFSTVGK, ANXA2,SALSGHLETVILGLLK, DSG1, ESSNVVVTER, LTGHLHAQGTPPYVINLDPAVHEVPFPANIDIR,GPN1, DHX9, AAECNIVVTQPR, DSP, HNRNPH3, TPM3,

NQCTQVVQER, VLLQEEGTR, EIAENALGK, STGEAFVQFASK, LAQFEPSQR, PPP1R18, MXRA5, MKRN1, MYT1L, CEL, KIAA1529, MELQELQLKEAK, LTEVPVEPVLTVHPESK, EFEDPRDAPPPTR, GLVSSDELAKDVTGAEALLER, HPYFYAPELLFFAK, DSCPLDCK, SRAEAALEEESR, HTDVQFYTEVGEITTDLGK, TTVQQEPLESGAK, SKGIAYVEFVDVSSVPLAIGLTGQR, TNQELQEINR, IHSDCAANQQVTYR, SMSLVLDEFYSSLR, FSDHVALLSVFQAWDDAR, TLELQGLINDLQR,ETQSQLETER, ATGEADVEFVTHEDAVAAMSK,DGMDNQGGYGSVGR,

GBAP1, PHKA1P1, FAM98A, SP5, PTX4, PNPLA6, BRD4, CSNK2A1, ESRP2, PABPC3, SMARCC2, FAM133B, HSP90AB3P, RBBP4, SNRPD2P1, SRSF8, SNRPB, CAMK2B, LLLPHWAK, LMRGLLHCMIR, RLDVTVQSFGWSDR, FMRSDHLAK, LEEQFR, HSDFSRLAR, QLSLDINKLPGEK, GGPNIITLADIVKDPVSR, YIELFR, FGPALSVK, TQDECILHFLR, ALAEFEEK, EQVANSAFVER, TVALWDLR, NNTQVLINCR, VGDVYIPR, VLGLVLLR,

DLKPENLLLASK, LTQYIDGQGRPR,

GBA, PHKA1, FAM98B, SP3, REV1, PNPLA7, BRD2, CSNK2A1P, ESRP1, PABPC1, SMARCC1, FAM133A, HSP90AB1, RBBP7, SNRPD2, SRSF2, SNRPN, CAMK2G,

HDAC1, YGLVTYATYPK, GLRAMSVGSGLR,NPPGFAFVEFEDPRDAEDAVR,NPPGFAFVEFEGPRDAEDAVR, AIGPHDVLATLLNNLK, IYNSIYIGSQDALIAHYPR, VNWATTPSSQKK, DTSNHFHVFVGDLSPEITTEDIK,LFLDFLEEFQSSDGEIK, TTGEGTSLRGDINVCIVGDPSTAK, LKGEATVSFDDPPSAK, AAIDWFDGKEFSGNPIK, TKDIEDVFYK, EAGDVCYADVYRDGTGVVEFVR,DSNFAGDLVR, KSEYTQPTPIQCQGVPVALSGR,VADPVVTFCETVVETSSLK, TCFVDCLIEQTHPEIR, QVTITGSAASISLAQYLINVR, INISEGNCPER, FINYVK, GPSSVEDIK, ELQCLTPR, IKSDHPGISITDLSK,LVEGILHAPDAGWGNLVYVVNYPK, FTEYETQVK, LLYDTFSAFGVILQTPK,VTGQHQGYGFVEFLSEEDADYAIK, CST8, KNETGVLR, CSPP1, RYLTQGITQGK, NOS2, AGGGQALSR, PLA2G4B, MAVAEVSR, PTPRG, NSIVVPSER, TRIP10, ALSLWPR, RBM15, KNSASAER, MRTO4, QLGLPTALKR, ARL6IP4, GGLSGPQGGR, LAMB4, ANSTITQLTANITK,GOLGA2, KLGELQEK,LOC100507381, MLSSFLR, DPP7, MLSAYLR,

LFENLR, YGEYFPGTGDLR, CFB, SRSF7, SF3B1, TIAL1, MCM6, FUS, SRSF1, DDX42, EFTUD2, PCBP2, NPM1, SSRP1, TARDBP, SF3B4,

CLVNLIEK, VSEADSSNADWVTK,AFSYYGPLR, VYVGNLGTGAGKGELER,DTPGHGSGWAETPR, ICFELLELLK,FEDVVNQSSPK, SAFAPFGK,YLQLAEELIRPER,DQGSRHDSEQDNSDNNTIFVQGLGENVTIESVADYFK,HVEEFSPR, APKPDGPGGGPGGSHMGGNYGDDR, YGAIRDIDLK, SHEGETAYIR,VVQGDIGEANEDVTQIVEILHSGPSK,NFYNEHEEITNLTPQQLIDLR, SFVEFILEPLYK, GLAEDIENEVVQITWNR,IITLAGPTNAIFK, LVVPASQCGSLIGK,GPSSVEDIKAK, MSVQPTVSLGGFEITPPVVLR,ADVIQATGDAICIFR, FYVPPTQEDGVDPVEAFAQNVLSK, TSDLIVLGLPWK, FGGNPGGFGNQGGFGNSR,NLDVGANIFIGNLDPEIDEK, NQDATVYVGGLDEK, HDAC2,

COL16A1, GLPGPPGSK,YLHSNMMAATLCGSPMYMAPEVIMSQHYDAK,ULK2, ADAM28, RYNENQDEIR, MATN1,MFAVGVGNAVEDELR,ATP2A1, GCSSGTGVR, SULT1C2, SILDQSISSFMR, INTS10, LPANLLYK, SPTBN2, LEQLAARFDR, CNGB1, AVEKMPR, IMPACT, KMGIAEAASK, AGTPBP1, INAEPSESDTAR, ATP6AP2,IPDVAALSMGFSVK, FANCM, ELSLVEQR, NCCRP1, TVIAQHHVAPR,LOC100289085, LLSSRSR, ANKRD63, LGLRLDR, DNAJC8, LLLDQEQKK, TCHH, LEQRDR, LOC729175,EPEKEAGAGALPR, DOCK10, RTWLESMAK, PPP1R3F, TSVEAAVAPR, C6orf99, DGGLAMLPILVSK, ABHD11, TAMLLALQR, CCT2, VLVDMSR, ADCK5, SSKVNDVEAIR, ARMCX6,DAGFPFSQDINSHLASLSMAR,SLC52A1, MAAPTLGR, ITPR3, KVHGDVVK, PCDH17, MKTTSTK, ZNF284, WASGILR, BCL2L10, SVAAWLR, PRKCE, TEDPKLR,

EXOC6B,SVYDGEEHGRFMEK, UBR1, ILTCMQGMEEIR, POLR2C,ITELTDENVKFIIENTDLAVANSIR,CD1A, FILGLLDAGKAHLQR,ABCC10, SLSLGQR, NID2, IESALLDGSERK, LAMA1, EPKELDPMLPR, JMJD6, GALVSGSGLR, NAT1, GIMDIEAYLER, USH2A, DGTIVYTGLETR, TNK2, LGDGSFGVVRR, BTN1A1, KVEISIPASSLPR, PGS1, KVDCLESTLEK, CLIP1, LLDLDALRK, SDHB, TCLQASR, POLI,EAMYNQLGLTGCAGVASNK,RAG1, TVLDQAR, TSPAN12, YSFLRGTK, THEMIS, EPLHVVATK, C1orf144, ITQKESR, KIF26B, LEQLASR, MLLT4, MVSMMEGVIQVR, PLEKHJ1, ELQALSR, ABCF3, LEKAEAR, CCDC112, FLQQTGGR, HERC1, LNASDLRLPSR, CCDC124, AQIEDTLR, KIAA0825, LALTDAIK, ASXL3, EEPRVPPLK, SNW1, GLQTVHINENFAK, ANPEP, LNYTLSQGHR, TRMT112, LQATEVR,

MLL, SETKSGDK, AQP5, SFGPAVVMNR, FLNA, TPCEEILVK, KDM3A, NEKEPMVLK, PSKH2, LLILEAGHR, ARHGAP20, TLLIDGR, DENND5B, IKTDVGR, MUC7,LPPSPNNPPKFPNPHQPPK,ANO2, LSPQAGSR, CYP2F1, RSIEER, C9orf80,MAANSSGQGFQNK, IPO13, GLAGGQQPR, TXLNG2P,GSGGGNETIGAKQGR,SNX13, LGRTSLPR, RASAL2, LLLEYKAR, FLJ22184,AGGIGGATIPESAPRAGPTR,KLHL34, DRWTAAGALPR, LRRC8A, MANLTELELIR, CRH, LMEIIGK, FAM105B, ESLTLLR, YIPF6, KSNTLLR, NAP1L2, TVTEDFPK, AK3, EDDKPETVINR, C1orf87, SSAWKTPR, SCYL1, ATRDPFAPSR, SUDS3, KLDQQYK, SERPINA12, VVDVSVPR, ZKSCAN3,VLQVPMLAHGGCCREDAVVASR,CEP97, MAPAYLPR, ABCF1, LKLLEEER, MORF4L1, TSGLQQKNVEVK, AKAP13, KNVLQGGESTK,

MOXD1, KEGLSFNK, ZNF643, DFPVGERR, SYT4, MLMNREIIK, LOC93622, AAEAARMGR, NUP205, LEEGSELEKK, ZFR2, CLESLAALR, ZBED1,YFGFDTNAEGCILQWK,FUBP3, SINQQSGAHVELQR,SMCR8, LADHRTQ, AS3MT, LFKHSK, IBTK, TRLDGVR, OR1Q1, GGFMKWMSR, LLHHDDPEVLADTCWAISYLTDGPNER,KPNA2, UNC80, MESRGLR, ANKRD26, LENEKQSK, RBL1, VIAIDSDAESPAK, PROS1, RANSLLEETK, CGN, QLKGLEEK, VARS, CLVALKER, ABCC13, SAAKGSDSR, SRCAP, IFCFILSTR, CMA1, LLLFLLCSR, MACROD1, LIICVFLEK, C10orf47, FPSNIIVTNGAAR,PRPS1L1, SRSPISAK, GPLD1, MQVAMAR, RYR3, MLANTGRK, SLC6A10P, EGTPDTAPTSR, TRIM56, ITGLCPFGPR, MTR, QGHYESLKER, HDAC6, QKLGGASQGR, MYRIP, MDTLAVALR,

RFESD, NLDGSAQDPEKR, NKX3-1, QLSSELGDLEK, XKR5, ATDLAGKR, PCDHB8, VDFLTGEIRLK, FAM211B, KLTDAIK, RGP1, DSWLAELAGER, CEP135, ELALSDLR, FKBP15, IMNQVFQSLR, FABP7, MVEAFCATWK,LOC100508805, WITQKR, RBM34, LNNSELMGRK, TRAPPC9, LTNRSPR, TIMM8A, LNSMLLNDK, LENG1, NKDNVAR, TYK2, HGIPLEEVAKK, FANCA, LTAASLSR, MGAT5, SLAEKQNLEK, SPECC1, MIRALEEK, GBP2, MLEEIQK, MAN2B2, SSLKGLAR, TLR7, TMESESLRTLEFR, C5orf30, QLDAGLAR, ZNF687, FISHKK, MFI2, LSVMGCDVLK, SACS, IENLSYDAK, PTPRB, EKSLLNIMMLVPHK,KIAA1841, FNKDAQR, SCEL, MNKSLNR, SH3TC2, EGECLTLCK, CEP350, GLDIESTSK, TCP10, LEGQLEAGEPK, 09-Mar, SSSHSGREVVMR,

SETX, VPVSKTPK, LIN37, NQLDAVLQCLLEK, ACSM5, MRPWLR, CHMP4A, AEVWRTLR, MED27, SANQMGVSAKR, C2orf77, ERVALAK, CDSN, IILQPCGSK, CYP2E1, RVCAGEGLAR, CCDC88C, QENIQLAADAR,CDC42BPG, AQGQQEELLQR, IFNE, LIIFQQR, MORN1, VLQISAGVR, CPE, LLIPGNYK, CDH11, KLADLYGSK, CA3, EPMTVSSDQMAK, LRRC9, TVDLIPVDQFR, SAP30BP, MARLESR, JMJD5, LEKTVPR, BRP44L, SPEIISGR, MAP1A, VAIDRSR, CCDC168, DSVEQVIR, DACT2, SPMDKVLR,LOC100506191,VLTAMVGK, MYOM3, LESGDRIR, PPP2R3C, DWKEVLR, USP24, ISFLLGASR, MAPK8IP2, ISEGSSPIR, NCAPD3, GSVMPSVIR, DTX4, LLLVAWDR, PLXNA3, YESLLR, ATP13A5, ELSEARIR, CEP250, ELKEAAR,

RTKN2, HCCEELMKIEIMSPR, TJP2, QGVNTMR, NUMA1, NSLISSLEEEVSILNR, RPL38, KIEEIK, GEMIN4, IFIQTLEANACR, FCHSD2, AELEQKIDEAR, STX3, QLEITGK, WWC2, RTSWIDPR, ITPR1, KTEEGNNKPQK, TPM1, MEIQEIQLKEAK, HMGB1, LNMGKGDPK, RGS12, VLVVDLGGSSSR, BBX, GQRSTPLTHDGQPK,ATXN7L1, KHQNGTK, B3GALT5,LEDVFVGLCLERLNIR, CALY, MAPLQTVR, CCHCR1, LREQLSDTER, SETDB1, LREAMAALR, CWF19L1, LMDAAELVK, SPR, GWALYCAGK, FBXO22, KCNEVK, HDAC9, ILLGDDSQK, CCNDBP1, MLVAENGKK, LAMA4, GKNLSKPK, PARPBP, KLAAVMR, DSE, SSIVPEVK, DDX23,RPAVVYIGSAGKPHER,C10orf114, GVPTPSGDR, APBA3, DFPTISR, MYH10,SLEAEILQLQEELASSER,RAD51AP2, MSSVYLK, SNED1, LASYTVR,

CCDC122, KVLFNLK, ZNF469, YILGIASSR, FLG, HHEASSRADSSR, ARID5B, LILPYER, GSN, LMATQVSKGIR, FKTN, IESKDPR, TC2N, IQTQTPR, SCN9A, ITWRSR, SEMA6D, LTWRSR, IREB2,GPYLLGVKAVLAESYEK,SPAG1, TVIDKSR, DAGLA, LKVFLCCTR, MMP16, ILLTFSTGR, NCAPG, CLILCYELLK, FAM9B, HAGKDPVR, AKAP1, SLLSSPTK, N4BP2L2, ITLGEEKDR, R3HCC1, ALELQHK, RRP1B, LLEMTSRK, AZGP1,HVEDVPAFQALGSLNDLQFFR,NAF1, IFEIFGPVAHPFYVLR,PRPF40A, QAAPPIELDAVWEDIRER,SAV1, NQSFLR, BEGAIN, EKLSALQEQK, C14orf38, KLEEMR, NBPF8, AEMNILQINEK, PDZD2, ANSLVTLGSHR, TRIM37, NVEAVRNAK, DIAPH1, AAELEKK, ETFA, SPDTFVR, SGCZ, SLIMEAPR, SCLT1, LILEHQEK,

SEZ6L, LLLHDKDR, AKNA, SSLSADLR, TEP1, GLDLSTCPIALK, FAM99A,MAAAGPMTGAAASR,BTN3A2, QQKEEIGLSR, RIOK2, QQKSAVR, ALDOA, SKGGVVGIK, SMC1A, DLTLEENQVK, ADCYAP1, VLDQLSAGK, TRMT12, LKLSNIADR, PRPF8, SGLNQIPNRR, MED14, LLIDSVHAR, C10orf118, LLSLNTDK, TNS1, SPARDVIFR, TTC25, KLSEVGR, SEMA4D, HGGTAELK, REEP3, VSWMISR, PLD5, LLLSFWK, SIK3, DLKAENLLLDANLNIK,CTTNBP2, NGHTDCVR, FBXO18, LLITYSNR, S100G, SPEELKR, TRIM66, TSGSLCPR, TRIO, VSTMLDRLHSTR, WNT8A, SPGSAQSLGK, ARFGEF2, TLKINDEK, ATF2, VWVQSLEK, ARVCF, DAEWTTVFK, RPUSD3, TLQCLGLR, ZMAT1, LPFETFR, S100A9, NIETIINTFHQYSVK, GLOD5, LRHLPSR,

FBXO33, VSPAEQPR, NDUFA10,LLQYSDALEHLLTTGQGVVLER,WDR55, LKFWDMAQLR, MYL6,VLDFEHFLPMLQTVAK,TIAM1, TLDSHASR, KCNQ5, STSANISR, AGBL4, LNPVVEK, NCKAP1L, KFPNIDVR, CCDC166, MPSLTASR, PLCL1, SLEAIEEKDLR, C20orf111,VDASGSVASLSVGEGTGVR,UTRN, DTLTQLNAK, MED25, DAILNELK, KIAA1429, ENSGAIEASVK, WHSC1L1, GASEISDSCKPLK, USP7, LGLEAGDTDDPPR, ARID5A, LGAYELVTGRR, LYAR, VKDAVEQQGEVK, FUT8, RMAESLR, HTATSF1, VLDEEGSER, KLRAP1, EQILTNKTLK, ACAT2, VAVLSQNR, MOK, RTLSLPR, DCLK2, LKQHFNNALPK, PITRM1, QGLCVLRR, RIMS1, IGEEARR, DST, GVVDPEFR, LRIG3, MSAPSLR, ACO1, RPQDKVAVSDMK, IL18RAP, DEIIER, IRGC, EAAVLQEIR, CCDC87, ELKSIPR,

ARID4B, EDRGTPINK, SP140, EVFAIQGNK, RBCK1, NTSLNPQELQR, GVINP1, LISDSKR, PCDHGA12,CQLNLDILMEDK, RC3H1, KEIMAQLEER, EXOC1, MTAIKHALQR, FAM120B,CLSSYTSVKENFDK, GNA12, MADVGGQRSQR,C6orf136, ARTSVLPGLR, NUDC, GKDMVVDIQR, MTOR, ILLNIEHR, KIAA1109, KENEGSAK, RAD50,VFQTEAELQEVISDLQSK,PAFAH1B1, MVSASEDATIK, DACT1, EAVVAKPK, OR51A7, SVLNNSEVK, PCDHGA1, KLTGCSR, RBBP8, KLASCSR, TMEM198, MPGTVATLR, ATP5A1, VGSAAQTR, LOC646214,SDSVLLGMEGSVK, FBLN7, LAALQNSVGR, GJA10, QLEVDPSNGKK,ANKRD12,KQMALLMQMTAR, MYLIP, YVFDIK, MAP3K1, EMENKETLK, GBE1, LAMAIPDK, IPO7, NWGVMKMILMK,MYBBP1A,AQHQQALSSLELLNVLFR,PCCB, TAQEYGGIRHGAK, JMJD1C, YEDLLK,

HDAC4, LQETGLR, MS4A13, LGREVSR, DGKZ, MLKSVSR, L1RE2, EALNMERNNR, KCNQ2, EHVDRHGCIVK, DLG5, ELKELK, BAIAP3, KAAGQALK, LCN8, VLKYFTR, ZNF629, VLGAQSVPR, PCCA, VTEDTSSVLR, TAB1, CRTASGVLGLCK,C17orf56, GEFPGVVLSK, LUZP1, ISSELEMLR, DTAGEVAGDTGGDTVGYTETSANVKTMG,MLNR, AK8, SLSSTCARR, EXO1, EASEPIHVR, CYFIP1, TVCFQNLR, HSPA12A, LPEGHLK, ADAMTS4, RFASLSR, GLS, STGLRTSDPR, DRGX, EALEAQQSLGR, EFCAB7, NSKLMEPNLIK,LOC100287869,VPSLREAMEK, EIF3G, VTNLSEDTR, ZHX3, YPEEQLK, C19orf60, DAPIRTLVQR, GTF3C1, KNITNDIR, PSD4, EAGEAPKPGEEVK, PCM1, LPEMEPLVPR, PTPRZ1, VKLAQLAEK, ZNF14, QCGKAFLR, GLG1, CAIGVTHFQLVQMK,

TTBK1, VATISPRR,LOC100507483, LSSLWAR, FLJ45964,VCAAALARAWRPENQAR,ZNF619, TAQLNISK, ENTPD1, VLDVVER, MAN1C1,ALWPLPQLPQPSEWELGATPCLSWR,ZRANB1, LFDEVLDRDVQK, CASC4, SLALWPR, HDDC2, QITQLLPEDLR, AP4B1, PYLGSEDVVKELK, PARP11, SNTQAPRLAPSHR,ARHGEF1, LLLKSHSR, HOXD9, MLGGSAGRLK, MMEL1, HVIPETNSR, SLFN11, AACALLNSGGGVIR,RNF123, LTIAILR, SARS2, AAVTEAVR, CCDC12, RVPQAKPVAVEEK, PAH, SWTIPSSLR, DMD, LVEPQISELNHR, POLR2B,IVATLPYIKQEVPIIIVFR,C14orf166, HILGFDTGDAVLNEAAQILR,ASB2, DGDEEALKTMIK, C2CD2, AGQAVSSGGILR, YJEFN3, VWPGAER, TKTL2, DLAMFR, UPF2, MIKGGGLK, HIVEP3, VAPPERR, LYST, LTVLAVNR, GBP1, IMKNEIQDLQTK, SWAP70, MGSLKEELLK, NIPAL3, LAGSKDPR,

PSMD1, NNNTDLMILK, OTC, QSDLDTLAK, IL31RA, IPAIQEK, C15orf5, RPPEQTR, TFAP4,AAILQQTAEYIFSLEQEK,RFX7, VPHSGKTEGSTAGAQIPSK,USP15, KVVEQGMFVK, PRIC285, ILGLEYSLR, ALDH1B1, MLRFLAPR, MADD, CRELYYCVK, SMU1, YIHLENLLAR, C8orf74, RADVLLLK, LINC00479, LLIPRWR, MOS, LLLGATLPR, ERAL1, LVQSALR, CCDC25, LSSAHVYLR, NUDT14, MERIEGASVGR, DAK, IKMATADEIVK, VAC14, MADILLR, KIAA1107, ARLENMSPR, S100A6, LQDAGIAR, MAST4, SLDWNSLLRQK, KIN, RVEGIQYEDISK, C1orf141, QAEIILARR, AHNAK2,VPDVEVSLPSMEVDVQAPR,NDUFS8, MRCLTTPMLLR, PRPF31,VIVDANNLTVEIENELNIIHK,RAPH1, ETSLLLR, PIK3R6, LAQAYHR, CAPZA1, DQNEAQTAKEFIK, EHD2, TSFIQYLLEQEVPGSR, XYLB, TKILATGGASHNR,

SHROOM3, SSPATADKR, CHSY1, SSNNNGSVRTA, NOL10, LLEQQELR, VWA3A, GLDSLVAIMR, PIAS4, QVELIRNSR, ZSCAN2, MAADIPR, DOCK6, VAELYLPLLSIAR, EIF2AK2, GVDYIHSK, TPR,GQNLLLTNLQTIQGILER,PRR11, DTIFPSR, SIM1, EKENSEFYELAK, ISYNA1, SVLVDFLIGSGLK, RAD54B,SLLGYDPQILQMLK, NEK10, LILPNKQK, COQ6, ILLLEAGPK, CSE1L, IIIPEIQK, C13orf35, VTVQTDDSNK, MAGEH1, KIVTEEFVR, CDC5L, ILLGGYQSR, GNAL, QDMLAEKVLAGK,FLJ39061, KGPYCPR, ASAP2, EIISEVQR, MRPS5,VSGSINMLSLTQGLFR, BBS9, LTELLSER, PTBP1,KLPIDVTEGEVISLGLPFGK,HNRNPUL1, HLPSTEPDPHVVR, TAF5, QTLLAVLQFLR, PRPF3,VLGTEAVQDPTKVEAHVR,SRSF10, YGPIVDVYVPLDFYTR,PABPN1, TSLALDESLFR, POF1B, NLEQENQNLR, RPL15, FFEVILIDPFHK,

ADARB2, MRAAELALR, RBM45, FSPFGDIQDIWVVR, RPL6, ASITPGTILIILTGR, PDCD6, LSDQFHDILIR, INA, QILELEER, RPS10, IAIYELLFK, TRIP12,GGPVKIDPLALVQAIER,CCT6A, AQLGVQAFADALLIIPK,EWSR1, GDATVSYEDPPTAK,MRE11A, GNDTFVTLDEILR, TLN1, TLAESALQLLYTAK, CCT5,WVGGPEIELIAIATGGR,RBM17, LLQSQLQVK, DAZAP1,SQAPGQPGASQWGSR, PLG, LSSPADITDK, DDX18, TLAFLIPAVELIVK, HAL, ALDYLAIGIHELAAISER,SNRPD1, NREPVQLETLSIR,AVAAAAAAAVTPAAIAAATTTLAQEEPVAAPEPK,SRRM1, PSMA2,YNEDLELEDAIHTAILTLK, GGH, TAFYLAEFFVNEAR,IGKV1OR1-1, LLIYAASSLQSGIPSR,C22orf28, QIGNVAALPGIVHR, DDX50,GAVDALAAALAHISGASSFEPR,PELP1, LLLLESVSGLLQPR, SUB1,GISLNPEQWSQLKEQISDIDDAVR,TCOF1, VLTELLEQER, SART1, GLAAALLLCQNK, KPNB1, AAVENLPTFLVELSR,NYQQNYQNSESGEKNEGSESAPEGQAQQR, YBX1, RPS6,GCIVDANLSVLNLVIVK, RALYL, VFIGNLNTAIVK,

RSL1D1, LLSSFDFFLTDAR, S100A8, LLETECPQYIR,

Figure 3.5: Peptide-protein graph constructed from an AP-MS experiment with RPAP3 as a bait. Proteins are drawn as red triangular vertices while peptides are drawn as blue round vertices. 3.4. Experimental Results 71

HIST1H2AH, x=7

HIST3H2A, x=1

H2AFJ, σ=2 NDEELNKLLGGVTIAQGGVLPNIQAVLLPK, HIST1H2AC, σ=1 NDEELNKLLGR, H2AFX, x=2

σ=1 HLQLAIRNDEELNK, HIST1H2AA,

AGLQFPVGR, σ=1 HIST1H2AG,

σ=1 HLQLAIR, HIST1H2AE,

σ=1 NDEELNK, HIST1H2AB,

HIST2H2AA3, σ=8 VTIAQGGVLPNIQAVLLPK,

HIST1H2AJ, σ=2 NDEELNKLLGK,

HIST1H2AD,

HIST1H2AI,

HIST2H2AC,

Figure 3.6: A connected component of the bipartite graph from the CDK9 dataset, where all peptides in the component are shared peptides that belong to histones. Our algorithm for MINIMUMPROTEINTYPES chooses only 3 proteins (HIST1H2AH, HIST3H2A and H2AFX) to be in the solution. Blue round vertices indicate peptides while proteins are drawn as red triangular vertices.

ular selection of proteins may not be very accurate due to the relatively small number of peptides. However, the existing approach of simply ignoring shared peptides within MS data would lead us to miss this instance entirely. On the other hand, using the naive ap- proach of assigning each protein the average value of constituent peptide abundances would lead us to selecting all the histones in this component, resulting in poor quanti- tative data.

Such an effect is more prominent in the example shown in Figure 3.7. Here, the pep- tide abundances are taken from an AP-MS run with RPAP3 as the bait protein. While the

peptides are shared by four distinct proteins, our algorithms for MINIMUMPROTEINTYPES

and MINIMUMERRORSUM both picked HNRNP1 and HNRNP2 (heterogeneous ribonucle- oprotein particles) with positive abundances. This is indeed a reasonable identification as the remaining candidate proteins are supported by only 1 or 2 constituent peptides. 72 Chapter 3. Protein Quantification with Shared Peptides

σ=5 HTGPNSPDTANDGFVR,

σ=2 YVEVFK,

σ=2 FFSDCK,

σ=1 IQNGAQGIR,

TRPV5, σ=4 SNNVEMDWVLK,

σ=2 EGRPSGEAFVELESEDEVK,

HNRNPH2, x=5 σ=5 STGEAFVQFASQEIAEK,

σ=2 VTGEADVEFATHEDAVAAMAK,

HNRNPH1, σ=4 YVELFLNSTAGASGGAYEHR, x=6

σ=2 GLPWSCSADEVQR,

σ=1 DLNYCFSGMSDHR,

σ=6 YGDGGSTFQSTTGHCVHMR, HNRNPF,

σ=4 VTGEADVEFATHEDAVAAMSK,

σ=1 VHIEIGPDGR,

σ=3 ATENDIYNFFSPLNPVR,

Figure 3.7: A connected component of the bipartite graph from the RPAP3 dataset. Our algorithms for MINIMUMPROTEINTYPES and MINIMUMERRORSUM both select HNRNPH1 and HNRNP2 in the solution while discarding the less supported proteins.

3.5 Discussions

In this chapter, we studied the protein quantification problem by first modelling the mass spectrometry data using mathematical programs. While the four problems for- mulated are all expressed as integer programs, we studied their hardness from combi- natorial standpoints by reductions from SETCOVER and EXACTCOVER. Then, we devised approximation algorithms for the corresponding problems which also showed promis- ing empirical performances when tested on synthetic data as well as biological data.

There remain various extensions to the models discussed here. For example, all four problems (MULTICOVER,MINIMUMPROTEINTYPES,MINIMUMUNIFORMERROR,MINI-

MUMERRORSUM) model the mass spectrometry data under the assumption that each peptide may occur in a protein at most once. Lifting this assumption can be achieved easily by modifying the adjacency matrix A so that Ai ,j is the number of times peptide si appears in protein w j , and the resulting problems can be attacked using the same 3.5. Discussions 73

approaches.

Moreover, one may wish to incorporate into our models the “detectability” c(i ) of each peptide7, indicating the difficulty for correctly detecting a particular peptide. Re- cent studies have shown that peptide detectability can be estimated reliably [2, 102]. For example, Tang et al. proposed a machine learning approach to estimate the pep- tide detectability using its sequence length and its neighbouring regions in the parent

protein sequence [133]. To incorporate this into our model of MINIMUMERRORSUM, we P can minimize the objective function i c(i ) εi . The algorithm for the modified objec- · tive function need not change, as our randomized rounding scheme simply rounds the fractional solution from the LP relaxation.

In the case of MULTICOVER and MINIMUMPROTEINTYPES, we can integrate this no- P tion of peptide detectability to each protein w j by p j c i . Then, given this ( ) = (i ,j ) E ( ) ∈ penalty function for each protein, we can ask to minimize the total cost of chosen pro- P P teins: j p(j )x j for MULTICOVER, and j p(j )χ(x j ) for MINIMUMPROTEINTYPES. The re- sulting problems can be solved under the same framework, using algorithms for the corresponding weighted set cover problems.

Another factor that can be incorporated into our model is the set of peptides that were not observed in the MS data. For every unobserved peptide sk , we can introduce its absence to our model by adding sk to the bipartite graph with the abundance value

σk = 0. As these are additional constraints to the integer programs, they would help us obtain a more accurate solutions albeit with slower running time. However, as discussed in Section 3.4.1, the coverage of a protein by the identified peptides is typically low due to low sensitivity of mass spectrometers. As a result, we cannot simply interpret un- detected peptides as zero abundance, and the addition of these additional constraints should be avoided until the sensitivity of MS data improves.

7 The detectability c(i ) for peptide si can be set higher if it is less likely to observe the correct abundance of si , say. 74 Chapter 3. Protein Quantification with Shared Peptides

3.6 Bibliographic Notes

The set cover problem and its generalization into the multicover problem have been studied extensively in the approximation algorithms community. Lund and Yannakakis [101] showed that SETCOVER cannot be approximated within a factor of o(logn) unless P = NP. Moreover, Feige [48] showed that, unless NP DTIME (nO(loglogn)), there is no ⊂ (1 o(1))lnn approximation. Thus the approximability of the set cover problem is es- − sentially resolved if P = NP. When the input set system admits a special underlying 6 structure, the lower bound on the approximation ratio for the general set cover (and multicover) can be improved: for example, Brönnimann and Goodrich [21] showed an improved O(1) approximation ratio when the elements are points on the plane while the sets correspond to disks on the plane. Further, there are various algorithms specialized for different restricted cases of set systems [25, 47] that may provide better approxima- tion ratios. Unfortunately, however, the set systems constructed from our PPI dataset do not admit these underlying structures.

The results discussed in this chapter are to be submitted [89]. Chapter 4

Direct PPI Networks from AP-MS Data

With the help of high-throughput experimental techniques, a large amount of PPI data has recently become available, providing us with a rough picture of how proteins inter- act in biological systems. However, interaction data from high-throughput experiments often suffers from relatively high error rates and protocol-specific biases. Therefore, in- ferring the physical PPI network from high-throughput data remains a challenge in sys- tems biology.

As discussed in Chapter 1, affinity purification followed by mass spectrometry iden- tification (AP-MS) is an increasingly popular approach to observe protein-protein inter- actions (PPI) in vivo. One drawback of AP-MS, however, is that it is prone to detecting indirect interactions mixed with direct physical interactions. Therefore, the ability to distinguish direct interactions from indirect ones is of much interest. In this chapter, we propose a simple probabilistic model for the interactions captured by AP-MS exper- iments, under which the problem of separating direct interactions from indirect ones is formulated. Then, given idealized quantitative AP-MS data, we propose an approach to identify the most likely set of direct interactions that produced the observed data.

This chapter is organized as follows. In Section 4.1, we give a brief overview of related research around the AP-MS technology. Then, Section 4.2 describes the mathematical

75 76 Chapter 4. Direct PPI Networks from AP-MS Data

modelling of an AP-MS experiment, and formulates an optimization problem. In Sec- tion 4.3, we describe the overall algorithm, which is based on a collection of graph theo- retic approaches that succeed at inferring a large fraction of the network nearly exactly, followed by a genetic algorithm that infers the remainder of the network. Finally, in Sec- tion 4.4, we test the accuracy of our method using both biological and simulated PPI networks. Here, we apply our algorithm to the prediction of direct interactions based on a large set of AP-MS PPI data in yeast [93].

4.1 Background

In an AP-MS experiment, a protein of interest (the bait) is first tagged and expressed in vivo. The bait is then immuno-precipitated (IP), together with all of its interacting partners (the preys), and finally, preys are identified using mass spectrometry. Like Y2H and other high-throughput experimental methods, however, AP-MS suffers from experi- mental noise. A number of approaches have been proposed to separate true interactions from false-positives. These approaches mostly focus on reducing false-positives due to protein misidentification from MS data [63, 107, 150], on detecting contaminants [19], or a combination of both [26, 33, 35, 111, 122, 123]. These methods often make use of the guilty-by-association principle, and quantify the confidence level of an interaction by considering alternative paths between two protein molecules. In this context, a true interaction between bait b and prey p is considered a true positive if, at some point in the set of cells considered, there exists a complex that contains both b and p. One con- cern with this line of reasoning is that, as the sensitivity of the AP-MS methods improves (and thus the stability of the complexes that can be detected decreases), the transience of detectable interactions will increase to a point where, eventually, every protein may be shown to marginally interact with every other protein.

A key property of AP-MS approaches is that a significant number of the co-purified prey proteins are in fact indirect interaction partners of the bait protein, in the sense that 4.2. Mathematical Modelling and Problem Formulation 77

they do not interact physically and directly with the bait, but interact with it through a chain of physical interactions involving other proteins in the complex. Therefore, it is critical, when interpreting AP-MS-derived PPI networks, to correctly interpret the mean- ing of the term “interaction”. Although not designed to identify physical interactions, the increasing amount of AP-MS data available justifies the need for separating direct phys- ical interactions from indirect ones. This is the problem we consider in this chapter:

“Given the results of a set of AP-MS experiments, filtered for protein misiden- tifications and contaminants, how can we distinguish direct (physical) from indirect interactions?”

Note that since the false-positive filtering methods listed above consider indirect in- teractions as true-positives, they cannot be used to address this problem. Gordân et al. [65] study the related problem of distinguishing direct vs. indirect interactions be- tween transcription factors (TF) and DNA. While the objective of their study is similar to ours, their method makes use of information specific to TF-DNA interactions (e.g. TF binding data, motifs from protein binding microarrays), and thus is not immediately ap- plicable to the problem on general PPI networks. In fact, to our knowledge, no existing approach seems directly applicable.

4.2 Mathematical Modelling and Problem Formulation

Throughout this chapter, we make the assumption that appropriate methods have been used to reduce as much as possible protein misidentifications and contaminants, in such a way that all interactions detected are either direct or indirect interactions. Our task is to separate the former from the latter. To save from confusion, therefore, we note that false positives (resp. false negatives) henceforth refer to falsely detected (resp. un- detected) direct interactions inferred by our algorithm. 78 Chapter 4. Direct PPI Networks from AP-MS Data

4.2.1 A Probabilistic Model for AP-MS Data

We first describe a simple model of the AP-MS PPI data that shall be used throughout this chapter. Though admittedly rather simplistic, our model has the benefit of allowing the formulation of a well-defined computational problem.

Let Gd i r e ct = (V, Ed i r e ct ) be the graph of direct (physical) interactions over the set V of proteins considered, and let N (b) = p V : (b,p) Ed i r e ct be the set of direct in- { ∈ ∈ } teraction partners of protein b. We model the physical process through which PPIs are identified in an AP-MS experiment as follows. If a bait protein b is in contact with a di- rect interaction partner p N (b), the affinity purification (AP) process (see Section 1.1.2 ∈ in Chapter 1) on b will pull down p, which will then be identified through mass spec- trometry. In addition, if p interacts with p N p at the same time as it interacts with 0 ( ) ∈ b, protein p may also be pulled down, even though the two proteins b and p only 0 0 indirectly interact. In general, any protein x that is connected to b by a series of simulta- neous direct interactions may be pulled down by b. As a result, all interaction partners of b (direct or indirect) will be identified together. Figure 4.1 depicts an example of this effect.

In order to distinguish direct physical interactions from indirect ones, the availabil- ity of quantitative AP-MS data is helpful. Although quantitative AP-MS remains in its in- fancy, prey abundance can be estimated fairly accurately using various approaches, e.g. the peptide count [55], spectral count [100], sequence coverage [51], protein abundance index [73], and our own approach of quantifying absolute protein abundance discussed in Chapter 3. Combined with the increasing accuracy and sensitivity of mass spectrom- eters, these methods are becoming more reliable. Throughout the discussions in this chapter, we thus assume that this quantitative data is available to us.

To model the quantitative AP-MS data, we first consider the strength of a direct in- teraction, which can be measured by the energy required to break it. To give an example, suppose the PPI network consists of only two proteins, b and p. Then, let A(b,p) denote 4.2. Mathematical Modelling and Problem Formulation 79

b Cell b p4

p2 p p b 1 2

p1 p1 p2 p3 p4 p5 p6 p7 p8 p7 p3 b (b) b p6 p3 p5 p5 p6 p5 b p6 p1

p8 p p7 p p4 2 8

(a) (c)

Figure 4.1: A schematic for indirect interactions in AP-MS data. (a) In a cell, multiple copies of a bait protein b are expressed, and interact (directly or indirectly) with other prey proteins p1,...,p8. (b) After the pull-down on the bait b, MS detects all prey proteins, including indirect interaction partners p4 and p8. (c) The direct interaction network should, however, contain only edges between direct interaction partners.

the observed abundance of a prey protein p obtained by the bait b, and let χ(b,p) denote the true abundance of interactions between b and p. Since protein interactions may be

disrupted by the AP process, we expect that A(b,p) is correlated with the strength of the interaction between b and p, as well as the true abundance χ(b,p). We model the strength of an interaction using the probability pˆ(b,p) that the interaction between b and p survives the purification process, and assume that the interaction breaks with

probability 1 pˆ(b,p). Then, the observed abundance of protein p obtained from the − pull-down on b would be

A(b,p) χ(b,p) pˆ(b,p). ∝ ·

As another example, consider now a system of three proteins b, p1, p2, where (b,p1)

and (p1,p2) form direct interactions, but b and p2 do not interact directly. As before,

let χ(b,p1,p2) denote the true abundance of protein interactions that occur simulta-

neously, i.e., number of times both (b,p1) and (p1,p2) occur together. Then, A(b,p2) ∝ 80 Chapter 4. Direct PPI Networks from AP-MS Data

χ(b,p1,p2) pˆ(b,p1) pˆ(p1,p2). In general, the observed amount of protein p obtained · · upon pull-down of b will be proportional to the probability that b and p remain con- nected after each edge (u ,v ) E (Gd i r e ct ) is broken with probability pˆ(u ,v ). Our goal ∈ is then to infer Gd i r e ct from the set of observed abundances A(u ,v ). In this paper, we make the following simplifying assumptions:

1. All direct interactions (u ,v ) Ed i r e ct survive with the uniform probability pˆ, and ∈ fail independently with probability 1 pˆ. − 2. All possible direct interactions take place at the same time, irrespective of the pres- ence of other interactions, and with the same frequency.

Albeit rather strong, these assumptions provide a useful starting point for separating direct interactions from indirect ones (see Section 4.5 for possible relaxation of these assumptions). Despite its simplicity, our mathematical modelling of AP-MS does fit ex- isting biological data reasonably well (see Section 4.4.5). We note that Asthana et al. [5] have proposed an approach similar to our probabilistic graph model to identify novel members of known protein complexes. However, their goal is not to identify indirect interactions, so their approach is not applicable to our problem.

4.2.2 Problem Formulation

We are now ready to formulate the algorithmic problem addressed in this chapter. We henceforth consider the (unknown) direct interaction network Gd i r e ct as a probabilistic graph, where each edge in Gd i r e ct survives the AP-MS process with probability pˆ, and fails otherwise. Let denote a random graph obtained from Gd i r e ct by removing edges G in Ed i r e ct independently with probability 1 pˆ. Then, define PGd i r e ct (u ,v ) to be the prob- − ability that vertices u and v remain connected (directly or indirectly) in : G

PGd i r e ct (u ,v ) = Pr[ there exists at least one path from u to v in ]. G 4.2. Mathematical Modelling and Problem Formulation 81

GDirect GDirect PGDirectPGDirect a a b b abab c d c e d f e f a 15/85/85/165/325/32a 15/85/85/165/325/32 c c b 15/85/165/325/32b 15/85/165/325/32 c c 11/21/41/411/21/41/4 d d 11/21/211/21/2 d d e e 11/411/4 f f 1 1

e e f f (a) (b)

Figure 4.2: An example of (a) a direct interaction network GDi r e ct and (b) its connectivity matrix PGDi r e ct 1 calculated with pˆ = 2 . Assuming each edge of GDi r e ct survives with probability pˆ, the probability of connectivity between each pair of protein can be estimated via sampling of the probabilistic network.

We call PGd i r e ct the connectivity matrix of Gd i r e ct . See Figure 4.2 for an example of a direct interaction network and its connectivity matrix. In Figure 4.2(b), the entry PGd i r e ct (b,d ) 5 is assigned 16 , because, out of all 64 subgraphs of Gd i r e ct , only 20 of those maintain the vertices b and d in the same component. In general, PGd i r e ct (u ,v ) can be estimated from Gd i r e ct by straight-forward Monte Carlo sampling. However, its exact computation

(known as the two-terminal network reliability problem [34, 138]) is #P-complete [138].

A set of AP-MS experiments where all proteins have been tagged and used as baits

yields an approximation of A(x,y ) for all pairs of proteins (x,y ), which can be trans-

formed into an estimate M (x,y ) of PGd i r e ct (x,y ) through an appropriate normalization.

We are thus interested in inferring Gd i r e ct from M :

EXACT DIRECT INTERACTION GRAPHFROM CONNECTIVITY MATRIX (E-DIGCOM) Given: An n n connectivity matrix M × Find: A graph G = (V, E ) such that PG (u ,v ) = M (u ,v ) for each u ,v V . ∈ 82 Chapter 4. Direct PPI Networks from AP-MS Data

In a more realistic setting, the connectivity matrix PG would not be observed pre- cisely, and the E-DIGCOM problem may not admit a solution. We are thus interested in an approximate, optimization version of the problem:

APPROXIMATE DIRECT INTERACTION GRAPHFROM CONNECTIVITY MATRIX (A-DIGCOM)

Given: An n n connectivity matrix M and a tolerance level 0 δ 1 × ≤ ≤ Find: A graph G = (V, E ) such that the number of pairs (u ,v ) V V such that PG (u ,v ) ∈ × | − M (u ,v ) δ is maximized. | ≤

Note that although the complexity of the DIGCOM problems are currently unknown, the reverse problem of finding the connectivity matrix M given an input graph G (net- work reliability problem) is well studied. Furthermore, related network design problems have been studied extensively in the computer networking community. For example, the reliable network design problem is to choose a minimal set of edges over a set a nodes so that the resulting network has at least the prescribed all-pairs terminal relia-

bility; various algorithms including branch-and-bound heuristics [76] and genetic algo- rithms [38, 39] have been proposed.

4.3 Algorithm for A-DIGCOM

Our algorithm for the A-DIGCOM problem has three main phases outlined as below.

PHASE I. We start by identifying, based on the connectivity matrix M , vertices from Gd i r e ct with low degree, together with edges incident to them. As most PPI networks ex-

hibit the properties of scale-free networks [10], this resolves the edges incident to a significant portion of the vertices ( 75%, in our networks). ∼

PHASE II. At the other end of the spectrum, Gd i r e ct contains dense clusters that often cor- respond to protein complexes. We use a quasi-clique finding heuristic to identify such dense clusters from the connectivity matrix M . 4.3. Algorithm for A-DIGCOM 83

(a) (b)

(c) (d)

Figure 4.3: The outcome of our direct PPI detection algorithm after each phase. (a) The original solution space; (b) After detecting weakly connected regions; (c) After detecting dense clusters; (d) The genetic algorithm detects the remaining interconnecting regions.

PHASE III. To infer the remainder of the network, we use a novel genetic algorithm (see Sec- tion 2.4). This highly customized genetic algorithm makes use of the findings from the previous two steps in order to dramatically reduce the dimension of the prob- lem space, and to guide the mating process between parent candidates to create good offspring solutions.

Figure 4.3 gives an example of the output after each phase of our algorithm. In what follows, we describe each phase of the algorithm in detail. 84 Chapter 4. Direct PPI Networks from AP-MS Data

4.3.1 Identification of Weakly Connected Regions

I-a. Finding cut edges

A cut edge in a graphG is an edge (u ,v ) whose removal would result in u and v belonging to two distinct connected components (e.g. edge (c,d ) in Figure 4.2(a)). The following theorem allows the identification of all cut edges based on the connectivity matrix PG .

Theorem 4.1. A pair of vertices u and v from V forms a cut edge in G if and only if the following two conditions hold:

(i) PG (u ,v ) = pˆ

(ii) V can be partitioned into V = Vu Vv , where Vu = x V : PG (x,u ) PG (x,v ) ∪ { ∈ ≥ } and Vv = x V : PG (x,u ) < PG (x,v ) , such that s Vu and t Vv ,PG (s,t ) = { ∈ } ∀ ∈ ∀ ∈ PG (s,u ) pˆ PG (v,t ). · ·

Proof. Necessity is trivial. For sufficiency, suppose the conditions (i) and (ii) hold, and

(u ,v ) is not a cut edge. Then, to keep the graph connected, there must be an edge (s,t ) = 6 (u ,v ) joining Vu and Vv . Since (s,t ) is an edge, p(s,t ) pˆ. However, by assumption, we ≥ have p(s,t ) = p(s,u ) pˆ p(v,t ) < pˆ, which is a contradiction. · ·

The above theorem immediately provides an efficient algorithm to test whether a pair of vertices forms a cut edge in time O( V 2): recursively find edges satisfying condi- | | tions in Theorem 4.1. Observe that removing a cut edge (u ,v ) allows us to decompose the graph into two connected components (subgraphs induced by Vu and Vv , respec- tively), and the probability of connectivity between every pair of vertices in Vu (Vv , resp.) remains the same after removing (u ,v ). Therefore, the submatrices that correspond to

Vu and Vv can be treated as independent subproblems, and one can recursively detect cut edges in the remaining subproblems.

Note that a special case of cut edges is an edge whose endpoint is a degree-1 vertex. Because removing a cut edge does not change the connectivity between any two nodes 4.3. Algorithm for A-DIGCOM 85

on the same side of the cut, this allows us to repeatedly identify and remove degree-1 ver- tices, i.e., the corresponding rows and columns in the input matrix. Hence, if the entire graph is assumed to be a tree, this algorithm can reconstruct the graph completely. On the other hand, PPI networks tend to contain many cut edges due to their sparsity. Con- sequently, this simple characterization for cut edges allows a significant simplification of our problem by identifying all cut edges.

Theorem 4.2. If G is a tree, E-DIGCOM can be solved efficiently.

I-b. Finding degree-2 vertices

We now consider the problem of identifying degree-2 vertices from the connectivity ma- trix M . After degree-1 vertices, which are identified in the previous step, they constitute the next most frequent vertices in the biological networks we studied. While we do not have a full characterization of these vertices, the following theorem gives a set of neces- sary conditions that lead to a heuristic to predict degree-2 vertices.

Theorem 4.3. Let s be a degree-2 vertex in G such that N (s ) = u ,v . Then, the following { } four conditions must hold.

2 (i) Low connectivity: for each t V, PG (s,t ) < 2pˆ pˆ . ∈ − (ii) Symmetry: PG (s,u ) = PG (s,v ).

(iii) Neighbourhood: for each t V s,u ,v ,PG (s,t ) < PG (s,u ). ∈ − { } (iv) “Triangle” inequality: for each t V s,u ,v ,PG (s,t ) < max PG (u ,t ),PG (v,t ) . ∈ − { } { }

Before we prove this, consider the following results on graph composition. Let G1 =

(V1, E1) and G2 = (V2, E2) be two graphs. Then the following hold.

Fact 4.4. (Series composition) Suppose V1 V2 = c , and a new graph G is constructed ∩ { } by joining G1 and G2 at c. Then, for any s V1 c and for any t V2 c ,PG (s,t ) = ∈ − { } ∈ − { } PG1 (s,c) PG2 (c,t ). · 86 Chapter 4. Direct PPI Networks from AP-MS Data

Proof. Since c is a cut vertex, PG (s,c) = PG1 (s,c), and PG (c,t ) = PG2 (c,t ). For the vertices s and t to be connected, all of s,c,t must be in the same connected component. Because every edge removal is independent, we have PG (s,t ) = PG1 (s,c) PG2 (c,t ). ·

Fact 4.5. (Parallel composition) Suppose V1 V2 = s,t , and a new graph G is constructed ∩ { } by joining G1 and G2 at s and t (possibly leading to parallel edges between s and t ). Then,

PG (s,t ) = PG1 (s,t ) + PG2 (s,t ) PG1 (s,t ) PG2 (s,t ). − ·

Proof. The event that s and t are not connected is when s and t are disconnected in both G1 and G2. Thus 1 PG (s,t ) = (1 PG1 (s,t )) (1 PG2 (s,t )) = 1 PG1 (s,t ) PG2 (s,t ) + − − · − − − PG1 (s,t ) PG2 (s,t ). ·

Now we are ready to prove Theorem 4.3.

Proof. We prove each condition separately.

Condition (i) Low connectivity. Since s has degree 2, it becomes disconnected from

2 2 2 the rest of the graph with probability (1 pˆ) . Thus, PG (s,t ) 1 (1 pˆ) = 2pˆ pˆ . The − ≤ − − − equality can only hold if PG (s,u ) = PG (s,v ) = 1, which is impossible.

Condition (ii) Symmetry. Let PG (s,u ) (s,u ) denote the connectivity of s and u when −{ } edge (s,u ) is removed from E . Then, we have:

PG (s,u ) = pˆ + PG (s,u ) (s,u ) pˆ PG (s,u ) (s,u ) (by Fact 4.5) −{ } − · −{ } 2 = pˆ + pˆ PG (s,u ),(s,v ) (v,u ) pˆ PG (s,u ),(s,v ) (v,u ) (by Fact 4.4) · −{ } − · −{ } 2 = pˆ + pˆ PG (s,u ),(s,v ) (u ,v ) pˆ PG (s,u ),(s,v ) (u ,v ) · −{ } − · −{ }

= pˆ + PG (s,v ) (s,v ) pˆ PG (s,v ) (s,v ) (by Fact 4.4) −{ } − · −{ } = PG (s,v ) (by Fact 4.5) 4.3. Algorithm for A-DIGCOM 87

Condition (iii) Neighbourhood Next, we show that for any t V s,u ,v , PG (s,t ) < ∈ − { } PG (s,u ). For a subgraph H G , let p(H) denote the probability of observing H from a ⊆ probabilistic graph G . We can write this probability as

E (H) E (G ) E (H) p(H) = pˆ| | (1 pˆ)| |−| | · −

Thus, for any two subgraphs Hi ,Hj G , such that E (Hi ) = E (Hj ) , p(Hi ) = p(Hj ). Let ⊆ | | | | H(s,t ) be the set of subgraphs of G where s and t are connected. We can then write the

probabilities PG (s,t ) and PG (s,u ) as the sum of the probabilities of these subgraphs.

X PG (s,t ) = p(Hi ) Hi H(s,t ) ∈X PG (s,u ) = p(Hj ) Hj H(s,u ) ∈

Observe that these two sets of subgraphs may overlap:

H(s,t ) = H(s,t ,u ) H(s,t ,u¯ ) ∪ H(s,u ) = H(s,t ,u ) H(s,t¯,u ) ∪ where H(s,t ,u ) is the set of subgraphs such that s,t , and u are all connected, while H(s,t ,u¯ ) is the set of subgraphs where s,t belong to the same connected component that doesn’t contain u . Thus, we now focus on H(s,t ,u¯ ) and H(s,t¯,u ). The subgraphs in H(s,t¯,u ) can be partitioned as follows, depending on which component v belongs to:

(i) s,u ,v , t : Vertices s,u ,v belong to the same component, and t belongs to an- { } { } other component.

(ii) s,u , v , t : Vertices s,u belong to the same component, and each of v and t { } { } { } belongs to distinct components.

(iii) s,u , v,t : Vertices s,u belong to the same component, and v,t belong to an- { } { } other component. 88 Chapter 4. Direct PPI Networks from AP-MS Data

It is easy to see that the cases (i) and (ii) are nonempty, since (s,u ),(s,v ) E (G ). ∈ Finally, consider the set of subgraphs in H(s,t ,u¯ ). Let H be a subgraph in this set. Since s and u belong to distinct components, (s,u ) / E (H), and thus (s,v ) E (H) in ∈ ∈ order to have s and t connected. We then make the following operation on H to con- struct H : (1) remove s,v from H, and (2) insert s,u . Then, H has s and u in the same 0 ( ) ( ) 0 component, and v and t in another component. Therefore, H belongs to Case (iii) of 0 H s,t¯,u . Furthermore, note that H and H have the same number of edges. Therefore, ( ) 0 there is a mapping from subgraphs in H(s,t ,u¯ ) to subgraphs in Case (iii) of H(s,t¯,u ) with an equal number of edges. Since the cases (i) and (ii) are nonempty, it follows that

PG (s,t ) < PG (s,u ).

Condition (iv) “Triangle” inequality. Given the probabilistic graph G , we partition the subgraphs of G into four cases depending on the existence of the two edges (s,u ) and (s,v ).

(s,u ),(s,v ) E : occurs with probability pˆ2. • ∈ (s,u ) E ,(s,v ) / E : occurs with probability pˆ(1 pˆ). • ∈ ∈ − (s,u ) / E ,(s,v ) E : occurs with probability pˆ(1 pˆ). • ∈ ∈ − (s,u ),(s,v ) / E : occurs with probability (1 pˆ)2. • ∈ −

We can thus rewrite the probability PG (s,t ) as follows.

2 PG (s,t ) = pˆ (1 pˆ) PG s (u ,t ) + pˆ (1 pˆ) PG s (v,t ) + pˆ PG s (u or v,t ) (4.1) · − · −{ } · − · −{ } · −{ } where PG s (u or v,t ) denotes the probability that u or v is connected to t in the graph −{ } G s . We write PG (u ,t ) and PG (v,t ) similarly: − { }

2 2 PG (u ,t ) = (1 pˆ ) PG s (u ,t ) + pˆ PG s (u or v,t ) (4.2) − · −{ } · −{ } 2 2 PG (v,t ) = (1 pˆ ) PG s (v,t ) + pˆ PG s (u or v,t ) (4.3) − · −{ } · −{ } 4.3. Algorithm for A-DIGCOM 89

Subtracting (4.1) from (4.2), and (4.1) from (4.3) gives:

PG (u ,t ) PG (s,t ) = (1 pˆ) PG s (u ,t ) pˆ(1 pˆ) PG s (v,t ) − − · −{ } − − · −{ } PG (v,t ) PG (s,t ) = (1 pˆ) PG s (v,t ) pˆ(1 pˆ) PG s (u ,t ) − − · −{ } − − · −{ }

If PG S (u ,t ) > pˆ PG S (v,t ), it follows that PG (u ,t ) > PG (s,t ). Otherwise, PG s (u ,t ) −{ } · −{ } −{ } ≤ pˆ PG s (v,t ) implies pˆ PG s (v,t ) < PG s (v,t ), and it follows that PG (v,t ) > PG (s,t ). · −{ } · −{ } −{ } This completes the proof.

These necessary conditions allow us to rule out vertices that cannot have degree greater than 2. As a result, Theorem 4.3 provides a heuristic (Algorithm 4.3.1) for pre- dicting the set of degree 2 vertices as well as the edges incident to them. It is easy to see that Algorithm 4.3.1 runs in time O( V 2), and in practice, our studies have shown that | | vertices satisfying these conditions while having degree higher than two are rare (see Table 4.1).

Unlike the case of degree-1 vertices, we cannot simply remove degree-2 vertices without affecting the remaining entries in the matrix. Therefore, as shown in Algo- rithm 4.3.1, degree-2 vertices and their incident edges are simply marked as such in the solution, but are not removed from the input matrix.

Algorithm 4.3.1: Prediction of degree 2 vertices Input: Probability matrix M and vertex set V Output: Degree 2 vertices V 2 and their incident edges E 2 foreach vertex s V do if s satisfies conditions∈ in Theorem 4.3 then V 2 V 2 s ; ← ∪ { } u ,v the two vertices with M s,u = M s,v = max M s,t : t V s ; E 2 ←E 2 (s,u ),(s,v ) ; { ∀ ∈ − { }} end ← ∪ { } end Return V 2 and E 2 90 Chapter 4. Direct PPI Networks from AP-MS Data

4.3.2 Identification of Densely Connected Regions

We now turn to the problem of finding densely connected regions in the network. These regions may correspond to protein complexes, where tagging any one of the members of the complex results in the identification of all other members with high probability. While correctly predicting the physical interactions within each complex is a difficult task, separating these dense regions from the remainder of the network is key to im- proving the accuracy of the genetic algorithm (Phase III).

In order to build an algorithm for detecting dense regions, we consider the connec- tivity between nodes in an n-clique Kn , which we denote as CliqueConn(n). First, let us recall a classical result in graph enumeration.

Lemma 4.6. [62, 67] Let γn denote the number of connected simple graphs over n labeled vertices. We can count this number by the recurrence:

 1 if n = 1 γn = n 1 Pn 1  n n i   (2)  ( 2− ) 2 n i −1 i i 2 γi otherwise − = · ·

Using this combinatorial result, we can construct the formula of CliqueConn(n) when 1 pˆ = 2 , which is the value we used for our analyses. The formula for other values of pˆ can be formulated using a similar proof.

1 Lemma 4.7. Let G be a complete, probabilistic graph on n vertices with pˆ = 2 . Then, for any two vertices in G , the probability that they are connected is:

n n  X i 2 γi C l iqu eConn(n) = n− n i (2) ( 2− ) i =2 2 −

Proof. We write the probability as follows.

σ2 + σ3 + ... + σn C l iqu eConn(n) = n 2(2) 4.3. Algorithm for A-DIGCOM 91

Algorithm 4.3.2: Detecting dense clusters in the network Input: Probability matrix M , and minimum cluster size k

Output: Set of possible k -cliques in Gd i r e ct . t CliqueConn(k ) ← Gt (V, Et ), where Et = (u ,v ) : u ,v V and M (u ,v ) t foreach← connected component{ S V ∈do ≥ } Find cliques of size k in S ⊆ end Return cliques discovered.

where σi denotes the number of subgraphs of Kn with i vertices, where the two given n  vertices u and v are connected. To compute σi , there are i 2 ways to pick i 2 vertices − − (in addition to u and v ) for the connected components, and γi ways to keep them all

n i connected, and finally, 2( 2− ) ways to assign edges among the remaining vertices. So we have  n  n i σi = γi 2( 2− ), i 2 · · − and finally, we have n n  X i 2 γi C l iqu eConn(n) = n− n i (2) ( 2− ) i =2 2 −

Lemma 4.7 gives rise to a heuristic (see Algorithm 4.3.2) that finds a set of dense re- gions covering the network. While the algorithm guarantees to identify all cliques of size k k contained within G , sets of vertices that do not form a k -clique may also be 0 d i r e ct ≥ reported, provided that they are sufficiently connected among themselves, possibly via vertices outside the dense cluster. However, for sufficiently large values of k , we found this to be a very rare occurrence. While finding cliques in a graph is a computation- ally intensive task in general, the construction of Gt (in Algorithm 4.3.2) for large values of k creates few small connected components and leaves the remaining vertices iso- lated. Therefore, in practice, Algorithm 4.3.2 can be implemented to run in a reasonable amount of time (see Table 4.5(ii)).

Based on the connectivity matrix M , our algorithm identifies (possibly overlapping) 92 Chapter 4. Direct PPI Networks from AP-MS Data

clusters of proteins of size at least k such that, for every pair u ,v in each cluster, M (u ,v ) ≥ tk for some threshold tk . For appropriately chosen values of k and tk , the discovered clusters correspond to cliques in Gd i r e ct with high accuracy (see Section 4.4.6).

The dense regions discovered at this phase provide us with (1) the set of edges within each dense region; and (2) sparse cuts between disjoint dense regions. The edge set within each cluster will be used for initial candidates in the genetic algorithm, whereas the cuts defined by the clusters will be used as crossover points during the crossover operation in the genetic algorithm.

4.3.3 Cut-based Genetic Algorithm

To predict the remaining section of the network, we use a customized genetic algorithm that aims at finding an optimal solution to the A-DIGCOM problem. We first devise a solution to a generalization of the A-DIGCOM problem, and then show how the results of Phase I and Phase II of the algorithm are used to improve the performance.

Genetic algorithms have been shown to be an effective family of heuristics for a wide variety of optimization problems [64], including network design under connec- tivity constraints [38, 39]. A genetic algorithm models a set of candidate solutions as in- dividuals of a population. From this population, pairs of promising candidate solutions are mated, and their offspring solutions inherit properties of the parents with some ran- dom mutations. Over generations, this process of natural selection improves the fitness of the population.

The A-DIGCOM problem is a hard optimization problem, because (i) the size of the

n search space is huge – 2(2) for a graph of size n, and (ii) there is no known polynomial- time algorithm to evaluate a proposed candidate solution (i.e. compute PG from G ). For these reasons, a straight-forward genetic algorithm implementation failed to produce satisfactory results (data not shown). Instead, we use a more sophisticated approach by making use of the results obtained in previous sections in order to reduce the search 4.3. Algorithm for A-DIGCOM 93

space and to guide the mating operations for more effective search.

The genetic algorithm aims to solve a generalization of A-DIGCOM. First, we allow each edge (u ,v ) in the network to survive with a non-uniform probability pˆ(u ,v ), in- stead of one constant probability pˆ over all edges. Secondly, we assume that we are

given two sets of edges EYES and ENO that indicate the set of edges that are guaranteed to be in the solution, and guaranteed not to be in the solution, respectively. This will later allow us to factor in the outcomes of the previous sections. Therefore, the edges whose presence remains to be determined are EM AY BE = EYES ENO . ∪

Encoding of candidate solutions. To represent a candidate solution, we first create a

hash table that maps each putative edge in EM AY BE to an integer. Each candidate is then

encoded as a list of integers (edges). Edges in EYES, which are part of all solutions, are not explicitly listed, in order to save space. Since the networks we consider are sparse

( E =O( V )), such an encoding technique significantly reduces the space requirements. | | | |

Initial population. The initial population of candidates is generated using a preferen-

tial attachment model [10] using the following observations: (i) The average connectiv- 1 P ity of vertex u , avgCon(u ) = V v V u M (u ,v ) is strongly positively correlated with | | ∈ −{ } the degree of u in Gd i r e ct ; (ii) the age of a vertex, measured by when the vertex was in- troduced to the graph, is positively correlated with the degree of the vertex. Therefore, during the generation of each candidate, we choose the next vertex to be added with probability proportional to its average connectivity. This results in a candidate solu-

tion where the degree of most vertices is likely to be close to their true degree in Gd i r e ct . Furthermore, in order to create candidates that are clustered similarly to the true direct interaction graph, we include the set of edges predicted by Algorithm 4.3.2 to each initial candidate. 94 Chapter 4. Direct PPI Networks from AP-MS Data

Fitness function. The fitness of a candidate solution G , fitness(G ) is obtained by first estimating the probability matrix PG using 500 Monte Carlo samples. To compute these samples, we break each edge independently with probability pˆ, and count the number of times each pair of vertices remain connected. Given 500 samples, we then count the number of vertex pairs (u ,v ) whose estimated connectivity PG (u ,v ) is within the tolerance level δ, i.e., M (u ,v ) δ (see Section 4.4.2 on how to choose δ). ±

Crossover. The crossover operation needs to hybridize two parent candidates to pro- duce offsprings preserving the good properties of the parents. This operation will be guided by a randomly chosen balanced cut V = V1 V2. Let G1 and G2 denote the two ∪ parent networks and let E1(Gi ) and E2(Gi ) denote the edges of Gi such that both end- points lie in V1 and V2, respectively. Furthermore, let E1,2(Gi ) denote the edges of Gi that cross from V and V . Mating G and G results in two children G V, E , and 1 2 1 2 0 = ( 0) G V, E such that: 00 = ( 00)

E 0 = E1(G1) E2(G2) (E1,2(G1) E1,2(G2)) ∪ ∪ ∩

E 00 = E2(G1) E1(G2) (E1,2(G1) E1,2(G2)) ∪ ∪ ∩

While choosing a random cut as the crossover point is a reasonable strategy to con- struct a new pair of offsprings, our studies have shown that a planned strategy in choos- ing the crossover points results in better performance with less chance of premature convergence. In particular, if the crossover point is chosen as a dense cut in the parent networks, then the connectivity among vertices within each partition would be deterio- rated significantly. This results in offsprings with much poorer fitness than their parents. On the other hand, if the parents are hybridized at a sparse cut, the connectivity among vertices within each partition is disrupted much less. Therefore, crossover operations are best done by selecting sparse balanced cuts ( V1 V2 ). Finding sparse balanced cuts | | ≈ | | is a well-studied problem in combinatorial optimization, for which various approxima- tion algorithms exist [96, 140]. However, these algorithms assume that the graph itself, 4.3. Algorithm for A-DIGCOM 95

not the connectivity matrix M , is given as input. We therefore use a simple heuristic that avoids cutting through the dense regions of the network. To generate these sparse cuts, we contract each dense region identified in Algorithm 4.3.2 to a single vertex, and then generate a weighted (by the number of vertices in each dense region) balanced partition of the vertices at random.

Mutation. In order to introduce variability to the population of candidates, a small number of edges (5 10%) are randomly inserted or deleted. Moreover, observe that the − child network constructed as above may not remain connected. Aside from the random mutation, therefore, we employ a simple local search that greedily adds edges to keep the network connected.

Genetic algorithm parameter selection. The various parameters of the genetic algo- rithm were selected based on the resulting performance on the Yu et al. data set. Two main parameters that affect the performance significantly are the population size and the selection criteria. For selection criteria, we tested several different selection criteria by setting the probability of choosing a candidate as a parent. The best compromise between running time and accuracy was obtained using a population size of 500 and, a selection probability for a parent proportional to fitness(ci ) minFit, where minFit is the − fitness of the worst candidate in the population.

4.3.4 Restricting the Solution Space for GA

While our genetic algorithm offers a plausible method for the A-DIGCOM problem, one can reduce the size of the solution space, which typically results in faster convergence to better solutions, using the results in Theorem 4.1 and 4.3. First, recall that finding all cut edges decomposes the problem into independent subproblems on 2-edge-connected components. Second, the identification of degree-2 vertices defines two sets of edges

EYES and ENO that constitute all putative edges incident to the identified degree-2 ver- 96 Chapter 4. Direct PPI Networks from AP-MS Data

3 tices. In other words, EM AY BE forms the subgraph of G induced by the set V + of vertices with degree 3. ≥

Furthermore, observe that the edges in EYES form parallel paths between vertices in V 3+. A classical result in network reliability (see Fact 4.5) suggests that these parallel paths can be merged into a single meta-edge whose reliability can be efficiently com- puted. To be more formal, consider the set

1(u ,v ), 2(u ,v ),..., k (u ,v ) {P P P } which is the set of paths between u and v in EYES. These paths can then be replaced by a single edge u ,v with survival probability p u ,v 1 1 p Pi (u ,v ) . By merg- ( ) ˆ( ) = Πi =1...k ( ˆ| |) − − ing every set of parallel paths, we obtain a compact network over V 3+ that efficiently encodes the edges in EYES. Since our genetic algorithm handles the case where the edge survival probability is non-uniform, this compact encoding results in substantial gains in the running time for estimating the fitness of the candidates, as well as in time and space requirements for handling large population sizes. In our applications, this allows us to remove approximately 70 75% of the original set of vertices. −

4.4 Experimental Results

In this section, we provide the details on how our algorithm is tested on various datasets. Then, the accuracy of the predicted direct PPI network is evaluated using simulated data. Finally, our algorithm is tested on a well-known yeast AP-MS dataset to compare against PPI datasets from other sources. 4.4. Experimental Results 97

4.4.1 Randomized Hill-climbing Algorithm

For performance comparisons, we tested our algorithm against a simple randomized hill-climbing approach. In this approach, we start with a randomly chosen spanning tree G 1 of the vertex set V . At the i t h iteration, we first sample the connectivity proba-

i bility PG i of G , using the Monte Carlo simulation. Then, we randomly pick a vertex pair u ,v with probability proportional to

D(u ,v ) P , i ,j V D(i , j ) ∀ ∈ where D(u ,v ) = M (u ,v ) PG i (u ,v ) . If the selected pair u ,v are connected by an edge | − | i i in G , but M (u ,v ) > PG i (u ,v ), then we remove (u ,v ) from G . On the other hand, if u i and v are not connected by an edge, but M (u ,v ) < PG i (u ,v ), then we add (u ,v ) to G . We repeat this local optimization heuristic while making sure the candidate solution remains connected.

4.4.2 Choosing a Tolerance Level δ and Handling Numerical Errors

In order to deal with numerical errors from the Monte Carlo sampling, we use an ad-

ditive tolerance level δ. Note that the sampling process for estimating the probability

matrix PG is a binomial process, which, by the central limit theorem, is closely approxi- mated by a normal distribution. The confidence interval is largest when the estimated probability is equal to 0.5, in which case we obtain a confidence interval of

z 1 α/2 pˆ δ = pˆ − , ± ± 2pn where pˆ denotes the fraction of samples where the two vertices are connected after n

samples, and z 1 α/2 is the z-value for desired level of confidence. Using this formula, we − conclude: 98 Chapter 4. Direct PPI Networks from AP-MS Data

1. When n = 20000 (the input matrix M by sampling the test networks), we obtain a 95% confidence interval of size at most 2 δ = 2 0.007 = 0.014. · ·

2. When n = 500 (the connectivity matrix by sampling each candidate solution in our genetic algorithm), the 95% confidence interval is of size at most 2 δ = 2 0.04 = · · 0.08.

With the chosen tolerance level δ, we modify our algorithm as appropriate each time we compare two connectivity probabilities. For example, in Theorem 4.1, the first con- dition PG (u ,v ) = pˆ is modified to PG (u ,v ) [pˆ δ,pˆ+δ]; and in Theorem 4.3, we modify ∈ − 2 2 the first condition PG (s,t ) < 2pˆ pˆ to PG (s,t ) < 2pˆ pˆ + δ. − −

4.4.3 Generation of Scale-free Networks

In order to generate artificial scale-free networks, we used two generation models: the preferential attachment model, and the duplication model. In the preferential attach- ment model, we evaluated the degree distribution of the two biological networks (GYu and GDIP ) and used the Barabási-Albert algorithm to construct a scale-free network with attachment factor 1.5 (each iteration adds a new vertex with 1 2 edges attached to ex- − isting vertices). In the duplication model, at each iteration, we randomly pick a vertex to duplicate with probability proportional to its degree and randomly drop the duplicated edges with probability at 0.5 in order to fit the degree distributions and sparsity of bio- logical networks. While there are various other random graph models proposed for PPI networks, these scale-free network models provided a sufficiently diverse set of initial population for our genetic algorithm.

4.4.4 Calculation of Connectivity Matrix from Peptide Counts

The peptide count of a prey protein in an AP-MS experiment is the number of different peptides that have been observed by mass spectrometry for that protein. We note that 4.4. Experimental Results 99

the peptide counts are biased towards preys with longer protein sequences, and to rec- tify this propensity, we normalized the abundance data by the protein sequence lengths to obtain the abundance ratios R(i , j ). In order to turn the normalized abundance ratios into the connectivity matrix for our probabilistic graph model, we used a simple logistic function 1 M i , j ˆ ( ) = R(i ,j ) α β − 1 + e − where the parameters α,β are chosen so that the computed distribution of pˆ fits the 2 simulated connectivity distribution of GYu , using a χ test (α = 2.8921,β = 0.6318). In − the cases where R(i , j ) differ from R(j ,i ), we choose the average of the two entries to symmetrize the matrix.

4.4.5 Model Validation

To test our approach, we first sought to validate our model of AP-MS data. To this end, we used one of the most comprehensive yeast AP-MS networks, obtained by Krogan et al. [93]. The dataset reports the Mascot score [114] and the number of peptides detected for each bait-prey pair. The complete set of interactions reported contains 2186 proteins and 5496 interactions (Krogan et al. Table S6); we call the resulting network G K rog anFu l l . The authors identified a subset of these interactions as high-confidence, based on their Mascot scores (Krogan et al. Table S5). We call this set of high-confidence interactions

G K rog anHi g h ; this network consists of 1210 proteins and 2357 interactions. We expect

that G K rog anHi g h is relatively rich in direct interactions, whereas the complete set of in-

teractions G K rog anFu l l consists in part of indirect interactions.

Considering G K rog anHi g h as a direct interaction network, we simulated Monte Carlo

sampling to estimate PGK rog anHi g h , using pˆ = 0.5 and 50,000 samples, which yields a 95%

confidence interval of size at most 0.007 on each PGK rog anHi g h (u ,v ) entry. Next, we normal-

ized the peptide counts of the interactions in G K rog anFu l l using protein lengths. We then

compared PGK rog anHi g h to the normalized peptide counts of the interactions inG K rog anFu l l . 100 Chapter 4. Direct PPI Networks from AP-MS Data

We expect that a significant fraction of low-confidence interactions in G K rog anFu l l − G K rog anHi g h are likely to be indirect interactions. If our model is correct, their peptide

counts should then be correlated with the corresponding entries in PGK rog anHi g h . Indeed, the positive linear correlation between the predicted connectivity PGK rog anHi g h and the ob- served normalized peptide counts is very significant (regression p-value of 8.17 10 11, − × Student t -test). Furthermore, this correlation is strongest when pˆ 0.5, as compared to ≈ pˆ = 0.3 or 0.7, justifying the use of this value in our subsequent analyses.

4.4.6 Accuracy of the Algorithm

The ideal validation of the accuracy of our algorithm would involve (i) constructing a connectivity matrix M using actual quantitative AP-MS data; (ii) predicting direct inter- actions based on M using our algorithm; and then (iii) comparing our predictions to experimentally generated direct interaction data. Yeast 2-Hybrid (Y2H) experiments are less prone to detect indirect interactions than are AP-MS methods, and several large- scale efforts have been reported [53, 74, 137]. Unfortunately, for a number of techni- cal reasons, the overlap between AP-MS PPI networks and Y2H networks remains very small [149]. As a consequence, Y2H data cannot be used directly to validate predictions made on AP-MS data. Instead, we had to rely on partially-synthetic data set, where an actual network of high-quality Y2H interactions is assumed to form the direct interac- tion graph, and a connectivity matrix is generated from it using Monte Carlo sampling, under our model. Two sets of Y2H interactions were used: (i) GYu is the network con- structed from the gold standard dataset of Yu et al. [149]. This network consists of 1090 proteins and 1318 interactions with high confidence of direct interactions; (ii)GDIP is the core, high-quality, physical interaction network of yeast, available at the DIP database, version 20090126CR [146], consisting of 1406 proteins and 1967 interactions.

These biological networks were complemented with two artificial 1000-vertices net- works. The first was generated using the preferential attachment model (PAM) [10]. For the second, we used the duplication model (DM) [13], which, in contrast to the PAM, 4.4. Experimental Results 101

generates graphs containing several dense clusters. The resulting artificial “direct” in- teractions graphs are called GPAM and GDM and contain 1500 2000 interactions each. − We then used the Monte Carlo sampling approach described above to estimate the con-

nectivity matrices PGYu , PGDIP , PGPAM , and PGDM . These will form the input to our inference algorithm, whose output will then be compared to the corresponding direct interaction graph. It is important to note that these input matrices are not perfectly accurate and may contain sampling errors. However, it is easy to bound the size of the errors with high probability and use this as a tolerance level within our algorithm. We also note that the results presented in this section only aim at evaluating the performance of the infer- ence algorithm on input data that was generated exactly according to our probabilistic model. As such, the error rates reported may be considered as lower bounds for those on actual biological data. An assumption-free evaluation is provided later in this section.

Identification of weakly connected vertices

Theorem 4.1 provides an efficient algorithm that guarantees the identification of all cut edges, provided that the given connectivity matrix is precise. We say that a vertex v is a 1-cut vertex if all edges incident on v are cut edges. By applying Theorem 4.1 recursively to detect cut edges and decomposing the graph into two connected components, we can detect and remove all 1-cut vertices from the input connectivity matrix. Table 4.1 (i) reports the number of 1-cut vertices that are detected by the recursive algorithm from Theorem 4.1. In both the Yu and the DIP network, 1-cut vertices constitute approxi- mately 50% of the network, and identifying them allows a significant reduction in the problem size. We note that the inaccuracies in the input connectivity matrices could, in principle, have introduced errors in the detection of cut edges. However, this rare event was never observed on any of our networks.

Algorithm 4.3.1 (see Methods) guarantees to efficiently identify all degree-2 vertices (again, provided that the connectivity matrix is known), but may also incorrectly flag some higher-degree vertices. As seen in Table 4.1 (ii), nearly all degree-2 vertices were 102 Chapter 4. Direct PPI Networks from AP-MS Data

(i) 1-cut vertices (ii) Degree 2 vertices Network Total Remaining real pred. FDR(%) FNR(%) real pred. FDR(%) FNR(%) Yu 1090 552 552 0 0 195 207 7.7 2.05 331 DIP 1406 656 656 0 0 309 326 5.82 0.64 424 PAM 1000 457 457 0 0 351 363 3.58 0.28 180 DM 1000 323 323 0 0 117 126 11.9 5.12 551

Table 4.1: Performance of detecting weakly connected vertices. Number of vertices detected as (i) 1-cut vertices, and (ii) degree 2 vertices in the real network and the predicted network. False discovery (FDR) and false negative (FNR) ratios are given in percentage. Remaining: the number of vertices remaining after identifying 1-cut vertices and degree 2 vertices.

identified, with a low false-discovery rate ranging from 6% to 9%. Moreover, the false- positives incorrectly detected as degree-2 vertices indeed had small degrees, and their predicted neighbours were mostly correct (but incomplete) predictions. Flagging degree- 2 vertices reduces the problem size further by 15% to 36%.

After repeatedly detecting and removing 1-cut vertices and degree-2 vertices from the problem space, the edges adjacent to approximately 70% of the vertices are detected with very low error rate. The remaining vertices only constitute approximately 30% of the original network. We call this remaining subset the hard core of the connectivity matrix. Because it is more densely connected than the rest of the network, the topology of hard core is more difficult to reconstruct.

Running our algorithm on the PAM simulated data yields similar resolution and error rates as on the Y2H networks. However, our DM network is found to be less amenable to these strategies, leaving 55% of vertices unresolved and resulting in an error rate approx- imately twice that seen for other networks. This is simply due to the fact that networks generated by the duplication model do not contain as many 1-cut vertices or degree 2 vertices when compared to other networks, including biological Y2H networks.

Identification of dense regions

Our dense region detection algorithm aims at identifying all edges that belong to a k - clique in GDi r e ct , for a given value of k . Table 4.2 reports the accuracy of the algorithm. 4.4. Experimental Results 103

k 7 k 6 k 5 Network = = = real pred. FD(%) FN(%) real pred. FD(%) FN(%) real pred. FD(%) FN(%) Yu 42 54 22.22 0 66 104 36.53 0 146 308 55.19 5.48 DIP 0 42 100 0 86 112 26.79 4.65 184 266 35.34 6.52 PAM 0 31 100 0 0 45 100 0 0 96 100 0 DM 254 346 28.32 2.36 194 267 29.21 2.57 488 718 34.96 4.30

Table 4.2: Performance of quasi-clique predictions. Number of edges that belong to maximal cliques of size k . Real: actual number of edges that belong to maximal cliques of size k ; pred: predicted number of maximal k -clique edges; FD(false discovery ratio): percentage of false positives in the predicted set; FN(false negative ratio): percentage of false negatives in the real set.

As expected, our algorithm achieves extremely high sensitivity for clique edges. How- ever, the false-discovery rate is quite high, especially for the case of DIP and PAM, k = 7. This is due to the fact that distinguishing a 5-clique from, say, a quasi-clique of size 7 is extremely difficult, causing false-positive predictions. We note however that these er- roneous predictions are mostly inconsequential, as the intra-cluster topology of each dense region shall only be used in generating the initial candidate solutions for the ge- netic algorithm.

Cut-based genetic algorithm

The various parameters of the genetic algorithm (population size, mate selection prob- ability, mutation rate, etc.) were optimized for the running time and accuracy of the so- lution based on GYu . Although our genetic algorithm could in principle be used on any connectivity matrix, running it on the full matrix of > 1000 proteins is impossible: the search space is huge, and the amount of time required to evaluate the fitness of a given candidate solution is too large. However, as discussed previously, applying first the 1- cut and degree-2 vertex detection algorithms significantly reduces the problem size and makes it accessible to our genetic algorithm. Table 4.3 (i) reports the accuracy of the genetic algorithm predictions on the hard core of each connectivity matrix. We note that since the network to be inferred is relatively highly connected, the problem is sig- nificantly more difficult than the identification of 1-cuts and degree-2 vertices. Indeed, the false-discovery and false-negative rates range from 35% to 55% for most datasets. 104 Chapter 4. Direct PPI Networks from AP-MS Data

(i) reduced network (ii) overall network Network real pred. FDR(%) FNR(%) real pred. FDR(%) FNR(%) Yu 563 552 43.65 44.76 1318 1390 14.96 10.31 DIP 931 890 35.50 38.34 1967 2041 17.34 14.23 PAM 473 421 49.88 55.39 1538 1462 16.14 20.28 DM 1138 1295 43.39 35.58 1869 1804 32.81 35.15

Table 4.3: Performance of genetic algorithm and overall algorithm. (i) Performance of the genetic algo- rithm on the reduced network obtained from removing 1-cut vertices and degree-2 vertices. (ii) Overall performance of the combined prediction pipeline on the complete connectivity matrix. Real and pre- dicted describe the number of edges in the real and predicted networks, respectively.

For a comparison, an algorithm that would pick edges randomly would achieve 98.75% false-discovery and false-negative rates.

Combining the three phases of the algorithm, the overall error rate obtained on each data set ranges from 10 to 20% false-discovery and false-negative rates, except for the DM data set, which fares considerably worse, for the reasons explained earlier.

To the best of our knowledge, there has been no other efforts to solve the DIGCOM problems (neither exact, nor approximate version). We thus compared our approach to a simple hill climbing search algorithm on the Yu et al. data set (see Methods). We let this algorithm run over several days (as opposed to few hours spent using our ap- proach), with multiple restarts, and discovered that it provides very poor sensitivity and specificity (see Table 4.4 for the best results obtained). This is not surprising since the hill climbing method is highly dependent on the initial solution (in this case, a spanning tree chosen randomly based on the connectivity matrix) and the search space is simply

Network Real Predicted FDR(%) FNR(%) (i) Hill-climbing 1318 2108 86.67 78.68 (ii) Hill-climbing + weakly conn. nodes 1318 1517 43.57 35.05 (iii) Our approach (GA) 1318 1390 14.96 10.31

Table 4.4: Comparison of our method to simple hill-climbing approach. (i) Accuracy of the hill-climbing approach used over the complete network; (ii) Accuracy of the hill-climbing approach after fixing the weakly connected nodes using our algorithm; (iii) Accuracy of our combined pipeline using the genetic algorithm. 4.4. Experimental Results 105

Network (i) 1-cut & degree 2 vertices (secs) (ii) Quasi-clique predictions (iii) Genetic algorithm Yu 0:00:29 0:31:02 15:00:00 DIP 0:00:41 0:48:14 15:00:00 PAM 0:00:19 0:29:49 15:00:00 DM 0:00:24 1:18:03 15:00:00

Table 4.5: Running times for each phase of our algorithm. (i) Detecting 1-cut vertices and degree two vertices; (ii) Predicting quasi-clique clusters; and (iii) Running the genetic algorithm. We are reporting the average run time over three runs on each network. The implementation was tested on a Powermac G5 2Ghz with 4GB of RAM. Note that the genetic algorithm was run for a fixed amount of time, and the top scoring candidates achieved the quality as shown in Table 3. The times are shown in hh:mm:ss format.

too large to exhaustively search for the a good initial solution. We also tested the hill- climbing approach in the same setting as the genetic algorithm, i.e. combining it with the 1-cut edges and degree-2 vertices detection algorithms. Here, the hill-climbing ap- proach showed a better sensitivity and specificity than the pure hill-climbing approach, but still performed much worse than our genetic algorithm (Table 4.4). Furthermore, the improvement over the pure hill-climbing approach was mostly due to the high sen- sitivity and specificity of our algorithm for detecting weakly connected vertices.

To provide an idea of the running times, Table 4.5 gives the empirical data from our experiments. The first two phases (detecting weakly connected vertices and recognizing dense regions) were run within seconds while the genetic algorithm was run for a fixed amount of time.

4.4.7 Inferring Direct Interactions from AP-MS Experimental Data

In order to apply our algorithm to biological data from AP-MS experiments, we used the raw data reported by Krogan et al. [93] for the 2186 putative interactions of G K rog anFu l l . We only considered the subnetwork of tagged proteins, and further focussed our efforts on the analysis of 77 proteins that are well separated in the tag-induced subnetwork. Quantitative abundance estimates were derived from the peptide counts reported for each prey, and an experimentally derived connectivity matrix M was obtained after nor- malization (see Section 4.4.4). Our full prediction algorithm was then run on the es- 106 Chapter 4. Direct PPI Networks from AP-MS Data

timated connectivity matrix, resulting in a direct interaction graph prediction we call

G K i m that consists of 164 interactions (see Table 4.6). The network G K i m was compared to G K rog anHi g h , the set of high-confidence interactions reported by Krogan et al., and Top to G K rog anHi g h , a subset of G K rog anHi g h consisting of the 164 (to compare against G K i m ) most confident interactions they reported.

Both G K rog anFu l l and G K rog anHi g h overlap G K i m quite substantially. These three sets of predictions were then compared against a set of high-quality binary interactions from

GYu . In Y2H experiments, the interaction partners are separately screened using a ge- netic readout. Therefore, interactions from GYu are believed to be direct, and thus used to test against the predictions from AP-MS data. On the other hand, these interactions may reflect only a subset of all direct interactions among the 77 proteins.

As shown in Figure 4.4, our results show that the high-confidence AP-MS dataG K rog anHi g h exhibited very little overlap with the direct binary interaction set GYu . 72.6% of inter- actions in G K rog anHi g h are disjoint from GYu , and 25% of GYu remains undetected by Top G K rog anHi g h . Furthermore, even the top scoring set of interactions G K rog anHi g h showed high discrepancy ratios against GYu . In contrast, G K i m produced by our algorithm coin- cides with GYu with better sensitivity and specificity. Given the crudeness of the method in translating the AP-MS data into a connectivity matrix, our algorithm has thus per- formed relatively well in predicting direct interactions from real AP-MS data.

4.5 Discussions

Approaches for determining bait-prey abundance remain in their infancy, and to date, no large-scale PPI networks have this type of quantitative data. As these approaches improve in accuracy, so will the results of our method. Furthermore, as the sensitiv- ity of AP-MS pipelines improves, the fraction of indirect interactions detected will also increase, thereby making the ability to distinguish them even more critical. 4.5. Discussions 107

Protein A Protein B Protein A Protein B Protein A Protein B YOL012C YKR048C YNL031C YJL115W YLR418C YIL035C YBL003C YGL241W YNL031C YBL003C YDR054C YGL173C YBL003C YKR048C YNL031C YLL022C YDR054C YLR418C YBL003C YGL207W YNL031C YIL035C YER112W YCR077C YBL003C YIL035C YNL031C YGL207W YER112W YDR432W YGR103W YIL035C YNL031C YER164W YMR229C YDR432W YGR103W YMR229C YOR039W YER164W YOR206W YMR229C YGR103W YOR272W YER164W YGL207W YHL034C YOR310C YGR103W YKR048C YBL002W YOL012C YHL034C YDR432W YGR103W YNL189W YBL002W YDR224W YHL034C YLR175W YGR103W YLR347C YBL002W YKR048C YHL034C YMR229C YMR304W YGR159C YBL002W YBR245C YHL034C YGL173C YMR304W YDR432W YBL002W YGL207W YOL076W YGR159C YMR304W YLR197W YBL002W YIL035C YOL076W YDR432W YMR075W YBL002W YLL022C YPL001W YNL088W YGR159C YMR075W YDR432W YBR010W YBR245C YCR077C YDR432W YMR075W YNL189W YBR010W YGL207W YER146W YCR077C YOL004W YAL034C YBR010W YDR432W YHR064C YGL173C YOL004W YMR075W YBR010W YKR048C YOR048C YDR432W YOL004W YDR432W YBR010W YPL001W YGL241W YKR048C YOL004W YIL035C YBR010W YBL002W YDR225W YGL207W YOR310C YGL120C YBR010W YLL022C YDR225W YKR048C YOR310C YNL189W YCR060W YGR159C YDR225W YGL241W YLR455W YOR310C YJR041C YNL189W YGL207W YOR039W YLR455W YOR206W YJR041C YDR432W YGL207W YGL244W YLR455W YNL088W YJR041C YMR125W YGL207W YIL035C YLR455W YMR229C YJR041C YPL178W YGL207W YBR279W YLR455W YDR496C YNL209W YDR432W YGL207W YOL145C YBR009C YPL001W YNL209W YER146W YGL207W YNL088W YBR009C YGL207W YNL209W YKR048C YGL207W YGL173C YNL030W YPL001W YGL244W YLR418C YGL207W YOR061W YNL030W YBR009C YGL244W YOL145C YDR224C YGL207W YNL030W YLL022C YOR123C YGL207W YDR224C YKR048C YNL030W YGL207W YOR123C YGL244W YGL120C YGR159C YKR024C YMR229C YDR234W YHR064C YGL120C YDR432W YKR024C YGL173C YOR061W YOR039W YMR128W YGL120C YBR103W YDR432W YIL035C YBR009C YGR272C YHL034C YDL229W YBR169C YIL035C YDR172W YGR272C YDR432W YDL229W YMR229C YIL035C YOR061W YMR014W YDR432W YDL229W YGL207W YIL035C YER164W YMR014W YGR272C YDL229W YDR432W YPL153C YGR159C YNL147W YGL173C YDL229W YKR048C YPL153C YNL189W YDR432W YDL060W YDL229W YGR159C YJL115W YOR061W YGL147C YDR432W YDL229W YHR064C YJL115W YIL035C YPL178W YLR347C YOL145C YBR279W YJL115W YPL153C YPL178W YMR125W YOL145C YDR172W YLR410W YGR159C YGR159C YOR310C YOL145C YOR123C YLR410W YKR048C YML062C YNL189W YBR279W YLR418C YLR197W YOR310C YML062C YGR159C YBR279W YIL035C YLR197W YLR347C YMR125W YNL189W YBR279W YGL244W YDL040C YMR229C YNL139C YLR347C YBR279W YOR123C YDL040C YLR175W YNL139C YNL189W YDR365C YMR229C YDL040C YLR197W YNL189W YLR347C YDR365C YGR159C YLR418C YGL207W YDR289C YNL189W YNL031C YPL001W YLR418C YOR123C YDR289C YLR347C YNL031C YBR009C YLR418C YNL189W

Table 4.6: The set of direct interactions among 77 yeast proteins, predicted by our algorithm. The connec- tivity matrix is generated from normalized peptide counts, and then our algorithm is run to predict the direct interactions. 108 Chapter 4. Direct PPI Networks from AP-MS Data

332 164 164 121 121 121

31 90 242 42 79 85 15 106 58

T op GY u GKroganHigh GY u GKroganHigh GY u GKim

Figure 4.4: Inferring direct interactions from actual AP-MS dataset. Overlap between the Y2H interaction network of Yu et al. and various AP-MS-based networks: (a) High-confidence set of interactions from Kro- gan et al. (b) The set of 164 highest scoring interactions from Krogan et al. (c) The set of 164 interactions predicted as direct interactions by our algorithms, based on the AP-MS data from Krogan et al.

In this chapter, we lay the groundwork for modelling indirect interactions in AP- MS experiments. We formulate the DIGCOM problems, which aim at distinguishing direct interactions from indirect ones, and provide a set of theoretical and heuristic approaches that are shown to be highly accurate on both biological PPI networks and simulated networks. Despite the unrealistic assumptions that should be relaxed, our re- sults show that the predicted set of interactions fits the experimental data reasonably well. In addition, applying our algorithms to a large-scale AP-MS data set from Krogan et al. results in predictions that overlap Y2H data approximately 35% more often than the equivalent number of top-scoring interactions reported by these authors.

The DIGCOM problems raise a number of challenging, yet fascinating computa- tional and mathematical problems. Is the solution to the exact DIGCOM problem, if it exists, always unique? We suspect it is. What is the computational complexity of the exact and approximate DIGCOM problems? We believe they are NP-hard, and possi- bly not even in NP. Are there types of graph substructures, other than those discussed here, that can be unambiguously inferred from PG ? Are there special properties of PPI networks, other than the power-law degree distribution, of which an algorithm can take advantage to make more accurate predictions and/or provide approximation or proba- bilistic guarantees? 4.5. Discussions 109

The model and algorithm proposed here are only a first step toward an accurate de- tection of direct interactions from AP-MS data. Several generalizations and improve- ments are worth investigating. First, the abundance of an interaction is not constant and needs to be modelled more accurately. Second, the strength of all physical inter- actions is non-uniform, and some interactions may be more prone to disruption by the affinity purification process than others. Given sufficient quantitative AP-MS data, one may study a generalization of the DIGCOM problems that aims at identifying not only the set of direct interactions, but also their individual strengths and abundances. While modelling these aspects is in theory possible, the amount and quality of experimental data required is currently unavailable, and the computational complexity of the result- ing problems are likely to be daunting.

Perhaps a more significant limitation of our model is that all direct interactions are assumed to occur simultaneously, though it is clear that certain interactions are either mutually exclusive, or restricted to specific subcellular compartments or conditions. One can investigate approaches to decompose the observed network into a family of simultaneously occurring interactions. In order to do so, complementary experimental data, such as comprehensive protein localization assays or cell cycle expression data, would be required to reduce the space of possible solutions in a biologically meaningful manner.

An additional assumption that may need to be relaxed is the independence of the edge failures, which may not hold in cases where the loss of an interaction between two proteins causes a significant destabilization of the larger complex they belong to. Un- fortunately, in the presence of strong dependencies between edge failures, it becomes almost impossible to distinguish direct from indirect interactions. Nonetheless, it may be possible to at least identify complexes where such dependencies hold, by studying subsets of proteins for which the AP-MS data differs significantly from our model. 110 Chapter 4. Direct PPI Networks from AP-MS Data

4.6 Bibliographic Notes

The results discussed in this chapter have been published in [88]. Chapter 5

Hypergraph Modelling of PPI Data

PPI networks are traditionally modelled as graphs whose vertices represent proteins, and two vertices are joined by an edge if and only if the two corresponding proteins interact with each other. However, many recent discoveries of protein complexes as building blocks of the proteome led us to believe that protein-protein interactions may involve any number of interaction partners [7, 58, 93]. Consequently, various methods have been proposed for detecting protein complexes (e.g. [7, 90, 129]). We discussed some of these methods in detail in Chapter 1.

One of the limitations of the existing approaches is that the identified protein com- plexes are often disjoint, and the sparse inter-cluster bridges remain as binary interac- tions. Some recently proposed methods do allow finding overlapping clusters [1, 7, 98], but most of these approaches are designed to first identify dense clusters, and then try to merge the nearby clusters. While intuitive, such a strategy often fails to find a globally optimum structure, if, for example, the optimization criteria is to minimize the number of clusters. In fact, none of these methods guarantees to find the optimum solution for the problem they are designed for. As a result, finding a comprehensive model for PPI networks as a collection of protein complexes remains unsolved to date.

This chapter is concerned with the problem of modelling binary PPI data as a hyper-

111 112 Chapter 5. Hypergraph Modelling of PPI Data

graph, i.e., a set system where each set represents a protein complex. In order to model this problem, let G = (V, E ) denote the combinatorial graph constructed from the widely available binary PPI data (e.g. Y2H interaction data, or the direct AP-MS data produced from Chapter 4). Then, each protein complex in the PPI network corresponds to a set S of vertices in G , where each pair of vertices in S share an edge between them. In other words, each protein complex corresponds to a clique in G . Our problem of finding a hypergraph (with possibly a small number of sets, or hyperedges) can then be stated as finding a clique cover of minimum cardinality.

(EDGE)CLIQUE COVER. Given a graph G = (V, E ), find a smallest collection Q of cliques in G such that every edge in E belongs to at least one clique in Q.

A similar problem can be defined as a partitioning problem:

(EDGE)CLIQUE PARTITION. Given a graph G = (V, E ), find a smallest collection Q of cliques in G such that every edge in E belongs to precisely one clique in Q.

Therefore, we draw our attention to the clique cover problem, with an application to hypergraph modelling of PPI data. In fact, our theoretical pursuit on the clique cover problem gives rise to efficient algorithms for sparse networks (including PPI networks), where the graph sparsity is measured by various graph parameters.

Related Work. In addition to graph theoretic aspects, the clique cover problem has been studied extensively from the standpoint of computational complexity. In particu- lar, there are numerous studies concerning approximability and fixed-parameter tractabil- ity. Clique cover is NP-hard in general [110], even when the input graph is planar [24] or has bounded degree [71]. Furthermore, Lund and Yannakakis have shown that clique cover is not approximable within a factor of V ε for some ε > 0 unless P = NP [101], | | thereby removing the hope of good approximation algorithms in the general case. In the case of clique partition problem, the problem has also been shown to be NP-complete 5.1. Preliminaries 113

for various restricted classes of graphs [23, 56, 57, 72].

A parameterized problem is fixed-parameter tractable (FPT) if it can be solved in

f (k ) I O(1) time, where f is a computable function depending on some parameter k , · | | independent of the input size I . Recently, Gramm et al. [66] showed that the clique | | cover problem is FPT when the size of the cover is chosen for the parameter k . Similarly,

Mujuni and Rosamond [105] have shown that the clique partition problem is FPT with the output size chosen as the parameter. These algorithms run in polynomial time in the input size but exponential time in the number of cliques in the solution. As a result, these algorithms are well suited for dense graphs where a few cliques can cover the entire graph, but they are not suitable for sparse graphs that require a large number of small cliques in the solution.

In this chapter, we fill the gap by designing efficient algorithms for such sparse graphs. We first introduce some definitions and related results in Section 5.1. Then, in Sec- tion 5.2, we present an exact algorithm for clique cover when the input graph has bounded treewidth. Section 5.3 discusses the problem restricted to planar graphs, and we provide a polynomial time approximation scheme. Finally, we show the performance of our al- gorithm from experimental studies in Section 5.4. In particular, our algorithm shows efficient and practical running time for both real and simulated biological networks. Furthermore, our PTAS for planar graphs shows a clear trade-off between the approxi- mation ratio and the running time when tested against random planar graphs.

5.1 Preliminaries

Throughout this chapter, we focus on the clique cover problem for sparse networks. Where possible, we shall discuss how the algorithms for the clique cover problem can be modified to solve the clique partition problem. Various measures of network sparsity have been proposed in the past, with possibly the best known being treewidth. We recall the definition of tree decomposition from Section 2.2. 114 Chapter 5. Hypergraph Modelling of PPI Data

Definition 5.1. [120] Tree decomposition of a graph G = (V, E ) is a pair (X = Xi i { | ∈ I ,T = (I , F )) where each node i I is associated with a set of vertices Xi V , such that } ∈ ⊆ S (1) Each vertex belongs to at least one node: i I Xi = V. ∈

(2) Each edge is induced by at least one node: (v,w ) E , there is an i I with v,w ∀ ∈ ∈ ∈ Xi .

(3) For each v V , the set of nodes i I v Xi induces a subtree of T . ∈ { ∈ | ∈ }

Then, the width of a tree decomposition (X,T ) is defined as maxi I Xi 1, and the ∈ | | − treewidth of a graph G , denoted tw(G ), is the minimum width over all tree decomposi- tions of G .

An alternative definition of treewidth can be given using k -trees.

Definition 5.2. [3] A k -tree is defined recursively as follows:

(i) The complete graph on k vertices is a k -tree, and

(ii) A k -tree G with n + 1 vertices (n k ) can be constructed from a k -tree H with n ≥ vertices by adding a vertex adjacent to exactly k vertices that form a k -clique in H.

Definition 5.3. A graph is a partial k -tree if it is a subgraph of a k -tree.

Then, the class of partial k -trees is equivalent to the class of graphs with treewidth k . Note that this recursive definition provides a simple algorithm to construct a graph of treewidth k ; we shall use this approach to generate a set of synthetic test data in

Section 5.4.1. In general, it is NP-complete to determine the treewidth of a graph [3]. However, when k is fixed, graphs with treewidth k can be recognized, and width k tree decompositions can be constructed, in linear time [17].

From the empirical studies of our input PPI data (shown in Section 5.4, Table 5.1), the treewidth of PPI networks is often small compared to the network sizes. We thus assume, 5.1. Preliminaries 115

where applicable, that the input graph has a bounded treewidth, and its optimum tree decomposition can be constructed efficiently. Furthermore, for ease of exposition, we assume that the decomposition tree T admits a nice structure as defined below.

Definition 5.4. [92] A tree decomposition (X,T ) is called nice if the tree T is rooted, and for each node i I , one of the following holds: ∈

1. LEAF: node i is a leaf of T , and Xi = 1. | |

OIN 2. J : node i has exactly two children j1 and j2 such that Xi = X j1 = X j2 .

3. INTRODUCE: node i has exactly one child j , and Xi = X j v . ∪ { }

4. FORGET: node i has exactly one child j , and X j = Xi v . ∪ { }

Figure 5.1 illustrates an example of these node types. It is easy to see that if tw(G ) k , ≤ then G also admits a nice tree decomposition of width k , with O(n) tree nodes: given ≤ an arbitrary decomposition tree T , one can repeatedly split each node Xi until all nodes satisfy the conditions above.

Another closely related graph parameter is branchwidth. We recall its definition from Section 2.2.

Definition 5.5. [121] A branch decomposition (T,φ) of a graph G is characterized by a ternary tree1 T , and a bijection φ from the leaves of T onto the edges of G .

Let e be a tree edge in T . Removing e from T partitions into T1 and T2, and this partition induces a partition of edges in G , called an e -separation, associated with the

leaves of T1 and T2. The set of vertices in G that are shared by both G1 and G2 is called the middle-set of e , and the width of this separation is the number of vertices in the middle-set.

Given a branch decomposition (T,φ), the width of this branch decomposition is the maximum width over all e -separations in T , and the branchwidth of G , denoted bw(G ), 1 A tree T is a ternary tree if every non-leaf node has degree 3. 116 Chapter 5. Hypergraph Modelling of PPI Data

Xi ab Xi ab cv c

Xj ab Xj ab c cv

(1) (2)

X i ab cd

Xj1 Xj2 ab ab cd cd

(3)

Figure 5.1: Node types in a nice tree decomposition: (1) introduce node: Xi = X j v ; (2) forget node: ∪ { } X j = Xi v ; (3) join node: Xi = X j1 = X j2 ∪ { } is the minimum width over all branch decompositions. It is well known that the branch-

3 width is closely related to the treewidth of graph [121]: bw(G ) tw(G ) + 1 2 bw(G ). For ≤ ≤ planar graphs, Fomin and Thilikos gave an upper bound on the branchwidth:

Theorem 5.6. [52] For any planar graph G , bw(G ) p4.5n 2.122pn. ≤ ≈

5.2 Clique Cover for Graphs with Bounded Treewidth

In this section, we design a dynamic programming algorithm for finding a minimum clique cover for a graph G where a nice tree decomposition (X,T ) is given. Let k denote the width of that decomposition. First, let us define E (Xi ) to be the set of edges in the subgraph induced by the vertices in Xi . Furthermore, we let Vi denote the union of all vertices in Xi and its descendent nodes. Similarly, let Gi denote the subgraph of G in- duced by the vertices Vi . Finally, for some v V (G ), let δ(v ) denote the set of edges that ∈ 5.2. Clique Cover for Graphs with Bounded Treewidth 117

are incident to v in G .

We shall in fact design an algorithm for a generalization of the clique cover problem, where we are given a subset S of edges that are already covered, i.e., our solution need not cover S, but may use these edges in the cliques. Then, the original clique cover problem is a special case where S = . Since our dynamic programming is formulated around the ; decomposition tree T , we often speak of a subgraph Gi where a certain subset S of edges is already covered, denoted by Gi (S). Now we can define a cost function:

Ci (S) = minimum size of clique cover for Gi where S is already covered.

Then, our final solution is precisely Cr ( ) where r is the root of T . Our dynamic ; programming algorithm will proceed from the leaves of T up to its root, computing, for each node Xi , the value of Ci (S) for every possible subset S of E (Xi ). Depending on the type of node, Ci (S) is computed differently.

LEAFNODE. Suppose i is a leaf node. Then, by the definition of nice tree decomposition,

Xi = 1, and thus Ci (S) = 0 for all S E (Xi ), trivially. | | ⊆

FORGETNODE. Suppose i is a forget node. Then, it has one descendant X j = Xi v for ∪ { } some unique vertex v .

Lemma 5.7. If i is a Forget node, then for any subset S E (Xi ),Ci (S) = C j (S). ⊆

Proof. Note that since Xi X j , the corresponding graphs Gi and G j are the same. Fur- ⊂ thermore, since v is not in Xi , S δ(v ) = . Therefore, S E (Xi ) = S E (X j ) and we have ∩ ; ∩ ∩ Ci (S) = C j (S).

INTRODUCENODE. Suppose i is an introduce node. Then, it has one descendant node X j

such that Xi = X j v for some unique vertex v . Consider an arbitrary clique cover ∪ { } W for Gi (S). Since the cliques in can be partitioned into Qv = cliques containing v and W Qv = Qv , the following recurrence relation holds for each introduce node. W − 118 Chapter 5. Hypergraph Modelling of PPI Data

Lemma 5.8. If i is an Introduce node, then for any subset S E (Xi ): ⊆

Ci (S) = min Qv +C j (S E (Qv )) : Qv is a clique cover for δ(v ) S { | | ∪ − }

Proof. Let be a clique cover for Gi (S). We partition the cover =Qv Qv as defined, W W ∪ and consider the edgeset in Qv . Since these edges are covered by Qv , Qv needs only cover Gi (S Qv ). Moreover, since the cliques in Qv do not contain v , Qv is a cover for − ∪ G j (S Qv ). Therefore, Qv C j (S Qv ), and the lemma follows. − ∪ | | ≥ ∪

To compute Ci (S), we need to consider all possible clique covers, Qv , for δ(v ) S. − Since Xi k , this is the number of ways to partition k vertices, and is given by the k th | | ≤ Bell number B(n) k !2k . ≤

JOINNODE. Finally, suppose i is a join node. Then, it has two children nodes j1 and j2 such that Xi = X j1 = X j2 , and Gi = G j1 G j2 . Therefore, a clique cover for Gi contains ∪ cliques that belong to G j1 or G j2 . We need to ensure not to double count cliques that belong in both G j1 and G j2 .

Lemma 5.9. Let S E (Xi ) be the set of already covered edges, and let R = E (Xi ) S be the ⊆ − edges to be covered. Then,

Ci (S) = min C j1 (S R2) +C j2 (S R1) : R1 R and R2 = R R1 { ∪ ∪ ∀ ⊆ − }

Proof. Assuming S is already covered, let be a minimum clique cover for Gi (S) with W cost Ci (S). By definition, any clique that belongs to both G j1 and G j2 must also belong to

Xi . Thus, the cliques in can be partitioned as =Q1 Q2 Q3, where W W ∪ ∪

Q1 = q q Vj1 and q Xi { ∈ W | ⊆ 6⊆ }

Q2 = q q Vj2 and q Xi { ∈ W | ⊆ 6⊆ } Q3 = q q Xi . { ∈ W | ⊆ } 5.2. Clique Cover for Graphs with Bounded Treewidth 119

Thus, = Q1 + Q2 + Q3 . Now, the edgeset R can be partitioned to R = R1 R2 such |W | | | | | | | ∪ that:

R1 = e R e covered by Q1 or Q3 { ∈ | } R2 = e R e covered only by Q2 = R R1 { ∈ | } −

By definition, Q1 Q3 needs to cover the edges in R1. Furthermore, Q3 needs to cover the ∪ edges in G j1 E (X j1 ), and thus Q1 Q3 C j1 (S R2). On the other hand, the cliques inQ2 − | ∪ | ≥ ∪ only need to cover the edges in R2 together with G j2 E (X j2 ), and thus Q2 C j2 (S R1), − | | ≥ ∪ and the result follows.

Therefore, we can compute Ci (S) for any given subset S E (Xi ). Observe that the ⊆ recurrence relation looks at all possible bipartitions of R. Since the number of edges in

k k  (2) E (Xi ) is at most 2 , we need to check at most 2 different partitions of R. Once the bipartition of R is fixed, it takes constant time to look up the values from C j1 and C j2 .

Note that, while these recurrence relations calculate the size of clique covers, they are also constructive: a little bookkeeping at each node will allow us to construct the optimal cover at the root node.

5.2.1 Running Time of Treewidth-based Algorithm

For each node i I , we compute Ci (S) for every S E (Xi ). Since tw(G ) = k , each node ∈ ⊆ contains at most k vertices. Let ρ denote the maximum number of edges induced in

ρ any node. Then we need to consider 2 cases. Then, for each fixed S E (Xi ), we carry ⊆ out one of the four recurrence relations. Leaf nodes and Forget nodes can be computed in constant time. An Introduce node can be computed in O(B(k ) k ) time, where B(k ) · is k th Bell number. Finally, a Join node takes O(2ρ) to compute. Therefore, the dynamic programming algorithm takes 2ρ max B(k ) k ,2ρ O(n) =O(4ρn) time overall. While ρ · { · }· k  can be as large as 2 in theory, this is rarely the case as shown in our experimental tests 120 Chapter 5. Hypergraph Modelling of PPI Data

(see Section 5.4, Table 5.1).

Theorem 5.10. There is a linear time algorithm for computing minimum clique cover for graphs with fixed treewidth k .

5.2.2 Modifications for Clique Partition

It is straightforward to modify the above algorithm to solve the clique partition problem: instead of assuming that edges already covered can be re-used to form other cliques, we simply delete those edges and solve for remaining edgeset. If we redefine the cost function Ci (S) to be the size of the minimum clique partition for the graph Gi S, the − same recurrence holds for each node type. The only difference is when, at an Introduce node, finding a local solution for δ(v ), we look for a clique partition rather than a cover.

5.3 Planar Clique Cover

In this section, we study the clique cover problem restricted to planar graphs, and present a PTAS for planar graphs. While planar graphs are possibly the most restricted class of interest for the clique cover problem (the largest clique being just K4), the problem re- mains NP-hard [136]. Furthermore, as treewidth is unbounded in planar graphs [68, 120], simply applying our algorithm from Section 5.2 would result in an exponential running time.

Instead, we shall design an exact polynomial time algorithm for planar graphs with bounded branchwidth. As we shall see, the algorithm runs in O(2k n) time with k being the branchwidth, and since bw(G ) p4.5n when G is planar, this would be the first ≤ subexponential algorithm for the clique cover problem on planar graphs.

Then, we will use our exact algorithm to construct a polynomial time approximation scheme. Baker [8] has proposed a divide-and-conquer technique to design approxima- tion schemes for various optimization problems on planar graphs. We will show that her 5.3. Planar Clique Cover 121

technique can be applied to the planar clique cover problem, using our exact algorithm as a subroutine, resulting in a (1 + ε) approximation algorithm.

5.3.1 Clique Cover for Planar Graphs with Bounded Branchwidth

As with its counterpart, treewidth, it is NP-complete to determine if a graph has a branch decomposition of width at most k in general, but this decomposition can be found in linear time when k is fixed. We thus assume that the input graph G is given together

with a branch decomposition (T,φ) of width at most k . Now pick an arbitrary edge e of T , and subdivide it to create a root node r . Then each tree node X is associated with a

subset of edges in E (G ), namely the leaf nodes of the subtree rooted at X. We let E (S) denote the edges in the subgraph induced by a subset of vertices S.

Define the middle-set of X, denoted by mid(X), to be the middle-set of the edge be- tween X and its parent node. Since the root node r has no parent, set mid(r ) = . Then, ; we create a table WX [ ] indexed by a subset of edges as follows: · F

WX [ ] = minimum clique cover for edges in X with already covered. F F

Since mid(X) is a cutset in G , we can paste together solutions from each subproblem

by computing only the entries WX [ ] where is a subset of edges in E (mid(X)). F F Before describing the recurrence relation for this table, we study the middle-set of three adjacent edges. Consider the sphere-cut branch decomposition for planar graphs,

as studied by Dorn et al. [41]; here, each middle-set defines a closed curve (namely, a noose) on the planar embedding of the input graph that intersects only the vertices in

the middle-set. Let X be a tree node with two children nodes X L and XR , and a parent

node XP . The three edges adjacent to X define 3 middle-sets which we denote by OP , OL,

OR for parent edge, left and right child edge, respectively. Since OR (OP OL) = and − ∪ ; OL (OP OR ) = , the vertices of OP OL OR can be partitioned as follows: − ∪ ; ∪ ∪ 122 Chapter 5. Hypergraph Modelling of PPI Data

XP

OP

X OL OR

OL OR

XL XR OP

(a) (b)

Figure 5.2: A sphere cut decomposition at a node X and its planar embedding. (a) A tree node X has three adjacent edges, each of which defines a middle-set that forms a simple curve called a noose; (b) The middle-set of OP is drawn as solid lines. Inside the noose OP are OL and OR . Here, portal vertices P are drawn as red circles, intersection vertices are drawn as blue squares, and symmetric difference vertices are drawn as green diamonds.

Portal vertices P =OL OR OP • ∩ ∩

Intersection vertices I =OL OR P • ∩ −

Symmetric Difference vertices D =OP (P I ) • − ∪

Figure 5.2 gives an example of a sphere cut decomposition at a tree node X.

Lemma 5.11. The table WX [ ] can be calculated as F

WX [ ] = min WX1 [ F2] + WX2 [ F1] : F1 F2 is a partition of E (I P) F { F ∪ F ∪ ∪ ∪ }

Proof. For an arbitrary clique cover for X with already covered, consider the cliques F covering the edges E (I P). Observe that, by planarity of G , any clique intersecting with ∪ I P only contains either vertices of X1 or vertices of X2. Therefore, we can partition ∪ the edges in E (I P) into F1 and F2, where F1 is to be covered by cliques in X1, and ∪ F2 is covered by cliques in X2. For each partition, the solution from the subproblem

WX1 [ F2] together with the solution from WX2 [ F1] gives the solution for WX [ ]. F ∪ F ∪ F Since we consider all possible partitions of E (I P), the lemma follows. ∪ 5.3. Planar Clique Cover 123

The algorithm runs in a bottom-up manner. Observe that to compute each state

of a node, we need to consider all bipartitions of E (I P). Since I P mid(X1), F ∪ ∪ ⊆ E (I P) = O(k ), and thus there are 2O(k ) such partitions. Moreover, to compute for all | ∪ | states , we consider 2O(k ) subsets of edges in E (mid(X)). Altogether, the above dynamic F programming algorithm runs in 2O(k )O(n) time.

Lemma 5.12. There is a linear time algorithm to compute a minimum clique cover for planar graphs with bounded branchwidth.

5.3.2 Modifications for Clique Partition

As with our treewidth-based algorithm, the above algorithm can be easily modified to solve the clique partition problem: rather than assuming is already covered, one can F simply delete those edges and solve for the remaining edges.

5.3.3 Baker’s Technique on Planar Graphs

Baker [8] has proposed a general approach to design approximation algorithms for var- ious NP-hard problems on planar graphs. Here, we show how this technique, together with our exact algorithm in Section 5.3.1, results in a (1 + ε) approximation algorithm.

Baker’s technique is a divide-and-conquer approach, where the input graph is de- composed into layers of subgraphs defined by the distance from a chosen vertex. Ap- plying this technique to the planar clique cover problem, we obtain Algorithm 5.3.1. In short, we pick an arbitrary vertex r , and define the level of each vertex of G r as dis- − { } tance from r . Then, we slice up the graph into layers of subgraphs, where each subgraph is solved independently. The clique covers for the layers are then merged to construct a solution for the entire graph G . The edges on the boundary of each layer are covered by cliques from adjacent layers, which can be bounded to obtain an approximation ratio. See Algorithm 5.3.1 for details, and Figure 5.3 shows a schematic of the approach. 124 Chapter 5. Hypergraph Modelling of PPI Data

Algorithm 5.3.1: PTAS for Planar Clique Cover Input: A planar graph G = (V, E ), 0 < ε < 1 Output: A clique cover C of G . r an arbitrary vertex of G ; foreach← vertex v V r do level(v ) distance∈ \{ from} r ; end ← k 2/ε ; for←i d= 0,1,...,e k 1 do for j = 0,1,2...− do Gi j subgraph of G induced by the vertices at levels j k + 1 through (j +←1)k + i ; Ci j minimum clique cover for Gi j ; end ← end for i = 0,1,...,k 1 do S − Ci j Ci j . end ←

Return a clique cover from C0,...,Ck 1 with the minimum size. { − }

Lemma 5.13. Algorithm 5.3.1 finds a clique cover of weight at most (1 + ε) OPT.

Proof. Let Q denote the optimum solution, and given some congruence class i mod k , ∗ let us assume that the above approximation scheme divides the problem into m pieces. Since the solution for each piece is solved exactly, we have

m m X X Ci = Ci j Qi∗ j , | | j =1 | | ≤ j =1 | |

where Qi∗ j denotes the optimum solution restricted to the graph Gi j . We therefore need to compare the two values Pm Q and Q . To compare these, consider the number j 1 i∗ j ∗ = | | | | of cliques in Q that contain vertices at levels i mod k . For planar graphs, each clique ∗ belongs to at most 2 consecutive levels. Therefore, for at least one value of i , 0 i k , ≤ ≤ there are at most 2 Q cliques containing a vertex at level i mod k . Since these cliques k ∗ d e| | 5.3. Planar Clique Cover 125

(a) (b)

(c) (d)

Figure 5.3: Planar clique cover using Baker’s technique. (a) For each value of 0 i k 1, the graph is sliced into layers; (b) Each layer is treated as an independent subproblem; (c) Each≤ ≤ layer− is solved using our clique cover algorithm on planar graphs with bounded branchwidth; (d) The solutions from the layers are pasted together. The number of edges doubly-covered by cliques from adjacent layers can be bounded to give the approximation ratio.

Pm are double counted in the sum j 1 Qi∗ j , we have = | |

m X 2 Q 1 Q i∗ ( + k ) ∗ i =1 | | ≤ | | and it follows that C 1 2 Q . i ( + k ) ∗ | | ≤ | |

What remains now is an algorithm to compute the clique cover for each Gi j . Note that Lemma 5.12 provides an algorithm for planar graphs with bounded branchwidth, and thus it suffices to show that each Gi j also has bounded branchwidth. Tamaki’s the- 126 Chapter 5. Hypergraph Modelling of PPI Data

orem [132] directly provides an answer.

Definition 5.14. Given a plane-embedded graph G , the face-incidence graph = ( , ) G V E of G consists of vertices from faces of G , and two vertices in are joined by an edge if V V and only if the corresponding faces in G share a vertex.

Theorem 5.15. [132] Given a planar graph G , there is a linear time algorithm that finds a branch decomposition with width at most the radius of the face-incidence graph of G .

Since the face-incidence graph of Gi j has a bounded radius, each Gi j is a planar graph with bounded branchwidth. Therefore, our algorithm from Lemma 5.12 can com- pute the minimum clique cover for the subgraph in each Gi j .

Recall that our dynamic programming algorithm for graphs with branchwidth k ≤ runs in time 2O(k )O(n). Due to Theorem 5.6, directly applying this algorithm to planar graphs gives an exact solution in time 2O(pn)O(n). On the other hand, the PTAS using Baker’s technique involves decomposing the graph into n/k layers, and solving each piece exactly in time 2O(k )O(n/k ). Trying for every congruent classes between 1 and k 1, 2 − n O(k ) O(k ) n the overall algorithm takes k k 2 O(n/k ) 2 O( k ) to obtain a solution of value · d e · ≈ 2 PTAS at most (1+ k )OPT . Therefore, while our provides an approximation with varying degree of approximation ratio, if one wants to get a solution any closer than 1 1 OPT , ( + pn ) one may be better off running our dynamic programming algorithm directly to the graph to obtain an exact solution. This trade-off is clearly shown empirically in Table 5.2.

Theorem 5.16. There is a PTAS for planar clique cover.

5.4 Experimental Results

We have implemented the described algorithms and tested them against both real bi- ological PPI data and simulated data. As a comparison, we considered Gramm et al.’s algorithm [66] for the decision problem version of clique cover: given the size of clique 5.4. Experimental Results 127

cover k as an input parameter, their algorithm works by first applying a set of reduc- tion rules to reduce the problem instance, and then using a search tree algorithm on the reduced instance, in time exponential in k . While their algorithm works well in cases where the solution size is small, it performed poorly against our test data which con- tains several hundreds of cliques. This is mainly because their reduction rules did not significantly decrease the size of our input graphs (especially biological networks): the solutions for the reduced instances would still contain a large number of cliques, result- ing in inefficient running time for the search tree algorithm. As their input parameter is the size of the minimum clique cover, it motivates our development of approaches using edge sparsity for these networks.

5.4.1 Simulated and Biological Networks

Each of our algorithms was tested on both simulated and actual biological networks.

First, Krogan et al. [93] obtained an extensive dataset on yeast protein interactions. Tak- ing the largest connected component from the dataset, with 323 vertices and 742 edges, we formed a model network, G K rog an . Various studies have shown that PPI networks ex- hibit the properties of scale-free networks [10]. Many generative models for scale-free networks have also been proposed, and we used the two most frequently used mod- els [10, 32] to create test graphs for our algorithm: (1) preferential attachment model

(denoted by GPAM ), and (2) duplication model (denoted by GDM ). In both cases, we set the parameters of generative models so that the resulting networks show similar charac- teristics to that of real PPI data; density of E 2 V , and degree distribution P k k γ ( ) − | | ≈ | | ∼ where γ 1.7. ≈ To investigate the behaviour of our algorithms on denser graphs, we generated a set of partial k -trees. Recall that a k -tree is a maximal graph with treewidth k such that no edge can be inserted without increasing its treewidth, and a graph is a partial k -tree if it is a subgraph of a k -tree. The partial k -trees of given treewidth have been generated by first generating a k -tree, and randomly removing edges to obtain desired edge density. 128 Chapter 5. Hypergraph Modelling of PPI Data

n m tw ρ tree decomp. clique cover algo. # of cliques (hh:mm) (hh:mm) Krogan 323 742 11 21 2:13 6:08 482 PAM 300 634 14 29 2:41 9:21 421 DM 300 794 21 43 4:13 17:47 583

Table 5.1: Performance of treewidth-based exact clique cover algorithm; we show the treewidth of each graph, maximum # of edges per tree node (ρ), time taken to compute an optimal tree decomposition and a clique cover, and size of the solution.

Since the generation process of k -trees is similar to that of the preferential attachment model, this allows us to create graphs with higher density than GPAM while preserving low treewidth.

5.4.2 Performance of the Treewidth and Branchwidth-based Algorithms

Table 5.1 reports results obtained on real and simulated biological networks. Both Kro- gan’s PPI network and simulated networks exhibit relatively low treewidths for their size. Figure 5.4 gives a more comprehensive view of the running times from empirical testing. As expected, the running time increases linearly with n for graphs with fixed treewidths (Figure 5.4(a)), but exponentially with treewidth k for graphs with fixed n (Figure 5.4(b)). Running times for partial k -trees (Figures 5.4(c) and (d)) follow the same trends, al- though they are somewhat higher due to the higher edge density.

While our branchwidth-based exact algorithm was designed for planar graphs, it is easy to modify the algorithm to handle non-planar graphs with bounded branch- width (but at the expense of higher time complexity due to non-planarity). Figure 5.5 shows that, in practice, the running time of the treewidth-based algorithm grows slower than that of the branchwidth-based algorithm, allowing it to handle graphs with larger treewidths. 5.4. Experimental Results 129

80,000 80,000 tw=12 n=300 70,000 70,000 n=250 60,000 60,000

50,000 50,000 n=200 40,000 tw=11 40,000 n=150 30,000 30,000

Time ( seconds ) ( seconds Time 20,000 tw=10 ) ( seconds Time 20,000 n=100 tw=9 10,000 10,000 n=50 tw=8

0 0 50 100 150 200 250 300 8 9 10 11 12 # of vertices Treewidth (a) (b)

160,000 160,000

tw = 12 140,000 140,000 n = 200 120,000 120,000 n = 150 100,000 100,000 tw = 11 80,000 80,000

60,000 60,000 n = 100 Time ( seconds ) ( seconds Time Time ( seconds ) ( seconds Time 40,000 tw = 10 40,000 n = 50 20,000 tw = 9 20,000 tw = 8 0 0 40 80 120 160 200 8 9 10 11 12 # of vertices Treewidth (c) (d)

Figure 5.4: Performance of the treewidth-based algorithm on simulated networks: (a) scale-free networks (PAM) with fixed treewidth; (b) scale-free networks (PAM) with fixed graph size; (c) partial k -trees with fixed treewidth; (d) partial k -trees with fixed graph size. 130 Chapter 5. Hypergraph Modelling of PPI Data

17 17

16 bw-based 16 15 bw-based

14 15

13 y = 1.4219x - 0.2377 R² = 0.9914 14 y = 1.3651x + 1.3206 R² = 0.9838 log2 ( time ) log2 ( time ) tw-based 12 tw-based 13

11 y = 1.0604x + 0.8255 R² = 0.9789 12 y = 0.8246x + 4.5411 R² = 0.9959 10

8 9 10 11 12 8 9 10 11 12

Treewidth Treewidth

(a) (b)

Figure 5.5: Performance comparison of treewidth-based algorithm vs. branchwidth-based algorithm on scale-free networks; y -axes are shown in log scale to exemplify the difference in exponents in the running time. (a) scale-free networks with 50 vertices; (b) scale-free networks with 100 vertices; data points with the same treewidth values in each chart are taken from the same network.

5.4.3 Performance of PTAS for Planar Graphs

To test our PTAS on planar graphs, we generated a set of random planar graphs us- ing the simple algorithm by Denise and Vasconcellos [40]. Then, we ran both our ex- act branchwidth-based algorithm and the PTAS with varying values of ε to explore the trade-off between running time and quality of the solution. As shown in Table 5.2, the promised approximation ratios are almost exactly realized and substantial speed-ups are obtained for relatively large values of ε, compared to the exact branch-decomposition based algorithm. The running time increases exponentially with 1/ε, and the exact al- gorithm starts to become faster than the PTAS when ε becomes sufficiently small.

5.4.4 Clique Cover in Biological Networks

When executed on the yeast protein-protein interaction network of Krogan et al. [93], our treewidth-based algorithm finds a clique cover that includes 93 cliques of size 5 or 5.5. Discussions 131

ε # of cliques time (hh:mm:ss) 0.5 159 00:12:08 0.2 124 00:39:17 0.1 116 02:36:18 0.05 113 03:53:42 OPT 109 02:46:31

Table 5.2: Performance of PTAS for planar graph with n = 200,m = 407,t w = 10. OPT was obtained using the branch-decomposition based algorithm. more. While the PPI data may not admit a unique clique cover on the network, we man- ually verified that most discovered cliques correspond to known complexes, such as the RNA polymerase II, the RSC complex, the mediator complex, and the 20S proteasome. In addition, three highly overlapping complexes, SWR1 (a chromatin remodelling com- plex), NuA4 (a histone acetyltransferase complex), and INO80 (another chromatin re- modelling complex) are correctly identified, despite the fact that SWR1 and NuA4 share three subunits (ARP4, GOD1, and YAF9) and SWR1 and INO80 share four (ARP4, GOD1, RVB1, RVB2). This suggests that our algorithm is capable of identifying biologically rel- evant protein complexes, even those that share a significant number of subunits.

5.5 Discussions

In this chapter, we studied the clique cover problem on sparse networks as measured by treewidth and branchwidth, with an application to protein-complex discovery in PPI networks. We gave exact polynomial-time algorithms for graphs with bounded treewidth and bounded branchwidth, and built on the latter using Baker’s technique to obtain a polynomial time approximation scheme for planar graphs.

Our empirical studies show that the biological networks as well as synthetic net- works with similar characteristics (e.g. edge density, degree distribution) indeed exhibit low treewidth, and our algorithms showed practical running times on these test net- works. Moreover, our branchwidth based PTAS algorithm shows practical running time for computing solutions close to the optimal. 132 Chapter 5. Hypergraph Modelling of PPI Data

In proteomics research, existing experimental methods for detecting binary interac- tions often suffer from false negatives, i.e., some edges are not detected in the experi- ments. Therefore, in the direction towards hypergraph modelling of PPI networks, one may wish to cover the edges with quasi-cliques: Here, the optimization function needs to be modified slightly. One possible formulation may be to find a minimum cardinal- ity quasi-clique cover, where a quasi-clique is defined by some lower bound on the edge density. On the other hand, there may be classes of graphs other than the ones discussed here that admit polynomial time exact algorithms, for example, graphs with bounded genus or bounded degree.

5.6 Bibliographic Notes

The results discussed in this chapter have been published in [14]. Chapter 6

Conclusion and Future Directions

Researchers in systems biology strive to uncover complex interactions in biological sys- tems, and protein-protein interactions lie at the heart of their efforts. As opposed to the classical reductionist paradigm, systems biologists consider biological phenomena as a complex system, and thus high throughput experimental technologies are vital to their studies. Indeed, the recent development of high throughput techniques has provided a huge momentum – the amount of PPI data available is rapidly growing, and large-scale studies of the human proteome are also underway.

Unfortunately, high throughput technologies remain in their infancy, and produce a large amount of data with relatively poor accuracy. As a result, the accumulating data in the literature presents us with both an opportunity and a challenge. While the analysis of the PPI networks provides invaluable information on the inner workings of the cell, the significant amount of noise within the data makes it difficult to handle. AP-MS is a good example of such a technology that suffers from noisy output data.

Noise in proteomics data can often be classified as two distinct types [60]: stochas- tic errors and systematic errors. Stochastic errors are measurement errors with random variability, which can be reduced by simply repeating the experiment. Systematic er- rors, on the other hand, are recurrent in the measurements, and thus require a careful

133 134 Chapter 6. Conclusion and Future Directions

modelling of every step of the experiment in order to correctly interpret the data.

In this thesis, we considered the overall pipeline of the AP-MS method, and looked for sources of systematic errors. Here we highlight the novel contributions put forward in this thesis, and discuss future research directions.

Protein Quantification with Shared Peptides. Chapter 3 is concerned with protein quantification from quantitative MS data. While significant progress has been made, the problem of protein identification (and quantification) still remains unresolved due to shared peptides prevalent in many MS datasets [108]. We proposed several approaches to protein quantification in the presence of shared peptides by defining a set of opti- mization problems based on a clean combinatorial model.

One of the limitations in our approach comes from the assumption that every pep- tide can be detected by MS equally well – and thus the errors in the peptide abundances are equally likely. While we have empirically tested the robustness of our algorithms by varying the coverage of the data (see Section 3.4.1), an approach that directly models the noisy MS data would likely yield more practical results.

Open Problem 6.1. Does there exist an approach for protein quantification with shared peptides that incorporates peptide detectability?

Peptide detectability may be estimated either using a suitably large MS/MS dataset [2, 133] or using the physical properties of each peptide [99]. In Section 3.5, we discussed possible ways to extend our combinatorial approach by incorporating peptide detectabil- ity as weighted errors. Alternatively, we may also attack this problem using a probabilis- tic framework where observed peptide abundances are thought to be random variables that incorporate detectability. Then, we can look for protein abundances maximizing the probability of observing the given peptide abundances using various Bayesian ap- proaches. 135

Predicting Direct PPI Network. In Chapter 4, we studied the problem of distinguish- ing direct interactions from indirect ones within the PPI data from AP-MS experiments. We tackled this problem by proposing a probabilistic graph model, and formulated the DIGCOM problems to identify the direct interaction network from quantitative PPI data. Our algorithm consists of three main phases: (1) Identify weakly connected vertices (and their neighbours); (2) Discover dense clusters in the network; (3) Identify remaining di- rect interactions via a genetic algorithm.

As discussed in Section 4.5, the DIGCOM problems raise a number of challenging computational problems to investigate further. From the standpoint of computational complexity, the hardness of the problems are not yet known, and we conjecture that they are NP-hard.

Open Problem 6.2. What is the computational complexity of E-DIGCOM and A-DIGCOM?

On the other hand, the probabilistic graph model is built around several assump- tions. For example, the abundance of protein complexes is not constant, and the strength of all physical interactions is non-uniform as some interactions may be more prone to disruption by the affinity purification process than others.

Open Problem 6.3. Given sufficient AP-MS data, can we generalize the DIGCOM prob- lems to handle individual interaction strengths and abundances?

In fact, if we were given the individual interaction strength for each pair of proteins (possibly from a complementary dataset such as protein co-crystallization or physical models), the second and third phases of our algorithm can be easily modified to handle non-uniform interaction probabilities. The first phase of the algorithm would require more careful modifications to handle arbitrary survival probability for each edge.

PPI Networks as Hypergraphs. In Chapter 5, we considered the problem of modelling PPI networks as hypergraphs via the (edge) clique cover problem on the binary PPI net- work. In particular, we devised an exact algorithm for graphs with bounded treewidth 136 Chapter 6. Conclusion and Future Directions

which showed promising performance on both simulated data and real PPI data. Fur- thermore, our theoretical pursuit on clique cover for planar graphs with bounded branch- width resulted in a PTAS for planar graphs.

However, existing experimental techniques for detecting binary interactions often suffer from false negatives, i.e., some edges are missing in the PPI network. Conse- quently, protein complexes that should ideally form cliques instead appear as quasi- cliques or simply dense subgraphs. In the direction towards modelling PPI networks as hypergraphs, therefore, we may wish to relax the assumption that each protein complex forms a clique.

Open Problem 6.4. Does there exist a quasi-clique cover algorithm?

Quasi-cliques can be defined in various ways: one possible definition of a quasi- clique is a dense subgraph with a lower bound on the edge density. Using this defini- tion, one may generalize our algorithm for graphs with bounded treewidth. Following our algorithm, the same type of dynamic programming algorithm can be conceived – however, our approach of finding cliques in the local neighbourhood of a vertex should now be extended to multi-hop neighbours depending on the edge density, resulting in an increase in the running time. On the other hand, one may devise an algorithm spe- cialized for graphs with other properties of PPI networks, for example the scale-freeness of networks.

There are many problems in PPI networks to pursue further studies, and the work presented in this thesis opens the door to several opportunities with important impli- cations in systems biology. One of the main pursuits in the field is a comprehensive understanding of the proteome as a complex dynamic system, and any improvements in tackling the problems discussed here would take us one step closer towards this goal. Bibliography

[1] B. Adamcsek, G. Palla, I. J. Farkas, I. Derényi, and T. Vicsek. CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics (Oxford, England), 22(8):1021–1023, Apr. 2006.

[2] P.Alves, R. J. Arnold, M. V. Novotny, P.Radivojac, J. P.Reilly, and H. Tang. Advance- ment in protein inference from shotgun proteomics using peptide detectability. Pacific Symposium on Biocomputing, pages 409–420, 2007.

[3] S. Arnborg, D. G. Corneil, and A. Proskurowski. Complexity of finding embeddings in a k -tree. Society for Industrial and Applied Mathematics. Journal on Algebraic and Discrete Methods, 8(2):277–284, 1987.

[4] S. Arora. Computational Complexity: A Modern Approach. Cambridge University Press, 1 edition, 2009.

[5] S. Asthana. Predicting protein complex membership using probabilistic network reliability. Genome Research, 14(6):1170–1175, 2004.

[6] J. Azé, T. Bourquard, S. Hamel, A. Poupon, and D. Ritchie. Using Kendall-Tau Meta-Bagging to Improve Protein-Protein Docking Predictions. In Lecture Notes in Bioinformatics 7036, PRIB 2011, pages 284–295, 2011.

[7] G. D. Bader and C. W. V.Hogue. An automated method for finding molecular com- plexes in large protein interaction networks. BMC bioinformatics, 4:2, Jan. 2003.

[8] B. Baker. Approximation algorithms for NP-complete problems on planar graphs. Journal of the ACM, 41(1), 1994.

[9] M. Bantscheff, M. Schirle, G. Sweetman, J. Rick, and B. Kuster. Quantitative mass spectrometry in proteomics: a critical review. Analytical and bioanalytical chem- istry, 389(4):1017–1031, Oct. 2007.

[10] A. L. Barabasi and R. Albert. Emergence of scaling in random networks. Science (New York, N.Y.), 286(5439):509–512, Oct. 1999.

[11] P. L. Bartel, J. A. Roecklein, D. SenGupta, and S. Fields. A protein linkage map of Escherichia coli bacteriophage T7. Nature Genetics, 12(1):72–77, Jan. 1996.

137 138 Bibliography

[12] H. M. Berman, T. N. Bhat, P. E. Bourne, Z. Feng, G. Gilliland, H. Weissig, and J. Westbrook. The protein data bank and the challenge of structural genomics. Nature structural biology, 7 Suppl:957–959, Nov. 2000.

[13] A. Bhan, D. Galas, and T. Dewey. A duplication growth model of gene expression networks. Bioinformatics, 18:1486–1493, 2002.

[14] M. Blanchette, E. Kim, and A. Vetta. Clique Cover on Sparse Networks. In The 9th SIAM Meeting on Algorithm Engineering & Experiments (ALENEX), Kyoto, Japan, 2012.

[15] M. Blatt, S. Wiseman, and E. Domany. Superparamagnetic Clustering of Data. Phys. Rev. Lett., 76:3251–3254, Apr 1996.

[16] H. Bodlaender and T. Kloks. A simple linear time algorithm for triangulating three- colored graphs. Journal of Algorithms, 15(1):160–172, 1993.

[17] H. L. Bodlaender. A linear-time algorithm for finding tree-decompositions of small treewidth. SIAM Journal on Computing, 25(6):1305–1317, 1996.

[18] H. L. Bodlaender, M. R. Fellows, M. T.Hallett, H. T.Wareham, and T.J. Warnow. The hardness of perfect phylogeny, feasible register assignment and other problems on thin colored graphs. Theoretical Computer Science, 244(1-2):167–188, 2000.

[19] A. Breitkreutz, H. Choi, J. Sharom, L. Boucher, V.Neduva, B. Larsen, Z.Y.Lin, B. Bre- itkreutz, C. Stark, G. Liu, J. Ahn, D. Dewar-Darch, Z. Qin, T. Pawson, A. Gingras, A. I. Nesvizhskii, and M. Tyers. Global architecture of the yeast kinome interaction net- work. Science, 2010.

[20] S. Brohée and J. van Helden. Evaluation of clustering algorithms for protein- protein interaction networks. BMC bioinformatics, 7:488, 2006.

[21] H. Brönnimann and M. Goodrich. Almost optimal set covers in fi- nite VC-dimension. Discrete & Computational Geometry, 14:463–479, 1995. 10.1007/BF02570718.

[22] P. C. Carvalho, J. Hewel, V. C. Barbosa, and J. R. Yates. Identifying differences in protein expression levels by spectral counting and feature selection. Genetics and molecular research : GMR, 7(2):342–356, 2008.

[23] M. R. Cerioli, L. Faria, T. O. Ferreira, C. A. J. Martinhon, F.Protti, and B. Reed. Par- tition into cliques for cubic graphs: planar case, complexity and approximation. Discrete Applied Mathematics, 156(12):2270–2278, 2008.

[24] M.-S. Chang and H. Müller. On the tree-degree of graphs. In Graph-theoretic Concepts in Computer Science (Boltenhagen, 2001), pages 44–54, 2001. BIBLIOGRAPHY 139

[25] C. Chekuri, K. L. Clarkson, and S. Har-Peled. On the set multi-cover problem in ge- ometric settings. In Proceedings of the 25th Annual Symposium on Computational geometry, SCG ’09, pages 341–350, New York, NY, USA, 2009. ACM.

[26] J. Chen, W. Hsu, M. L. Lee, and S. Ng. Increasing confidence of protein interac- tomes using network topological metrics. Bioinformatics, 22(16):1998–2004, Aug. 2006.

[27] J. Chen, W. Hsu, M. L. Lee, and S.-K. Ng. Systematic assessment of high- throughput experimental data for reliable protein interactions using network topology. In Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI ’04, pages 368–372, Washington, DC, USA, 2004. IEEE Computer Society.

[28] Q. Cheng, P. Berman, R. Harrison, and A. Zelikovsky. Efficient alignments of metabolic networks with bounded treewidth. In IEEE International Conference on Data Mining Workshops (ICDMW), pages 687–694, 2010.

[29] H. Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):pp. 493–507, 1952.

[30] H. Choi, D. Fermin, and A. I. Nesvizhskii. Significance analysis of spectral count data in label-free shotgun proteomics. Molecular & Cellular Proteomics, 7(12):2373–2385, December 2008.

[31] V. Choi. Yucca: an efficient algorithm for small-molecule docking. Chemistry & biodiversity, 2(11):1517–1524, Nov. 2005.

[32] F. Chung, L. Lu, T. G. Dewey, and D. J. Galas. Duplication models for biological networks. Journal of Computational Biology, 10(5):677–687, Oct. 2003.

[33] P. Cloutier, R. Al-Khoury, M. Lavallée-Adam, D. Faubert, H. Jiang, C. Poitras, A. Bouchard, D. Forget, M. Blanchette, and B. Coulombe. High-resolution map- ping of the protein interaction network for the human transcription machinery and affinity purification of RNA polymerase II-associated complexes. Methods (San Diego, Calif.), 48(4):381–386, Aug. 2009.

[34] C. J. Colbourn. The Combinatorics of Network Reliability. Oxford University Press, Inc., 1987.

[35] S. R. Collins, P.Kemmeren, X.-C. Zhao, J. F.Greenblatt, F.Spencer, F.C. P.Holstege, J. S. Weissman, and N. J. Krogan. Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Molecular & cellular proteomics : MCP, 6(3):439–450, Mar. 2007. 140 Bibliography

[36] B. Coulombe, M. Blanchette, and C. Jeronimo. Steps towards a repertoire of comprehensive maps of human protein interaction networks: the Human Pro- teotheque Initiative (HuPI). Biochemistry and Cell Biology, 86(2):149–156, Apr. 2008.

[37] T. Dandekar, B. Snel, M. Huynen, and P.Bork. Conservation of gene order: a finger- print of proteins that physically interact. Trends in Biochemical Sciences, 23(9):324 – 328, 1998.

[38] B. Dengiz, F.Altiparmak, and A. Smith. Efficient optimization of all-terminal reli- able networks, using an evolutionary approach. Reliability, IEEE Transactions on, 46(1):18–26, 1997.

[39] B. Dengiz, F. Altiparmak, and A. Smith. Local search genetic algorithm for op- timal design of reliable networks. Evolutionary Computation, IEEE Transactions on, 1(3):179–188, 1997.

[40] A. Denise and M. Vasconcellos. The random planar graph. Congressus Numeran- tium, 113:61–79, 1996.

[41] F. Dorn, E. Penninkx, H. Bodlaender, and F. Fomin. Efficient exact algorithms on planar graphs: exploiting sphere cut branch decompositions. Algorithms–ESA 2005, 3669:95–106, 2005.

[42] B. Dost, N. Bandeira, X. Li, Z. Shen, S. P.Briggs, and V. Bafna. Accurate mass spec- trometry based protein quantification via shared peptides. Journal of Computa- tional Biology, 19, 2012.

[43] B. Dost, T. Shlomi, N. Gupta, E. Ruppin, V. Bafna, and R. Sharan. QNet: a tool for querying protein interaction networks. Journal of Computational Biology, 15(7):913–925, 2008.

[44] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America, 95(25):14863–14868, Dec. 1998.

[45] D. Eppstein. Subgraph isomorphism in planar graphs and related problems. J. Graph Algorithms and Applications, 3(3):1–27, 1999.

[46] D. Eppstein. Diameter and treewidth in minor-closed graph famillies. Algorith- mica, 27:275–291, 2000.

[47] G. Even, D. Rawitz, and S. Shahar. Hitting sets when the VC-dimension is small. Inf. Process. Lett., 95(2):358–362, July 2005.

[48] U. Feige. A threshold of lnn for approximating set cover. J. ACM, 45(4):634–652, 1998. BIBLIOGRAPHY 141

[49] S. Fields and S. O. A novel genetic system to detect protein-protein interactions. Nature, 340(6230):245–6, 1989.

[50] R. L. Finley and R. Brent. Interaction mating reveals binary and ternary con- nections between Drosophila cell cycle regulators. Proceedings of the National Academy of Sciences of the United States of America, 91(26):12980–12984, Dec. 1994.

[51] L. Florens, M. Washburn, J. Raine, R. Anthony, M. Grainger, J. Haynes, J. Moch, N. Muster, J. Sacci, D. Tabb, and et al. A proteomic view of the Plasmodium falci- parum life cycle. Nature, 419(6906):520–526, 2002.

[52] F. V. Fomin and D. M. Thilikos. New upper bounds on the decomposability of planar graphs. Journal of Graph Theory, 51(1):53–81, 2006.

[53] M. Fromont-Racine, J. Rain, and P. Legrain. Toward a functional analysis of the yeast genome through exhaustive two-hybrid screens. Nat Genet, 16(3):277–282, July 1997.

[54] A. Galarneau, M. Primeau, L.-E. Trudeau, and S. W. Michnick. β-Lactamase pro- tein fragment complementation assays as in vivo and in vitro sensors of protein- protein interactions. Nature Biotechnology, 20(6):619–622, June 2002.

[55] J. Gao, G. Opiteck, M. Friedrichs, A. Dongre, and S. Hefta. Changes in the protein expression of yeast as a function of carbon source. J Proteome Res, 2(6):643–649, 2003.

[56] M. R. Garey and D. S. Johnson. Computers and Intractability : A Guide to the Theory of NP-Completeness. WH Freman,1979, 1979.

[57] M. R. Garey, D. S. Johnson, and L. Stockmeyer. Some simplified NP-complete graph problems. Theoretical Computer Science, 1(3):237–267, 1976.

[58] A. Gavin, M. Bösche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. Schultz, J. M. Rick, A. Michon, C. Cruciat, M. Remor, C. Höfert, M. Schelder, M. Brajen- ovic, H. Ruffner, A. Merino, K. Klein, M. Hudak, D. Dickson, T. Rudi, V. Gnau, A. Bauch, S. Bastuck, B. Huhse, C. Leutwein, M. Heurtier, R. R. Copley, A. Edel- mann, E. Querfurth, V. Rybin, G. Drewes, M. Raida, T. Bouwmeester, P. Bork, B. Seraphin, B. Kuster, G. Neubauer, and G. Superti-Furga. Functional organi- zation of the yeast proteome by systematic analysis of protein complexes. Nature, 415(6868):141–147, 2002.

[59] L. Y. Geer, S. P. Markey, J. A. Kowalak, L. Wagner, M. Xu, D. M. Maynard, X. Yang, W. Shi, and S. H. Bryant. Open mass spectrometry search algorithm. Journal of proteome research, 3(5):958–964, Aug. 2004.

[60] R. Gentleman and W. Huber. Making the most of high-throughput protein- interaction data. Genome Biology, 8(10):112, 2007. 142 Bibliography

[61] S. A. Gerber, J. Rush, O. Stemman, M. W. Kirschner, and S. P.Gygi. Absolute quan- tification of proteins and phosphoproteins from cell lysates by tandem MS. Pro- ceedings of the National Academy of Sciences, 100(12):6940–6945, 2003.

[62] E. Gilbert. Enumeration of labeled graphs. Canad. J. Math, 8:405–411, 1956.

[63] J. Gilmore, D. Auberry, J. Sharp, A. White, K. Anderson, and D. Daly. A bayesian es- timator of protein-protein association probabilities. Bioinformatics, 24(13):1554– 5, 2008.

[64] D. E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learn- ing. Addison-Wesley Professional, January 1989.

[65] R. Gordân, A. J. Hartemink, and M. L. Bulyk. Distinguishing direct versus indirect transcription factor-DNA interactions. Genome Research, 19(11):2090–2100, Nov. 2009.

[66] J. Gramm, J. Guo, F. Hüffner, and R. Niedermeier. Data reduction and exact algo- rithms for clique cover. ACM Journal of Experimental Algorithmics, 13, 2009.

[67] J. Gross and J. Yellen. Handbook of Graph Theory (Discrete Mathematics and its Applications). CRC, 2003.

[68] R. Halin. S-functions for graphs. Journal of Geometry, 8:171–186, 1976. 10.1007/BF01917434.

[69] E. Hartuv, A. O. Schmitt, J. Lange, S. Meier-Ewert, H. Lehrach, and R. Shamir. An algorithm for clustering cDNA fingerprints. Genomics, 66(3):249 – 256, 2000.

[70] Y. Ho, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore, S. Adams, A. Millar, P. Taylor, K. Bennett, K. Boutilier, L. Yang, C. Wolting, I. Donaldson, S. Schandorff, J. Shew- narane, M. Vo, J. Taggart, M. Goudreault, B. Muskat, C. Alfarano, D. Dewar, Z. Lin, K. Michalickova, A. R. Willems, H. Sassi, P. A. Nielsen, K. J. Rasmussen, J. R. An- dersen, L. E. Johansen, L. H. Hansen, H. Jespersen, A. Podtelejnikov, E. Nielsen, J. Crawford, V. Poulsen, B. D. Sørensen, J. Matthiesen, R. C. Hendrickson, F. Glee- son, T. Pawson, M. F. Moran, D. Durocher, M. Mann, C. W. V. Hogue, D. Figeys, and M. Tyers. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature, 415(6868):180–183, 2002.

[71] D. N. Hoover. Complexity of graph covering problems for graphs of low degree. Journal of Combinatorial Mathematics and Combinatorial Computing, 11:187– 208, 1992.

[72] H. B. Hunt III, M. V. Marathe, V. Radhakrishnan, and R. E. Stearns. The complex- ity of planar counting problems. SIAM Journal on Computing, 27(4):1142–1167 (electronic), 1998. BIBLIOGRAPHY 143

[73] Y. Ishihama, Y. Oda, T. Tabata, T. Sato, T. Nagasu, J. Rappsilber, and M. Mann. Ex- ponentially modified protein abundance index (e mPAI ) for estimation of abso- lute protein amount in proteomics by the number of sequenced peptides per pro- tein. Mol Cell Proteomics, 4(9):1265–1272, 2005.

[74] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki. A comprehen- sive two-hybrid analysis to explore the yeast protein interactome. Proceedings of the National Academy of Sciences of the United States of America, 98(8):4569–4574, 2001.

[75] T. Ito, K. Tashiro, S. Muta, R. Ozawa, T. Chiba, M. Nishizawa, K. Yamamoto, S. Kuhara, and Y. Sakaki. Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proceedings of the National Academy of Sciences of the United States of America, 97(3):1143–1147, Feb. 2000.

[76] R.-H. Jan, F.-J.Hwang, and S.-T. Chen. Topological optimization of a communica- tion network subject to a reliability constraint. Reliability, IEEE Transactions on, 42(1):63–70, 1993.

[77] J. Janin, K. Henrick, J. Moult, L. T. Eyck, M. J. E. Sternberg, S. Vajda, I. Vakser, S. J. Wodak, and Critical Assessment of PRedicted Interactions. CAPRI: a Critical As- sessment of PRedicted Interactions. Proteins, 52(1):2–9, July 2003.

[78] L. J. Jensen and P. Bork. Not comparable, but complementary. Science, 322(5898):56–57, 2008.

[79] H. Jeong, S. P. Mason, A. L. Barabasi, and Z. N. Oltvai. Lethality and centrality in protein networks. Nature, 411(6833):41–42, 05 2001.

[80] R. Jin, S. Mccallen, C.-C. Liu, Y. Xiang, E. Almaas, and X. J. Zhou. Identifying dy- namic network modules with temporal and spatial constraints. In Pacific Sympo- sium on Biocomputing, pages 203–214, 2009.

[81] S. Jin, D. Daly, D. Springer, and J. Miller. The effects of shared peptides on pro- tein quantitation in label-free proteomics by LC/MS/MS. Journal of Proteome Re- search, 7(1):164–169, 2008.

[82] R. Jothi and P. T. M. Computational approaches to predict protein-protein and domain-domain interactions. In M. I. and Z. A., editors, Bioinformatics Algo- rithms: Techniques and Applications. Wiley Press, 2008.

[83] V. Kann. On the Approximability of NP-complete Optimization Problems. PhD thesis, Department of Numerical Analysis and Computing Science, Royal Institute of Technology, Stockholm, 1992.

[84] R. Kannan, S. Vempala, and A. Vetta. On clusterings: good, bad and spectral. Jour- nal of the ACM, 51(3):497–515, 2004. 144 Bibliography

[85] M. Karas and F. Hillenkamp. Laser desorption ionization of proteins with molec- ular masses exceeding 10,000 daltons. Analytical Chemistry, 60(20):2299–2301, 1988.

[86] R. M. Karp. Reducibility among combinatorial problems. Complexity of Computer Computations, 40(4):85–103, 1972.

[87] S. M. Khan, B. Franke-Fayard, G. R. Mair, E. Lasonder, C. J. Janse, M. Mann, and A. P.Waters. Proteome analysis of separated male and female gametocytes reveals novel sex-specific Plasmodium biology. Cell, 121(5):675–687, June 2005.

[88] E. Kim, A. Sabharwal, A. Vetta, and M. Blanchette. Predicting direct protein inter- actions from affinity purification mass spectrometry data. Algorithms for Molecu- lar Biology, 5(1):34, 2010.

[89] E. Kim, A. Vetta, and M. Blanchette. Protein quantification with shared peptides using multi cover algorithms. in preparation, 2011.

[90] A. D. King, N. Przulj, and I. Jurisica. Protein complex prediction via cost-based clustering. Bioinformatics (Oxford, England), 20(17):3013–3020, Nov. 2004.

[91] D. S. Kirkpatrick, S. A. Gerber, and S. P. Gygi. The absolute quantification strat- egy: a general procedure for the quantification of proteins and post-translational modifications. Methods, 35(3):265 – 273, 2005.

[92] T. Kloks. Treewidth: Computations and Approximations (Lecture Notes in Com- puter Science). Springer, 1 edition, Sept. 1994.

[93] N. J. Krogan, G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, J. Li, S. Pu, N. Datta, A. P. Tikuisis, T. Punna, J. M. Peregrín-Alvarez, M. Shales, X. Zhang, M. Davey, M. D. Robinson, A. Paccanaro, J. E. Bray, A. Sheung, B. Beattie, D. P. Richards, V. Canadien, A. Lalev, F. Mena, P. Wong, A. Starostine, M. M. Canete, J. Vlasblom, S. Wu, C. Orsi, S. R. Collins, S. Chandran, R. Haw, J. J. Rilstone, K. Gandi, N. J. Thompson, G. Musso, P.St Onge, S. Ghanny, M. H. Y. Lam, G. But- land, A. M. Altaf-Ul, S. Kanaya, A. Shilatifard, E. O’Shea, J. S. Weissman, C. J. Ingles, T. R. Hughes, J. Parkinson, M. Gerstein, S. J. Wodak, A. Emili, and J. F. Greenblatt. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Na- ture, 440(7084):637–643, Mar. 2006.

[94] W. B. Langdon and R. Poli. Foundations of Genetic Programming. Springer, Dec. 2010.

[95] M. Lavallée-Adam, P. Cloutier, B. Coulombe, and M. Blanchette. Modeling Con- taminants in AP-MS/MS Experiments. Journal of Proteome Research, 10(2):886– 895, 2011.

[96] T. Leighton and S. Rao. Multicommodity max-flow min-cut theorems and their use in designing approximation algorithms. Journal of the ACM, 48:787–832, 1999. BIBLIOGRAPHY 145

[97] E. D. Levy and J. B. Pereira-Leal. Evolution and dynamics of protein interac- tions and networks. Current Opinion in Structural Biology, 18(3):349 – 357, 2008. Nucleic acids / Sequences and topology.

[98] M. Li, J. Wang, J. Chen, Z. Cai, and G. Chen. Identifying the overlapping com- plexes in protein interaction networks. International Journal of Data Mining and Bioinformatics, 4(1):91–108, 2010.

[99] Y. F. Li, R. J. Arnold, H. Tang, and P. Radivojac. The Importance of Peptide De- tectability for Protein Identification, Quantification, and Experiment Design in MS/MS Proteomics. Journal of Proteome Research, 9(12):6288–6297, 2010.

[100] H. Liu, R. G. Sadygov, and J. R. Yates. A model for random sampling and estima- tion of relative protein abundance in shotgun proteomics. Analytical chemistry, 76(14):4193–4201, July 2004.

[101] C. Lund and M. Yannakakis. On the hardness of approximating minimization problems. Journal of the ACM, 41(5):960–981, 1994.

[102] P. Mallick, M. Schirle, S. S. Chen, M. R. Flory, H. Lee, D. Martin, J. Ranish, B. Raught, R. Schmitt, T. Werner, B. Kuster, and R. Aebersold. Computational prediction of proteotypic peptides for quantitative proteomics. Nature Biotech- nology, 25(1):125–131, Jan. 2007.

[103] S. W. Michnick. Protein fragment complementation strategies for biochemical network mapping. Current Opinion in Biotechnology, 14(6):610 – 617, 2003.

[104] M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algo- rithms and Probabilistic Analysis. Cambridge University Press, Jan. 2005.

[105] E. Mujuni and F. Rosamond. Parameterized complexity of the clique partition problem. In Proc. Fourteenth Computing: The Australasian Theory Symposium (CATS 2008), 2008.

[106] M. Narayanan, A. Vetta, E. E. Schadt, and J. Zhu. Simultaneous clustering of mul- tiple gene expression and physical interaction datasets. PLoS Computational Bi- ology, 6(4):e1000742, 13, 2010.

[107] A. Nesvizhskii. Protein identification by tandem mass spectrometry and sequence database searching. Methods Mol Biol., 367:87–119, 2007.

[108] A. I. Nesvizhskii and R. Aebersold. Interpretation of shotgun proteomic data: the protein inference problem. Molecular & cellular proteomics : MCP, 4(10):1419– 1440, Oct. 2005. 146 Bibliography

[109] W. M. Old, K. Meyer-Arendt, L. Aveline-Wolf, K. G. Pierce, A. Mendoza, J. R. Sevin- sky, K. A. Resing, and N. G. Ahn. Comparison of label-free methods for quanti- fying human proteins by shotgun proteomics. Molecular & Cellular Proteomics, 4(10):1487–1502, 2005.

[110] J. Orlin. Contentment in graph theory: covering graphs with cliques. Indagationes Mathematicae (Proceedings), 80(5):406–424, 1977.

[111] P.Pei and A. Zhang. A topological measurement for weighted protein interaction network. In Computational Systems Bioinformatics Conference, 2005. Proceedings. 2005 IEEE, pages 268–278, 2005.

[112] M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates. As- signing protein functions by comparative genome analysis: protein phylogenetic profiles. Proceedings of the National Academy of Sciences of the United States of America, 96(8):4285–4288, Apr. 1999.

[113] M. Penrose. Random Geometric Graphs (Oxford Studies in Probability). Oxford University Press, USA, July 2003.

[114] D. N. Perkins, D. J. C. Pappin, D. M. Creasy, and J. S. Cottrell. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis, 20(18):3551–3567, 1999.

[115] P.A. Pevzner, V.Dancík, and C. L. Tang. Mutation-tolerant protein identification by mass spectrometry. Journal of computational biology : a journal of computational molecular cell biology, 7(6):777–787, 2000.

[116] U. Pieles, W. Zürcher, M. Schär, and H. Moser. Matrix-assisted laser desorption ionization time-of-flight mass spectrometry: a powerful tool for the mass and se- quence analysis of natural and modified oligonucleotides. Nucleic Acids Research, 21(14):3191–3196, 1993.

[117] N. Przulj, D. G. Corneil, and I. Jurisica. Modeling interactome: scale-free or geo- metric? Bioinformatics, 20(18):3508–3515, 2004.

[118] O. Puig, F. Caspary, G. Rigaut, B. Rutz, E. Bouveret, E. Bragado-Nilsson, M. Wilm, and B. Séraphin. The tandem affinity purification (TAP) method: a general proce- dure of protein complex purification. Methods (San Diego, Calif.), 24(3):218–229, July 2001.

[119] G. Rigaut, A. Shevchenko, B. Rutz, M. Wilm, M. Mann, and B. Séraphin. A generic protein purification method for protein complex characterization and proteome exploration. Nature Biotechnology, 17(10):1030–1032, 1999.

[120] N. Robertson and P. D. Seymour. Graph minors. II. Algorithmic aspects of tree- width. Journal of algorithms, 7(3):309–322, 1986. BIBLIOGRAPHY 147

[121] N. Robertson and P. D. Seymour. Graph minors. X. Obstructions to tree- decomposition. Journal of Combinatorial Theory. Series B, 52(2):153–190, 1991.

[122] R. Saito, H. Suzuki, and Y. Hayashizaki. Interaction generality, a measurement to assess the reliability of a protein-protein interaction. Nucl. Acids Res., 30(5):1163– 1168, Mar. 2002.

[123] R. Saito, H. Suzuki, and Y. Hayashizaki. Construction of reliable protein-protein interaction networks with a new interaction generality measure. Bioinformatics, 19(6):756–763, Apr. 2003.

[124] M. E. Sardiu, Y. Cai, J. Jin, S. K. Swanson, R. C. Conaway, J. W. Conaway, L. Flo- rens, and M. P. Washburn. Probabilistic assembly of human protein interaction networks from label-free quantitative proteomics. Proceedings of the National Academy of Sciences of the United States of America, 105(5):1454–1459, Feb. 2008.

[125] A. Schrijver. Combinatorial Optimization (3 volume, A,B, & C). Springer, 1 edition, Feb. 2003.

[126] R. Sharan and R. Shamir. CLICK: a clustering algorithm with applications to gene expression analysis. In Proceedings of the Eighth International Conference on In- telligent Systems for Molecular Biology. AAAI Press, Aug. 2000.

[127] S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon. Network motifs in the transcrip- tional regulation network of Escherichia coli. Nat Genet, 31(1):64–68, 05 2002.

[128] M. E. Sowa, E. J. Bennett, S. P.Gygi, and J. W. Harper. Defining the human deubiq- uitinating enzyme interaction landscape. Cell, 138(2):389–403, July 2009.

[129] V. Spirin and L. A. Mirny. Protein complexes and functional modules in molecular networks. Proceedings of the National Academy of Sciences of the United States of America, 100(21):12123–12128, Oct. 2003.

[130] M. P. H. Stumpf, T. Thorne, E. de Silva, R. Stewart, H. J. An, M. Lappe, and C. Wiuf. Estimating the size of the human interactome. Proceedings of the Na- tional Academy of Sciences, 105(19):6959–6964, 2008.

[131] K. Suhre. Inference of gene function based on gene fusion events. Springer Proto- cols, 396, November 2007.

[132] H. Tamaki. A linear time heuristic for the branch-decomposition of planar graphs. In G. D. Battista and U. Zwick, editors, European Symposium on Algorithms (ESA), volume 2832 of Lecture Notes in Computer Science, pages 765–775. Springer, 2003.

[133] H. Tang, R. J. Arnold, P. Alves, Z. Xun, D. E. Clemmer, M. V. Novotny, J. P. Reilly, and P. Radivojac. A computational approach toward label-free protein quantifi- cation using predicted peptide detectability. Bioinformatics (Oxford, England), 22(14):e481–8, 2006. 148 Bibliography

[134] K. Tarassov, V. Messier, C. R. Landry, S. Radinovic, M. M. S. Molina, I. Shames, Y. Malitskaya, J. Vogel, H. Bussey, and S. W. Michnick. An in vivo map of the yeast protein interactome. Science, 320(5882):1465–1470, 2008.

[135] J. A. Taylor and R. S. Johnson. Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. Rapid Communications in Mass Spec- trometry, 11(9):1067–1075, 1997.

[136] R. Uehara. NP-complete problems on a 3-connected cubic planar graph and their application. Technical report, Tokyo Woman’s Christian University, Tokyo, Sept. 1996.

[137] P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, R. S. Judson, J. R. Knight, D. Lock- shon, V. Narayan, M. Srinivasan, P. Pochart, A. Qureshi-Emili, Y. Li, B. Godwin, D. Conover, T. Kalbfleisch, G. Vijayadamodar, M. Yang, M. Johnston, S. Fields, and J. M. Rothberg. A comprehensive analysis of protein-protein interactions in Sac- charomyces cerevisiae. Nature, 403(6770):623–627, 2000.

[138] L. G. Valiant. The complexity of enumeration and reliability problems. SIAM Jour- nal on Computing, 8(3):410–421, 1979.

[139] S. M. van Dongen. Graph clustering by flow simulation. PhD thesis, University of Utrecht, The Netherlands, 2000.

[140] V. Vazirani. Approximation Algorithms. Springer, 2004.

[141] C. von Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields, and P. Bork. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417(6887):399–403, 05 2002.

[142] A. J. Walhout, R. Sordella, X. Lu, J. L. Hartley, G. F.Temple, M. A. Brasch, N. Thierry- Mieg, and M. Vidal. Protein interaction mapping in C. elegans using proteins in- volved in vulval development. Science (New York, N.Y.), 287(5450):116–122, Jan. 2000.

[143] X. Wang, J. Venable, P.LaPointe, D. M. Hutt, A. V. Koulov, J. Coppinger, C. Gurkan, W. Kellner, J. Matteson, H. Plutner, J. R. Riordan, J. W. Kelly, J. R. Yates, and W. E. Balch. Hsp90 cochaperone Aha1 downregulation rescues misfolding of CFTR in cystic fibrosis. Cell, 127(4):803–815, Nov. 2006.

[144] C. M. Whitehouse, R. N. Dreyer, M. Yamashita, and J. B. Fenn. Electrospray in- terface for liquid chromatographs and mass spectrometers. Analytical Chemistry, 57(3):675–679, 1985. PMID: 2581476.

[145] D. P.Williamson and D. B. Shmoys. The design of approximation algorithms. Cam- bridge University Press, 2011. BIBLIOGRAPHY 149

[146] I. Xenarios, L. Salwínski, X. J. Duan, P.Higney, S. M. Kim, and D. Eisenberg. DIP,the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res, 30(1):303–305, January 2002.

[147] A. Yamaguchi, K. F.Aoki, and H. Mamitsuka. Graph complexity of chemical com- pounds in biological pathways. In Genome Informatics Vol. 14, pages 376–377, 2003.

[148] P. Ye, B. D. Peyser, X. Pan, J. D. Boeke, F. A. Spencer, and J. S. Bader. Gene func- tion prediction from congruent synthetic lethal interactions in yeast. Molecular systems biology, 1:2005.0026, 2005.

[149] H. Yu, P.Braun, M. A. Yildirim, I. Lemmens, K. Venkatesan, J. Sahalie, T. Hirozane- Kishikawa, F.Gebreab, N. Li, N. Simonis, T. Hao, J. Rual, A. Dricot, A. Vazquez, R. R. Murray, C. Simon, L. Tardivo, S. Tam, N. Svrzikapa, C. Fan, A. de Smet, A. Motyl, M. E. Hudson, J. Park, X. Xin, M. E. Cusick, T. Moore, C. Boone, M. Snyder, F. P. Roth, A. Barabasi, J. Tavernier, D. E. Hill, and M. Vidal. High-quality binary protein interaction map of the yeast interactome network. Science, 322(5898):104–110, Oct. 2008.

[150] B. Zhang, B. Park, T. Karpinets, and N. Samatova. From pull-down data to protein interaction networks and complexes with biological relevance. Bioinformatics, 24(7):979–86, 2008.

[151] B. Zhang, N. C. VerBerkmoes, M. A. Langston, E. Uberbacher, R. L. Hettich, and N. F. Samatova. Detecting differential and correlated protein expression in label- free shotgun proteomics. Journal of Proteome Research, 5(11):2909–2918, Nov. 2006.