Mutually Exclusive Weighted Graph Matching Algorithm

for Protein-Protein Interaction Network Alignment

A thesis submitted to

Graduate School

of the University of Cincinnati

in partial fulfilment for the

requirements for the degree of

Master of Science

in the Department of Electrical Engineering and Computer Science

by

Brandan Dunham

Bachelor of Science, Shawnee State University

May 2011

Thesis Advisor & Committee Chair: Dr. Yizong Cheng Abstract

Motivation: Understanding and analyzing proteins interacting within a species is a topic of interest in various biological studies. These interactions can be aligned as a full network, allowing research to view the system of interactions that occur globally. Given this global alignment, one common area of research is comparing the protein interactions produced by different species to determine similarity.

Results: In this paper, we analyze current alignment methodologies for both a constrained and an unconstrained version of the protein-protein interaction network alignment problem. We also provide a new algorithm, Mutual Exclusion Matching, which scores higher and finds larger alignments than most other algorithms. Finally, we analyze all algorithms tested in a variety of biological and topological measures, showing that

Mutual Exclusion produces some of the best alignments found throughout the course of the study, including finding larger connected components, conserving a larger number of edges, and finding some of the highest biologically scoring alignments when compared to other algorithms. Code necessary for running Mutual Exclusion Matching, as well as some example datasets, can be found at: https://sourceforge.net/projects/mutual-exclusion-matching/

ii

Acknowledgments

I would like to thank my advisors, Dr. Cheng, for advising my thesis and providing suggestions on what I needed to do next at various writing steps, Dr. Ralescu, for lectures on artificial intelligence and always being wiling to spend time talking to me about problems I may have had, and Dr. Meller, for teaching me about the biological data and resources that made studying this problem possible.

I would also like to thank Shikha Chaganti, for helping me proofread and review various aspects of this paper. Without her assistance, writing this paper would have been much more difficult. Additionally, I would like to thank her and Wilma Lam for working with me in classes and making my time in Cincinnati more enjoyable.

Finally, I would like to thank my parents, Chris and Dona Dunham, for supporting my decision to go to graduate school and assisting me when I needed help.

iv Contents

Chapter 1 Introduction 1

1.1 Introduction ...... 1

Chapter 2 History 4

2.1 Problem Background ...... 4

2.2 Terminology ...... 7

2.2.1 Protein-Protein Interaction Network ...... 7

2.2.2 Local and Global Graph Alignment ...... 7

2.2.3 Constrained and Unconstrained Graph Matching Problem ...... 7

2.2.4 Edge Correctness ...... 8

2.2.5 Sequence Similarity ...... 8

2.2.6 Functional Similarity ...... 9

2.2.7 Maximum Common Subgraph ...... 9

2.2.8 Largest Connected Component ...... 9

2.3 Local Aligners ...... 10

2.4 Global Aligners ...... 10

2.5 Global Alignment Algorithms ...... 14

v 2.5.1 GRAAL ...... 14

2.5.2 HubAlign ...... 14

2.5.3 Ghost ...... 15

2.5.4 Magna ...... 15

2.5.5 Netal ...... 15

2.5.6 Graemlin ...... 16

2.5.7 GEDEVO ...... 16

2.5.8 PINALOG ...... 17

2.5.9 SPINAL ...... 17

2.5.10 PISWAP ...... 17

2.5.11 IsoRank ...... 18

2.5.12 Natalie ...... 18

2.5.13 Belief Propagation ...... 19

2.5.14 Maximum Weighted Matching ...... 19

Chapter 3 Algorithm 20

3.1 Background ...... 20

3.2 IsoRank ...... 21

3.3 First Mutual Exclusion Algorithm ...... 22

3.4 Mutual Exclusion Matching ...... 24

3.5 Example Problem ...... 26

Chapter 4 Experimentation 35

4.1 Datasets ...... 35

4.2 Constrained Results ...... 38

vi 4.3 Unconstrained Results ...... 46

4.4 Biological Comparison ...... 52

4.4.1 Average Biological Results ...... 52

4.4.2 Topological Usefulness ...... 53

4.5 Future Improvements ...... 56

Chapter 5 Conclusion 62

5.1 Conclusion ...... 62

Appendix A 71

vii List of Figures

3.1 Example Problem: Graph A ...... 29

3.2 Example Problem: Graph B ...... 29

3.3 Example Problem: Weighted Bipartite Matching from A and B ...... 29

3.4 Example Problem: Full constrained matching problem between Graph A and B 30

3.5 Example Problem: Conserved edge constrained matching problem between

Graph A and B ...... 31

3.6 Example Problem: Cross Product (S) matrix between Graph A and Graph

B. Only showing node pairs with possible conserved edges ...... 31

3.7 Example Problem: Conserved edge constrained matching problem between

Graph A and B. 1st iteration ...... 32

3.8 Example Problem: Cross Product (S) matrix between Graph A and Graph

B. Only showing node pairs with possible conserved edges. 1st iteration . . . 32

3.9 Example Problem: Cross Product (S) matrix between Graph A and Graph

B. Only showing node pairs with possible conserved edges. 2nd iteration . . 33

3.10 Example Problem: Conserved edge constrained matching problem between

Graph A and B. 3rd iteration ...... 34

viii 3.11 Example Problem: Cross Product (S) matrix between Graph A and Graph

B. Only showing node pairs with possible conserved edges. 3rd iteration . . . 34

4.1 Scores of the best version of each algorithm, averaged over 11 α values. Scores

are shown relative to the upper bound calculated by NetalignMR. PPIs are

ordered from largest to smallest by Constrained Edge Product ...... 40

4.2 Scores from the best version of each algorithm, for the 10 largest PPI pairs,

averaged over 11 α values. Scores are shown relative to the upper bound

calculated by NetalignMR. PPIs are ordered from largest to smallest by Con-

strained Edge Product ...... 41

4.3 Logarithmically scaled running time to find the best solution for each algo-

rithm, averaged over 11 α values. PPIs are ordered from largest to smallest

by Constrained Edge Product ...... 42

4.4 Average EC for each PPI pair PPIs are ordered from largest to smallest by

Constrained Edge Product ...... 44

4.5 Relative percentage of EC to the best found value for each PPI pair PPIs are

ordered from largest to smallest by Constrained Edge Product ...... 44

4.6 Relative value of LCC to the best found value for each PPI pair. PPIs are

ordered from largest to smallest by Constrained Edge Product ...... 45

4.7 Average scoring of various algorithms on the unconstrained alignment problem

across all α. PPIs are ordered from largest to smallest by edge product . . . 48

4.8 Maximal Edge Correctness across all α for the unconstrained alignment prob-

lem. PPIs are ordered from largest to smallest by Edge Product ...... 49

ix 4.9 Maximal LCC across all α for the unconstrained alignment problem. PPIs

are ordered from largest to smallest by Edge Product ...... 50

4.10 Average number of protein matches found for each algorithm across α values 52

4.11 Average of biological scores across various α ...... 53

4.12 Average of LCC biological scores across various α ...... 54

4.13 Average of LCC size across all α ...... 55

4.14 The number of times each algorithm found the maximum biological similarity

between two species for a given α ...... 56

4.15 The number of times each algorithm found the maximum biological similarity

between two species for a given α from 0 to 0.9 ...... 57

4.16 Scores produced by aligning Drosophila melanogaster and Schizosaccharomyces

pombe 972h by α value ...... 58

x List of Tables

4.1 Constrained alignment algorithm settings. All non-listed settings are run at

default values ...... 39

4.2 Total time taken by the best run of each of the different algorithms to find a

best solution, besttime, and to converge, totaltime, in seconds ...... 43

4.3 Average across all constrained problems of the maximal values found across

all α ...... 44

4.4 Average values for biological measures for each constrained problem . . . . . 46

4.5 Average values for biological measures for each constrained problem within

their Largest Connected Components for α values less than 1 ...... 46

4.6 Unconstrained alignment algorithms ...... 48

4.7 Average across all unconstrained problems of the maximal values found across

all α ...... 49

4.8 Average values for biological measures for each unconstrained problem . . . . 51

4.9 Average values for biological measures for each unconstrained problem within

their Largest Connected Components for α values less than 1 ...... 51

xi 4.10 Comparison of α values used for scoring for the constrained problem, y axis

vs α values used for finding the best solution, x axis ...... 60

4.11 Percentage of optimal solution found for each α for the unconstrained problem 60

4.12 Comparison of α values used for scoring the unconstrained problem, y axis vs

α values used for finding the best solution, x axis ...... 61

4.13 Percentage of optimal solution found for the unconstrained problem for each α 61

A.1 Species Tested ...... 71

A.2 Species Pairs by Cross Product Graph Size ...... 72

A.3 Species Pairs by Conserved Cross Product Graph Size ...... 73

xii Chapter 1

Introduction

1.1 Introduction

One of the most popular usages of computers is to automate large tasks that would be very expensive or computationally unfeasible to do by hand. One common task is the processing and comparison of similar data, which is done for various reasons. These reasons include facial recognition software that relies on the processing and comparison of similar images, credit card companies comparing previous purchases to find outliers to detect fraud, and web services such as google using key words and links to and from various webpages to detect similarity and importance for searching algorithms. All of these represent concepts which, if done by hand, would be very long processes requiring lots of work. A common way to compare the similarities and differences between a group of objects is to make a graph of the main aspects that define them, and compare the node and edge structures of the graphs. For this reason, trying to find the similarity between two graphs is a problem of importance in various fields of research. A few of the various fields to format their

1 comparison problems in graph matching problem include:

• Matching of molecules, comparing their similar subgraphs before and after chemical

reactions, using graphs of elements and chemical bonds[1]

• Matching of CAD mechanical images [2]

• Social network customer identification[3]

Given two graphs, the problem of detecting the similarity between them usually revolves around finding whether or not one graph is the same as the other, a problem known as graph isomorphism. Graph isomorphism is a well researched problem which falls between P and NP[4][5][6]. Generalizing the problem further yields the problem of trying to find whether a smaller graph is contained within a larger or equally sized graph, a problem known as graph subisomorphism. Graph subisomorphism is a known

NP-Complete problem, and thus not solvable for large graphs[7][8]. Various methods have been used to solve the above problems. Some algorithms attempt to speed up finding a matching by pruning the search space,[9][10] while others attempt to reduce the memory used during the computation.[10] Finally, some algorithms solve the problem in polynomial time for specifically defined types of graphs, such as trees[11], partial K-trees[8], and planar graphs[12].

Since the problem of determining whether two graph are a perfect matching is unsolvable in a reasonable amount of time for the general case when involving large graphs, most algorithms rely on trying to find another way to compare two given graphs. One technique revolves around finding a similar graph from within a set to a new graph[13].

2 Other techniques involve mathematically reducing the search space to a set of identifying features to be compared[14]. However, one of the main ways to compare the similarity between two graphs is to find a maximal mapping between the graphs, that is, a maximally sized graph that exists as a subgraph within both graphs being compared. Finding the maximal such graph would not only solve the isomorphic polymorphism problem previously state, but also solve the more general subgraph isomorphism problem, a problem well known as NP-Complete. While the problem cannot be solved exactly in a reasonable amount of time, algorithms can be created to attempt to find a suboptimal maximum common subgraph (MCS), which gives a measurement of the similarity between a set of two graphs. This methodology has been the source of many areas of study including and computer vision.

This rest of this paper will focus on a new algorithm, Mutual Exclusion Matching, designed to find a good MCS. The algorithm will be designed for the problem of aligning protein-protein interaction networks (PPIs). In PPIs, the goal is to not only conserve a large number of edges, but also align similar nodes using a scoring metric that contains the value of mapping one node to another. While this makes the problem slightly different than the general graph matching problem, the algorithm is given a weight to emphasize node matches vs edges, which can be adjusted to ignore node weighting entirely. Thus, this new problem contains the graph matching problem previously stated, as a subset of the itself, yielding a problem that is NP-Hard.

3 Chapter 2

History

2.1 Problem Background

While general graph matching applies to many sciences, this paper will focus on the bioinformatics problem of protein-protein interaction networks (PPIs). PPIs aid in the understanding of how various diseases and organisms function. Species can be analyzed in terms of the proteins they contain and what functions these proteins undertake in relation to the species as a whole. An interaction between proteins is a suggestion that the proteins work together or close to each other for some biological / biochemical event. The detection of these interaction pairs is vital to understanding the protein structure of the organism as a whole. By aligning all proteins with other proteins they interact with, a protein-protein interaction network is formed, with a full topology showing the interactions, edges, between various proteins, nodes. These networks, and the comparisons of the similarity of multiple networks, have been used for finding different protein functionalities within organisms,

finding common protein patterns across different species[15], understanding the phylogentic

4 relationships and histories of various species[16], metabolic pathway analysis[15], understanding diseases[17], computing function orthologs[16], and the correlation of biological networks with gene expression data[15]. With these structures holding so much information, it is also an important area of research not only to examine them, but to be able to find how closely related two entire protein networks are to each other. This yields the previously defined NP-Hard problem of trying to find the MCS between two networks.

Because of the rapid expansion of the field of bioinformatics, the number of techniques for data sampling, the number of different samples taken, as well as the number of algorithms for predicting protein interactions based on given data are all constantly growing, the ability to quickly match PPIs is vital. As previously stated, the problem is only similar to the general subisomorphism problem due to the additional information in the nodes of a graph. General graphs typically contain no information within their nodes, and can be defined only by the edges they contain. PPIs, however, are a type of Attributed

Relational Graph (ARG). In an ARG, each node is labeled with some meaningful knowledge that can suggest what nodes are similar across the networks being compared.

For some ARGs, such as chemical compounds, the elements comprising the various networks can be matched by the names alone.

Each node in a PPI represents a given protein found within a species. Proteins each contain a sequence of nucleotides describing their biological making, which can be used to determine the similarity of a pair of nodes. These sequence can be used directly, or converted into a more distinct sequence of amino acids, which in some cases can be better for similarity comparisons. The Basic Local Alignment Search Tool (BLAST)[18] can be used to align sequences, either nucleotides or amino acids, of two proteins and determine

5 the amount of similarity between them[18]. The BLAST aligner, given a sequence to compare to a database of sequences, produces the information needed for bit scores, representing the number of matches between two sequences, normalized bit scores, which represent the bit score normalized relative to the database’s search space, and e scores, a logarithmic representation of the probability of finding a match as good as the one found by chance alone[19]. The scores given by the aligner can then be used for scoring the matching of a pair of nodes.

The similarity of two nodes can also be determined using Gene Ontology (GO), which describes the different functionalities of given proteins[20]. The Gene Ontology project contains a tree of known protein information that is broken down by different groups in a top down approach from top most nodes covering general information to bottom nodes containing more precise information. The hierarchy is broken into three sets covering biological processes, cellular components, and molecular functions, each with unique root nodes[20]. Each protein each be mapped to a various set of GO terminologies, the similarity of each can be calculated based on entropy from within the tree. The probabilities needed for entropy calculations can be calculated from the chance of each node being an ancestor of, or the exact node that matches a GO term given by a large example set. By using GO probabilities, sequence information, or both, the values of matching two nodes can be calculated.

6 2.2 Terminology

2.2.1 Protein-Protein Interaction Network

A Protein-Protein Interaction network (PPI) is a set of proteins for a given species with an edge connecting each protein found to interact with each other through various biological studies. Through one of various biological methods such as yeast two-hybrid[21] or mass spectrometry[22], interacting proteins within a species are detected and made into a graph, providing a topological overview of the species as a whole.

2.2.2 Local and Global Graph Alignment

PPIs are compared based on the structure of their graphs, as well as the known similarities between the proteins they contain. A local alignment consists of aligning a small subnetwork of each graph with a high similarity. Local alignment algorithms typically find several local alignments independent of one another, creating a many-to-many mapping between the nodes in two networks. A global alignment attempts to align an entire graph onto another graph, creating a one-to-one mapping between the nodes in each network.

2.2.3 Constrained and Unconstrained Graph Matching Problem

When matching weighted graphs, there exists a weighted bipartite matching between the two graphs representing the value of mapping a node in one graph to a node in another graph. For some cases, only node pairs that have a non-zero value in the bipartite graph are allowed to be paired, forming a constrained alignment problem. In the unconstrained

7 graph matching problem, any node from one graph can be mapped to any node in another

graph, regardless of whether or not it is in the weighted bipartite matching.

2.2.4 Edge Correctness

The edges conserved between two networks, also known as a conserved pathway, can be

measured by the percentage of the edges from the smaller network mapped to the larger

network. This measure is known as edge correctness. Alternatively, edge scoring can be

computed based on the number of edges connected to the proteins in the mapping.

Penalizing the number of edges connected to proteins used from the larger network but not

in the mapping can be done with a measure known as Induced Conserved Structure (ICS),

while penalizing edges connected to proteins used from both networks that do not exist in

the mapping can be done using a measure known as Symmetric Substructure Score (S3). In the case of a constrained matching problem, a conserved edge forms a square, containing one edge from each graph, the conserved edge, as well as two edges matching the two nodes from each graph to each other. Because of this, the number of edges matched in the constrained problem is sometimes referred to as the sum of squares.

2.2.5 Sequence Similarity

Sequence similarity is the similarity calculated between two proteins based on their sequence structure. This can be done by comparing nucleotides or amino acids, using the

Basic Local Alignment Search Tool (BLAST)[18]. This tool uses a version of the minimal edit distance algorithm to compute the similarity of the sequences.

8 2.2.6 Functional Similarity

Functional Similarity is the calculated similarity between two proteins based on the gene functions and relationships they are known to be involved with. Gene Ontology[20] (GO) provides a hierarchy of gene functions. The similarity of two gene functions can be determined based on their distances from one another on the hierarchy, or based on the distance of their lowest common parent from the top of the hierarchy.

2.2.7 Maximum Common Subgraph

The best scoring graph that is also a subgraph between the two graphs involved is known as the maximum common subgraph (MCS). In the case of general graphs without scores between the nodes, this would be the graph with the most edges but still contained within both graphs to be aligned. With weight nodes, it represents the graph common to both graph being compared that contains a maximal score based on a scoring function. The problem of finding the graph is NP-Complete in the general graph case, and NP-Hard in the case with weighted nodes.

2.2.8 Largest Connected Component

The Largest Connected Component (LCC) of a graph is the section of the graph with the most nodes that have a conserved pathway that connects every node in the section together. This is typically measured by the number of nodes it contains.

9 2.3 Local Aligners

Because of the general size of PPIs and the NP-Hard problem of finding an optimal solution, most algorithms originally only attempted to find several local solutions to the problem. Pathblast[23] is one such algorithm, which compared two sets of PPIs, looking for conserved pathways, a set of conserved connected edges, between them. Other algorithms, such as NetworkBlast[24] and MaWiSh[25] also attempt to find several small subgraph alignments between a set of graphs, with nodes being used in multiple subgraphs, creating many-to-many matchings. These algorithms were at the forefront of a field that attempted to uncover conserved pathways and molecular structures from protein interaction networks.

While the problem of local network alignment is still useful for finding small, conserved networks between multiple species, several more modern algorithms try to find global matchings between networks for a more general similarity comparison between entire organisms.

2.4 Global Aligners

Global alignments are similar to the local alignments found by the algorithms above, but with the constraint that each node be mapped to at most one node from the other network, forming a one-to-one matching. Ideally, this will yield a large connected component (LCC), a large subgraph common to both networks with all nodes connected by a single pathway of edges, that contains most of the nodes of the smaller network.

To analyze global network alignments, the first step is to define a fitness function that determines how well the alignments are matched, and thus what should be optimized. The

10 majority of algorithms for comparing global alignments between PPIs use a general scoring formula of fitness for a matching. Given two PPIs P1 = (V1,E1) and P2 = (V2,E2) and a

−1 given alignment F (V1)− > V2 and F (V2)− > V1, with a one-to-one mapping of V1 to V2 such that each node from one graph is mapped to no more than one node of the other, let f(E1) represent the edges u,v in E1 that also exist as f(u), f(v) in E2. Given all of the above, a fitness score can be calculated from

X Score = ∀n ∈ V 1 w(n, f(n)) ∗ a + |f(E1)| ∗ b (2.1)

where e is 1, or a nonzero weight, if and only if u, v ∈ E1 and f(u), f(v) ∈ E2, w returns the weight given by the biological probability calculated for matching the node n from V1 to f(n) from V2, and a and b are weights emphasizing biological information, a, or topological matching, b, such that a + b = 1. a and b represent user input variables, where a = 1 would give the bipartite matching problem solvable by the Hungarian algorithm in

O(N 3), while b = 1 would yield an equation trying to find the maximal common subgraph, an NP-Complete problem. Weights from interactions are typically normalized between 0 and 1, while edges are typically worth 1 each, although this can be easily customized in several algorithms.

Given the above fitness equation and problem statement, one of the largest remaining splits between aligners is how to handles cases of nodes which the protein similarity measure used suggests are not similar. Several algorithms, including L-GRAAL,

MIGRAAL, GHOST, HUBALIGN, and SPINAL, align any nodes that score well regardless of the weight given in the bipartite matching. Thus, these algorithms must take

11 into account the possibility of matching any node to any node. Other algorithms, such as

Belief Propagation, add the additional constraints to the above equation

∀(u, f(u))u− > f(u) ⇐⇒ w(u, fu) > 0 (2.2)

Thus, in the constrained case, nodes cannot be aligned if no biological information is presented to suggest their alignment. This yield a lower search space for solving the problem, but is still unsolvable in a manageable time for medium and larger sized PPIs.

This matching problem is equivalent to the problem of matching two ARGs.

In this paper, we will also analyze each final alignment topologically and biologically.

Topologically, four of the most common ways to analyze a graph are Edge Correctness

(EC), Induced Conserved Structure (ICS), Symmetric Substructure Score (S3) , and

Largest Connected Component (LCC). LCC can be defined as the maximal number of nodes in the final solution that are connected to each other by a single pathway of edges. A good alignment should have a large number of nodes connected together in one single structure globally to show the similarity between two PPIs. A smaller LCC represents the matching containing multiple local alignments rather than a global one. Edge Correctness is the percentage of edges from the smaller graph that exist in the conserved matching, and can be defined as follows

|f(E1)| EC = (2.3) |E1|

ICS is the ratio of edges conserved from the larger network to the ratio of edges that are connected to nodes conserved from the larger network. This can be defined as follows

12 |f(E1)| ICS = P − − (2.4) ∀u2, v2 ∈ E2 1 ⇐⇒ f 1(u2) ∈ V1 and f 1(v2) ∈ V1

Finally, S3 penalizes missing edges from both sides, combining the metrics of EC and

ICS as follows

3 |f(E1)| S = P − − (2.5) |E1| − |f(E1)| + ∀u2, v2 ∈ E2 1 ⇐⇒ f 1(u2) ∈ V1 and f 1(v2) ∈ V1

For biological comparison, the analysis in this paper uses Gene Ontology compute functional similarity. Each Node in each PPI is associated with a number of Gene Ontology

GO terms that are listed in a Gene Ontology tree. For each term, the probability of it or its children’s appearance over the set of all species is calculated, allowing for a probability for each term’s existence to be calculated. Given the lowest common parent on the GO tree of the two terms, t1 and t2, being compared as LCP(t1, t2), and the probability of a term, t, being used as p(t), the similarity score of two terms can be defined as

sim(t1, t2) = −log(p(LCP (t1, t2))) (2.6)

This calculation, known as Resnik[26][27], gives the similarity between GO terms based on the information gain of the lowest common parent with respect to the entire tree of probabilities. For each set of proteins matched in the final conserved network, each GO term from the proteins is matched to the highest scoring GO term in the other protein.

These scores are then averaged, yielding a functional similarity score.

13 2.5 Global Alignment Algorithms

The following is a list of the popular global alignment algorithms for matching Protein

Protein Interaction networks.

2.5.1 GRAAL

GRAAL[28], MI-GRAAL[29], and L-GRAAL [27] are members of the GRAAL family of graph alignment algorithms. GRAAL uses a method of pruning non-similar nodes from possible matchings using a concept known as graphlets, which represent different configurations of graphs with a small number of nodes. GRAAL counts the number of unique variations of each type of graphlet configuration, usually for up to 4 or 5 nodes, that each node takes part in, and creates a similarity measure for aligning the nodes topologically. After this, pruning non-similar nodes occurs based on these probabilities and provided biological weighting information, allowing an underlying algorithm such as integer based Lagrangian relaxation (L-GRAAL) to solve the problem for the remaining possible matches.

2.5.2 HubAlign

HubAlign[30] attempts to solve the graph alignment problem by calculating the topological importance of nodes and edges by propagating a weight measure from low degree nodes to high degree nodes. By slowly removing low degree nodes and adding their weight to its neighbors, higher weight values move towards nodes that are believed to be more topologically important, known as hubs. With this calculated information of the

14 importance of nodes to the network, an alignment is created from the maximal scoring

node pairs being extended greedily to obtain a global alignment.

2.5.3 Ghost

Ghost[31] uses laplacian methodology to determine a spectral signature, and uses it to

make an initial alignment from which it extends itself iteratively to better solutions within

a local search space. Notably, this paper introduces ICS as a graph topology measure, to

penalize mapping sparse regions from a smaller PPI to dense regions of a larger PPI.

2.5.4 Magna

Magna[32] uses a genetic programming approach to finding the maximal alignment

between two protein protein interaction graphs. The algorithm works from an initial pool

of either random graphs or graphs that have been aligned by other solutions and uses a

crossover function to maximize the conserved edges from a combination of two previous

graphs. The algorithm converges when the genetic algorithm can no longer improve the

graph. The authors of this paper also introduce the topological scoring concept of S3 to penalize mapping sparse and dense regions together in an alignment mapping.

2.5.5 Netal

Netal[33] uses multiple matrices, each containing different types of information. After matrices containing biological, interaction, and topological information are initialized and calculated, Netal relies on a greedy alignment to find the best possible matchings, yielding

15 good alignments in a very fast running time.

2.5.6 Graemlin

Graemlin 2.0[34] produces both global and local alignments through the usage of topological data as well as additional biological information not required by most aligners, such as phylogentic trees. The algorithm creates several local alignments that are then combined into a global alignment probabilistically, with a user specified threshold determining whether to add larger alignments for greater coverage or smaller alignments with better alignment scores. The usage of phylogentic trees as well as training examples for its learning algorithms requires Gremlin to use more information than most aligners, but it is also one the few algorithms designed for multiple network alignment, able to align over 10 different PPIs at once.

2.5.7 GEDEVO

GEDEVO[35] uses a combination of a genetic algorithm with an optimizer based on the minimal edit distance of the graphs being matched. The algorithm uses the number of insertions and deletions between two possible alignments in addition to how high the given alignment scores to determine how often the matching gets used in the creation of child matchings, and which matchings survive to the next iteration. A variety of different functions are used to create new possible alignments from the given set of alignments remaining.

16 2.5.8 PINALOG

PINALOG[36] works by first splitting nodes into highly connected graph subnetworks known as communities within their respective networks. Once communities from each PPI have been located, PINALOG then matches communities to each other, maximizing the biological similarities between the given node sets. Finally, nodes are added to nearby communities, resulting in a many-to-many matching, that is then greedily reduced to one-to-one.

2.5.9 SPINAL

SPINAL[37] aligns protein protein interaction networks in a two step process. The first step calculates the probability of a pair of nodes being matched based on the maximal scores of its neighborhood, that is, the best sum of probabilities from its possible neighbors constrained to the one to one matching rule for global alignments. In the second phase, a seed and extended technique is used to slowly add and swap in new pairs to the final matching.

2.5.10 PISWAP

PISWAP[38] iteratively improves a given alignment using the 2-opt or 3-opt technique.

Given an initial alignment, either from a matching relying only on biological information, or the result of another matching algorithm, two or three protein pairs are broken, mixed, and re-added to the mapping repeatedly where it improves the alignment. The paper shows the algorithm can dramatically improve upon the final results of other aligners,

17 especially when using the 3-opt technique.

2.5.11 IsoRank

IsoRank[16] is one of the first algorithms designed for the purpose of matching two PPIs.

IsoRank uses a version of the PageRank algorithm to allow weight to flow to nodes that have a high importance. Given a matrix X, representing a weighting for each node pair, the matrix is iteratively multiplied by the adjacency matrices of graphs A and B. After each matrix multiplication step, the new weights are passed to a matching function which finds a maximal solution given the proposed weights. This solution is then used with the actual data to produce a fitness score. The algorithm continues running until the eigenvalue matrix X converges, with the matching generating the highest fitness score given as the solution.

2.5.12 Natalie

Natalie[39][40] uses a maximal structural matching formulation to align two PPIs through an approach relying on linear programming and Lagrangian relaxation. In addition to providing a good solution for the alignment problem, this Lagrangian matching is also notable for providing a general upper bound on the problem, allowing the algorithm to show approximately how close it is to a maximal matching.

18 2.5.13 Belief Propagation

Bayati, Gleich, Saberi, & Wang’s Belief Propagation[41][42] algorithm relies on messages being passed from nodes to their neighbors within their respective graphs, as well as passing messages to an additional set of nodes that represent each potential overlapping edge. Belief propagation is specifically made for the constrained matching case.

2.5.14 Maximum Weighted Matching

Finally, maximum weighted matching (MWM) is a process that solves the assignment problem without the constraint of edges. That is, given each node as a single score representing its value, the optimal solution can be solved in polynomial time. In this paper,

MWM is a primal dual version of the Hungarian algorithm coded by Bayati, Gleich,

Saberi, and Wang[41]. The Hungarian algorithm can solve the problem of maximizing weight on a bipartite graph with a one-to-one matching in cubic time. However, the sparsity of PPI adjacency matrices can allow the problem to be solved faster, particularly for the constrained problem. This technique does not take into account any sort of edges or neighborhoods, but relies on the problem having been reduced to a single set of weights representing each possible node pair matching. While this algorithm doesn’t solve the graph matching problem, it does work well to solve the best matching at a given iteration of several iterative aligners such as IsoRank.

19 Chapter 3

Algorithm

3.1 Background

While most algorithms have been tested against the general case only, the constrained case still represents an NP-HARD problem, but with the benefits of being sparse enough to give a good estimate for the maximal scoring alignment. The constrained case has been shown to be useful for a variety of different graph matchings, and tested previously against PPIs by Bayati, Gleich, Saberi, & Wang[41][42]. In their work, they created their own new algorithm, Belief Propagation, a variation of Natalie, NetalignMR, and a version of

IsoRank for the constrained case, SPAIsoRank. We now present a new algorithm, Mutual

Exclusion Matching, which relies on using the best currently found match to find a better matching, for both the constrained and unconstrained version of the matching problem, starting from a formulation of SPAIsoRank for the constrained case.

20 3.2 IsoRank

We start with the formulation of SpaIsoRank, shown in Algorithm 1, a variation of

IsoRank for the constrained graph matching case. SpaIsoRank, which is based on

PageRank, uses a vector, X, with each entry representing a possible pair of nodes to be

matched. This vector is repeatedly multiplied by a matrix representing a cross product of

the matrices to be matched, S until it converges. At each step, the weights of X are given

to a matching algorithm such as Hungarian that grabs the maximal score of X with the

constraints that no two pairs can be selected that use the same nodes, to find a good global

maximum. Finally, the matching calculated from the manipulated vector is used on the

actual data for scoring.

Algorithm 1 SpaIsoRank[42] Require: Undirected graphs to match, A and B. Weighted bipartite graph, W, connecting A and B. Percentage weight for emphasizing edges vs node, β. Ensure: M is a good node matching 1: S ← A cross B where Nodea,Nodeb ∈ W 2: P ← normalize(S) 3: α ← 1 - β 4: oldX ← X ← v ← W/ P W 5: fbest ← 0 6: M ← null 7: while X - oldX > .001 do 8: oldX ← x 9: X ← βPtx + αv 10: newM, matchPairs ← Maximal Weighted Matching(X) 11: P weight ← (Nodea,Nodeb)∈newM W(Nodea,Nodeb) 12: edges ← matchPairst× S × matchPairs × 0.5 13: if α × weight + β × edges > fbest then 14: fbest ← α × weight + β × edges 15: M ← newM 16: end if 17: end while

21 3.3 First Mutual Exclusion Algorithm

The algorithm above is largely based on PageRank, which works well for modeling movement throughout a network, such as internet traffic, but doesn’t deal well with the constraint that only one node from graph A can be matched to one node from graph B. For example, if we already have a good match for a node from graph A to a node in graph B, an ideal algorithm would most likely deemphasize matchings between these nodes and other nodes they could be potentially matched too. This concept is not addressed by the matrix multiplication factor in IsoRank, and is only addressed by the MWM function, and thus the algorithm has no real guarantee that X is moving towards an improved solution for the majority of its run.

From this knowledge, an idea forms of using mutual exclusivity at the primary force behind finding a good matching. Rather than using flow through a network similar to

IsoRank, or manipulating the problem into a mathematical model similar to L-GRAAL and Natalie, the idea for this algorithm is to replace the concept of IsoRank’s multiplication of a vector X that converges to an eigenvector with multiplication by a set of probabilities to that get changed based on the matching selected by a mutually exclusive matching algorithm such as Hungarian or primal dual.

A simple example is shown in Algorithm 2. The algorithm basically consists of 3 steps. The first step breaks the problem of graph matching into a simpler weighted bipartite graph matching problem by assigned each possible node pair a weight. This is done by calculating an expected value for each possible matched node pair, which is done by adding together the weight each pair gets from the bipartite graph with the probability

22 that each potential neighboring node pair will be in the final solution. The probabilities are

all equal in the first iteration, meaning that the first iteration will just be the bipartite

weight from W combined with the number of potential edges each node pair could connect

to from S. This creates a weighted bipartite graph in the form of a matrix containing a

weighted value for each pair of nodes being matched. Step 2 of the algorithm passes these

expected values into an algorithm that can accurately find a good matching for the newly

created weighted bipartite graph from step 1. This can be done using the Hungarian

algorithm or any similar bipartite graph matching method. The final step, given the best

matching found thus far by Step 2, is to update the weighting influence on the problem.

Each node pair that was not used in the matching found in step 2 has its influence

decreased, subsequently decreasing the value all of its neighbors receive in the next

iteration during step 1. The algorithm will converge when the matching found is stable,

that is, when the matching chosen receives all of its weight in step 1 from node pairs that

are selected in the matching in step 2.

Algorithm 2 High Level Example of a Mutual Exclusion based algorithm Require: S, a Cross Product of graphs A and B and W, a bipartite weighting of A and B Ensure: M is a good node matching 1: weights ← (1) × size(W) //arrays of ones, length W 2: while not converged do 3: xf ← Weight of expected value of matching each node pair 4: newM, newMatchPairs ← Best matching for xf, from a mutually exclusive algorithm 5: weight ← updated based on best found matching from previous step 6: end while

To formulate the high level idea given by Algorithm 2 into a full algorithm, we started

with SpaIoRank. From there, we replaced the matrix multiplication of X with a vector,

weights, that represents the probability of a give node pair being in the final solution. In

23 each iteration of the algorithm, the value of a node pair’s probability is decreased by a given amount if the pair does not exist in the best found matching. All nodes in the best matching have their probabilities reset to a value of one. When applying these probabilities, we assume the cross product of matrices A and B to be a directed graph, so that we can multiply each edge by the probability value of the node the edge is moving away from. In this way, node’s not chosen in the best matching slowly have their influence removed from the other nodes. This gives us a new algorithm that slowly excludes the influence of nodes not in the best matching, shown in Algorithm 3.

3.4 Mutual Exclusion Matching

While the above algorithm works nicely, it can be improved. The algorithm relies on a small stepWeight to reach a good solution, which can take a while to converge.

Additionally, this approach is prone to getting stuck on local maximums well below better known solutions due to its reliance of the bipartite graph weights in the first step, and the size of the bipartite graph weights relative to the size of the edge weights. Both types of weights are set to sum to 1, but, since there are more edges than nodes in most graphs, node weights tend to be higher. In order to avoid both of these factors, we can change our probability vector, weights, to containing two probabilities per node pair. The probabilities represent the chance of the node pair being in the final matching divided by all possible node pairs, and the chance of the node pair being in the final matching divided by only the remaining probabilities. The non-normalized weight from the first problem makes mostly small, local changes, commonly increasing the influence of the node weight part of the

24 Algorithm 3 First Mutual Exclusion Algorithm Require: Undirected graphs to match, A and B. Weighted bipartite graph, W, connecting A and B. Percentage weight for emphasizing edges vs node, β. A stepWeight, between 0 and 1. Ensure: M is a good node matching 1: S ← A cross B where Nodea, Nodeb ∈ W 2: α ← 1 - β 3: v ← W/ P W 4: fbest ← 0 5: M ← matchPairs ← null //Match pairs is a binary vector of which pairs are in the matching 6: bscore ← 1 7: weights ← (1) × size(W) //arrays of ones, length W 8: tol ← .001; 9: while bscore > 0 do 10: xf ← .5 × β × (S × (weights) / size(W)) 11: xf ← xf + α × v 12: newM, newMatchPairs ← Maximal Weighted Matching(xf) 13: P weight ← (Nodea,Nodeb)∈newM W(Nodea,Nodeb) 14: edges ← newMatchPairst × S × newMatchPairs × 0.5 15: if α × weight + β × edges > fbest then 16: fbest ← α × weight + β × edges 17: M ← newM 18: matchPairs ← newMatchPairs 19: bscore ← 1; 20: end if 21: bscore = bscore × stepWeight 22: if bscore < tol then 23: break 24: end if 25: weights(:) = weights(:) × stepWeight 26: weights(matchPairs=1) ← 1 27: weights(weights(:)

25 algorithm rather than the edge weight, while the new weighting vector represents the chance of the node pair existing relative to other remaining pairs, which constantly increases the value of selected node pairs, increasing the influence of the remaining edges and pushing the algorithm to find good solutions that contain them in it. The push between these two forces leads to an algorithm that can rapidly yield near optimal results on smaller problems, and good results on larger problems.

With two probabilities per node pair, scoring will now include two probabilities multiplied together for each edge. In order to converge with two sets of weights, the algorithm will also have to switch back and forth between which set it is altering at any given time. By switching between the two sets of probabilities whenever a new best is found, the algorithm attempts to solve both constraints iteratively, allowing a larger stepWeight, thanks to knowledge of the previous two bests, and converges to a better solution than the previous algorithm. Thus, we present a final algorithm for the constrained case, Mutual Exclusion Matching, shown in Algorithm 4.

Finally, given the Mutual Exclusion Matching algorithm for the a constrained problem, a version from the unconstrained problem can be produced by simply replacing S with the general equation of multiplying the adjacency matrices A and B, as shown in

Algorithm 5.

3.5 Example Problem

As stated previously, this algorithm slowly reduces the influence of node pairs not found in the best matching on the rest of the algorithm. We will now walk through a simple example

26 Algorithm 4 Mutual Exclusion Matching Require: Undirected graphs to match, A and B. Weighted bipartite graph, W, connecting A and B. Percentage weight for emphasizing edges vs node, β. A stepWeight, between 0 and 1. Ensure: M is a good node matching 1: S ← A Cross B where Nodea Nodeb in W 2: α ← 1 - β 3: v ← W/ P W 4: fbest ← 0 5: M ← matchPairs ← null 6: bscores ← (1,1) 7: weights ← (1,1) * size(W) //two arrays of ones, length w 8: steptype ← 1 9: tol ← .001; 10: while P bscores > 0 do 11: x ← weights(:,1) / P weights(:,1) 12: xf ← .5 × β × (S × (x. × weights(:,2))) 13: xf ← xf + α × v 14: newM, newMatchPairs ← Maximal Weighted Matching(xf) 15: P weight ← (Nodea,Nodeb)∈newM W(Nodea,Nodeb) 16: edges ← newMatchPairst × S × newMatchPairs × 0.5 17: if α × weight + β × edges > fbest then 18: fbest ← α × weight + β × edges 19: M ← newM 20: matchPairs ← matchPairs 21: bscores ← (1,1); 22: steptype ← 3 - steptype //flip between type 1 and 2 23: end if 24: bscores(steptype) = bscores(steptype) × stepWeight 25: weights(:,steptype) = weights(:,steptype) × stepWeight 26: weights(matchPairs=1,steptype) ← 1 27: weights(weights(:,steptype)

27 Algorithm 5 Mutual Exclusion Matching Full Require: Undirected graphs to match, A and B. Weighted bipartite graph, W, connecting A and B. Percentage weight for emphasizing edges vs node, β. A stepWeight, between 0 and 1. Ensure: M is a good node matching 1: α ← 1 - β 2: v ← W/ P W 3: fbest ← 0 4: M ← matchPairs ← null 5: bscores ← (1,1) 6: weights ← (1,1) × size(W) //two arrays of ones, length w 7: steptype ← 1 8: wtol ← .01; 9: btol ← max(stepWeight4, wtol); // for faster convergence 10: while P bscores) > 0 do 11: x ← weights(:,1) / P weights(:,1) 12: xf ← .5 * β × (S × A × (x.×weights(:,2)) × B) 13: xf ← xf + α × v 14: newM, newMatchPairs ← Maximal Weighted Matching(xf) 15: P weight ← (Nodea,Nodeb)∈newM W(Nodea,Nodeb) P 16: edges ← u,v∈A 1 ⇐⇒ newM(u), newM(v) ∈ B 17: if α × weight + β × edges > fbest then 18: fbest ← α × weight + β × edges 19: M ← newM 20: matchPairs ← newMatchPairs 21: bscores ← (1,1); 22: steptype ← 3 - steptype //flip between type 1 and 2 23: end if 24: bscores(steptype) = bscores(steptype) × stepWeight 25: weights(:,steptype) = weights(:,steptype) × stepWeight 26: weights(matchPairs=1,steptype) ← 1 27: weights(weights(:,steptype)

28 of the Mutual Exclusion Matching algorithm on the constrained matching problem.

Figure 3.1: Example Problem: Graph A Figure 3.2: Example Problem: Graph B

Figure 3.3: Example Problem: Weighted Bipartite Matching from A and B

In the above images figure 3.1 and figure 3.2 show two example graphs, A and B, each containing 10 nodes and 8 edges. Figure 3.3 represents all possible node maps for each node from A to each node in B. Since this is a constrained test case, only a fraction of the node mappings are possible. If this were unconstrained, the image would be a complete bipartite graph, with weight zero for all remaining node pairs. Using the data from Figures

3.1, 3.2, and 3.3, we can now create a view of full graph matching problem below.

29 Figure 3.4: Example Problem: Full constrained matching problem between Graph A and B

In 3.4, we can now see the full scope of the problem. Normally, each edge between nodes in graph A and graph B would contain a weight, W, representing the value of matching those two nodes together. In order to simplify the problem, we are going to assume for the rest of this example to be solving the case with a = 0 and b = 1, or 100% focused on edge scoring over node scoring. Thus, we only need to concern ourselves with conserving edges in the final mapping. Possible conserved edges can be seen in the above image by finding squares containing two nodes from graph A and two nodes from graph B with one edge connecting the nodes in graph A, one edge connecting the nodes in graph B, and two edges connecting each of the two nodes from A to each of the two nodes in B. This formulation of a square is why the number of conserved edges is sometimes referred to as

30 the sum of squares.

Keeping in mind that we are only concerned with the squares formed in the above

image, we can simplify the image by removing all edges that cannot be conserved by

matching the graphs, and all nodes that cannot contribute to conserved edges.

Figure 3.6: Example Problem: Cross Figure 3.5: Example Problem: Con- Product (S) matrix between Graph served edge constrained matching A and Graph B. Only showing node problem between Graph A and B pairs with possible conserved edges

Figure 3.5 above shows a simplified version of the full matching problem, now only dealing with possibly conserved edges. Using this, we can create the cross product (S) matrix, containing possible node pairs linked by possible conserved edges, in Figure 3.6.

From Figures 3.5 and 3.6, we can see there are five possible edges that can be conserved. We can also see that it is not possible to keep all the edges, since keeping the edge requires keeping both nodes surrounding it, and we cannot map a single node from

31 graph A to two nodes from graph B. For example, we cannot keep both a4 b9 and a4 b6, since a4 can only map to one of them.

Applying MWM to the S matrix can be done by assigning each node pair a value equal to the number of possible edges directed into it. This results in the following alignment

Figure 3.8: Example Problem: Cross Figure 3.7: Example Problem: Conserved Product (S) matrix between Graph A and edge constrained matching problem be- Graph B. Only showing node pairs with tween Graph A and B. 1st iteration possible conserved edges. 1st iteration

As we can see in Figures 3.5 and 3.6, MWM matching maximized the given weights,

choosing four nodes with a maximum of one half edge and one node with a maximum of

two half edges, for a value of 6 * .5 = 3. However, this doesn’t take into account the fact

that we need to choose both nodes surrounding an edge to conserve it in the final

matching. When we score this matching on the actual matrix, we only get a value of 1.

32 The next step in the algorithm is to consider the bipartite S matrix, and weight each outgoing edge based on the probability of the node it is leaving from’s existence.

Figure 3.9: Example Problem: Cross Product (S) matrix between Graph A and Graph B. Only showing node pairs with possible conserved edges. 2nd iteration

In Figure 3.9, edge weights are represented by decreasing the size of the edges that come from node pairs not in the current best matching. With these new weights applied, the values for each node passed into MWM, calculated by summing the incoming edge weights, in this iteration will be different, as node pairs a6 b10, a7 b1, a4 b6, and a9 b2 will have decreased from the previous iteration, since some of the node pairs that gave them value in the previous iteration have now been decreased.

33 Figure 3.11: Example Problem: Figure 3.10: Example Problem: Con- Cross Product (S) matrix between served edge constrained matching Graph A and Graph B. Only showing problem between Graph A and B. node pairs with possible conserved 3rd iteration edges. 3rd iteration

In Figure 3.11, we can now see the algorithm has chosen a matching conserving two edges, one of the two possible best solutions (the other would be to conserve a4 b6, a7 b9, a9 b2). The squares created by a6,b10,a4,b9 and a4,b9,a7,b1 conserves two edges, giving a

final score of 2. Node pair a9 b2 is also matching, but does not have any affect on the final

value because we are scoring the case a = 0 b = 1, thus only conserved edges have value.

We have now given a new algorithm, Mutual Exclusion Matching, that scores well and

converges in only a couple dozen fast steps. In the next section, we will give an overview of

a set of PPI data and the results of finding good matchings between them with Mutual

Exclusion Matching and various other constrained and unconstrained matching algorithms.

34 Chapter 4

Experimentation

4.1 Datasets

Data for the PPIs used in this experiment were collected from BioGRID.org[43]. BioGRID is an online database containing a collection of interaction data over a large set of species.

BioGRID’s August 2015 release, v3.4.127, contained interaction data for 112 species. Each of these files contain unique ids for each interaction and protein, allowing access to further data from BioGRID’s website.

While BioGRID gives the data needed to construct the PPIs, in order to do a weighted matching between the networks, we need a separate set of data containing protein similarity. For this, we relied on the BLAST[18]. BLAST gives back scores based on the matchings found and substitution penalties chosen, but knowledge of the score alone is, in the words of Stephen Altschul, like ”citing a distance without specifying feet, meters, or light years.”[19] Instead the most common metrics are normalized bit scores, which incorporate knowledge about substitution penalties into the calculation to give a score that represents the search space necessary to find a match this good by chance, and e, which

35 represents the chance of finding a match this good in the given search space. For this datasets, we utilized bit scores capped at a value of 100, but the possible matchings would look very similar using e values as well.

Given the choice between using amino acids and nucleotides for BLAST weighting, amino acids were used in this study. Protein sequences consist of up to twenty unique amino acid identifiers in contrast to the four nucleic acids that make up nucleotide sequences. Each of the twenty protein amino acids represent one or more combination of three nucleic acids. This means that there can exist multiple nucleic acid combinations to represent the same amino acid. In addition to different nucleic sequences yielding the same amino acid result, protein sequences also have the advantage of having have non-coding regions removed. For these reasons, this study chose the usage of protein sequences matching over nucleotides. However, it is noteworthy to mention that recent studies have suggested that non-coding regions could have more of an impact on genetic functions than originally realized[44] and that Synonymous substitutions, where nucleotide sequences mutate to other sequences representing the same protein, may indicate changes in biological functionalities[45].

To gather then necessary data for the BLAST algorithm to use, sequences were collected from the Universal Protein Resource (UniProt)[46]. UniProt is a collection of resources on a large set of proteins including functional annotations and sequences. Links to the necessary UniProt webpage for each protein were retrieved from BioGrid. From

Uniprot, protein sequences were gathered for each protein in each species in the original list of PPIs. These sequences were then combined into a BLAST database using an offline

BLAST tool provided by the National Center for Biotechnology Information (NCBI). After

36 the database was formed, each sequence was run against the database, with only matches reaching below a default minimal threshold of e = 10 being kept, where e is the expectation value for the BLAST algorithm. All substitution matrix weightings were left at default values. For the unconstrained case, the data now generated from previous steps is enough to begin testing. For the constrained case, however, the final step in data preparation was to create the cross product of each pair of PPIs, preserving only edges where both pairs of nodes on each network are linked together in the bipartite weighting created from the BLAST results. All normalized bit scores obtained from BLAST were capped at a value of 100 in order to not emphasize a good matching too much over another good matching, after which all scores were divided by the maximal found score in the matching, placing all values for the bipartite weighted graph between zero and one.

For the tests below, only 12 of the initial 112 species were chosen. This is because several of the species have very few interactions, less than 100, with several less than 10, which yields a small comparison with a good chance of have no conserved edges for the constrained case. Of the 12 remaining species, 2 pairs did not yield any possible edge conservation for the constrained case, creating a constrained cross product graph of size 0, thus, 12 choose 2 - 2, or 64 pairs of networks were created for the constrained comparison.

The unconstrained case uses the full 66 pairs of networks from the 12 networks chosen. For the constrained case, the size of the smallest few pairs contain only a handful of possible interactions, while the majority contain more than a couple thousand with the largest two,

Homo sapiens - Saccharomyces cerevisiae contains over 5.5 million potentially conserved edges. This should yield a good variety of tests on how well the algorithms do with larger graphs, as well as how close they are to optimal on the smaller graphs. In the data sets for

37 the unconstrained case, the cross product of two PPIs yields networks ranging from a few hundred thousands edges to over a billion. Data on the full set of species can be found in

Appendix A.1, while information about the pairs of species being matching for the unconstrained and constrained cases can be found in Appendix A.2 and Appendix A.3 respectively.

Finally, for a biological comparison between the final matches, we retrieved the GO data for each protein from BioGRID, as well as a copy of the GO tree from the official

Gene Ontology website. In order to score each pair of GO terms between a matched set of proteins, each GO term from one protein was greedily aligned to the maximal scoring match from the other protein, with the average of all the maximum scores being used as the final biological score for the matching. Resnik[26][27] scoring was used for each pair of

GO terms, as outline in Equation 2.6. Rather than use all GO terms in our comparisons, we followed the method outlined by Malod-Dognin and Pr˜zulj[27] in using only experimentally validated GO terms, excluding annotations inferred from PPIs.

4.2 Constrained Results

For the constrained alignment problem, Mutual Exclusion Matching is compared to a set of algorithms coded by Mohsen Bayati, David F. Gleich, Amin Saberi, Ying Wang in a

Matlab package called Netalign. The algorithms in the package specifically target the constrained matching problem, and are based on the works of a few different algorithms.

NetalignBP and NetalignSCBP are versions of belief propagation[41][42] made by Bayati et al. specifically for the constrained problem. NetalignMR, Lagrange, and LP, are based off

38 of Klau’s algorithm, Natalie [39]. Finally, SpaIsoRank is based of the work of Singh[16]. Of these algorithms, only the best of each group was used. Based on testing, NetalignMR,

NetalignBP, SpaIsoRank, and Mutual Exclusion Matching were the final algorithms compared. Algorithms were run 11 times per problem, for each α value from 0.0 to 1.0 increasing by 0.1 each run. Each algorithm was also run with a variety of settings, with the best results for each problem being used to compile a final result set for the algorithm, consisting of 11 runs per pair of PPIs. Belief Propagation was split into two groups, one running faster settings to solve the problem quickly, and another solving the problem more slowly with smaller step sizes and a larger maximum number of iterations. NetalignMR, in addition to attempting to solve the problem, provides an upper bound on what the best solution value is. Full details on the algorithms run can be found in Table 4.1. Algorithms are compared in 4 categories, fitness score, running time, topology, and biologically.

Table 4.1: Constrained alignment algorithm settings. All non-listed settings are run at default values

Algorithm Settings Total Best versions Ver- sions SpaIsoRank rtype ∈ 1,2 2 rtype 2 runs bet- ter more often Belief Propagation Short dtype ∈ 2,3; gamma ∈ .9, .99, .995, 8 gamma .995 runs .999; max iterations = 100 best in most cases Belief Propagation Long (dtype ∈ 2,3; gamma = .995; max it- 4 gamma .999 runs erations = 250) (dtype ∈ 2,3; gamma best in most cases = .999; max iterations = 1000) NetalignMR rtype = 1; gamma = .4; stepm ∈ 5, 3 Lower stepm 25, 50; tends to run better Mutual Exclusion Matching stepWeight ∈ 0, .01, .05, .1, .15, .25, 9 .05 - .25 runs bet- .5, .75, .9 ter on larger net- works, .5-.9 runs better on smaller networks

39 Figure 4.1: Scores of the best version of each algorithm, averaged over 11 α values. Scores are shown relative to the upper bound calculated by NetalignMR. PPIs are ordered from largest to smallest by Constrained Edge Product

The first metric to compare various alignments is by the score of the fitness function.

In this metric, Mutual Exclusion Matching and NetalignMR were the two most dominant algorithms. NetalignMR did better on smaller PPI sets, since it can run until the optimal solution is found. On larger sets were the optimal alignment is not as easy to find, Mutual

Exclusion Matching produced the best results. Full results can be found in Figures 4.1 and

4.2.

The second metric, running time, represents the feasibility of running the given algorithm for larger graphs in a small amount of time. Times are calculated both in terms of the time the algorithm spends finding the best alignment, as well as the time each algorithm spends before it converges or stops looking for a better alignments. Results of these metrics can be found in Figure 4.3 and Table 4.2. IsoRank finds its maximal solution faster than any other algorithm, with Mutual Exclusion the second fastest. While IsoRank

finds a good solution fast, the solution found is typically not good, as shown in the scoring section Figure 4.1. The other best algorithm at finding a good solution in terms of fitness

40 Figure 4.2: Scores from the best version of each algorithm, for the 10 largest PPI pairs, averaged over 11 α values. Scores are shown relative to the upper bound calculated by NetalignMR. PPIs are ordered from largest to smallest by Constrained Edge Product

41 Figure 4.3: Logarithmically scaled running time to find the best solution for each algorithm, averaged over 11 α values. PPIs are ordered from largest to smallest by Constrained Edge Product score, NetalignMR, takes far more time to run than any other algorithm. While it runs relatively fast for the first and third largest matchings, this only occurs because it fails to

find a good score later in the matching, and gives a solution that it finds in the first few steps. This can be seen from its poor score on these pairs in Figure 4.1, and from the overall time it takes to complete the algorithms in 4.2. IsoRank is similar to NetalignMR in this sense, commonly finding its best possible solution in the first few iterations, yielding a running time of near zero. Overall, Mutual Exclusion Matchings low running time and high scoring are favorable when compared to these other algorithms.

For topology, measurements are made on the maximal found value across all α values rather than the average, as averaging the topology found on high α values, which value semantic similarity, not topology, would be meaningless. The topology of the conserved networks is measured in Figure 4.4, Figure 4.5, Figure 4.6, and Table 4.3. In all 4

42 Table 4.2: Total time taken by the best run of each of the different algorithms to find a best solution, besttime, and to converge, totaltime, in seconds Algorithm Best Time Total Time SpaIsoRank 235 6317 Belief Propagation Short 11605 15593 Belief Propagation Long 53064 119310 NetalignMR 854520 1196000 Mutual Exclusion Matching 1756 3854 topological measures, EC, ICS, S3, LCC, the results produced are rather similar for each

PPI pair, usually with most algorithms producing a result within 5% of each other, as seen in Figure 4.4. The reason for this can be found in how highly constrained this version of the problem is relative to the size of the networks as a whole. Since most possible edge matchings are removed in the constrained case, each algorithm is forced to choose from a smaller selection of possible matchings, minimizing the amount of edges conserved. Since such small variance does not show easily on the graph of actual values, in order to highlight the differences between the algorithms topology, Figures 4.5 and 4.6 present the data relative to the best found solutions. EC scores follow a pattern similar to fitness score, with

Mutual Exclusion Rank winning the majority of larger alignments and NetalignMR winning the majority of smaller alignments. For LCC, Mutual Exclusion matching does well in the top 20 networks and overall, but is constantly fighting with Belief Propagation for the top spot. Both algorithms do very well in finding a good LCC. Edge measures, particularly ICS and S3 are very low due the constrained limitations enforced by this problem. The results of all solutions averaged together shows Mutual Exclusion slightly losing to NetalignMR in Edge Correctness, and placing 3rd and 4th in IC and S3, behind both NetalignBPs in both measures. Mutual Exclusion creates the largest average LCC.

43 Figure 4.4: Average EC for each PPI pair PPIs are ordered from largest to smallest by Constrained Edge Product

Figure 4.5: Relative percentage of EC to the best found value for each PPI pair PPIs are ordered from largest to smallest by Constrained Edge Product

Table 4.3: Average across all constrained problems of the maximal values found across all α

Algorithm EC ICS S3 LCC SpaIsoRank 0.1365 0.0887 0.0400 347 Belief Propagation Short 0.1572 0.1272 0.0531 459 Belief Propagation Long 0.1595 0.1245 0.0530 464 NetalignMR 0.1635 0.1143 0.0522 412 Mutual Exclusion Matching 0.1626 0.1165 0.0518 466

44 Figure 4.6: Relative value of LCC to the best found value for each PPI pair. PPIs are ordered from largest to smallest by Constrained Edge Product

Finally, biological scores are calculated using Resnik scoring, and listed in Table 4.4,

4.5. No algorithms strongly outperform any other in biological scoring, but Mutual

Exclusion, IsoRank, and NetalignMR tend to perform better than Belief Propagation for average scoring. Rather than only compare the average value given to the matches, we also check the average value of the biological scores within the largest connected component, to ensure that good matches are being created within the main connected network rather than scattered into several small networks in the background. For this measure, we excluded results found with an α value of 1, since those matchings ignore topology creating non meaningful LCCs. Within LCCs, IsoRank outperforms other alignment algorithms in terms of biological similarity, which could be related to it finding the smallest LCCs of all the alignment algorithms tested.

45 Table 4.4: Average values for biological measures for each constrained problem

Algorithm Cellular Molecular Biological Max Max Max Biological Process Cellular Molecular Process SpaIsoRank 1.1715 0.9687 1.2922 1.2289 1.0323 1.3718 Belief Propagation Short 1.1634 0.9653 1.287 1.2380 1.0423 1.3844 Belief Propagation Long 1.1637 0.9647 1.2830 1.2368 1.0395 1.3834 NetalignMR 1.1711 0.9681 1.2905 1.2395 1.0392 1.3840 Mutual Exclusion Matching 1.1715 0.9722 1.2956 1.2335 1.0397 1.3764

Table 4.5: Average values for biological measures for each constrained problem within their Largest Connected Components for α values less than 1

Algorithm Cellular Molecular Biological Max Max Max Biological Process Cellular Molecular Process SpaIsoRank 1.2966 0.9851 1.4431 1.6842 1.3349 1.9579 Belief Propagation Short 1.1199 0.8747 1.2642 1.4745 1.2639 1.6967 Belief Propagation Long 1.1298 0.8792 1.2769 1.4685 1.2610 1.6714 NetalignMR 1.1933 0.8792 1.2769 1.4685 1.2610 1.8202 Mutual Exclusion Matching 1.1797 0.9664 1.3789 1.5564 1.3315 1.7962

4.3 Unconstrained Results

For the unconstrained alignment problem, Mutual Exclusion Matching is compared to

IsoRank, L-GRAAL, and Spinal. Due to the size of the networks involved, a 1 hour time limit was enacted on IsoRank, L-GRAAL, and Mutual Exclusion, as those three aligners produce iteratively better results and can be stopped after a certain time limit. IsoRank usually finds its best solution rather early, and converges within an hour on most alignments with an α value greater than .2 or .3. L-GRAAL used a 1 hour time limit in its initial paper, which suggests that it should run fine given an hour as well. Mutual

Exclusion Matching converges for most small and medium sized alignments within the 1 hour time limit, and gives a decent matching for the larger comparisons within this time limit. The primary reason for the longer running time for Mutual Exclusion Matching in the constrained problem is the MWM algorithm, which finds the maximal matching in cubic time. Spinal does not give iterative solutions, and was thus allowed a long running

46 time, commonly taking between 1 and 4 hours to converge for most algorithms, with some exceeding 6 hours, and 1 matching on both types of yeast exceeding 24 hours before converging. Since time was constrained, comparisons will be limited to score, topology, and biology. In addition to constraints on time, the computers running these matching were limited to 8 GB ram, and thus some algorithms were unable to run all pairs of PPIs.

IsoRank, L-GRAAL, and Mutual Exclusions successfully ran 62 of the 66 possible matches in these conditions. Spinal could run all 66 comparisons in the RAM provided, but was reduced to 64 total comparisons after removing 2 alignments that took over 6 hours. Table

4.6 shows the algorithms and the comparisons that were excluded from their runs. Because not all algorithms ran with all 66 comparisons, all averages will be calculated across the 61 pairs which all 4 algorithms ran. However, all graphs which split the values by PPI pair will show all possible values calculated from each alignment algorithm. All algorithms were run with default settings, with Mutual Exclusion Matching running with a stepWeight equal to .4, which proved to be decent at finding a good result quickly on larger alignments. There were two computers used to calculate the results, a 2.6 GHz Fedora

Linux machine, and a 2.7 GHz Windows 10 machine, each with 8 GB ram. Comparisons between the two showed approximately the same amount of calculations done on each machine for the same PPI pairs within the 1 hour time limit. For Spinal and a couple cases of L-GRAAL, the algorithm was run with an α of 0.99999 in place of an α of 1.0, to avoid crashes and freezes that occurred when testing the algorithms.

A comparison of the average scores given by each alignment algorithm can be found in

Figure 4.7. Mutual Exclusion is the dominant algorithm in terms of scoring, winning in almost all comparisons with exception to the four that it could not run with the given

47 Table 4.6: Unconstrained alignment algorithms

Algorithm Missing alignments Mutual Exclusion Matching Homo - Sacc, Homo - Mus, Arab - Homo, Dros - Homo IsoRank Homo - Sacc, Homo - Mus, Arab - Homo, Dros - Homo L-GRAAL Homo - Sacc, Homo - Mus, Arab - Homo, Dros - Homo Spinal Homo - Sacc, Sacc - Schi

Figure 4.7: Average scoring of various algorithms on the unconstrained alignment problem across all α. PPIs are ordered from largest to smallest by edge product hardware limitations. L-GRAAL typically comes in second, with Spinal beating it only occasionally. Mutual Exclusion achieves or ties the best scoring average over fifty times in sixty six comparisons, including all 19 of the largest comparisons in which it was run.

Comparisons of the topological metrics of the alignment algorithms can be found in

Figures 4.8 and 4.9 as well as Table 4.7. Figure 4.8 shows edge correctness found for each alignment, on a scale of 0% to 100%. While the maximal found EC varies mostly based on the set of networks compared, Mutual Exclusion finds more conserved edges than other aligners on most networks compared, with an EC of over 90% for several alignments.

Mutual Exclusion finds 11 alignments containing over 95% of the edges from the smaller network to the larger, including an overlap of over 98% between Rat and Baker’s yeast, a

48 Figure 4.8: Maximal Edge Correctness across all α for the unconstrained alignment problem. PPIs are ordered from largest to smallest by Edge Product

Table 4.7: Average across all unconstrained problems of the maximal values found across all α Algorithm EC ICS S3 LCC Mutual Exclusion 0.6791 0.3033 0.2059 1666 IsoRank 0.1842 0.0860 0.0482 318 L-GRAAL 0.5668 0.4724 0.3118 1453 Spinal 0.5605 0.2407 0.1603 1516 matching that no other alignment algorithm tested found an overlap of 80% or more.

Figure 4.9 shows the maximal LCC found by each alignment algorithm, with results similar to that of Edge Correctness. The overall results in Table 4.7 show that Mutual Exclusion

Matching aligns a better topology in terms of EC and LCC than any other algorithm, and places second in ICS and S3 only to L-GRAAL. L-GRAAL removes potential node matches with a dissimilar local topology in its initial preparation state, thus it is unsurprising that it placed first in a measures based on the local topological similarity of the nodes matched.

Finally, a comparison between various biological calculations can be found in Tables

49 Figure 4.9: Maximal LCC across all α for the unconstrained alignment problem. PPIs are ordered from largest to smallest by Edge Product

4.8 and 4.9. Table 4.8 shows that IsoRank provides the best matching average for GO term matching, followed by L-GRAAL and Mutual Exclusion. Looking at the values provided from within the LCC in Table 4.9 shows large improvements for IsoRank, L-GRAAL, and

Mutual Exclusion compared to their global averages. For the global average, IsoRank tends to produce better biological scores on average, while L-GRAAL tends to produce the best maximal scores. When focusing on the biological scoring from within the LCC each algorithm creates, Mutual Exclusions does better than L-GRAAL on average, but loses to it in the maximum match found. IsoRank again does the best within its LCC, but also provides the smallest LCC of all algorithms tested.

The results found suggest that L-GRAAL is better than Mutual Exclusion at finding biological scores based on global averages, and better at finding a similar node structure based on ICS and S3 scores. However, a comparison of the number of matches produced by

50 Table 4.8: Average values for biological measures for each unconstrained problem

Algorithm Cellular Molecular Biological Max Max Max Biological Process Cellular Molecular Process Mutual Exclusion 0.8242 0.5770 0.8568 0.9665 0.7637 1.0513 IsoRank 0.9131 0.6925 0.9753 0.9721 0.7665 1.0565 L-GRAAL 0.8363 0.6043 0.8753 1.0619 0.8851 1.1770 Spinal 0.7110 0.4307 0.6965 0.8382 0.5604 0.8590

Table 4.9: Average values for biological measures for each unconstrained problem within their Largest Connected Components for α values less than 1

Algorithm Cellular Molecular Biological Max Max Max Biological Process Cellular Molecular Process Mutual Exclusion 0.8430 0.5598 0.8371 1.2085 0.9570 1.3507 IsoRank 1.1821 0.7699 1.1495 1.6510 1.2164 1.7102 L-GRAAL 0.7756 0.5206 0.7879 1.2529 1.0007 1.4420 Spinal 0.7193 0.4221 0.7156 0.8692 0.5641 0.9079 each algorithm in Figure 4.10 shows another possible cause for the drop in Mutual

Exclusions global averages. While L-GRAAL and Spinal only return a subset of the possible nodes from the smaller network to be mapped to the larger, the unconstrained

Mutual Exclusion algorithm and IsoRank always provides a matching for each node, regardless of whether or not a good match exists for it. This explains why the algorithm does worse when looking at the overall average biological scoring compared to L-GRAAL, but performs similar to or better than L-GRAAL when looking at only nodes from within the LCC provided by each. Future work should be done to remove nodes from the output match based on how much score they provide to the final solution. While this would reduce

EC and fitness score slightly, it shouldn’t degrade their quality too much as only the lowest scoring nodes would be removed. LCC would probably be reduced, but it would also likely improve Mutual Exclusion’s ICS and S3 metrics. We leave pruning the final dataset to return a smaller set that scores well to future work.

51 Figure 4.10: Average number of protein matches found for each algorithm across α values

4.4 Biological Comparison

4.4.1 Average Biological Results

In this section, a comparison is made between two of the best constrained alignment algorithms, NetalignMR and Mutual Exclusion, and two of the best unconstrained alignment algorithms, Mutual Exclusion and L-GRAAL. A comparison of the average of the Biological scores both overall and within the found LCCs can be seen in Figures 4.11 and 4.12. The plots show all four algorithms plus the average of matching two random protein together for the constrained and unconstrained case. This shows that the unconstrained alignment algorithms score worse than a random constrained alignment until an α value around .5 or .6 is used. At α equals 1, a sharp spike can be seen in the LCC graph, while the overall average graph shows a decline. The primary reason for the LCC spike is the small size of LCCs at α equal 1, as is shown in 4.13, which makes the biological scoring of an LCC for α at 1 a poor metric. Overall, both unconstrained algorithms fail to

52 Figure 4.11: Average of biological scores across various α perform as well as the constrained algorithms, showing that there still exists a strong correlation between sequence similarity and GO terms similarity.

4.4.2 Topological Usefulness

The final analysis done between the algorithms is to show that the algorithms did make use of the topological information and not just the sequence similarity information when selecting biological matches. Figures 4.14 and 4.15 both show the number of times each α variable was used to find the best biological results. Both overall and LCC averages make use of several α variables, finding maximum biological similarities using a combination of

53 Figure 4.12: Average of LCC biological scores across various α

54 Figure 4.13: Average of LCC size across all α

55 Figure 4.14: The number of times each algorithm found the maximum biological similarity between two species for a given α topological and sequence similarity data. However, it is apparent that in the majority of cases, maximizing biological similarity was done through mainly the analysis of sequence similarity data.

4.5 Future Improvements

Of the 682 comparison tests run by both L-GRAAL and Mutual Exclusion Matching,

L-GRAAL scored higher by more than 30% less than 50 times. Of those times, it won by both more than 30% and more than 500 points only 4 times. In contrast, Mutual Exclusion outscored L-GRAAL by both 500 and 30% over 90 times. Analyzing the 4 times Mutual

Exclusion loses by a large amount reveals a trend. Figure 4.16 shows the scores of

L-GRAAL and Mutual Exclusion across all α variables for the alignment of Fission Yeast and Fruit Fly PPIs. In the figure we can see that Mutual Exclusions wins with α values

56 Figure 4.15: The number of times each algorithm found the maximum biological similarity between two species for a given α from 0 to 0.9

less than .4, but sharply declines at α equal to .5, without ever recovering thereafter. Given that scoring is based on Equation 2.1, it is clear that this is less than optimal. As α

increases by .1 each step, beta decreases by .1 each step. Assuming all scoring in the

previous step was on the beta side of the equation, scoring should drop by no more than

10% per α step, while scoring drops by over 40% between α values of .4 and .5. This is a

sign that the algorithm gets stuck at a local maximum far earlier than desirable. Since

Mutual Exclusion Matching is a greedy algorithm, it only moves towards steps that

increase it goal, which can fail to see better options multiple steps ahead.

A good indication of how well the algorithm is working can be made by determining

how often the algorithm produces a suboptimal solution such as the one produced in

Figure 4.16. Rather than look for areas with more than a 10% decline, we will look for

what we define here as a suboptimal local search space matching. A search space is a given

set of states that an algorithm can move through, containing all possible solutions to the

57 Figure 4.16: Scores produced by aligning Drosophila melanogaster and Schizosaccharomyces pombe 972h by α value

58 problem. Since the number of states in the total search space in this algorithm is far to large to iterate through, we define local search space as the set of states that the algorithm went through for each PPI comparison for all α. That is, that the matching produced at each iteration of each run of a species pair makes up the local search space for that species pair. Using this local search space, we can compare the solutions found in each iteration of each comparison made on each set of PPIs to the final solutions chosen and see whether or not a better result should have been found.

Tables 4.10 and 4.11 show how well the constrained algorithm did within its local search space. Table 4.10 shows where the best solution was found for each α weighting. For example, the number in the 4th row down, 2nd column, is 29, which represents that the number of times the best possible solution for the fitness function with an α equal to 0.3 was found while search for a good solution with a fitness function α of .1, is 29. Thus, 29 time it would have been better to have tried to solve the problem with an α value of .1 to get a solution for an α value of .3. Ideally, the set of cells moving diagonally from the top left to the bottom right would all have a 64 in them, representing that for all 64 constrained attempts, the best solution was found for each α while searching for the solution for the given α value. Unfortunately, this is not correct. While the algorithm does well, α values .2 through .7 do not solve the algorithm as well as using a lower α value could, while the solution for the fitness score for an α value of 0.0 can improve itself commonly by using an

α value of .1. Table 4.11 shows the percentage of the total value from the found solution to the maximum for the local search space. The algorithm is only running at 98% optimal for the lowest values of α, slowing increasing to 100% has the variable α increases. This means that only small improvements can be made from the local search space, so the current

59 Table 4.11: Percentage of Table 4.10: Comparison of α values used for scoring for optimal solution found for the constrained problem, y axis vs α values used for each α for the uncon- finding the best solution, x axis strained problem

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 α Percentage 0.0 36 25 3 0 0 0 0 0 0 0 0 0.0 0.983 0.1 2 46 15 3 0 0 0 0 0 0 0 0.1 0.989 0.2 0 31 27 5 0 1 0 0 0 0 0 0.2 0.990 0.3 0 29 16 15 1 1 2 0 0 0 0 0.3 0.990 0.4 0 15 19 10 15 1 4 0 0 0 0 0.4 0.991 0.5 0 6 11 12 13 15 5 1 1 0 0 0.5 0.993 0.6 0 1 3 3 9 10 17 14 6 1 0 0.6 0.996 0.7 0 0 0 0 2 2 9 31 18 2 0 0.7 0.998 0.8 0 0 0 0 1 1 3 3 42 14 0 0.8 0.999 0.9 0 0 0 0 0 0 0 0 5 59 0 0.9 1.0 1.0 0 0 0 0 0 0 0 0 0 0 64 1.0 1.0 matching is not too bad. This can probably be attributed to the simplicity of the constrained problem and the number of attempts to run it with different stepWeights.

The performance results are much worse when looking at the results for the unconstrained matching algorithm. Tables 4.12 and 4.13 show that Mutual Exclusion

Matching performs poorly for most high α values, getting stuck on local maximums far earlier than desired. The percentage breakdown shows that the algorithm only runs at 80%

- 86% optimal, based on known possible solutions. These results suggest that the algorithm gets stuck on local maximums when attempting to find a solution with a moderate to high

α value too early. This could be caused by the initial step in the algorithm. When the algorithm first starts, the first solution it selects is based on the weight of each possible matching and the percentage of total possible edges that are connected to that node matching. Since several of the larger graphs have a much larger number of possible edges than they have total weight, the initial step commonly overemphasizes the weights, and

60 Table 4.13: Percentage of Table 4.12: Comparison of α values used for scoring optimal solution found for the unconstrained problem, y axis vs α values used for the unconstrained problem finding the best solution, x axis for each α

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 α Percentage 0.0 45 9 5 3 3 1 0 0 0 0 0 0.0 0.984 0.1 27 16 11 4 5 2 1 0 0 0 0 0.1 0.940 0.2 22 16 15 4 6 2 1 0 0 0 0 0.2 0.927 0.3 19 18 13 7 5 2 2 0 0 0 0 0.3 0.906 0.4 16 18 13 6 8 3 2 0 0 0 0 0.4 0.827 0.5 9 19 11 11 9 4 2 1 0 0 0 0.5 0.814 0.6 6 20 9 12 6 2 5 5 1 0 0 0.6 0.820 0.7 5 26 16 6 3 2 3 4 1 0 0 0.7 0.869 0.8 4 21 17 6 3 2 4 2 5 2 0 0.8 0.928 0.9 4 4 16 14 9 4 3 3 1 8 0 0.9 0.975 1.0 4 0 0 0 0 0 0 0 0 0 62 1.0 1.0 continues to do so for several steps afterwards. Future progress on this algorithm should try to offset this by either making the α value flexible while the algorithm is running, allowing more edges to be found earlier by using a lower α, or by seeding the initial weighting matrix with values found solving the matching when α is equal to zero. Finding a way to more accurately give priority to edge weights can yield up to approximately a 20% increase for certain α values, which could lead to a 10% overall improvement in the algorithms performance. Notable, the sections .4 - .8, also include the primary regions where L-GRAAL outscores Mutual Exclusion in biological similarity measures, a results that may change with a more optimal algorithm.

61 Chapter 5

Conclusion

5.1 Conclusion

Aligning Protein-Protein Interaction networks is an important task, providing insight into how a species works, which could lead to, among other things, a better understanding of phylogentic relationships between species and more insight into how diseases work. The importance of aligning networks has led to a large variety of alignment algorithms to be produced to try to solve the NP-Hard PPI network alignment problem. In the proceeding sections, we outlined some of the various methods used to previously solve the problem and presented our own new algorithm, Mutual Exclusion Matching. Mutual Exclusion

Matching aligns a pair of Protein-Protein Interaction networks together quickly while producing a high amount of conserved edges, with an Edge Correctness measure of over

95% when aligning several moderate sized networks to larger networks. We have also shown the algorithm produces the highest scores of any algorithm tested and yielded near optimal solutions on constrained cases testing smaller networks with smaller amounts of data. Mutual Exclusion Matching also conserved the largest LCCs of any alignment

62 algorithm. Furthermore, the results showed that while a better biological matching can be derived from only nodes with similar sequences, matching the entire networks can yield a biological mapping result almost as good, with a much larger amount of topology conserved. Finally, we showed that despite Mutual Exclusion Matching outscoring all other algorithms in terms of fitness score, it still has room for improvement. By trying to lower running times with a greedy matching strategy, masking unlikely pairs or loosening constrains to try to yield more optimal results, or adding a function that removes low scoring pairs from the final matching, improvements could be made to further enhance the algorithm and try to improve results further than we already have.

63 Bibliography

[1] James J McGregor. Backtrack search algorithms and the maximal common subgraph

problem. Software: Practice and Experience, 12(1):23–34, 1982.

[2] Luigi P Cordella, Pasquale Foggia, Carlo Sansone, and Mario Vento. Fast graph

matching for detecting cad image components. In Pattern Recognition, 2000.

Proceedings. 15th International Conference on, volume 2, pages 1034–1037. IEEE,

2000.

[3] Chun-Ta Lu, Hong-Han Shuai, and S Yu Philip. Identifying your customers in social

networks. 2014.

[4] Uwe Sch¨oning.Graph isomorphism is in the low hierarchy. Journal of Computer and

System Sciences, 37(3):312–323, 1988.

[5] Scott Fortin. The graph isomorphism problem. Technical report, Technical Report

96-20, University of Alberta, Edomonton, Alberta, Canada, 1996.

[6] Anna Lubiw. Some np-complete problems similar to graph isomorphism. SIAM

Journal on Computing, 10(1):11–21, 1981.

[7] Colin De La Higuera, Jean-Christophe Janodet, Emilie´ Samuel, Guillaume Damiand,

64 and Christine Solnon. Polynomial algorithms for open plane graph and subgraph

isomorphisms. Theoretical Computer Science, 498:76–99, 2013.

[8] Jiˇr´ıMatouˇsekand Robin Thomas. On the complexity of finding iso-and other

morphisms for partial k-trees. Discrete Mathematics, 108(1):343–364, 1992.

[9] Julian R Ullmann. An algorithm for subgraph isomorphism. Journal of the ACM

(JACM), 23(1):31–42, 1976.

[10] Luigi P Cordella, Pasquale Foggia, Carlo Sansone, and Mario Vento. A (sub) graph

isomorphism algorithm for matching large graphs. Pattern Analysis and Machine

Intelligence, IEEE Transactions on, 26(10):1367–1372, 2004.

[11] Birgit Jenner, Klaus-J¨ornLange, and Pierre McKenzie. Tree isomorphism and some

other complete problems for deterministic logspace. DIRO, Universit´ede Montr´eal,

997, 1997.

[12] Guillaume Damiand, Colin de la Higuera, Jean-Christophe Janodet, Emilie´ Samuel,

and Christine Solnon. A polynomial algorithm for submap isomorphism.

[13] Huahai He and Ambuj K Singh. Closure-tree: An index structure for graph queries. In

Data Engineering, 2006. ICDE’06. Proceedings of the 22nd International Conference

on, pages 38–38. IEEE, 2006.

[14] David Knossow, Avinash Sharma, Diana Mateus, and Radu Horaud. Inexact matching

of large and sparse graphs using laplacian eigenvectors. In Graph-Based

Representations in Pattern Recognition, pages 144–153. Springer, 2009.

65 [15] Brian P Kelley, Roded Sharan, Richard M Karp, Taylor Sittler, David E Root,

Brent R Stockwell, and Trey Ideker. Conserved pathways within bacteria and yeast as

revealed by global protein network alignment. Proceedings of the National Academy of

Sciences, 100(20):11394–11399, 2003.

[16] Rohit Singh, Jinbo Xu, and Bonnie Berger. Global alignment of multiple protein

interaction networks with application to functional orthology detection. Proceedings of

the National Academy of Sciences, 105(35):12763–12768, 2008.

[17] Jean-Fran¸coisRual, Kavitha Venkatesan, Tong Hao, Tomoko Hirozane-Kishikawa,

Am´elieDricot, Ning Li, Gabriel F Berriz, Francis D Gibbons, Matija Dreze, Nono

Ayivi-Guedehoussou, et al. Towards a proteome-scale map of the human

protein–protein interaction network. Nature, 437(7062):1173–1178, 2005.

[18] Stephen F Altschul, , , Eugene W Myers, and David J

Lipman. Basic local alignment search tool. Journal of molecular biology,

215(3):403–410, 1990.

[19] Stephen Altschul. The statistics of sequence similarity scores.

[20] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather

Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T

Eppig, et al. Gene ontology: tool for the unification of biology. Nature genetics,

25(1):25–29, 2000.

[21] Ulrich Stelzl, Uwe Worm, Maciej Lalowski, Christian Haenig, Felix H Brembeck, Heike

Goehler, Martin Stroedicke, Martina Zenkner, Anke Schoenherr, Susanne Koeppen,

66 et al. A human protein-protein interaction network: a resource for annotating the

proteome. Cell, 122(6):957–968, 2005.

[22] Yuen Ho, Albrecht Gruhler, Adrian Heilbut, Gary D Bader, Lynda Moore, Sally-Lin

Adams, Anna Millar, Paul Taylor, Keiryn Bennett, Kelly Boutilier, et al. Systematic

identification of protein complexes in saccharomyces cerevisiae by mass spectrometry.

Nature, 415(6868):180–183, 2002.

[23] Brian P Kelley, Bingbing Yuan, Fran Lewitter, Roded Sharan, Brent R Stockwell, and

Trey Ideker. Pathblast: a tool for alignment of protein interaction networks. Nucleic

acids research, 32(suppl 2):W83–W88, 2004.

[24] Roded Sharan, Silpa Suthram, Ryan M Kelley, Tanja Kuhn, Scott McCuine, Peter

Uetz, Taylor Sittler, Richard M Karp, and Trey Ideker. Conserved patterns of protein

interaction in multiple species. Proceedings of the National Academy of Sciences of the

United States of America, 102(6):1974–1979, 2005.

[25] Mehmet Koyut¨urk,Ananth Grama, and Wojciech Szpankowski. Pairwise local

alignment of protein interaction networks guided by models of evolution. In Research

in Computational Molecular Biology, pages 48–65. Springer, 2005.

[26] Philip Resnik. Using information content to evaluate semantic similarity in a

taxonomy. arXiv preprint cmp-lg/9511007, 1995.

[27] No¨elMalod-Dognin and NataˇsaPrˇzulj.L-graal: Lagrangian graphlet-based network

aligner. Bioinformatics, page btv130, 2015.

67 [28] Oleksii Kuchaiev, Tijana Milenkovi´c,Vesna Memiˇsevi´c,Wayne Hayes, and Nataˇsa

Prˇzulj.Topological network alignment uncovers biological function and phylogeny.

Journal of the Royal Society Interface, 7(50):1341–1354, 2010.

[29] Oleksii Kuchaiev and NataˇsaPrˇzulj.Integrative network alignment reveals large

regions of global network similarity in yeast and human. Bioinformatics,

27(10):1390–1396, 2011.

[30] Somaye Hashemifar and Jinbo Xu. Hubalign: an accurate and efficient method for

global alignment of protein–protein interaction networks. Bioinformatics,

30(17):i438–i444, 2014.

[31] Rob Patro and Carl Kingsford. Global network alignment using multiscale spectral

signatures. Bioinformatics, 28(23):3105–3114, 2012.

[32] Vikram Saraph and Tijana Milenkovi´c.Magna: maximizing accuracy in global

network alignment. Bioinformatics, 30(20):2931–2940, 2014.

[33] Behnam Neyshabur, Ahmadreza Khadem, Somaye Hashemifar, and Seyed Shahriar

Arab. Netal: a new graph-based method for global alignment of protein–protein

interaction networks. Bioinformatics, 29(13):1654–1662, 2013.

[34] Jason Flannick, Antal Novak, Chuong B Do, Balaji S Srinivasan, and Serafim

Batzoglou. Automatic parameter learning for multiple local network alignment.

Journal of computational biology, 16(8):1001–1022, 2009.

[35] Rashid Ibragimov, Maximilian Malek, Jiong Guo, and Jan Baumbach. Gedevo: an

evolutionary graph edit distance algorithm for biological network alignment. In

68 OASIcs-OpenAccess Series in Informatics, volume 34. Schloss

Dagstuhl-Leibniz-Zentrum fuer Informatik, 2013.

[36] Hang TT Phan and Michael JE Sternberg. Pinalog: a novel approach to align protein

interaction networksâĂŤimplications for complex detection and function prediction.

Bioinformatics, 28(9):1239–1245, 2012.

[37] Ahmet E Alada˘gand Cesim Erten. Spinal: scalable protein interaction network

alignment. Bioinformatics, 29(7):917–924, 2013.

[38] Leonid Chindelevitch, Cheng-Yu Ma, Chung-Shou Liao, and Bonnie Berger.

Optimizing a global alignment of protein interaction networks. Bioinformatics, page

btt486, 2013.

[39] Gunnar W Klau. A new graph-based method for pairwise global network alignment.

BMC bioinformatics, 10(Suppl 1):S59, 2009.

[40] Mohammed El-Kebir, Jaap Heringa, and Gunnar W Klau. Lagrangian relaxation

applied to sparse global network alignment. In Pattern Recognition in Bioinformatics,

pages 225–236. Springer, 2011.

[41] Mohsen Bayati, David F Gleich, Amin Saberi, and Ying Wang. Message-passing

algorithms for sparse network alignment. ACM Transactions on Knowledge Discovery

from Data (TKDD), 7(1):3, 2013.

[42] Mohsen Bayati, Margot Gerritsen, David F Gleich, Amin Saberi, and Ying Wang.

Algorithms for large, sparse network alignment problems. In Data Mining, 2009.

ICDM’09. Ninth IEEE International Conference on, pages 705–710. IEEE, 2009.

69 [43] Chris Stark, Bobby-Joe Breitkreutz, Teresa Reguly, Lorrie Boucher, Ashton

Breitkreutz, and Mike Tyers. Biogrid: a general repository for interaction datasets.

Nucleic acids research, 34(suppl 1):D535–D539, 2006.

[44] Peter Andolfatto. Adaptive evolution of non-coding dna in drosophila. 2005.

[45] Chava Kimchi-Sarfaty, Jung Mi Oh, In-Wha Kim, Zuben E Sauna, Anna Maria

Calcagno, Suresh V Ambudkar, and Michael M Gottesman. A” silent” polymorphism

in the mdr1 gene changes substrate specificity. Science, 315(5811):525–528, 2007.

[46] UniProt Consortium et al. The universal protein resource (uniprot). Nucleic acids

research, 36(suppl 1):D190–D195, 2008.

70 Appendix A

Table A.1: Species Tested Species Description Nodes Edges Arabidopsis thaliana Columbia Mouse-ear cress, plant 8221 21058 Bos taurus Domestic Cow 374 320 Caenorhabditis elegans Roundworm 3902 7782 Drosophila melanogaster Fruit Fly 8165 38922 Homo sapiens Human 18788 174792 Human Immunodeficiency Virus 1 HIV-1 1021 1018 Mus musculus Mouse 8296 17675 Plasmodium falciparum 3D7 Malaria parasite 1226 2442 Rattus norvegicus Rat 3085 3782 Saccharomyces cerevisiae S288c Baker’s Yeast 6274 222891 Schizosaccharomyces pombe 972h Fission Yeast 3889 53749 Xenopus laevis African Clawed Frog 462 486

71 Table A.2: Species Pairs by Cross Product Graph Size

Species1 Species2 Edge Product Weight Total Constrained Edge Product Homo sapiens Saccharomyces cerevisiae S288c 38959563672 212259 5568202 Saccharomyces cerevisiae S288c Schizosaccharomyces pombe 972h 11980168359 42955 1400664 Homo sapiens Schizosaccharomyces pombe 972h 9394895208 150699 1658376 Drosophila melanogaster Saccharomyces cerevisiae S288c 8675363502 67387 557744 Drosophila melanogaster Homo sapiens 6803254224 425448 1122158 Arabidopsis thaliana Columbia Saccharomyces cerevisiae S288c 4693638678 94110 951616 Mus musculus Saccharomyces cerevisiae S288c 3939598425 122511 750062 Arabidopsis thaliana Columbia Homo sapiens 3680769936 416700 1601854 Homo sapiens Mus musculus 3089448600 758331 1830878 Drosophila melanogaster Schizosaccharomyces pombe 972h 2092018578 48341 153432 Caenorhabditis elegans Saccharomyces cerevisiae S288c 1734537762 36063 235244 Caenorhabditis elegans Homo sapiens 1360231344 200403 418588 Arabidopsis thaliana Columbia Schizosaccharomyces pombe 972h 1131846442 68982 293394 Mus musculus Schizosaccharomyces pombe 972h 950013575 90766 239968 Rattus norvegicus Saccharomyces cerevisiae S288c 842973762 48564 136308 Arabidopsis thaliana Columbia Drosophila melanogaster 819619476 126234 139306 Drosophila melanogaster Mus musculus 687946350 226805 182356 Homo sapiens Rattus norvegicus 661063344 277942 333842 Plasmodium falciparum 3D7 Saccharomyces cerevisiae S288c 544299822 9864 12528 Homo sapiens Plasmodium falciparum 3D7 426842064 43290 29100 Caenorhabditis elegans Schizosaccharomyces pombe 972h 418274718 26249 67334 Arabidopsis thaliana Columbia Mus musculus 372200150 246615 237702 Caenorhabditis elegans Drosophila melanogaster 302891004 61204 47378 Human Immunodeficiency Virus 1 Saccharomyces cerevisiae S288c 226903038 13244 1158 Rattus norvegicus Schizosaccharomyces pombe 972h 203278718 35554 37446 Homo sapiens Human Immunodeficiency Virus 1 177938256 60330 8386 Arabidopsis thaliana Columbia Caenorhabditis elegans 163873356 72876 66240 Drosophila melanogaster Rattus norvegicus 147203004 79479 25632 Caenorhabditis elegans Mus musculus 137546850 116104 63868 Plasmodium falciparum 3D7 Schizosaccharomyces pombe 972h 131255058 7488 3208 Saccharomyces cerevisiae S288c Xenopus laevis 108325026 10315 50072 Drosophila melanogaster Plasmodium falciparum 3D7 95047524 16599 3148 Homo sapiens Xenopus laevis 84948912 47030 81490 Arabidopsis thaliana Columbia Rattus norvegicus 79641356 101957 48168 Bos taurus Saccharomyces cerevisiae S288c 71325120 9180 16630 Mus musculus Rattus norvegicus 66846850 162702 56256 Bos taurus Homo sapiens 55933440 44075 29924 Human Immunodeficiency Virus 1 Schizosaccharomyces pombe 972h 54716482 10560 232 Arabidopsis thaliana Columbia Plasmodium falciparum 3D7 51423636 19261 2586 Mus musculus Plasmodium falciparum 3D7 43162350 25763 3086 Drosophila melanogaster Human Immunodeficiency Virus 1 39622596 19787 106 Caenorhabditis elegans Rattus norvegicus 29431524 46093 11938 Schizosaccharomyces pombe 972h Xenopus laevis 26122014 8018 14648 Arabidopsis thaliana Columbia Human Immunodeficiency Virus 1 21437044 22747 302 Caenorhabditis elegans Plasmodium falciparum 3D7 19003644 7914 686 Drosophila melanogaster Xenopus laevis 18916092 13659 6376 Human Immunodeficiency Virus 1 Mus musculus 17993150 36763 128 Bos taurus Schizosaccharomyces pombe 972h 17199680 6669 4560 Bos taurus Drosophila melanogaster 12455040 12129 2802 Arabidopsis thaliana Columbia Xenopus laevis 10234188 20408 16076 Plasmodium falciparum 3D7 Rattus norvegicus 9235644 9531 738 Mus musculus Xenopus laevis 8590050 29565 12582 Caenorhabditis elegans Human Immunodeficiency Virus 1 7922076 10110 10 Arabidopsis thaliana Columbia Bos taurus 6738560 19588 4442 Bos taurus Mus musculus 5656000 26682 5010 Human Immunodeficiency Virus 1 Rattus norvegicus 3850076 13654 20 Caenorhabditis elegans Xenopus laevis 3782052 8315 3424 Bos taurus Caenorhabditis elegans 2490240 7837 998 Human Immunodeficiency Virus 1 Plasmodium falciparum 3D7 2485956 3776 2 Rattus norvegicus Xenopus laevis 1838052 11631 2646 Bos taurus Rattus norvegicus 1210240 11648 1296 Plasmodium falciparum 3D7 Xenopus laevis 1186812 1931 164 Bos taurus Plasmodium falciparum 3D7 781440 1860 108 Human Immunodeficiency Virus 1 Xenopus laevis 494748 0 0 Bos taurus Human Immunodeficiency Virus72 1 325760 0 0 Bos taurus Xenopus laevis 155520 2353 298 Table A.3: Species Pairs by Conserved Cross Product Graph Size

Species1 Species2 Edge Product Weight Total Constrained Edge Product Homo sapiens Saccharomyces cerevisiae S288c 38959563672 212259 5568202 Homo sapiens Mus musculus 3089448600 758331 1830878 Homo sapiens Schizosaccharomyces pombe 972h 9394895208 150699 1658376 Arabidopsis thaliana Columbia Homo sapiens 3680769936 416700 1601854 Saccharomyces cerevisiae S288c Schizosaccharomyces pombe 972h 11980168359 42955 1400664 Drosophila melanogaster Homo sapiens 6803254224 425448 1122158 Arabidopsis thaliana Columbia Saccharomyces cerevisiae S288c 4693638678 94110 951616 Mus musculus Saccharomyces cerevisiae S288c 3939598425 122511 750062 Drosophila melanogaster Saccharomyces cerevisiae S288c 8675363502 67387 557744 Caenorhabditis elegans Homo sapiens 1360231344 200403 418588 Homo sapiens Rattus norvegicus 661063344 277942 333842 Arabidopsis thaliana Columbia Schizosaccharomyces pombe 972h 1131846442 68982 293394 Mus musculus Schizosaccharomyces pombe 972h 950013575 90766 239968 Arabidopsis thaliana Columbia Mus musculus 372200150 246615 237702 Caenorhabditis elegans Saccharomyces cerevisiae S288c 1734537762 36063 235244 Drosophila melanogaster Mus musculus 687946350 226805 182356 Drosophila melanogaster Schizosaccharomyces pombe 972h 2092018578 48341 153432 Arabidopsis thaliana Columbia Drosophila melanogaster 819619476 126234 139306 Rattus norvegicus Saccharomyces cerevisiae S288c 842973762 48564 136308 Homo sapiens Xenopus laevis 84948912 47030 81490 Caenorhabditis elegans Schizosaccharomyces pombe 972h 418274718 26249 67334 Arabidopsis thaliana Columbia Caenorhabditis elegans 163873356 72876 66240 Caenorhabditis elegans Mus musculus 137546850 116104 63868 Mus musculus Rattus norvegicus 66846850 162702 56256 Saccharomyces cerevisiae S288c Xenopus laevis 108325026 10315 50072 Arabidopsis thaliana Columbia Rattus norvegicus 79641356 101957 48168 Caenorhabditis elegans Drosophila melanogaster 302891004 61204 47378 Rattus norvegicus Schizosaccharomyces pombe 972h 203278718 35554 37446 Bos taurus Homo sapiens 55933440 44075 29924 Homo sapiens Plasmodium falciparum 3D7 426842064 43290 29100 Drosophila melanogaster Rattus norvegicus 147203004 79479 25632 Bos taurus Saccharomyces cerevisiae S288c 71325120 9180 16630 Arabidopsis thaliana Columbia Xenopus laevis 10234188 20408 16076 Schizosaccharomyces pombe 972h Xenopus laevis 26122014 8018 14648 Mus musculus Xenopus laevis 8590050 29565 12582 Plasmodium falciparum 3D7 Saccharomyces cerevisiae S288c 544299822 9864 12528 Caenorhabditis elegans Rattus norvegicus 29431524 46093 11938 Homo sapiens Human Immunodeficiency Virus 1 177938256 60330 8386 Drosophila melanogaster Xenopus laevis 18916092 13659 6376 Bos taurus Mus musculus 5656000 26682 5010 Bos taurus Schizosaccharomyces pombe 972h 17199680 6669 4560 Arabidopsis thaliana Columbia Bos taurus 6738560 19588 4442 Caenorhabditis elegans Xenopus laevis 3782052 8315 3424 Plasmodium falciparum 3D7 Schizosaccharomyces pombe 972h 131255058 7488 3208 Drosophila melanogaster Plasmodium falciparum 3D7 95047524 16599 3148 Mus musculus Plasmodium falciparum 3D7 43162350 25763 3086 Bos taurus Drosophila melanogaster 12455040 12129 2802 Rattus norvegicus Xenopus laevis 1838052 11631 2646 Arabidopsis thaliana Columbia Plasmodium falciparum 3D7 51423636 19261 2586 Bos taurus Rattus norvegicus 1210240 11648 1296 Human Immunodeficiency Virus 1 Saccharomyces cerevisiae S288c 226903038 13244 1158 Bos taurus Caenorhabditis elegans 2490240 7837 998 Plasmodium falciparum 3D7 Rattus norvegicus 9235644 9531 738 Caenorhabditis elegans Plasmodium falciparum 3D7 19003644 7914 686 Arabidopsis thaliana Columbia Human Immunodeficiency Virus 1 21437044 22747 302 Bos taurus Xenopus laevis 155520 2353 298 Human Immunodeficiency Virus 1 Schizosaccharomyces pombe 972h 54716482 10560 232 Plasmodium falciparum 3D7 Xenopus laevis 1186812 1931 164 Human Immunodeficiency Virus 1 Mus musculus 17993150 36763 128 Bos taurus Plasmodium falciparum 3D7 781440 1860 108 Drosophila melanogaster Human Immunodeficiency Virus 1 39622596 19787 106 Human Immunodeficiency Virus 1 Rattus norvegicus 3850076 13654 20 Caenorhabditis elegans Human Immunodeficiency Virus 1 7922076 10110 10 Human Immunodeficiency Virus 1 Plasmodium falciparum 3D7 2485956 3776 2 Human Immunodeficiency Virus 1 Xenopus laevis 73 494748 0 0 Bos taurus Human Immunodeficiency Virus 1 325760 0 0