View the System of Interactions That Occur Globally
Total Page:16
File Type:pdf, Size:1020Kb
Mutually Exclusive Weighted Graph Matching Algorithm for Protein-Protein Interaction Network Alignment A thesis submitted to Graduate School of the University of Cincinnati in partial fulfilment for the requirements for the degree of Master of Science in the Department of Electrical Engineering and Computer Science by Brandan Dunham Bachelor of Science, Shawnee State University May 2011 Thesis Advisor & Committee Chair: Dr. Yizong Cheng Abstract Motivation: Understanding and analyzing proteins interacting within a species is a topic of interest in various biological studies. These interactions can be aligned as a full network, allowing research to view the system of interactions that occur globally. Given this global alignment, one common area of research is comparing the protein interactions produced by different species to determine similarity. Results: In this paper, we analyze current alignment methodologies for both a constrained and an unconstrained version of the protein-protein interaction network alignment problem. We also provide a new algorithm, Mutual Exclusion Matching, which scores higher and finds larger alignments than most other algorithms. Finally, we analyze all algorithms tested in a variety of biological and topological measures, showing that Mutual Exclusion produces some of the best alignments found throughout the course of the study, including finding larger connected components, conserving a larger number of edges, and finding some of the highest biologically scoring alignments when compared to other algorithms. Code necessary for running Mutual Exclusion Matching, as well as some example datasets, can be found at: https://sourceforge.net/projects/mutual-exclusion-matching/ ii Acknowledgments I would like to thank my advisors, Dr. Cheng, for advising my thesis and providing suggestions on what I needed to do next at various writing steps, Dr. Ralescu, for lectures on artificial intelligence and always being wiling to spend time talking to me about problems I may have had, and Dr. Meller, for teaching me about the biological data and resources that made studying this problem possible. I would also like to thank Shikha Chaganti, for helping me proofread and review various aspects of this paper. Without her assistance, writing this paper would have been much more difficult. Additionally, I would like to thank her and Wilma Lam for working with me in classes and making my time in Cincinnati more enjoyable. Finally, I would like to thank my parents, Chris and Dona Dunham, for supporting my decision to go to graduate school and assisting me when I needed help. iv Contents Chapter 1 Introduction 1 1.1 Introduction . .1 Chapter 2 History 4 2.1 Problem Background . .4 2.2 Terminology . .7 2.2.1 Protein-Protein Interaction Network . .7 2.2.2 Local and Global Graph Alignment . .7 2.2.3 Constrained and Unconstrained Graph Matching Problem . .7 2.2.4 Edge Correctness . .8 2.2.5 Sequence Similarity . .8 2.2.6 Functional Similarity . .9 2.2.7 Maximum Common Subgraph . .9 2.2.8 Largest Connected Component . .9 2.3 Local Aligners . 10 2.4 Global Aligners . 10 2.5 Global Alignment Algorithms . 14 v 2.5.1 GRAAL . 14 2.5.2 HubAlign . 14 2.5.3 Ghost . 15 2.5.4 Magna . 15 2.5.5 Netal . 15 2.5.6 Graemlin . 16 2.5.7 GEDEVO . 16 2.5.8 PINALOG . 17 2.5.9 SPINAL . 17 2.5.10 PISWAP . 17 2.5.11 IsoRank . 18 2.5.12 Natalie . 18 2.5.13 Belief Propagation . 19 2.5.14 Maximum Weighted Matching . 19 Chapter 3 Algorithm 20 3.1 Background . 20 3.2 IsoRank . 21 3.3 First Mutual Exclusion Algorithm . 22 3.4 Mutual Exclusion Matching . 24 3.5 Example Problem . 26 Chapter 4 Experimentation 35 4.1 Datasets . 35 4.2 Constrained Results . 38 vi 4.3 Unconstrained Results . 46 4.4 Biological Comparison . 52 4.4.1 Average Biological Results . 52 4.4.2 Topological Usefulness . 53 4.5 Future Improvements . 56 Chapter 5 Conclusion 62 5.1 Conclusion . 62 Appendix A 71 vii List of Figures 3.1 Example Problem: Graph A . 29 3.2 Example Problem: Graph B . 29 3.3 Example Problem: Weighted Bipartite Matching from A and B . 29 3.4 Example Problem: Full constrained matching problem between Graph A and B 30 3.5 Example Problem: Conserved edge constrained matching problem between Graph A and B . 31 3.6 Example Problem: Cross Product (S) matrix between Graph A and Graph B. Only showing node pairs with possible conserved edges . 31 3.7 Example Problem: Conserved edge constrained matching problem between Graph A and B. 1st iteration . 32 3.8 Example Problem: Cross Product (S) matrix between Graph A and Graph B. Only showing node pairs with possible conserved edges. 1st iteration . 32 3.9 Example Problem: Cross Product (S) matrix between Graph A and Graph B. Only showing node pairs with possible conserved edges. 2nd iteration . 33 3.10 Example Problem: Conserved edge constrained matching problem between Graph A and B. 3rd iteration . 34 viii 3.11 Example Problem: Cross Product (S) matrix between Graph A and Graph B. Only showing node pairs with possible conserved edges. 3rd iteration . 34 4.1 Scores of the best version of each algorithm, averaged over 11 α values. Scores are shown relative to the upper bound calculated by NetalignMR. PPIs are ordered from largest to smallest by Constrained Edge Product . 40 4.2 Scores from the best version of each algorithm, for the 10 largest PPI pairs, averaged over 11 α values. Scores are shown relative to the upper bound calculated by NetalignMR. PPIs are ordered from largest to smallest by Con- strained Edge Product . 41 4.3 Logarithmically scaled running time to find the best solution for each algo- rithm, averaged over 11 α values. PPIs are ordered from largest to smallest by Constrained Edge Product . 42 4.4 Average EC for each PPI pair PPIs are ordered from largest to smallest by Constrained Edge Product . 44 4.5 Relative percentage of EC to the best found value for each PPI pair PPIs are ordered from largest to smallest by Constrained Edge Product . 44 4.6 Relative value of LCC to the best found value for each PPI pair. PPIs are ordered from largest to smallest by Constrained Edge Product . 45 4.7 Average scoring of various algorithms on the unconstrained alignment problem across all α. PPIs are ordered from largest to smallest by edge product . 48 4.8 Maximal Edge Correctness across all α for the unconstrained alignment prob- lem. PPIs are ordered from largest to smallest by Edge Product . 49 ix 4.9 Maximal LCC across all α for the unconstrained alignment problem. PPIs are ordered from largest to smallest by Edge Product . 50 4.10 Average number of protein matches found for each algorithm across α values 52 4.11 Average of biological scores across various α .................. 53 4.12 Average of LCC biological scores across various α ............... 54 4.13 Average of LCC size across all α ......................... 55 4.14 The number of times each algorithm found the maximum biological similarity between two species for a given α ........................ 56 4.15 The number of times each algorithm found the maximum biological similarity between two species for a given α from 0 to 0.9 . 57 4.16 Scores produced by aligning Drosophila melanogaster and Schizosaccharomyces pombe 972h by α value . 58 x List of Tables 4.1 Constrained alignment algorithm settings. All non-listed settings are run at default values . 39 4.2 Total time taken by the best run of each of the different algorithms to find a best solution, besttime, and to converge, totaltime, in seconds . 43 4.3 Average across all constrained problems of the maximal values found across all α ........................................ 44 4.4 Average values for biological measures for each constrained problem . 46 4.5 Average values for biological measures for each constrained problem within their Largest Connected Components for α values less than 1 . 46 4.6 Unconstrained alignment algorithms . 48 4.7 Average across all unconstrained problems of the maximal values found across all α ........................................ 49 4.8 Average values for biological measures for each unconstrained problem . 51 4.9 Average values for biological measures for each unconstrained problem within their Largest Connected Components for α values less than 1 . 51 xi 4.10 Comparison of α values used for scoring for the constrained problem, y axis vs α values used for finding the best solution, x axis . 60 4.11 Percentage of optimal solution found for each α for the unconstrained problem 60 4.12 Comparison of α values used for scoring the unconstrained problem, y axis vs α values used for finding the best solution, x axis . 61 4.13 Percentage of optimal solution found for the unconstrained problem for each α 61 A.1 Species Tested . 71 A.2 Species Pairs by Cross Product Graph Size . 72 A.3 Species Pairs by Conserved Cross Product Graph Size . 73 xii Chapter 1 Introduction 1.1 Introduction One of the most popular usages of computers is to automate large tasks that would be very expensive or computationally unfeasible to do by hand. One common task is the processing and comparison of similar data, which is done for various reasons. These reasons include facial recognition software that relies on the processing and comparison of similar images, credit card companies comparing previous purchases to find outliers to detect fraud, and web services such as google using key words and links to and from various webpages to detect similarity and importance for searching algorithms. All of these represent concepts which, if done by hand, would be very long processes requiring lots of work.