Analysis of Protein-Protein Interaction Networks Using High Performance Scalable Tools

University of New Orleans ScholarWorks@UNO Senior Honors Theses Undergraduate Showcase 5-2018 Analysis of Protein-Protein Interaction Networks Using High Performance Scalable Tools Bikesh Pandey Follow this and additional works at: https://scholarworks.uno.edu/honors_theses Part of the Computer Sciences Commons Recommended Citation Pandey, Bikesh, "Analysis of Protein-Protein Interaction Networks Using High Performance Scalable Tools" (2018). Senior Honors Theses. 113. https://scholarworks.uno.edu/honors_theses/113 This Honors Thesis-Unrestricted is protected by copyright and/or related rights. It has been brought to you by ScholarWorks@UNO with permission from the rights-holder(s). You are free to use this Honors Thesis-Unrestricted in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/or on the work itself. This Honors Thesis-Unrestricted has been accepted for inclusion in Senior Honors Theses by an authorized administrator of ScholarWorks@UNO. For more information, please contact [email protected]. ANALYSIS OF PROTEIN-PROTEIN INTERACTION NETWORKS USING HIGH PERFORMANCE SCALABLE TOOLS An Honors Thesis Presented to the Department of Computer Science of the University of New Orleans In Partial Fulfillment of the Requirements for the Degree of Bachelor of Science, with University High Honors by Bikesh Pandey May 2018 ACKNOWLEDGEMENTS I would like to thank Dr. Shaikh Arifuzzaman for providing me with an opportunity to work with him on large-scale graph mining problems and also supporting me throughout this entire process. I would also like to thank Dr. Summa for providing important feedback as well as providing essential knowledge and help throughout my undergraduate career. Similarly, I want to thank the entire Computer Science department for helping me be a part of projects, and for providing me with a great learning environment. I would also like to think Ms. Erin Sutherland for assisting with the Thesis as well as bearing with me despite the setbacks and delays in writing it up. I also want to thank my family, specifically my parents and my brother, who despite trying times kept inspiring me to push myself to complete difficult tasks while fighting hectic schedules. ii TABLE OF CONTENTS ACKNOWLEDGEMENTS ................................................................................................................ ii LIST OF TABLES ............................................................................................................................... iv LIST OF FIGURES ............................................................................................................................ v Abstract ............................................................................................................................................... vi Introduction ....................................................................................................................................... 1 Related Work ..................................................................................................................................... 2 Datasets ............................................................................................................................................... 3 Computation Requirement .......................................................................................................... 4 Basics of Graph Theory ................................................................................................................. 5 Using the tools .................................................................................................................................. 8 Categorization of Analysis Metrics ........................................................................................... 9 Results ................................................................................................................................................. 11 A. Global Analysis ................................................................................................................................... 11 B. Community Analysis........................................................................................................................ 13 C. Local Analysis ..................................................................................................................................... 15 D. The “Hub” ............................................................................................................................................. 15 E. Scalability ............................................................................................................................................. 19 F. Network Analysis Tools ................................................................................................................. 20 Comparison: Networkx vs Igraph ............................................................................................ 20 Conclusion .......................................................................................................................................... 22 BIBLIOGRAPHY ................................................................................................................................ 24 iii LIST OF TABLES Table 1. Sample Datasets used in the experiment .............................................................. 3 Table 2. Global Analysis metrics results 1 ............................................................................. 11 Table 3. Global Metrics Results 2 .............................................................................................. 12 Table 4. Global Metrics Result 3 ................................................................................................ 12 Table 5. Community Analysis metrics results ...................................................................... 13 Table 6. Dominantly central nodes in 5 network datasets ............................................. 17 Table 7. Speedups compared: sequential vs MPI based tool ......................................... 19 Table 8. Speedups compared: networkx and igraph ......................................................... 21 iv LIST OF FIGURES Figure 1. A graph colored on the basis of BC from least (red) to greatest (blue) .. 5 Figure 2. Average Clustering Coefficient in the datasets…………………………….…... 13 Figure 3. A subgraph in Homo sapiens. Node colors represent modularity classes and node sizes represent degrees…………………………………………………………... 14 Figure 4. Top three proteins with highest centrality values in Homo sapiens….. 16 v Abstract Protein-Protein Interaction (PPI) Research currently generates an extraordinary amount of publications and interest in fellow computer scientists and biologists alike because of the underlying potential of the source material that researchers can work with. PPI networks are the networks of protein complexes formed by biochemical events or electrostatic forces serving a biological function [1]. Since the analysis of the protein networks is now growing, we have more information regarding protein, genomes and their influence on life. Today, PPI networks are used to study diseases, improve drugs and understand other processes in medicine and health that will eventually help mankind. Though PPI network research is considered extremely important in the field, there is an issue – we do not have enough people who have enough interdisciplinary knowledge in both the fields of biology and computer science; this limits our rate of progress in the field. Most biologists that are not expert coders need a way of calculating graph values and information that will help them analyze the graphs better without having to manipulate the data themselves. In this research, I test a few ways of achieving results through the use of available frameworks and algorithms, present the results and compare each method’s efficacy. My analysis takes place on very large datasets where I calculate several centralities and other data from the graph using different metrics, and I also visualize them in order to gain further insight. I also managed to note the significance of MPI and multithreading on the results thus obtained that suggest building scalable tools will help improve the analysis immensely. vi 1 Introduction In the recent times, we have invested a lot in researching DNA and specific genes and how they have the possibility of affecting how an organism behaves and functions. DNA in an organism gives it the possibility to have certain traits; however, only the activated proteins produced from the genetic information actually bring the traits to fruition. There are many ways in which we can detect how important specific proteins are to an organism by looking at how often certain proteins interact and how specific proteins are involved in all interactions. We usually do this by building graphs using data obtained from giant protein datasets. However, the analysis of these graphs using network is really costly and takes up a lot of resources. The following research paper deals in how we can possibly reduce the cost by minimizing the overhead required to perform most of these operations by finding out the best ways to do so. Networks are important because in a vast array for complicated information

Load more