Analysis of Protein-Protein Interaction Networks Using High Performance Scalable Tools

Total Page:16

File Type:pdf, Size:1020Kb

Analysis of Protein-Protein Interaction Networks Using High Performance Scalable Tools University of New Orleans ScholarWorks@UNO Senior Honors Theses Undergraduate Showcase 5-2018 Analysis of Protein-Protein Interaction Networks Using High Performance Scalable Tools Bikesh Pandey Follow this and additional works at: https://scholarworks.uno.edu/honors_theses Part of the Computer Sciences Commons Recommended Citation Pandey, Bikesh, "Analysis of Protein-Protein Interaction Networks Using High Performance Scalable Tools" (2018). Senior Honors Theses. 113. https://scholarworks.uno.edu/honors_theses/113 This Honors Thesis-Unrestricted is protected by copyright and/or related rights. It has been brought to you by ScholarWorks@UNO with permission from the rights-holder(s). You are free to use this Honors Thesis-Unrestricted in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/or on the work itself. This Honors Thesis-Unrestricted has been accepted for inclusion in Senior Honors Theses by an authorized administrator of ScholarWorks@UNO. For more information, please contact [email protected]. ANALYSIS OF PROTEIN-PROTEIN INTERACTION NETWORKS USING HIGH PERFORMANCE SCALABLE TOOLS An Honors Thesis Presented to the Department of Computer Science of the University of New Orleans In Partial Fulfillment of the Requirements for the Degree of Bachelor of Science, with University High Honors by Bikesh Pandey May 2018 ACKNOWLEDGEMENTS I would like to thank Dr. Shaikh Arifuzzaman for providing me with an opportunity to work with him on large-scale graph mining problems and also supporting me throughout this entire process. I would also like to thank Dr. Summa for providing important feedback as well as providing essential knowledge and help throughout my undergraduate career. Similarly, I want to thank the entire Computer Science department for helping me be a part of projects, and for providing me with a great learning environment. I would also like to think Ms. Erin Sutherland for assisting with the Thesis as well as bearing with me despite the setbacks and delays in writing it up. I also want to thank my family, specifically my parents and my brother, who despite trying times kept inspiring me to push myself to complete difficult tasks while fighting hectic schedules. ii TABLE OF CONTENTS ACKNOWLEDGEMENTS ................................................................................................................ ii LIST OF TABLES ............................................................................................................................... iv LIST OF FIGURES ............................................................................................................................ v Abstract ............................................................................................................................................... vi Introduction ....................................................................................................................................... 1 Related Work ..................................................................................................................................... 2 Datasets ............................................................................................................................................... 3 Computation Requirement .......................................................................................................... 4 Basics of Graph Theory ................................................................................................................. 5 Using the tools .................................................................................................................................. 8 Categorization of Analysis Metrics ........................................................................................... 9 Results ................................................................................................................................................. 11 A. Global Analysis ................................................................................................................................... 11 B. Community Analysis........................................................................................................................ 13 C. Local Analysis ..................................................................................................................................... 15 D. The “Hub” ............................................................................................................................................. 15 E. Scalability ............................................................................................................................................. 19 F. Network Analysis Tools ................................................................................................................. 20 Comparison: Networkx vs Igraph ............................................................................................ 20 Conclusion .......................................................................................................................................... 22 BIBLIOGRAPHY ................................................................................................................................ 24 iii LIST OF TABLES Table 1. Sample Datasets used in the experiment .............................................................. 3 Table 2. Global Analysis metrics results 1 ............................................................................. 11 Table 3. Global Metrics Results 2 .............................................................................................. 12 Table 4. Global Metrics Result 3 ................................................................................................ 12 Table 5. Community Analysis metrics results ...................................................................... 13 Table 6. Dominantly central nodes in 5 network datasets ............................................. 17 Table 7. Speedups compared: sequential vs MPI based tool ......................................... 19 Table 8. Speedups compared: networkx and igraph ......................................................... 21 iv LIST OF FIGURES Figure 1. A graph colored on the basis of BC from least (red) to greatest (blue) .. 5 Figure 2. Average Clustering Coefficient in the datasets…………………………….…... 13 Figure 3. A subgraph in Homo sapiens. Node colors represent modularity classes and node sizes represent degrees…………………………………………………………... 14 Figure 4. Top three proteins with highest centrality values in Homo sapiens….. 16 v Abstract Protein-Protein Interaction (PPI) Research currently generates an extraordinary amount of publications and interest in fellow computer scientists and biologists alike because of the underlying potential of the source material that researchers can work with. PPI networks are the networks of protein complexes formed by biochemical events or electrostatic forces serving a biological function [1]. Since the analysis of the protein networks is now growing, we have more information regarding protein, genomes and their influence on life. Today, PPI networks are used to study diseases, improve drugs and understand other processes in medicine and health that will eventually help mankind. Though PPI network research is considered extremely important in the field, there is an issue – we do not have enough people who have enough interdisciplinary knowledge in both the fields of biology and computer science; this limits our rate of progress in the field. Most biologists that are not expert coders need a way of calculating graph values and information that will help them analyze the graphs better without having to manipulate the data themselves. In this research, I test a few ways of achieving results through the use of available frameworks and algorithms, present the results and compare each method’s efficacy. My analysis takes place on very large datasets where I calculate several centralities and other data from the graph using different metrics, and I also visualize them in order to gain further insight. I also managed to note the significance of MPI and multithreading on the results thus obtained that suggest building scalable tools will help improve the analysis immensely. vi 1 Introduction In the recent times, we have invested a lot in researching DNA and specific genes and how they have the possibility of affecting how an organism behaves and functions. DNA in an organism gives it the possibility to have certain traits; however, only the activated proteins produced from the genetic information actually bring the traits to fruition. There are many ways in which we can detect how important specific proteins are to an organism by looking at how often certain proteins interact and how specific proteins are involved in all interactions. We usually do this by building graphs using data obtained from giant protein datasets. However, the analysis of these graphs using network is really costly and takes up a lot of resources. The following research paper deals in how we can possibly reduce the cost by minimizing the overhead required to perform most of these operations by finding out the best ways to do so. Networks are important because in a vast array for complicated information
Recommended publications
  • Sagemath and Sagemathcloud
    Viviane Pons Ma^ıtrede conf´erence,Universit´eParis-Sud Orsay [email protected] { @PyViv SageMath and SageMathCloud Introduction SageMath SageMath is a free open source mathematics software I Created in 2005 by William Stein. I http://www.sagemath.org/ I Mission: Creating a viable free open source alternative to Magma, Maple, Mathematica and Matlab. Viviane Pons (U-PSud) SageMath and SageMathCloud October 19, 2016 2 / 7 SageMath Source and language I the main language of Sage is python (but there are many other source languages: cython, C, C++, fortran) I the source is distributed under the GPL licence. Viviane Pons (U-PSud) SageMath and SageMathCloud October 19, 2016 3 / 7 SageMath Sage and libraries One of the original purpose of Sage was to put together the many existent open source mathematics software programs: Atlas, GAP, GMP, Linbox, Maxima, MPFR, PARI/GP, NetworkX, NTL, Numpy/Scipy, Singular, Symmetrica,... Sage is all-inclusive: it installs all those libraries and gives you a common python-based interface to work on them. On top of it is the python / cython Sage library it-self. Viviane Pons (U-PSud) SageMath and SageMathCloud October 19, 2016 4 / 7 SageMath Sage and libraries I You can use a library explicitly: sage: n = gap(20062006) sage: type(n) <c l a s s 'sage. interfaces .gap.GapElement'> sage: n.Factors() [ 2, 17, 59, 73, 137 ] I But also, many of Sage computation are done through those libraries without necessarily telling you: sage: G = PermutationGroup([[(1,2,3),(4,5)],[(3,4)]]) sage : G . g a p () Group( [ (3,4), (1,2,3)(4,5) ] ) Viviane Pons (U-PSud) SageMath and SageMathCloud October 19, 2016 5 / 7 SageMath Development model Development model I Sage is developed by researchers for researchers: the original philosophy is to develop what you need for your research and share it with the community.
    [Show full text]
  • Networkx Tutorial
    5.03.2020 tutorial NetworkX tutorial Source: https://github.com/networkx/notebooks (https://github.com/networkx/notebooks) Minor corrections: JS, 27.02.2019 Creating a graph Create an empty graph with no nodes and no edges. In [1]: import networkx as nx In [2]: G = nx.Graph() By definition, a Graph is a collection of nodes (vertices) along with identified pairs of nodes (called edges, links, etc). In NetworkX, nodes can be any hashable object e.g. a text string, an image, an XML object, another Graph, a customized node object, etc. (Note: Python's None object should not be used as a node as it determines whether optional function arguments have been assigned in many functions.) Nodes The graph G can be grown in several ways. NetworkX includes many graph generator functions and facilities to read and write graphs in many formats. To get started though we'll look at simple manipulations. You can add one node at a time, In [3]: G.add_node(1) add a list of nodes, In [4]: G.add_nodes_from([2, 3]) or add any nbunch of nodes. An nbunch is any iterable container of nodes that is not itself a node in the graph. (e.g. a list, set, graph, file, etc..) In [5]: H = nx.path_graph(10) file:///home/szwabin/Dropbox/Praca/Zajecia/Diffusion/Lectures/1_intro/networkx_tutorial/tutorial.html 1/18 5.03.2020 tutorial In [6]: G.add_nodes_from(H) Note that G now contains the nodes of H as nodes of G. In contrast, you could use the graph H as a node in G.
    [Show full text]
  • Networkx: Network Analysis with Python
    NetworkX: Network Analysis with Python Salvatore Scellato Full tutorial presented at the XXX SunBelt Conference “NetworkX introduction: Hacking social networks using the Python programming language” by Aric Hagberg & Drew Conway Outline 1. Introduction to NetworkX 2. Getting started with Python and NetworkX 3. Basic network analysis 4. Writing your own code 5. You are ready for your project! 1. Introduction to NetworkX. Introduction to NetworkX - network analysis Vast amounts of network data are being generated and collected • Sociology: web pages, mobile phones, social networks • Technology: Internet routers, vehicular flows, power grids How can we analyze this networks? Introduction to NetworkX - Python awesomeness Introduction to NetworkX “Python package for the creation, manipulation and study of the structure, dynamics and functions of complex networks.” • Data structures for representing many types of networks, or graphs • Nodes can be any (hashable) Python object, edges can contain arbitrary data • Flexibility ideal for representing networks found in many different fields • Easy to install on multiple platforms • Online up-to-date documentation • First public release in April 2005 Introduction to NetworkX - design requirements • Tool to study the structure and dynamics of social, biological, and infrastructure networks • Ease-of-use and rapid development in a collaborative, multidisciplinary environment • Easy to learn, easy to teach • Open-source tool base that can easily grow in a multidisciplinary environment with non-expert users
    [Show full text]
  • STATS 701 Data Analysis Using Python Lecture 27: Apis and Graph Processing Some Slides Adapted from C
    STATS 701 Data Analysis using Python Lecture 27: APIs and Graph Processing Some slides adapted from C. Budak Previously: Scraping Data from the Web We used BeautifulSoup to process HTML that we read directly We had to figure out where to find the data in the HTML This was okay for simple things like Wikipedia… ...but what about large, complicated data sets? E.g., Climate data from NOAA; Twitter/reddit/etc.; Google maps Many websites support APIs, which make these tasks simpler Instead of scraping for what we want, just ask! Example: ask Google Maps for a computer repair shop near a given address Three common API approaches Via a Python package Service (e.g., Google maps, ESRI*) provides library for querying DB Example: from arcgis.gis import GIS Via a command-line tool Ultimately, all three of these approaches end up submitting an Example: twurl https://developer.twitter.com/ HTTP request to a server, which returns information in the form of a Via HTTP requests JSON or XML file, typically. We submit an HTTP request to a server Supply additional parameters in URL to specify our query Example: https://www.yelp.com/developers/documentation/v3/business_search * ESRI is a GIS service, to which the university has a subscription: https://developers.arcgis.com/python/ Web service APIs Step 1: Create URL with query parameters Example (non-working): www.example.com/search?key1=val1&key2=val2 Step 2: Make an HTTP request Communicates to the server what kind of action we wish to perform https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Request_methods Step 3: Server returns a response to your request May be as simple as a code (e.g., 404 error)..
    [Show full text]
  • Introduction to Network Science & Visualisation
    IFC – Bank Indonesia International Workshop and Seminar on “Big Data for Central Bank Policies / Building Pathways for Policy Making with Big Data” Bali, Indonesia, 23-26 July 2018 Introduction to network science & visualisation1 Kimmo Soramäki, Financial Network Analytics 1 This presentation was prepared for the meeting. The views expressed are those of the author and do not necessarily reflect the views of the BIS, the IFC or the central banks and other institutions represented at the meeting. FNA FNA Introduction to Network Science & Visualization I Dr. Kimmo Soramäki Founder & CEO, FNA www.fna.fi Agenda Network Science ● Introduction ● Key concepts Exposure Networks ● OTC Derivatives ● CCP Interconnectedness Correlation Networks ● Housing Bubble and Crisis ● US Presidential Election Network Science and Graphs Analytics Is already powering the best known AI applications Knowledge Social Product Economic Knowledge Payment Graph Graph Graph Graph Graph Graph Network Science and Graphs Analytics “Goldman Sachs takes a DIY approach to graph analytics” For enhanced compliance and fraud detection (www.TechTarget.com, Mar 2015). “PayPal relies on graph techniques to perform sophisticated fraud detection” Saving them more than $700 million and enabling them to perform predictive fraud analysis, according to the IDC (www.globalbankingandfinance.com, Jan 2016) "Network diagnostics .. may displace atomised metrics such as VaR” Regulators are increasing using network science for financial stability analysis. (Andy Haldane, Bank of England Executive
    [Show full text]
  • Graph Database Fundamental Services
    Bachelor Project Czech Technical University in Prague Faculty of Electrical Engineering F3 Department of Cybernetics Graph Database Fundamental Services Tomáš Roun Supervisor: RNDr. Marko Genyk-Berezovskyj Field of study: Open Informatics Subfield: Computer and Informatic Science May 2018 ii Acknowledgements Declaration I would like to thank my advisor RNDr. I declare that the presented work was de- Marko Genyk-Berezovskyj for his guid- veloped independently and that I have ance and advice. I would also like to thank listed all sources of information used Sergej Kurbanov and Herbert Ullrich for within it in accordance with the methodi- their help and contributions to the project. cal instructions for observing the ethical Special thanks go to my family for their principles in the preparation of university never-ending support. theses. Prague, date ............................ ........................................... signature iii Abstract Abstrakt The goal of this thesis is to provide an Cílem této práce je vyvinout webovou easy-to-use web service offering a database službu nabízející databázi neorientova- of undirected graphs that can be searched ných grafů, kterou bude možno efektivně based on the graph properties. In addi- prohledávat na základě vlastností grafů. tion, it should also allow to compute prop- Tato služba zároveň umožní vypočítávat erties of user-supplied graphs with the grafové vlastnosti pro grafy zadané uži- help graph libraries and generate graph vatelem s pomocí grafových knihoven a images. Last but not least, we implement zobrazovat obrázky grafů. V neposlední a system that allows bulk adding of new řadě je také cílem navrhnout systém na graphs to the database and computing hromadné přidávání grafů do databáze a their properties.
    [Show full text]
  • Evolving Networks and Social Network Analysis Methods And
    DOI: 10.5772/intechopen.79041 ProvisionalChapter chapter 7 Evolving Networks andand SocialSocial NetworkNetwork AnalysisAnalysis Methods and Techniques Mário Cordeiro, Rui P. Sarmento,Sarmento, PavelPavel BrazdilBrazdil andand João Gama Additional information isis available atat thethe endend ofof thethe chapterchapter http://dx.doi.org/10.5772/intechopen.79041 Abstract Evolving networks by definition are networks that change as a function of time. They are a natural extension of network science since almost all real-world networks evolve over time, either by adding or by removing nodes or links over time: elementary actor-level network measures like network centrality change as a function of time, popularity and influence of individuals grow or fade depending on processes, and events occur in net- works during time intervals. Other problems such as network-level statistics computation, link prediction, community detection, and visualization gain additional research impor- tance when applied to dynamic online social networks (OSNs). Due to their temporal dimension, rapid growth of users, velocity of changes in networks, and amount of data that these OSNs generate, effective and efficient methods and techniques for small static networks are now required to scale and deal with the temporal dimension in case of streaming settings. This chapter reviews the state of the art in selected aspects of evolving social networks presenting open research challenges related to OSNs. The challenges suggest that significant further research is required in evolving social networks, i.e., existent methods, techniques, and algorithms must be rethought and designed toward incremental and dynamic versions that allow the efficient analysis of evolving networks. Keywords: evolving networks, social network analysis 1.
    [Show full text]
  • Network Science, Homophily and Who Reviews Who in the Linux Kernel? Working Paper[+] › Open-Access at ECIS2020-Arxiv.Pdf
    Network Science, Homophily and Who Reviews Who in the Linux Kernel? Working paper[P] Open-access at http://users.abo.fi/jteixeir/pub/linuxsna/ ECIS2020-arxiv.pdf José Apolinário Teixeira Åbo Akademi University Finland # jteixeira@abo. Ville Leppänen University of Turku Finland Sami Hyrynsalmi arXiv:2106.09329v1 [cs.SE] 17 Jun 2021 LUT University Finland [P]As presented at 2020 European Conference on Information Systems (ECIS 2020), held Online, June 15-17, 2020. The ocial conference proceedings are available at the AIS eLibrary (https://aisel.aisnet.org/ecis2020_rp/). Page ii of 24 ? Copyright notice ? The copyright is held by the authors. The same article is available at the AIS Electronic Library (AISeL) with permission from the authors (see https://aisel.aisnet.org/ ecis2020_rp/). The Association for Information Systems (AIS) can publish and repro- duce this article as part of the Proceedings of the European Conference on Information Systems (ECIS 2020). ? Archiving information ? The article was self-archived by the rst author at its own personal website http:// users.abo.fi/jteixeir/pub/linuxsna/ECIS2020-arxiv.pdf dur- ing June 2021 after the work was presented at the 28th European Conference on Informa- tion Systems (ECIS 2020). Page iii of 24 ð Funding and Acknowledgements ð The rst author’s eorts were partially nanced by Liikesivistysrahasto - the Finnish Foundation for Economic Education, the Academy of Finland via the DiWIL project (see http://abo./diwil) project. A research companion website at http://users.abo./jteixeir/ECIS2020cw supports the paper with additional methodological details, additional data visualizations (plots, tables, and networks), as well as high-resolution versions of the gures embedded in the paper.
    [Show full text]
  • Network Science
    This is a preprint of Katy Börner, Soma Sanyal and Alessandro Vespignani (2007) Network Science. In Blaise Cronin (Ed) Annual Review of Information Science & Technology, Volume 41. Medford, NJ: Information Today, Inc./American Society for Information Science and Technology, chapter 12, pp. 537-607. Network Science Katy Börner School of Library and Information Science, Indiana University, Bloomington, IN 47405, USA [email protected] Soma Sanyal School of Library and Information Science, Indiana University, Bloomington, IN 47405, USA [email protected] Alessandro Vespignani School of Informatics, Indiana University, Bloomington, IN 47406, USA [email protected] 1. Introduction.............................................................................................................................................2 2. Notions and Notations.............................................................................................................................4 2.1 Graphs and Subgraphs .........................................................................................................................5 2.2 Graph Connectivity..............................................................................................................................7 3. Network Sampling ..................................................................................................................................9 4. Network Measurements........................................................................................................................11
    [Show full text]
  • Networkx Reference Release 1.9.1
    NetworkX Reference Release 1.9.1 Aric Hagberg, Dan Schult, Pieter Swart September 20, 2014 CONTENTS 1 Overview 1 1.1 Who uses NetworkX?..........................................1 1.2 Goals...................................................1 1.3 The Python programming language...................................1 1.4 Free software...............................................2 1.5 History..................................................2 2 Introduction 3 2.1 NetworkX Basics.............................................3 2.2 Nodes and Edges.............................................4 3 Graph types 9 3.1 Which graph class should I use?.....................................9 3.2 Basic graph types.............................................9 4 Algorithms 127 4.1 Approximation.............................................. 127 4.2 Assortativity............................................... 132 4.3 Bipartite................................................. 141 4.4 Blockmodeling.............................................. 161 4.5 Boundary................................................. 162 4.6 Centrality................................................. 163 4.7 Chordal.................................................. 184 4.8 Clique.................................................. 187 4.9 Clustering................................................ 190 4.10 Communities............................................... 193 4.11 Components............................................... 194 4.12 Connectivity..............................................
    [Show full text]
  • Exploring Network Structure, Dynamics, and Function Using Networkx
    Proceedings of the 7th Python in Science Conference (SciPy 2008) Exploring Network Structure, Dynamics, and Function using NetworkX Aric A. Hagberg ([email protected])– Los Alamos National Laboratory, Los Alamos, New Mexico USA Daniel A. Schult ([email protected])– Colgate University, Hamilton, NY USA Pieter J. Swart ([email protected])– Los Alamos National Laboratory, Los Alamos, New Mexico USA NetworkX is a Python language package for explo- and algorithms, to rapidly test new hypotheses and ration and analysis of networks and network algo- models, and to teach the theory of networks. rithms. The core package provides data structures The structure of a network, or graph, is encoded in the for representing many types of networks, or graphs, edges (connections, links, ties, arcs, bonds) between including simple graphs, directed graphs, and graphs nodes (vertices, sites, actors). NetworkX provides ba- with parallel edges and self-loops. The nodes in Net- sic network data structures for the representation of workX graphs can be any (hashable) Python object simple graphs, directed graphs, and graphs with self- and edges can contain arbitrary data; this flexibil- loops and parallel edges. It allows (almost) arbitrary ity makes NetworkX ideal for representing networks objects as nodes and can associate arbitrary objects to found in many different scientific fields. edges. This is a powerful advantage; the network struc- In addition to the basic data structures many graph ture can be integrated with custom objects and data algorithms are implemented for calculating network structures, complementing any pre-existing code and properties and structure measures: shortest paths, allowing network analysis in any application setting betweenness centrality, clustering, and degree dis- without significant software development.
    [Show full text]
  • Fastnet: an R Package for Fast Simulation and Analysis of Large-Scale Social Networks
    JSS Journal of Statistical Software November 2020, Volume 96, Issue 7. doi: 10.18637/jss.v096.i07 fastnet: An R Package for Fast Simulation and Analysis of Large-Scale Social Networks Xu Dong Luis Castro Nazrul Shaikh Tamr Inc. World Bank Cecareus Inc. Abstract Traditional tools and software for social network analysis are seldom scalable and/or fast. This paper provides an overview of an R package called fastnet, a tool for scaling and speeding up the simulation and analysis of large-scale social networks. fastnet uses multi-core processing and sub-graph sampling algorithms to achieve the desired scale-up and speed-up. Simple examples, usages, and comparisons of scale-up and speed-up as compared to other R packages, i.e., igraph and statnet, are presented. Keywords: social network analysis, network simulation, network metrics, multi-core process- ing, sampling. 1. Introduction It has been about twenty years since the introduction of social network analysis (SNA) soft- ware such as Pajek (Batagelj and Mrvar 1998) and UCINET (Borgatti, Everett, and Freeman 2002). Though these software packages are still existent, the last ten years have witnessed a significant change in the needs and aspiration of researchers working in the field. The growth of popular online social networks, such as Facebook, Twitter, LinkedIn, Snapchat, and the availability of data from large systems such as the telecommunication system and the inter- net of things (IoT) has ushered in the need to focus on computational and data management issues associated with SNA. During this period, several Python (Van Rossum et al. 2011) and Java based SNA tools, such as NetworkX (Hagberg, Schult, and Swart 2008) and SNAP (Leskovec and Sosič 2016), and R (R Core Team 2020) packages, such as statnet (Hunter, Handcock, Butts, Goodreau, and Morris 2008; Handcock et al.
    [Show full text]