Efficient Enumeration of All Connected Induced Subgraphs of a Large Undirected Graph

Efficient Enumeration of all Connected Induced Subgraphs of a Large Undirected Graph by Sean Maxwell Submitted in partial fulfillment of the requirements For the degree of Master of Science Graduate Program in Systems Biology and Bioinformatics Case Western Reserve University January, 2014 CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES We hereby approve the thesis/dissertation of Sean Maxwell candidate for the Master of Science degree*. Mark Chance Harold Connamacher Mehmet Koyutürk (date) June 25, 2013 *We also certify that written approval has been obtained for any proprietary material contained therein. 1 For my wife Lea and our daughter Stella. 2 Contents Abstract 8 1 Introduction 9 2 Problem Definition and Observations 14 3 Base Case 17 3.1 Focusing on Direct Neighbors . 17 3.2 Optimized Local Search Tree . 19 4 General Case 21 4.1 Joining Local Search Trees . 21 4.2 Optimized Joining of Local Search Trees . 22 4.3 Caching Depth First Search (CDFS) . 25 5 Correctness 27 6 Experimental Results 28 6.1 Exhaustive Synthetic Testing . 28 6.2 Integration into CRANE . 30 7 Discussion 35 8 Conclusion 36 A Extended Definitions of Complex Notation 37 B Supporting Lemmas 39 3 List of Figures 1 Enumeration of all S using anchor vertices . 17 2 Binomial tree generated from an anchor vertex . 18 3 Optimized local search tree construction . 20 4 Extending a local search tree . 21 5 Enumeration tree T ............................ 23 6 Examples of potential overhead during construction of T . 24 7 Rejection rate analysis for arXiv[27] . 31 8 Runtime comparison for arXiv[27] . 32 9 Adjacency overhead comparison for arXiv[27] . 33 10 Rejection rate analysis for HPRD[15] . 34 11 Runtime comparison for HPRD[15] . 35 4 List of Algorithms 1 CDFS . 26 2 CHOOSELOWEST . 49 5 Acknowledgments I would like to thank my committee for all of their guidance: Dr. Mehmet Koyutürk, my thesis advisor; Dr. Mark Chance, my committee chair; and Dr. Harold Conna- macher. I am truly grateful to have had the opportunity to learn and work with each committee member. This thesis would not have been possible without my family who supported me on my journey. I am forever grateful to my wife Lea for her infinite patience and love, and our daughter Stella who unselfishly sacrificed time with her father on many occasions. 6 Symbols The following table provides a very brief definition of all symbols. Detailed definitions of the symbols defining complex concepts are provided in Appendix A. Graph G An undirected graph / network V Set of all vertices in G E Set of all edges in G v A vertex v 2 V S A set of vertices S ⊆ V that induce a connected subgraph in G Tree T Tree enumerating all S that contain a given anchor vertex v 2 V N The set of all nodes of T B The set of all branches of T n A node of T labeled with one v 2 V . Label vertex denoted as vn r The root node of T , labeled by the anchor vertex vr Set=List P(nk) The sequence of nodes along the path from n1 = r to nk in T Dnk Neighbor set of nk is all vertices adjacent to vnk in G Cnk Cull set of nk is fvrg [ fu : uvni 2 E; 1 ≤ i < kg χnk Extension set of nk is Dnk nCnk β(nk; nk+1) A list of branch nodes defined by nk and nk+1 f(x) Υ(v) Adds a new node to N labeled with v (P(nk)) Maps P(nk) to labeling vertices, (P(nk)) = fvn1 ; vn2 ; : : : ; vnk g Γ(n) Returns the source node m 2 N that n was cloned from fb(S) Evaluates a bounding function on the subset S K(n) The set of all children of n 2 N p(n) The immediate parent of n 2 N 7 Efficient Enumeration of all Connected Induced Subgraphs of a Large Undirected Graph Abstract by Sean Maxwell In this work we investigate the problem of efficiently enumerating all connected induced subgraphs of an undirected graph. We show that the redundant computations in a depth first branch and bound search for subgraphs that satisfy a hereditary property can be reduced using an enumeration tree as a \memory" during the depth first search to avoid redundant rejections. This approach reduces the runtime of exhaustive search as well. Our method includes a proof of correctness and computa- tional results on synthetic and real data sets demonstrating improved runtime over traditional depth first approaches. 8 1 Introduction For many applications in systems biology, the connected induced subgraphs of molecular interaction networks are of particular interest since they represent a set of functionally associated biomolecules. In many applications, scientists are interested in finding groups of functionally associated molecules that together induce coherent patterns in other types of biological data. For example, in the context of the systems biology of complex diseases, one is interested in identifying \dysregulated protein subnetworks" that are sets of proteins connected to each other via protein-protein interactions that exhibit collective differential expression between different phenotypes [5, 6, 7, 10, 29]. Similarly, gene set enrichment analysis aims to evaluate the statistical significance of the aggregate disease association of sets of genes that are defined a priori, and the connected subgraphs of molecular networks provide excellent candidate gene sets since they are functionally related through physical and functional interactions [24]. At an evolutionary scale, sets of orthogonal proteins that induce connected subgraphs are shown to be useful in gaining insights into the conservation and modularity of biological processes across diverse taxa [17, 22, 25]. While methods designed to tackle these problems implement heuristic algorithms to search the space of connected induced subgraphs of a network, it was shown that exhaustive search may lead to the identification of more biologically relevant patterns as compared to those identified by simple heuristics [6, 25, 30]. There are many sources of molecular interaction data available such as HPRD[15] and BioGrid[13]. These networks are usually modeled as undirected graphs, in which nodes represent proteins and edges represent pairwise interactions between them. These networks are quite large, but highly sparse. For this reason, when searching for groups of proteins that together optimize a certain objective function, limiting the search to interacting proteins greatly reduces the search space while not compromising the biological relevance of the results. For instance, at the time this document was prepared, HPRD contained 37,080 protein-protein interactions among 9,455 proteins. 9 The number of interactions specified by HPRD is significantly less than assuming every protein is functionally related to every other protein and dramatically reduces the search space for dysregulated groups of functionally related proteins. For example, to search for all groups of size k among n proteins where all proteins interact with n each other results in k subgraphs whereas the number of protein groups that induce a connected subgraph of the interaction network is much smaller. We seek to further reduce the search space by addressing one type of overhead that can arise during depth first branch and bound enumeration of connected induced subgraphs. Subgraph enumeration is a problem that arises in many applications. Depend- ing on the application, the problem can be formulated in several ways. While two frequently used approaches share overlap, it should be noted that algorithms which enumerate subgraphs by edges (topologically) are fundamentally different than algorithms which enumerate connected induced subgraphs (i.e. the former focuses on sets of edges while the latter focuses on sets of vertices). Topological enumeration such as that used for motif mining or querying focuses on generating all edge configurations, whereas connected induced subgraph enumeration aims to generate vertex sets that satisfy the criterion of connectedness. However, regardless of the task, most enumeration methods rely on an underlying order of vertices (intrinsic or imposed) to perform efficient enumeration, and thus certain themes are common. The ordering of vertices defines a relation on any two vertices v1; v2 such that v1 ≺ v2 (read as v1 precedes v2) is either true or false. Throughout this work we refer to the order of vertices as a lexicographic order to establish that a total order exists on all sequences of alphanumeric symbols used to name vertices, i.e., the lexicographic order of vertices named A1 and A2 is A1 ≺ A2 while the lexicographic order of vertices named B1 and A2 is A2 ≺ B1. We refer to the lexicographic rank of vertex v1 in relation to v2 as either higher if v2 ≺ v1 or lower if v1 ≺ v2. Some works from our literature survey use the ordering of vertices(and edges) to define a lexicographic order/rank on canonical forms of subgraphs, i.e., if two connected induced subgraphs 10 had canonical forms A; B; C and C; B; A then A; B; C ≺ C; B; A is true. The lexicographic rank of vertices and of canonical forms of subgraphs is used in several of the following enumeration methods we surveyed. A good deal of research has focused on identifying motifs or frequent subgraphs in graphs. Methods such as gSpan[34] and FFSN[20] mine all frequent subgraphs from an input graph. gSpan utilizes a branch and bound depth first approach to enumerate candidates. Bounding is performed using the lexicographic rank of the canonical form of each subgraph such that a subgraph is only extended if its lexicographic rank is theoretically lowest, thus avoiding much redundancy caused by isomorphisms of the same graph being searched separately.

Load more