Efficient Enumeration of all Connected Induced Subgraphs of a Large Undirected Graph

by

Sean Maxwell

Submitted in partial fulfillment of the requirements

For the degree of Master of Science

Graduate Program in Systems Biology and

Case Western Reserve University

January, 2014 CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

Sean Maxwell candidate for the Master of Science degree*.

Mark Chance

Harold Connamacher

Mehmet Koyut¨urk

(date) June 25, 2013

*We also certify that written approval has been obtained for any proprietary material contained therein.

1 For my wife Lea and our daughter Stella.

2 Contents

Abstract 8

1 Introduction 9

2 Problem Definition and Observations 14

3 Base Case 17 3.1 Focusing on Direct Neighbors ...... 17 3.2 Optimized Local Search ...... 19

4 General Case 21 4.1 Joining Local Search Trees ...... 21 4.2 Optimized Joining of Local Search Trees ...... 22 4.3 Caching Depth First Search (CDFS) ...... 25

5 Correctness 27

6 Experimental Results 28 6.1 Exhaustive Synthetic Testing ...... 28 6.2 Integration into CRANE ...... 30

7 Discussion 35

8 Conclusion 36

A Extended Definitions of Complex Notation 37

B Supporting Lemmas 39

3 List of Figures

1 Enumeration of all S using anchor vertices ...... 17 2 Binomial tree generated from an anchor vertex ...... 18 3 Optimized local search tree construction ...... 20 4 Extending a local search tree ...... 21 5 Enumeration tree T ...... 23 6 Examples of potential overhead during construction of T ...... 24 7 Rejection rate analysis for arXiv[27] ...... 31 8 Runtime comparison for arXiv[27] ...... 32 9 Adjacency overhead comparison for arXiv[27] ...... 33 10 Rejection rate analysis for HPRD[15] ...... 34 11 Runtime comparison for HPRD[15] ...... 35

4 List of

1 CDFS ...... 26 2 CHOOSELOWEST ...... 49

5 Acknowledgments

I would like to thank my committee for all of their guidance: Dr. Mehmet Koyut¨urk, my thesis advisor; Dr. Mark Chance, my committee chair; and Dr. Harold Conna- macher. I am truly grateful to have had the opportunity to learn and work with each committee member. This thesis would not have been possible without my family who supported me on my journey. I am forever grateful to my wife Lea for her infinite patience and love, and our daughter Stella who unselfishly sacrificed time with her father on many occasions.

6 Symbols

The following table provides a very brief definition of all symbols. Detailed definitions of the symbols defining complex concepts are provided in Appendix A.

Graph G An undirected graph / network V Set of all vertices in G E Set of all edges in G v A vertex v ∈ V S A set of vertices S ⊆ V that induce a connected subgraph in G

Tree T Tree enumerating all S that contain a given anchor vertex v ∈ V N The set of all nodes of T B The set of all branches of T n A node of T labeled with one v ∈ V . Label vertex denoted as vn r The root node of T , labeled by the anchor vertex vr

Set/List P(nk) The of nodes along the path from n1 = r to nk in T

Dnk Neighbor set of nk is all vertices adjacent to vnk in G

Cnk Cull set of nk is {vr} ∪ {u : uvni ∈ E, 1 ≤ i < k}

χnk Extension set of nk is Dnk \Cnk β(nk, nk+1) A list of branch nodes defined by nk and nk+1

f(x) Υ(v) Adds a new node to N labeled with v

ψ(P(nk)) Maps P(nk) to labeling vertices, ψ(P(nk)) = {vn1 , vn2 , . . . , vnk } Γ(n) Returns the source node m ∈ N that n was cloned from fb(S) Evaluates a bounding function on the subset S K(n) The set of all children of n ∈ N p(n) The immediate parent of n ∈ N

7 Efficient Enumeration of all Connected Induced Subgraphs

of a Large Undirected Graph

Abstract

by

Sean Maxwell

In this work we investigate the problem of efficiently enumerating all connected in- duced subgraphs of an undirected graph. We show that the redundant computations in a depth first search for subgraphs that satisfy a hereditary property can be reduced using an enumeration tree as a “memory” during the depth first search to avoid redundant rejections. This approach reduces the runtime of exhaustive search as well. Our method includes a proof of correctness and computa- tional results on synthetic and real data sets demonstrating improved runtime over traditional depth first approaches.

8 1 Introduction

For many applications in systems biology, the connected induced subgraphs of molecu- lar interaction networks are of particular interest since they represent a set of function- ally associated biomolecules. In many applications, scientists are interested in finding groups of functionally associated molecules that together induce coherent patterns in other types of biological data. For example, in the context of the systems biology of complex diseases, one is interested in identifying “dysregulated protein subnetworks” that are sets of proteins connected to each other via protein-protein interactions that exhibit collective differential expression between different phenotypes [5, 6, 7, 10, 29]. Similarly, gene set enrichment analysis aims to evaluate the statistical significance of the aggregate disease association of sets of genes that are defined a priori, and the connected subgraphs of molecular networks provide excellent candidate gene sets since they are functionally related through physical and functional interactions [24]. At an evolutionary scale, sets of orthogonal proteins that induce connected subgraphs are shown to be useful in gaining insights into the conservation and of biological processes across diverse taxa [17, 22, 25]. While methods designed to tackle these problems implement heuristic algorithms to search the space of connected in- duced subgraphs of a network, it was shown that exhaustive search may lead to the identification of more biologically relevant patterns as compared to those identified by simple heuristics [6, 25, 30]. There are many sources of molecular interaction data available such as HPRD[15] and BioGrid[13]. These networks are usually modeled as undirected graphs, in which nodes represent proteins and edges represent pairwise interactions between them. These networks are quite large, but highly sparse. For this reason, when searching for groups of proteins that together optimize a certain objective function, limiting the search to interacting proteins greatly reduces the search space while not compromising the biological relevance of the results. For instance, at the time this document was prepared, HPRD contained 37,080 protein-protein interactions among 9,455 proteins.

9 The number of interactions specified by HPRD is significantly less than assuming every protein is functionally related to every other protein and dramatically reduces the search space for dysregulated groups of functionally related proteins. For example, to search for all groups of size k among n proteins where all proteins interact with n each other results in k subgraphs whereas the number of protein groups that induce a connected subgraph of the interaction network is much smaller. We seek to further reduce the search space by addressing one type of overhead that can arise during depth first branch and bound enumeration of connected induced subgraphs. Subgraph enumeration is a problem that arises in many applications. Depend- ing on the application, the problem can be formulated in several ways. While two frequently used approaches share overlap, it should be noted that algorithms which enumerate subgraphs by edges (topologically) are fundamentally different than algo- rithms which enumerate connected induced subgraphs (i.e. the former focuses on sets of edges while the latter focuses on sets of vertices). Topological enumeration such as that used for motif mining or querying focuses on generating all edge configurations, whereas connected induced subgraph enumeration aims to generate vertex sets that satisfy the criterion of connectedness. However, regardless of the task, most enumera- tion methods rely on an underlying order of vertices (intrinsic or imposed) to perform efficient enumeration, and thus certain themes are common.

The ordering of vertices defines a relation on any two vertices v1, v2 such that

v1 ≺ v2 (read as v1 precedes v2) is either true or false. Throughout this work we refer to the order of vertices as a lexicographic order to establish that a total order exists on all of alphanumeric symbols used to name vertices, i.e., the lexicographic order of vertices named A1 and A2 is A1 ≺ A2 while the lexicographic order of ver-

tices named B1 and A2 is A2 ≺ B1. We refer to the lexicographic rank of vertex v1

in relation to v2 as either higher if v2 ≺ v1 or lower if v1 ≺ v2. Some works from our literature survey use the ordering of vertices(and edges) to define a lexicographic order/rank on canonical forms of subgraphs, i.e., if two connected induced subgraphs

10 had canonical forms A, B, C and C,B,A then A, B, C ≺ C,B,A is true. The lexico- graphic rank of vertices and of canonical forms of subgraphs is used in several of the following enumeration methods we surveyed. A good deal of research has focused on identifying motifs or frequent subgraphs in graphs. Methods such as gSpan[34] and FFSN[20] mine all frequent subgraphs from an input graph. gSpan utilizes a branch and bound depth first approach to enumerate candidates. Bounding is performed using the lexicographic rank of the canonical form of each subgraph such that a subgraph is only extended if its lexicographic rank is theoretically lowest, thus avoiding much redundancy caused by isomorphisms of the same graph being searched separately. FFSM[20] utilizes a similar lexicographic ranking technique, but it uses a hybrid method of joining candidate subgraphs to form new candidates as well as extending current candidates by a single vertex to perform a breadth first search. A pruning step removes undesirable candidates from the search at each iteration. Another problem that has been studied extensively is motif counting. In motif counting, rather than unsupervised mining of all frequent subgraphs, one is inter- ested in counting how many instances of a target motif exist in a graph. This can be viewed as finding all isomorphic instances of a given graph as subgraphs of an- other graph. An early that was developed for this task is the algorithm for testing subgraph isomorphism proposed by Ullman[32]. The VF2 algo- rithm [9] exhibits a performance improvement over Ullman using a depth first branch and bound approach and an imposed lexicographic ordering of vertices to enumerate isomorphisms of the target motif. A recent work, the ISAM algorithm [11], demon- strates performance improvements over VF2 and Ullman by implementing a depth first search as an iterative procedure (similar to Ullman) but using highly optimized data structures and candidate ranking criteria. A different approach is taken by Afrati et al.[1] where they investigate methods for parallelizing motif counting using the well known MapReduce framework to distribute the work of the search across

11 many different processors. A special case of motif search that is commonly investigated is finding cliques, i.e., subgraphs in which each node is connected to every other node of the subgraph. More specifically, investigators commonly wish to identify all maximal cliques that are cliques that are not contained by any larger . Algorithm457 [4] uses a branch and bound approach to generate maximal cliques, but rather than a vertex ordering strategy to avoid redundancy they employ a not set to track vertices which have already been explored. In contrast, the depth first backtracking algorithm by Krehner and Stinson[26] uses an imposed ordering on the vertices to avoid redundancy and find all maximal cliques. This problem has also been studied in the MapReduce framework by Wu et al.[33] who develop a novel depth first clique enumeration algorithm that can be distributed across multiple processors. Alternatively, investigators may wish to find independent sets of a graph, which in some ways is the dual of the clique problem, i.e., no two nodes of an independent set are joined by an edge. Similar to cliques, maximal independent sets are of interest where a maximal independent set is not contained in any other independent set. Johnson and Yannakakis [21] perform a theoretical analysis and present an iterative depth first for general graphs that outputs results in lexicographic order. Eppstein [12] explores an iterative algorithm to perform this task based on the ReverseSearch framework [2], which is also a form of depth first search. The general task that relates most closely to ours is enumerating all sets of vertices of a graph that induce a connected subgraph. A powerful algorithm well suited to this task is ReverseSearch [2], which is highly efficient in terms of space and can be distributed to run in parallel. ReverseSearch is itself a form of depth first search that utilizes a rank ordering of vertices to eliminate redundancy and imposes a child/parent relationship on all subgraphs. However, the complexity of the general ReverseSearch algorithm is less efficient than derivatives optimized for specific tasks [12, 28]. There are also methods like Algorithm447 [19] that use iterative depth first search and label

12 vertices as visited to avoid redundancy. This is less efficient in terms of space than ReverseSearch but is similar to many of the algorithms we surveyed for topological enumeration. In this study, we place an additional constraint on the problem of connected in- duced subgraph enumeration that enables development of efficient branch and bound algorithms for problems that include finding high-scoring connected subgraphs ac- cording to a well defined scoring criteria. Namely, we focus on the case where all connected induced subgraphs satisfy a hereditary property. A hereditary property of a graph G is a property such that all induced subgraphs g ∈ G also satisfy the prop- erty [3]. For example, being a clique is a hereditary property because any induced subgraph of the clique is also a clique. A problem closely related to our search for all connected induced subgraphs satisfying a hereditary property was recently studied by Cohen et al.[8] in which they investigate for which classes of hereditary property P the maximal P-subgraphs problem can be solved in time. Cohen et al. also point out that the general problem (for any hereditary property P) cannot be solved in polynomial time because the output may be exponential in size. This is closely related to our problem because we do not restrict the class of hereditary property that each subgraph must satisfy, and thus our result set may be exponential in size. This is in fact the case when we relax the property such that any S satisfies the property and the search becomes exhaustive. The two unifying observations we made from our survey of previous work are (1) methods for subgraph enumeration are generally based on depth first search with a significant portion using a branch and bound optimization strategy. (2 ) an intrinsic or imposed order of vertices enables the use of diverse strategies to reduce the search space and avoid redundant solutions. The first observation motivates our interest in the problem because depth first search exhibits an inherent drawback when applied to branch and bound search for subgraphs that satisfy a hereditary property fb which we will outline in the following section.

13 In this work we will first clearly define the problem and the type of overhead we seek to reduce, and we then briefly outline the process through which we have devel- oped our proposed solution with examples to clarify key points. We will then provide a formal algorithm for our proposed solution with supporting theorems and a set of computational experiments demonstrating the difference in performance between our solution and a conventional depth first branch and bound search.

2 Problem Definition and Observations

Let G = (V,E) be an undirected graph. A set V 0 ⊆ V is said to be a connected node set if the subgraph induced by V 0 is connected, i.e., if for every pair of nodes {u, v} ∈ V 0, there is a path in G from u to v that goes only through nodes in V 0. Throughout this work we refer to connected node sets as S where it is implied that S ⊆ V and S induces a connected subgraph of G. We are interested in enumerating all connected node sets in G. While enumerating such sets can be useful in the context of many applications, here we are particularly interested in facilitating branch-and-bound algorithms that are designed to solve op- timization problems or enumerate all subgraphs that satisfy a hereditary property.

In particular, we assume that we are given a scoring function f : 2V → R such that, for V 0 ⊆ V , f(V 0) = −∞ if V 0 does not induce a connected subgraph in G. In this setting, branch-and-bound algorithms can be useful in solving two types of problems:

P1 : Given a score threshold f ∗, find all connected node sets S such that f(S) ≥ f ∗.

P2 : Find a connected node set S such that f(S) ≥ f(S0) for any connected node set S0 in G.

Since our focus is on facilitating branch-and-bound algorithms, we assume that a

V “bounding” function fb : 2 → R is available such that for a given connected node 0 0 set S, f(S) ≤ fb(S ) for any connected node set S ⊆ S. In other words, the function

14 fb(S) provides a mechanism for bounding the score of any connected subset that can be obtained by adding more nodes to S. In the context of problems of type P1, if

∗ fb(S) < f we say that fb(S) is not satisfied. Alternatively, fb can be defined as a

V boolean function fb : 2 → {0, 1} that determines if a subgraph S satisfies a desired

0 0 property where if fb(S ) = 0 the property is not satisfied. In the case fb(S ) = 0, all

0 0 0 S ⊇ S have bound fb(S) = 0, i.e, if S does not satisfy the property then no S ⊇ S satisfies the property either. Both definitions of fb are hereditary in nature and thus meet our requirement that enumerated S ⊆ V satisfy a hereditary property. Observe that, if we can solve problems of the type P1 using a branch-and-bound algorithm, we can also solve problems of type P2 by adaptively setting the threshold f ∗ to the score of the best subnetwork found so far. For this reason, we focus on the first type of problems in the rest of our discussion. For both types of problems, if we have an efficient way of enumerating all connected node sets, we can develop a branch-and-bound algorithm that will prune out chunks of the search space efficiently by bounding the score (f or satisfiability of a desired property) of larger connected node sets using the bounding function (fb) for their subsets, which are smaller. To facilitate efficient and effective branch-and-bound algorithms, we need an al- gorithm to enumerate the solution space (here, the space of all connected node sets of the input graph) correctly and efficiently. We observe that such an enumera- tion algorithm should satisfy the following criteria to result in efficient and effective branch-and-bound algorithms:

• Completeness: All connected node sets in G satisfying bound fb should be generated and all generated node sets should be connected.

• No redundant subgraph generation: Each connected node set in G should be generated exactly once.

• Optimal order of enumeration: If S0 and S are connected node sets and S0 ⊂ S,

0 0 then S should be generated before S so that if fb(S ) is not satisfied we try to

15 avoid generating S.

The “completeness” criterion relates to the correctness of the algorithm while the “no redundant subgraph generation” and “optimal order of enumeration” criteria relate to efficiency. The “no redundant subgraph generation” criterion asserts that each candidate solution in the solution space should be considered exactly once since additional considerations will lead to redundant computation. The “optimal order of enumeration” criterion, on the other hand, facilitates op- timal pruning of the search space by ensuring that all subsets of a connected node set are considered before the node set itself is considered. To see why this is useful,

0 consider the definition of the bounding function which guarantees f(S) ≤ fb(S ) for any S0 ⊆ S. From this it is apparent that evaluating S0 before all S that contain

0 0 ∗ 0 ∗ S is desirable because if fb(S ) < f then any S containing S will have f(S) < f and in the context of G there may be an exponential number of S that contain S0.

0 0 ∗ Whenever an S ⊃ S is evaluated where fb(S ) < f we call it a redundant rejection. Redundant rejections are a source of overhead and reducing redundant rejections is the focus of this thesis. Eliminating all redundant rejections likely requires a breadth first search of G, but in this work we took a more conservative approach to keep the size of the problem manageable. To begin, we observed that satisfying the “completeness”and “no redun- dant subgraph generation”criteria can be accomplished by selecting a single v ∈ V as an anchor vertex and enumerating all subgraphs containing v before removing v from G. In this way each v ∈ V is chosen as a starting point and all subgraphs containing it are enumerated before v is removed from G. When V ≡ ∅ all subgraphs have been enumerated. An example of enumerating connected induced subgraphs from an anchor vertex is shown in Figure 1. It is obvious that, for some S0 ⊂ V , this process generates many S ⊃ S0 before S0 itself, and thus it does not satisfy the criterion of “optimal order of enumeration”. For example, subgraphs containing DH are generated 4 times before DH itself is

16 Figure 1: Example illustrating the enumeration all connected induced subgraphs of a graph using anchor vertices. On the left is the graph as each vertex becomes the anchor used for enumeration of all connected subgraphs that contain the anchor before it is subsequently removed from G. On the right are the connected subgraphs generated from each anchor vertex.

generated, and this can cause redundant rejections if fb(DH) does not satisfy our criteria for score or hereditary property. Rather than going breadth first over the entire network we took a different approach. We instead focused on eliminating redundant rejections within each search anchored at v. I.e., during extension from

A in Figure 1, if fb(ADH) is not satisfied, we do not enumerate ACDH, ABDH or ABCDH.

3 Base Case

3.1 Focusing on Direct Neighbors

The first structure we explore is a variation of the binomial tree. Any child n of the root of a binomial tree has children that are copies of all branches rooted at siblings that precede n in the tree, and it can be used to enumerate the super set of a set[23].

17 To apply this to a graph, we observe that starting at an anchor vertex v ∈ V , the neighbors of v can be treated as a set because v and any combination of its neighbors induce a connected subgraph of G. This collection of subgraphs can be represented as a binomial tree T with the root node r labeled by v. We construct the tree such that each node n only contains descendants labeled by vertices of greater lexicographic rank than the vertex labeling n. Since all vertices labeling nodes in T are connected to the vertex labeling the

root, the set of vertices that label each path P(nk) from the root r of T to a node

nk represents a connected subgraph of G. The binomial tree is a special case of our more general solution. In this special case the input is restricted to a vertex and its

direct neighbors and the bound fb is satisfied by any S. Here, we do not prove that the binomial tree satisfies our criteria of “completeness”, “no redundant subgraph generation”and “optimal order of enumeration” since Theorems 1 and 2 related to the general solution show that depth first search of the binomial tree satisfies these criteria. An example is illustrated in Figure 2.

Figure 2: Graph G and binomial tree T generated from anchor vertex A and its neighbors B, C and D. Performing a depth first search of T generates the sets A,AD,AC,ACD,AB,ABD,ABC,ABCD

However, this does not yet provide a performance gain as we have still generated all possible branches first (each branch representing an S) and if an S0 does not

0 satisfy our bound fb, it is still possible to encounter S ⊃ S as searching T continues. Taking advantage of T for branch and bound algorithms requires that the binomial tree be constructed in a specific manner. To motivate this statement consider the

18 worst case scenario for generating all subsets using a full binomial tree T . If the first

0 0 subset S = AD in Figure 2 does not satisfy fb(S ), then a depth first search of T rejects 3 additional S ⊃ S0, i.e., we would perform three redundant rejections. On

0 the other hand, if S = AB does not satisfy fb no redundant rejections occur. The number of redundant rejections of a subgraph depends on the order in which vertices are considered. In the worst case being 2m−1 − 1 (m being the number of neighbors of v) redundant rejections and in the best case there are no redundant rejections. Similar to depth first traversal of the binomial tree, performing depth first branch and bound enumeration directly on G exhibits the same type of redundant rejections. This is obvious because a depth first search must explore subgraphs of G equivalent to those represented by the binomial tree and no depth first search strategy can select nodes to follow a priori that will avoid all redundant rejections. I.e., regardless of the order in which the depth first search of G selects vertices to follow the number of redundant rejections encountered varies depending on which S0 ⊂ S is the cause of the rejection. That is to say, depth first branch and bound search is inherently unstable in regards to the number of redundant S0 ⊂ S evaluated as this quantity varies unpredictably.

3.2 Optimized Local Search Tree

A simple method that reduces redundant rejections using a binomial tree based ap- proach is as follows. The binomial tree can be constructed by adding each neighbor vertex to the root as a new node n, and then appending copies of the branches rooted at each sibling of n as children of n. Neighbor vertices are added to the root in reverse lexicographic order similar to the construction of the binomial tree in the previous section. However, branches are copied using a method that evaluates each path of

the branch while copying it, and copying a path P(nk) terminates whenever the set

S represented by P(nk) does not satisfy the bound fb. The resulting local search is similar in spirit to the set enumeration tree (SE-tree) search of Rymon [31]. However,

19 our approach is more closely related to the binomial tree because we construct an explicit tree where the set is defined by the path from the root to a node in the tree. It is important to note that depending on how rejections occur, T may no longer meet the definition of a binomial tree so moving forward we will refer to T as a local search tree. An example of constructing a local search tree is shown in Figure 3.

Figure 3: Creating the local search tree T using a branch and bound optimization generates the sets A,AD,AC,AB,ABD,ABC. (A) The input graph G with the anchor vertex A. (B) Exploring D with no previous branches yields the D branch. (C) Exploring C with the previous D branch evaluates ACD which is rejected resulting in branches C and D. (D) Exploring B with previous branches evaluates ABD and ABC (avoiding ABCD which contains the previously rejected ACD)

In this way, copying of branches stops at any point that fb(S) is not satisfied. It is obvious this reduces redundant rejections because when we append the branch rooted

at sibling n1 to the new sibling n2 we avoid enumerating S that are rejected while

creating the n1 branch. We stress that this is only a potential reduction in redundant rejections because we are avoiding one source of redundant rejections, but others exist that we describe in the Discussion section. The local search tree is a special case of our general solution. In this special case the input is limited to only a vertex and its direct neighbors. Here we do not prove that the local search tree satisfies our criterion of “completeness” since Theorem 3 related to the general solution shows that after rejecting an S0 all S 6⊃ S0 are still enumerated by depth first search of the local search tree.

20 4 General Case

4.1 Joining Local Search Trees

To demonstrate our reasoning behind the next development, we observe that the local

search tree from the previous section contains all subgraphs that satisfy fb(S) using a single anchor vertex v and its direct neighbors. To generate subgraphs beyond this initial seed, we can follow a path P(nk) in T where the vertices that label the nodes of

P(nk) represent an S ⊆ V . At this point we can treat S as an anchor set by removing all neighbors of v not in S from G. We can then look at the direct neighbors of the anchor set and create a T 0 rooted at S. An example illustrating this idea is shown in

Figure 4. The process is repeated for each P(nk) in T until all paths have been used as anchor sets.

Figure 4: Example illustrating the key idea for joining local search trees of direct neighbors to generate all connected subgraphs. Initial tree T1 generated by Algorithm 1 anchored at A is then extended by following path ACD and extending it also using Algorithm 1 to generate T2.

Generating all S ⊆ V in this manner maintains our original optimization during generation of each tree rooted at an anchor set, but the optimization does not extend

21 beyond the individual tree generations. For example, if in Figure 4 we extend AD and

during generation find that fb(ADI) is not satisfied, we would reject its supergraphs ( e.g., ACDI, ACDHI, ACDEHI and ACDEHI) again when we extend ACD. To utilize the information from local search branches globally, we must modify the construction procedure for the local search tree as described in the next section.

4.2 Optimized Joining of Local Search Trees

We initially investigated creating a local search tree and then following each path to enumerate connected induced subgraphs containing vertices beyond the anchor vertex and its direct neighbors. However if instead we extend each branch as it is added to generate a depth first search through a neighbor of the anchor v, we can use the depth first branches as the search continues through other neighbors to leverage the information from previous rejections. In the context of the enumeration procedure, this becomes a simultaneous depth-breadth search. An example of this method is shown in Figure 5. However, an additional matter that greatly complicates this process is the occur- rence of cycles in G. A cycle is a path in G that originates and terminates at a vertex v without back tracking, i.e., we arrive back at v by only traversing unvisited vertices. While generating a branch from an anchor set S, if a cycle exists in G that originates and terminates in S, T can inadvertently become corrupt such that a path P(nk) contains multiple nodes labeled with the same vertex. As such, the vertices labeling

the nodes along P(nk) no longer represent a proper set as elements are duplicated. In addition, if the branch being generated from an anchor set S is joined with a previously generated branch that contains nodes labeled by vertices adjacent to S then the joining can disrupt the desired order of generating all S0 ⊂ S before S. If this occurs it can introduce additional overhead into the subgraph enumeration if an

0 0 S ⊂ S that does not satisfy fb(S ) is generated after S.

Furthermore, cycles can also lead to redundant P(nk) in T such that the same

22 Figure 5: (A) The input graph G with the anchor vertex A highlighted. (B) The enumeration tree generated by exploring depth first through vertex D, where fb(ADIH) is not satisfied and ADIH is rejected. (C) The first step of adding the tree generated by AD to the tree being generated through AC. At this step ACD, ACDI and ACDH are evaluated and fb(ACDH) is not satisfied so ACDH is rejected. The DI branch is passed to the children of AC discovered by continued depth first search. (D) The second step of extending AC to neighbor E. After ACE is evaluated, the DI branch is joined and ACED and ACEDI are evaluated.

S will be enumerated multiple times. An example of all three types of corruption are shown in Figure 6 (A). Repeated vertex labels in a path are quite obvious. An example of the enumeration order being violated is ABCEG is evaluated before ABG. Redundancy occurs multiple times where ABCEG is evaluated on every branch. In order to resolve these issues, we place an additional restriction on adding each previously generated branch to the branch currently being generated. If we are ex- tending a subgraph S represented by path P(nk), we prune previous branches at

nodes labeled with v ∈ χnk before joining, i.e., we remove nodes labeled by vertices

adjacent to vnk but not adjacent to any other vertex labeling a node along P(nk−1). We then pass the pruned branch forward as enumeration continues. This method

23 represents our solution to the general case and is formalized in Algorithm 1. Lemma 3 guarantees that all paths in T represent sets, i.e., every vertex labeling a node of a path in T is unique. Theorem 1 guarantees that all S ⊆ V containing the anchor vertex are uniquely enumerated during exhaustive search. Theorem 2 guarantees that no S ⊃ S0 is enumerated before S0 during exhaustive search. Theorem 3 guarantees that when an S0 is rejected all S 6⊃ S0 are still enumerated. An example of the tree constructed by Algorithm 1 is shown in Figure 6 (C).

Figure 6: (A) Graph G with anchor vertex A highlighted. (B) The enumeration tree generated by appending an un-pruned branch generated from S =AC to the branch generated through S =AB which exhibits several redundant instances of G and E. (C) The tree generated by pruning the branch generated through AC as it is added to the branch generated through AB. At AB, G is a neighbor of B so the branch through AC is pruned at G to CE before being added. The pruned branch is then passed to ABG during extension. At ABG, E is a neighbor of G so the branch from AC is pruned again at E to C. The C branch is then passed to ABGE where it is added the final time. In addition it can be observed that if G was not removed from the CEG branch when first joining it to B, CEBG would be generated before BG, thus violating the optimal generation order.

24 4.3 Caching Depth First Search (CDFS)

Algorithm 1 is a formal presentation of the full algorithm for generating all con- nected induced subgraphs of G optimized for branch and bound algorithms. It makes

use of the extension set χnk defined in detail in Appendix A. The entry point is DEPTH(∅,v,[ ]) which performs a CDFS search from anchor vertex v ∈ G. We have used the boolean function definition of fb that tests S for a hereditary property. We know by Theorems 1 and 3 that when DEPTH returns all S ⊆ V containing v that satisfy fb(S) have been enumerated. Enumerating all connected induced subgraphs in G only requires calling DEPTH on each v ∈ V and removing the selected v from G after each call.

25 Algorithm 1 Enumerate all S that contain anchor vertex v and satisfy fb(S). Re- turns the root node of the enumeration tree T . Entrance point is DEPTH(∅,v,[ ]) 1: procedure BREADTH(S, n, U) 2: if vn ∈ U then . Prune branch by topology 3: return null 4: end if 5: 0 6: S ← S ∪ vn . Prune branch by bounding function 0 7: if fb(S ) = false then 8: return null 9: end if 10: 0 11: n ← Υ(vn) . Recursively evaluate/prune/clone branch 12: for all {n∗ : nn∗ ∈ B} do 13: n00 ← BREADTH(S0, n∗,U) 14: if n00 6= null then 15: B ← B ∪ n0n00 16: end if 17: end for 18: return n0 19: end procedure

20: procedure DEPTH(S, v, β) 21: S0 ← S ∪ v 0 22: if fb(S ) = false then 23: return null 24: end if 25: n ← Υ(v) 26: β0 ← [] 27: for i = 1 to |β| do 0 0 28: n ← BREADTH(S , β[i], χn) 29: if n0 6= null then 30: B ← B ∪ nn0 31: push(β0, n0) 32: end if 33: end for 34: for all v ∈ χn do . Note: Derive χn from S and v 35: n0 ← DEPTH(S0, v, β0) 36: if n0 6= null then 37: B ← B ∪ nn0 38: push(β0, n0) 39: end if 40: end for 41: return n 42: end procedure

26 5 Correctness

The following theorems are based on supporting lemmas in Appendix B. Theorem 1 guarantees that our method satisfies the “completeness” and “no redundant sub- graph generation” criteria during exhaustive enumeration, i.e, when fb is satisfied by any S. Theorem 2 guarantees that our method satisfies the “optimal order of enu- meration” criterion during exhaustive enumeration. Theorem 3 guarantees that our method satisfies the “completeness” criterion when fb is selective.

Theorem 1 Given an input graph G, an anchor vertex v ∈ V and a bound fb that is satisfied by any S, Algorithm 1 uniquely enumerates all S ⊆ V containing v that induce a connected subgraph of G.

Proof: By Lemma 4 we know that the set represented by any path P(nk) in T induces a connected subgraph of G and by Lemma 5 we know that every path in T represents a unique set. By Lemma 7 we know that all S ⊆ V containing v are represented by a path P(nk) in T . Therefore, we can conclude that because Algorithm 1 enumerates all paths of T , Algorithm 1 uniquely enumerates all S ⊆ V containing v.



Theorem 2 Given an input graph G, an anchor vertex v ∈ V and a bound fb that is satisfied by any S, Algorithm 1 enumerates all connected induced subgraphs of G containing v in an order such that all S0 ⊂ S containing v are enumerated before S.

By Theorem 1 we know that all S ⊆ V containing v that induce a connected subgraph of G are enumerated, and by Lemma 8 we know that any S0 ⊂ S containing v must be generated before S.



0 Theorem 3 Given an input graph G, an anchor vertex v and a bound fb, if an S

0 does not satisfy fb all S 6⊃ S are still enumerated by Algorithm 1.

27 Proof: We prove the theorem by contradiction

For a given set of connected vertices, let P(nk) be the path that represents S in

an exhaustive enumeration tree T . Assume P (nk) was eliminated because P (mk)

0 0 representing S was rejected and assume that S 6⊃ S . Rejecting P (mk) can only

eliminate P (nk) if n is a copy of m created by the BREADTH procedure of Algorithm 1. As n is a copy of m created by the BREADTH procedure, all vertices labeling nodes

0 along P (mk) also label nodes along P (nk). Thus, ψ(P (nk)) ⊃ ψ(P (mk)) =⇒ S ⊃ S and we have a contradiction. Since we know by Theorem 1 that all S ⊆ V containing v are enumerated by Algorithm 1 when no rejections occur, and that when a rejection of S0 occurs it can only eliminate S ⊃ S0 we conclude that if an S0 is rejected all S 6⊃ S0 are still enumerated.



6 Experimental Results

6.1 Exhaustive Synthetic Testing

We compare the performance of CDFS to DFS based approaches by enumerating connected induced subgraphs of real world networks. We use the total weight of a subgraph as our hereditary property and comparison to a threshold t as our bounding

function fb where if weight(S) > t then fb(S) is false, and otherwise fb(S) is true. Weights were assigned to vertices from a Gaussian distribution with mean m and standard deviation ρ. In our implementation of Algorithm 1 used for computational tests we imposed a maximum size k on enumerated subgraphs by adding a test for |S| = k in both the DEPTH and BREADTH procedures. We utilized the human protein reference [15] that consists of 9,455 ver- tices and 37,080 edges and a citation network generated by Leskovec et al.[27] from the on-line arXiv journal that consists of 5,241 vertices and 28,958 edges. Enumera-

28 tion was performed across a range of thresholds and standard deviations a total of ten times and the performance measures such as the number of rejections and execution time were averaged. It should be noted that for the arXiv network enumeration was performed to a size of 5 while for HPRD enumeration was performed to a size of 4. A smaller maximum size of S was imposed on enumeration of HPRD because runtimes were significantly longer than for arXiv, and thus a smaller size was required to run all tests in a tractable amount of time. The results for the arXiv network are displayed in Figures 7,8,9 and the results for the HPRD network are displayed in Figures 10,11.

Figures 7 and 10 plot the rejection rate (the number of S enumerated where fb(S) was false divided by the total number of S enumerated) for each algorithm while enumerating all S ⊆ V that satisfy fb. The rejection rate for CDFS is consistently lower than that of DFS though the relationship is more obvious at higher values of ρ. The correlation of rejection rate and ρ occurs because at low values of ρ the rejections are strongly correlated with depth, i.e. at t=10 and p=0 all single vertex subgraphs would be accepted but all two vertex subgraphs would be rejected, and this correlation aligns well with how branch and bound depth first search prunes areas of the search space. However as ρ increases the rejections become more likely to happen at any depth, and our strategy for reducing redundancy has more opportunity to prune the search space in ways that DFS cannot. Figures 8 and 11 show that the two algorithms have similar runtimes for small thresholds t and small standard deviations ρ, but at larger t and ρ the CDFS method completes before the DFS method consistently. The large discrepancy in rejection rate between the two algorithms that occurs for values of ρ > 6 appears to contribute to CDFS outperforming the DFS method for small values of t, but we were unable to establish a correlation between rejection rate and runtime for the values of t that showed the greatest difference in runtime between the two algorithms. For this reason we performed additional experiments to measure what factors were contributing to the performance gain for large values of t.

29 We determined the dominating factor of the performance gain was that the DFS based approach expends a significant amount of effort rediscovering relationships that it has already established. Both algorithms require a method ADJ(S) that returns unvisited vertices adjacent to at least one v ∈ S (in Algorithm 1 this returns χn at line 34). However, the CDFS method does not call ADJ(S) in the BREADTH procedure where it extends the current S using all previous subgraphs that satisfied fb. As the search continues, the amount of search space being explored by BREADTH increases exponentially, and the cache of previously established relationships allows CDFS to outperform the DFS based method on larger search spaces. The reduction in calls to ADJ(S) shows larger disparity between methods as the search space expands demonstrating that the CDFS method reduces runtime complexity for both branch and bound and exhaustive searches, i.e., searches where any S satisfies fb(S). Figure 9 shows a plot of the number of calls to ADJ(S) versus threshold value t. The cache provides CDFS a computational edge over methods that use a strictly branch and bound depth first approach, but it can require exponentially more memory. For this reason CDFS is best suited to problems where G is very large, but the maximum size of S to be enumerated is small (|S| < 10) so that memory consumption is reasonable and the runtime improvement is spread across all vertices of G.

6.2 Integration into CRANE

The CRANE algorithm by Chowdhury et al.[6], performs a heuristic branch and bound depth first search of protein-protein interaction networks to find combinato- rially dysregulated subnetworks with binary expression state patterns that are dis- criminative between two sample classes. The heuristic property of the algorithm is adjustable in that it extends the best B subgraphs at any point during the depth first search where if B is made large enough the search becomes exhaustive and B = 1 is a greedy search. The objective function for CRANE is non-trivial as it must compute probabilities

30 Figure 7: Rejection rate analysis for enumerating all S up to size 5 satisfying fb in the arXiv[27] citation network. Each pane plots the average rejection rate (ratio of the number of S evaluated that did not satisfy fb to the total number of S evaluated) of each algorithm versus threshold t where the node scores were sampled from a Gaussian distribution with m=10 and standard deviation ρ. based on the contents of the binary expression data for each subgraph explored. Profiling executions of the code confirmed that most processing time is spent in the objective function which is different from the previous computational tests where the objective function complexity was less than that of the enumeration. We implemented a version of CRANE that can enumerate connected induced subgraphs using either the CDFS method or the DFS method and then compared the performance of both methods. Expression data sets used for comparison were a synthetic test case comprised of 500 genes and 32 samples and a glioblastoma data set comprised of 7,419 genes and 86 samples. We received the glioblastoma data set from Patel et al.[30] that they had constructed from a dataset of RG Verhaak et al.[16] using additional info from the

31 Figure 8: Comparison of the average runtime required for each algorithm to enumerate all S up to size 5 satisfying fb in the arXiv[27] network. Each pane plots the average execution time versus threshold t where the node scores were sampled from a Gaussian distribution with m=10 and standard deviation ρ. At low values of t and ρ the average DFS execution time was sometimes less than the time for CDFS. However when ρ > 6 or t > 25 the average CDFS time was consistently less than that of the DFS method.

TCGA. For the synthetic set, a corresponding synthetic network was generated and the optimal solution was known. We enumerated connected induced subgraphs up to size eight in the synthetic network with both DFS and CDFS methods and compared the results. Due to the heuristic nature of the algorithm and differences between how DFS and CDFS explore the search space, there were discrepancies in the results. However, the known result was identified by both algorithms as part of their overall result sets. The DFS method required 2 minutes to complete whereas the CDFS method required 1 minute. This is not of particular interest from a runtime perspective, but the analysis helped to underline how both methods behave when used in a heuristic

32 Figure 9: Comparison of the effort expended by each algorithm to retrieve unvisited vertices while enumerating all S up to size 5 satisfying fb in the arXiv[27] network. Each pane plots the average number of calls to a method returning unvisited adjacent vertices versus threshold t where the node scores were sampled from a Gaussian distribution with m=10 and standard deviation ρ. algorithm. For the TCGA expression data, we used the HPRD[15] protein-protein interac- tion network to enumerate connected induced subgraphs. We enumerated connected induced subgraphs up to size 8 in HPRD, and both methods identified the same top results. The DFS method completed in 234 minutes compared to 30 minutes for CDFS. The top results consisted of several large subgraphs with state patterns of all genes down regulated and a two gene subgraph with a state pattern of one gene up and the other gene down regulated. The subgraphs with all genes down regulated were not of particular interest biologically. However, the two gene subgraph discriminated roughly half of the short and long term survivors by up regulated MDK and down regulated SDC1. This is interesting because MDK expression is known to promote

33 Figure 10: Rejection rate analysis for enumerating all S up to size 4 satisfying fb in the HPRD[15] network. Each pane plots the average rejection rate (ratio of the number of S evaluated that did not satisfy fb to the total number of S evaluated) of each algorithm versus threshold t where the node scores were sampled from a Gaussian distribution with m=10 and standard deviation ρ. cell migration and angiogenesis during tumerigenesis and SDC1 is a trans-membrane protein involved in cell migration and cell-matrix interactions [18]. Our result is intuitive in that increased MDK promotes cell migration while loss of SDC1 could potentially weaken the intracellular matrix enabling cancer cells to migrate more eas- ily. A literature survey uncovered a recent paper that concluded that up regulated MDK plays a pivotal role in promoting human glioma cell resistance to cannabinoid antitumoral activity [14]. However, we were unable to find literature investigating the role of SDC1 in glioblastoma, and this may be an interesting avenue for further research.

34 Figure 11: Comparison of the average runtime required for each algorithm to enumerate all S up to size 4 satisfying fb in the HPRD[15] network. Each pane plots the average execution time versus threshold t where the node scores were sampled from a Gaussian distribution with m=10 and standard deviation ρ. At low values of t and ρ the average DFS execution time was sometimes less than the time for CDFS. However when ρ > 6 or t > 25 the average CDFS time was consistently less than that of the DFS method. 7 Discussion

The computational results support the hypothesis that CDFS can perform enumer- ation of all connected induced subgraphs that satisfy a bound fb or exhaustive enu- meration of all connected induced subgraphs with less overhead than DFS based approaches. The CDFS method consistently performed equally well or significantly better during the synthetic tests, and when implemented into the CRANE[6] algo- rithm it identified the same interesting solutions in dramatically less time. The CDFS algorithm removes many possible redundant rejections of subgraphs that do not satisfy fb(S). However, it is only a reduction, and it is possible that an

35 0 0 S ⊂ S that does not satisfy fb(S ) will be evaluated multiple times during traversal of T . For example, it can be observed in Figure 5 (B) that if CE is rejected, then when branch C is later joined to BGE, the subgraph CE is again contained in BGEC so a redundant rejection will occur. Another less subtle example is in Figure 3 where if ACD is not rejected but ABD is, because the CD branch already exists in the β of AB then ABCD will reject ABD again. In practice this does not appear to happen often enough to cause CDFS to encounter more redundant rejections than the DFS method, but it is overhead that we plan to address in future work. The memory consumption of CDFS also requires consideration. If CDFS will be used for an enumeration task, it is better suited to tasks of evaluating S that are small compared to G because the size of T grows as 2|S|. The potential to use memory exponential to |S| makes use of CDFS for enumeration of all S in G up to size |V | infeasible, in which case a low memory approach such as ReverseSearch may take longer to complete, but is better suited in terms of space requirements.

8 Conclusion

We have investigated the problem of reducing the number of subgraphs evaluated while enumerating all connected induced subgraphs S ⊆ V that satisfy a hereditary

bounding criterion fb. Our proposed method displays a significant decrease in runtime compared to a classical depth first branch and bound approach. In addition, Theo- rems 1, 2 and 3 provide proof of correctness that all connected induced subgraphs S

that satisfy fb(S) are enumerated. Finally, our optimization strategy also improves performance when the search is equivalent to exhaustive enumeration. However, due to the potential for our method to use space exponential to the maximum size of S being enumerated, our method is best suited to enumerating all |S| ≤ k from G where k is chosen appropriate to the problem and the available memory.

36 A Extended Definitions of Complex Notation

P(nk) : The sequence of nodes along a connected path from the root of T to a tree

node nk. It represents a sequence of n ∈ N such that the first element of P(nk)

is always n1 = r, i.e.,

P(nk) = {n1, n2, n3, . . . , nk} (1)

We use the relation to indicate that the path P(nk) is a sub-path

of P(nk+1), i.e., P(nk) @ P(nk+1) means the sequence of nodes defined by P(nk)

matches the sequence of the first k nodes defined by P(nk+1).

Dnk : The neighbor set of nk, defined as the set of vertices adjacent to the vertex

labeling nk, i.e.,

Dnk = {u ∈ V : uvnk ∈ E} (2)

Cnk : The cull set of nk. If nk = r, the cull set is {vr}. Otherwise the cull set is

the union of {vr} and all vertices adjacent to vertices labeling nodes along the

path P(nk−1), i.e.,

Cnk = {vr} ∪ {u : uvni ∈ E, 1 ≤ i < k} (3)

χnk : The extension set of nk defined as the set of vertices adjacent to vnk but not

adjacent to any vni

χnk = Dnk \Cnk (4)

Γ(n) : The source of a tree node n. Source nodes are added to N by the DEPTH procedure of Algorithm 1 while the BREADTH procedure only copies nodes

37 already in N. The source of n is n if n was created by DEPTH. Otherwise, the source of n is the source of the node that BREADTH copied to create n. We refer to a node that is not its own source as a clone.

  n if n was created by DEPTH Γ(n) = (5)  Γ(m) if n was cloned by BREADTH from node m

38 β(nk, nk+1) : The branch list passed from node nk to node nk+1

Let ni ∈ N be a node in the enumeration tree and let nj ∈ K(ni) be a child of

ni. The branch list β(ni, nj) denotes an ordered list passed by ni to nj by the enumeration algorithm, defined recursively as follows:

• β(n0, ni) = [nj : vnj ∈ χn0 , vnj vni ] for all ni ∈ K(n0). In other words, the branch list passed by the root to each of its children contains, in lexi-

cographic order, the nodes labeled with vertices in χn0 that are of greater lexicographic rank than the vertex labeling the respective child.

• β(nk, nk+1) = [β(nk−1, nk); [nj : vnj ∈ χnk , vnj vnk+1 ]] for all nk ∈ N

and nk+1 ∈ K(nk). In other words, the branch list passed to node nk+1

by its parent nk contains the concatenation of the branch list of nk and

the ordered list of nodes labeled with vertices in χnk that are of greater

lexicographic rank than vnk+1 .

Equation 6 defines the relationship in a more compact form.

    nj : vn ∈ χn , vn vn if k = 0 β(n , n ) = j 0 j i (6) k k+1     β(nk−1, nk); nj : vnj ∈ χnk , vnj vnk+1 if k > 0

B Supporting Lemmas

The following lemmas support Theorems 1 and 2 in section 5. The general conclusions of the lemmas used directly in the theorems is as follows:

• Theorem 1

– Lemma 4 shows that each path in T represents a set of vertices that induce a connected subgraph of G.

– Lemma 5 shows that each path in T represents a unique set of vertices.

39 – Lemma 7 shows that any connected induced subgraph in G is represented by a path in T .

• Theorem 2

– Lemma 8 shows that the enumeration order matches the desired order where all S0 ⊂ S are enumerated before S.

Lemma 1 Let nk+1 ∈ N be a node in the enumeration tree and let nk = p(nk+1) be its parent. For all n ∈ β(nk, nk+1), there must be a node nj ∈ P(nk+1) such that

vn ∈ χnj . In other words, any node in β(nk, nk+1) must be labeled by a vertex adjacent to a vertex labeling a node along P(nk+1).

Proof. We prove the lemma by induction on k. Base case: In the base case, the node of interest is a child of the root node, i.e.,

nk+1 ∈ K(n0). In this case, by Equation 6, all vertices in β(n0, nk+1) are in χn0 and clearly n0 ∈ P(nk+1).

Inductive step: Assume that ∀ n ∈ β(nk−1, nk), ∃ nj ∈ P(nk) such that vn ∈ χnj .

Now consider a node n ∈ β(nk, nk+1). By Equation 6 at least one of the following

has to be true: (i) vn ∈ χnk or (ii) n ∈ β(nk−1, nk). If (i) is true, then the lemma is proved since nk ∈ P(nk+1). If (ii) is true, then by the inductive hypothesis, we know that there exists nj ∈ P(nk) such that vn ∈ χnj . Since P(nk) @ P(nk+1), we have nj ∈ P(nk+1), and thus the lemma is proven.

Lemma 2 For any P(nk+1) in T , there is at least one node nj ∈ P(nk) labeled by a

vertex adjacent to vnk+1 in G, i.e., vnj vnk+1 ∈ E.

Proof: We investigate the two possible cases where nk+1 is either created by the DEPTH procedure or the BREADTH procedure of Algorithm 1.

Case 2.A: nk+1 created by DEPTH

40 In this case, from the DEPTH procedure, we can see that vnk+1 is either in the

extension set χnk line 35, or it labels a node in β(nk−1, nk) line 28. In the first

case (vnk+1 ∈ χnk ), we have vnk vnk+1 ∈ E by definition of extension set. In the

second case, since nk+1 ∈ β(nk−1, nk), we know by Lemma 1 that ∃nj ∈ P(nk+1)

such that vnk+1 ∈ χnj . Therefore, by definition of extension set, it immediately

follows that ∃nj ∈ P(nk) such that vnj vnk+1 ∈ E.

Case 2.B: nk+1 created by BREADTH

In this case, nk+1 is a clone of another node mk+1 ∈ N. Note that cloned nodes can also be cloned by BREADTH, thus there may be multiple clonings between

mk+1 and nk+1, but mk+1 is the very first node that is created by DEPTH which

is defined as Γ(nk+1) in Equation 5. Let ` be the lowest common ancestor of

nk+1 and mk+1.

Since mk+1 is created by DEPTH, we know from Case 2.A that there exists

mj ∈ P (mk) such that vmj vmk+1 ∈ E. Now, since mj ∈ P (mk+1), we have two cases:

• mj ∈ P (`): In this case, since P (`) @ P(nk+1) (because ` is along the

path to P(nk+1)) we immediately have mj ∈ P(nk), and thus the lemma is proven.

• mj ∈ P (mk) − P (`): Since ` is the lowest common ancestor of mk+1 and

nk+1, and nk+1 is a clone of mk+1, the sub-path from ` to mk+1 is cloned

such that the vertices labeling P (mk+1) − P (`) will label a sub-path of

the path from ` to nk+1. Since the set of vertices labeling the nodes on this cloned sub-path is identical to the set of vertices labeling the nodes

on P (mk+1) − P (`) and mj ∈ P (mk) − P (`), there exists a clone nj of mj

along P (nk) − P (`) labeled by vmj . Therefore because nk+1 is a clone of

mk+1 we know ∃nj along P(nk) s.t. vnj vnk+1 ∈ E.



41 Lemma 3 For any path node nk ∈ T , no other node on P(nk) can be labeled by vnk .

Stated formally, 6 ∃ni ∈ P(nk−1) s.t. vni = vnk

Proof: Let ni be a node on P(nk−1). We consider all possible relationships

between ni and nk to show that vni 6= vnk .

Case 3.A: ni and nk created by DEPTH

In this case, from the DEPTH procedure of Algorithm 1, we know that vnk ∈ χ at line 35. But since v ∈ C , we know that v 6∈ χ (4). It nk−1 ni vnk−1 ni nk−1

immediately follows that vni 6= vnk .

Case 3.B: ni created by DEPTH, nk created by BREADTH

In this case, nk is a clone of another node nm = Γ(nk), and thus we have

vnm = vnk . Let n` be the lowest common ancestor of nk and nm. There are five

possible relationships between n`, nm, and ni.

Case 3.B.1 ni is an ancestor of n`.

If ni is an ancestor of n` then it is also an ancestor of nm. Thus we can

apply the argument in Case 3.A to conclude that vni 6= vnm and hence

vni 6= vnk .

Case 3.B.2 Both ni and nm are children of n`.

If ni and nm are children of n`, then the nodes are labeled by different

vertices by (4), i.e, vni 6= vnm . Hence vni 6= vnk .

Case 3.B.3 ni is a child of n`, nm is child of a descendant of n`.

If ni is a child of n`, vni will be in the cull set of all descendants of n` by

(3). By this fact we know that no descendant nm of n` can be labeled with

vni . It follows that vni 6= vnm and hence vni 6= vnk .

Case 3.B.4 nm is a child of n`, ni is child of a descendant of n`.

42 If nm is a child of n`, vnm will be in the cull set of all descendants of n` by

(3). By this fact we know that no descendant ni of n` can be labeled with

vnm . It follows that vni 6= vnm and hence vni 6= vnk .

Case 3.B.5 ni and nm are children of descendants of n`.

As each branch in β is cloned, nodes labeled with vertices in the extension set of the node appending the branch are removed by Algorithm 1 at

line 2. When p(ni) is passed the branch containing nm it will copy it by

removing any nodes labeled by vni . Because the pruned branch is passed

to all descendants, we know that if vnm = vni it has been removed and

thus for any nk, vni 6= vnk .

Case 3: ni and nk created by BREADTH.

In this case ni and nk are clones of nodes in a branch cloned by BREADTH.

Let nz be the original node created by DEPTH that is cloned to ni, and let nm

be the descendant of nz that is cloned to nk. Because nz is created by DEPTH,

we can prove that vnz 6= vnm using Case 3.B, and thus vni 6= vnk .



Lemma 4 For a given P(nk), the vertices ψ(P(nk)) induce a connected subgraph of G.

Proof: This follows directly from Lemma 2 in that any vertex labeling nk of

P(nk) is adjacent to a vertex labeling at least one other node nj of P(nk) such that j < k.



Lemma 5 For any pair of distinct nodes nk, nm ∈ N, ψ(P(nk)) 6= ψ(P(nm)). In other words, each path in T represents a unique connected induced subgraph of G.

43 Proof: Let n` be the minimum common ancestor of nk and nm, and let x and y be the children of n` such that x ∈ P(nk) and y ∈ P(nm). Assume without generality that x is to the left of y.

We will consider all possible relationships between x and y to show that ψ(P(nk)) 6=

ψ(P(nm)).

Case 5.A: Both x and y are created by DEPTH

In this case, vy is in the cull set of x and x is to the left of y so β(n`, nx) contains

no nodes labeled by vy. Hence no descendant of x can be labeled by vy, and thus

vy ∈/ ψ(P(nk)). But since vy ∈ ψ(P(nm)), it follows that ψ(P(nk)) 6= ψ(P(nm)).

Case 5.B: x created by BREADTH and y created by DEPTH

In this case, x was passed to n` from its parent and by line 2 of Algorithm 1

we know that all nodes in the branch rooted at x labeled by vy were removed.

Hence no descendant of x can be labeled by vy, and thus vy ∈/ ψ(P(nk)). But

since vy ∈ ψ(P(nm)), it follows that ψ(P(nk)) 6= ψ(P(nm)).

Case 5.C: x and y created by BREADTH

We will consider two cases, based on whether n` was created by DEPTH or by BREADTH.

Case 5.C.1: n` created by DEPTH

Let w and t denote the “original” (created by DEPTH) nodes that are

cloned to create respectively x and y. So we have vw = vx and vt = vy. In

this case w and t are in β(n`−1, n`). If p(w) = p(t), the branch rooted at x

cannot contain any nodes labeled by vy by Case 5.A. Otherwise, we know w is discovered before t because it is to the left of t in T . It follows from

line 2 of Algorithm 1 that any node labeled by vt = vy is removed from the clone of branch w when t is discovered. Thus no descendant of x can

be labeled by vy.

44 Case 5.C.2: n` created by BREADTH

Let nz be the “original” node (created by DEPTH) that is cloned to cre-

ate n`. Let ni and nj denote the descendants of nz that are cloned to

respectively create nk and nm.

Using Case 5.A - Case 5.C.1 we can prove that ψ(P(ni)) 6= ψ(P(nj)). As

the paths P(ni) and P(nk) only differ by their respective prefixes P(nz) and

P(n`), we can establish the relation P(nk) − P(n`) = P(ni) − P(nz), and

hence ψ(P(nk)) = (ψ(P(ni)) \ ψ(P(nz))) ∪ ψ(P(n`)). Similarly, we have

ψ(P(nm)) = (ψ(P(nj)) \ ψ(P(nz))) ∪ ψ(P(n`)). Therefore, ψ(P(ni)) 6=

ψ(P(nj)) implies ψ(P(nk)) 6= ψ(P(nm)).



Lemma 6 Given a path P(nk) representing S\v where v was selected from S by

Algorithm 2, then for the source node m = Γ(nk) either v ∈ χm or v labels a node in

β(p(m), m). In other words, Algorithm 2 selects a vertex that labels a child of nk.

Proof. In Algorithm 2, v will be the last vertex removed from θ so the vertex

labeling nk must be removed before v. Because we prove the lemma on Γ(nk) we are

proving that when vnk is discovered it has the ability to add a child labeled with v.

We investigate three possible relationships between vnk and v. Cases B and C utilize the fact that the vertices in θ in Algorithm 2 are sorted first by the order they are discovered in Algorithm 2 at lines 4 and 14, and second by reverse lexicographic order at lines 5 and 16.

Case 6.A v adjacent to vnk

In this case v was added to θ by the iteration that removed vnk , and we know v

is adjacent to vnk and not adjacent to any vertex preceding vnk in P by line 18

of Algorithm 2. This is the definition of extension set (4), therefore v ∈ χΓ(nk).

45 Case 6.B v adjacent to the same vertex that discovered vnk

If v is discovered at the same time as vnk then we know both vertices are in the extension set of the vertex that discovered them. Furthermore, we know

that vnk ≺ v because v is the last removed from θ where the vertices are sorted in reverse lexicographic order. By the definition of branch list (6) we know

because vnk ≺ v that Γ(nk) receives a branch list from its parent containing a node labeled by v.

Case 6.C v adjacent to an ancestor of the vertex that discovered vnk

If v is discovered before vnk then v is the last removed from θ by the order of discovery. By the definition of branch list (6) we know because v was discovered

before vnk that Γ(nk) receives a branch list from its parent containing a node labeled by v.

Because v must be adjacent to at least one vertex in S\v and vnk labels the last

node along P(nk) no other relationships exist between vnk and v, and the lemma is proven.



Lemma 7 For any S ⊆ V such that v0 ∈ S and S induces a connected subgraph of G, there exists a path in T that represents S.

Proof. We prove the lemma by induction on |S|.

Base case: In the base case |S| = 1, we know v0 ∈ S so S is represented by the root of T . Inductive step: Assume that for any S such that |S| ≤ k and S induces a connected subgraph in G, there exists a path in T that represents S. Consider any set S such that |S| = k + 1 and S induces a connected subgraph in G. We will show that S is represented by a path in T .

46 Let g denote the subgraph of G that is induced by S. Let v be the vertex in S that is selected by Algorithm 2 (where Algorithm 2 selects the v ∈ S that labels the lowest node in T among all vertices in S). Now define S0 = S \ v, and let g0 be the subgraph of g induced by S0. By its definition, v is a leaf in the search tree that results from running Algorithm 2 on g. Thus its removal leaves the tree connected. Since this remaining tree is a subgraph of g that contains all vertices in S0, we can conclude that g0 is connected, i.e., S0 induces a connected subgraph in G. Therefore,

0 by the inductive hypothesis, we know that |S | = k is represented by a path P(nk) in

T . Furthermore, we know that v is adjacent to a vertex labeling a node along P(nk)

0 since g is connected. By the inductive hypothesis that P(nk) exists in T representing

S\v and the fact that v is selected by Algorithm 2 we know P(nk+1) representing S exists in T by one of the following two cases:

Case 7.A : nk = Γ(nk)

If nk is created by DEPTH we know by Lemma 6 that either v ∈ χnk or v labels

a node in β(nk−1, nk). If v ∈ χnk then a path P(nk+1) representing S exists by

extending P(nk) with a node labeled by v at line 34 of Algorithm 1. If v labels

a node in β(nk−1, nk) then a path P(nk+1) representing S exists by extending

path P(nk) with a node labeled by v at line 27 of Algorithm 1.

Case 7.B : nk 6= Γ(nk)

If nk is created by BREADTH, we can use Case 7.A to prove that Γ(nk) created

by DEPTH has a child labeled by v, and that because BREADTH clones Γ(nk)

and all descendants that the node nk has a child labeled by v. Therefore, a path

P(nk+1) exists representing S.



Lemma 8 No S0 s.t. S0 ⊂ S is generated after S.

47 Proof. We prove the lemma by contradiction.

0 0 Assume that an S ⊂ S is generated after S, that S is represented by P(nk), and

that S is represented by P (mk). We can conclude that P(nk) @6 P (mk) because in that case S0 would be generated before S. Therefore, the paths must diverge at a common node p where the search follows mj on P (mk) before nj on P(nk). P (mk)

0 must contain a node labeled with vnj in order for S to contain S as a subset. If nj

and mj are both created by DEPTH at line 23 of Algorithm 1 then the ordering of

siblings prohibits mj from containing a descendant labeled by vnj , and thus we have

a contradiction. If mj is created by BREADTH at line 11 of Algorithm 1 and nj

is created by DEPTH then any descendants of mj labeled by vnj would be pruned

during the BREADTH procedure because vnj ∈ χp (line 2 of Algorithm 1), and thus

we have a contradiction. If both nj and mj are created by BREADTH, the same contradiction can be raised by searching backward from p to r along P (p) for the

node where Γ(nj) is added to β.



48 Algorithm 2 Given a set S of size k + 1, selects the vertex v to remove from S such that v will label the lowest node in T among all v ∈ S 1: procedure CHOOSE(S, r) 2: θ ← [],P ← [] 3: Add all elements of S to P 4: Remove r from P and insert at P [0] 5: X ← −SORT (x s.t. rx ∈ E) . Sort B ≺ A 6: for all x ∈ X do 7: if x ∈ P then 8: push(θ, x) 9: end if 10: end for 11: 12: t ← 1 13: while |θ|= 6 0 do 14: w ← pop(θ) 15: Remove w from P , and insert at P [t] 16: X ← −SORT (x s.t. wx ∈ E) 17: for all x ∈ X do 18: if x∈ / θ AND x ∈ P [t..end] then 19: push(θ, x) 20: end if 21: end for 22: t ← t + 1 23: end while 24: return P [end] 25: end procedure

49 References

[1] Foto Afrati, Dimitris Fotakis, and Jeffrey Ullman. Enumerating subgraph in- stances using map-reduce. arXiv, Nov 2012. 1208.0615v2.

[2] David Avis and Komei Fukuda. Reverse search for enumeration. Discrete Applied Mathematics, 1993.

[3] Bella Bollobas. Hereditary properties of graphs asymptotic enumeration global structure and colouring. Documenta Mathematica, pages 333–342, 1998.

[4] Coen Bron and Joep Kerbosch. Finding all cliques of an undirected graph. Communications of the ACM, 16(9):575–577, September 1973.

[5] Salim Chowdhury and Mehmet Koyuturk. Identification of coordinately dysreg- ulated subnetworks in complex phenotypes. In Pacific Symposium on Biocom- puting, pages 133–144, 2010.

[6] Salim Chowdhury, Rod Nibbe, Mark Chance, and Mehmet Koyuturk. Subnet- work state functions define dysregulated subnetworks in cancer. 18(3):263–281, 2011.

[7] Han-Yu Chuang, Eunjung Lee, Yu-Tsueng, Doheon Lee, and Trey Ideker. Network-based classification of breast cancer metastasis. October 2007.

[8] Sarah Cohen, Benny Kimelfeld, and Yehoshua Sagiv. Generating all maximal in- duces subgraphs for hereditary and connected-hereditary graph properties. Jour- nal of Computer and System Sciences, 74:1147–1159, June 2008.

[9] Luigi Cordella, Pasquale Foggia, Carlo Sansone, and Mario Vento. A (sub)graph isomorphism algorithm for matching large graphs. 10:1367–1372, October 2004.

50 [10] Phuong Dao, Kendric Wang, Colin Collins, Martin Ester, Anna Lapuk, and S. Cenk Sahinalp1. Optimally discriminative subnetwork markers predict re- sponse to chemotherapy. July 2011.

[11] Sofie Demeyer, Tom Michoel, Jan Fostier, Pieter Audenaert, Mario Pickavet, and Piet Demeester. The index-based subgraph matching algorithm (isma): Fast subgraph enumeration in large networks using optimized search trees. PLoS ONE, 8(4):e61183, April 2013. doi:10.1371/journal.pone.0061183.

[12] David Eppstein. All maximal independent sets and dynamic dominance for sparse graphs. arXiv, July 2004. http://arxiv.org/abs/cs/0407036v1.

[13] Chatr-Aryamontri et al. The biogrid interaction database: 2013 update. 41:816– 823, Jan 2013.

[14] Lorente M et al. Stimulation of the midkine/alk axis renders glioma cells resistant to cannabinoid antitumoral action. January 2011.

[15] Prasad T. S. K. et al. Human protein reference database: 2009 update. 37:767– 772, 2009.

[16] RG Verhaak et al. Integrated genomic analysis identifies clinically relevant sub- types of glioblastoma characterized by abnormalities in pdgfra, idh1, egfr, and nf1. January 2010.

[17] Jason Flannick, Antal Novak, Balaji Srinivasan, Harley McAdams, and Serafim Batzoglou1. Grmlin: General and robust alignment of multiple large interaction networks. September 2006.

[18] National Center for Biotechnology Information. Ncbi gene, October 2013. http://www.ncbi.nlm.nih.gov/gene.

[19] John Hopcroft and Robert Tarjan. Efficient algorithms for graph manipulation. Communications of the ACM, 16(6), 1973.

51 [20] Jun Huan, Wei Wang, and Jan Prins. Efficient mining of frequent subgraphs in the presence of isomorphism. In Proceedings of the Third IEEE International Conference on . IEEE, 2003.

[21] David Johnson and Mihalis Yannakakis. On generating all maximal independent sets. Information processing Letters, 27:119–123, March 1988.

[22] Maxim Kalaev, Mike Smoot, Trey Ideker, and Roded Sharan. Networkblast: comparative analysis of protein networks. January 2008.

[23] Donald Knuth. The Art of Computer Programming, volume 4A of Combinatorial Algorithms Part 1. Addison-Wesley, 2012.

[24] Bin Konga, Tao Yanga, Lin Chenb, Yong qin Kuanga, Jian wen Gua, Xun Xiaa, Lin Chenga, and Jun hai Zhang. Proteinprotein interaction network analysis and gene set enrichment analysis in epilepsy patients with brain cancer. November 2013.

[25] Mehmet Koyutrk, Yohan Kim, Shankar Subramaniam, Wojciech Szpankowski, and Ananth Grama. Detecting conserved interaction patterns in biological net- works. October 2006.

[26] Donald Kreher and Douglas Stinson. Combinatorial Algorithms. CRC Press, 1999.

[27] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densi- fication and shrinking diameters. 2007.

[28] Ambros Marzetta. ZRAM: A Library of Parallel Search Algorithms and Its Use in Enumeration and Combinatorial Optimization. PhD thesis, Swiss Federal Institute of Technology Zurich, 1998.

52 [29] Rod Nibbe, Sanford Markowitz, Lois Myeroff, Rob Ewing, and Mark Chance. Discovery and scoring of protein interaction subnetworks discriminative of late stage human colon cancer. 8(4):827–845, April 2009.

[30] Vishal Patel, Giridharan Gokulrangan, Salim Chowdhury, Yanwen Chen, An- drew Sloan, Mehmet Koyutrk, Jill Barnholtz-Sloan, and Mark Chance. Network signatures of survival in glioblastoma multiforme. 9, September 2013.

[31] Ron Rymon. Search through systematic set enumeration. Technical report, University of Pennsylvania, August 1992.

[32] J.R. Ullmand. An algorithm for subgraph isomorphism. 23:31–42, January 1976.

[33] Bin Wu, Shengqi Yang, Haizhou Zhao, and Bai Wang. A distributed algorithm to enumerate all maximal cliques in mapreduce. In International Conference on Frontier of and Technology, 2009.

[34] Xifeng Yan and Jiawei han. gspan: Graph-based substructure pattern mining. In Proc. 2002 of Int. Conf. on Data Mining (ICDM’02), 2002.

53