On Network Tools for Network Motif Finding: A Survey Study

Elisabeth A. Wong1,2, Brittany Baur1,3

1 2010 NSF Bio-Grid REU Research Fellows at Univ of Connecticut 2Bowdoin College 3Manhattanville College

Abstract. Network motifs have been called the building blocks of networks [1]. Graph theory is used to computationally represent and search networks. Many efforts have been put into developing motif discovery tools to search for and find network motifs, patterns or subgraphs within the input network that occur more frequently in the input network than in randomized networks where patterns occur by chance [2]. Complications involved with network motif discovery include the graph isomorphism problem which is NP-complete. A myriad of tools and algorithms have been developed for both full enumeration of subgraphs and methods for avoiding full enumeration in order to lessen runtimes and required computational power. Experimental data from various tools is provided in this paper including (1) runtimes for different subgraph sizes, network sizes, and number of random networks generated, (2) differences in frequencies based on different search restrictions, and (3) protein-protein interaction (PPI) network results. The limitations that still exist especially concerning size of motifs and networks that can be searched are also included. This paper presents a survey study of current network motif discovery tools; algorithms, experimental data, limitations, and pros and cons of tools are examined and discussed.

Keywords: network, motif, algorithm, isomorphism

1 Introduction

Networks are integral parts of many real systems and thus it has become a priority in many research fields to analyze them. Emphasis has been placed on the importance of studying small aspects of networks in order to gain a better understanding of the entire network. Recently graph theory has been used to allow for computational analysis of networks. Through graph theory, it has been found that numerous networks contain network motifs, small sub-graphs that appear more frequently than expected in randomized networks [1]. Because these motifs are statistically significant, it has been hypothesized that they are also significant to their respective networks and systems; higher frequencies of subgraphs than result by chance suggest that the motifs are present due to factors such as being conserved evolutionarily and having an important function or purpose [1]. Each network has different motifs that are more frequent and thus more important to the system or organism that they are in. For example, gene regulation transcriptional networks and neuronal connectivity networks have been found to have motifs known as ‘feed forward loops’ [2] and ‘bifans’ [2]. This suggests that these two networks are similar in some design aspects.

The feed-forward is thought to be used in information processes where its design helps with controlling connections and signals [1]. In contrast, food web networks which do not deal with information processing have motifs unique to networks such as neuronal connectivity and gene regulation. This demonstrates how motifs are biologically significant in their ability to help analyze, explain, and classify networks. Due to the significance of network motifs many efforts have been put forth into developing tools that can detect network motifs. In order for graph theory to be applied to the study of networks and motifs, networks need to be represented by graphs. Each entity in a network (i.e. a protein, a gene, a person) is represented by a node (or ) while the connections between the entities (i.e. an interaction, a regulatory signal, a correspondence) are represented by edges. In some networks the nodes and edges have different characteristics (i.e. different types of genes or different signals passed from gene to gene). In these cases the nodes or edges are „colored‟ with each color representing a different kind of entity or connection. Furthermore another aspect of a network that must be considered and included in the graph representation is whether or not a connection or edge is „directed‟ or „undirected‟. Gene regulatory networks have directed connections as it is important that the signal travels from one specific gene to another. On the other hand networks such as a that counts handshaking as a connection is not directed because the shaking of hands is not direction specific. „Undirected‟ edges versus „directed‟ edges can be differentiated on a graph with different edge colors and node colors or by arrows. A graph can be split up into small graphs known as subgraphs. As stated before, statistically overrepresented subgraphs are defined as network motifs. The number of edges that enter and exit a node are summed to determine the node‟s „‟ and the number of nodes that make up a motif determine the subgraph‟s size. Tools for network motif discovery have proved very difficult to develop to be both efficient and able to find motifs of all sizes (not just small sized motifs). One of the larger obstacles in finding an efficient and thorough algorithm is the graph isomorphism problem. This problem entails determining if a bijection occurs between the nodes of two graphs and that each corresponding node is adjacent to the same corresponding nodes [3]. Two isomorphic graphs have the same number of nodes and edges and the same number of degrees for corresponding nodes. The graph isomorphism problem is computationally complex and is classified as an NP complete problem. The NAUTY algorithm is a well known and powerful algorithm that has been developed to test for graph isomorphism. It is used by multiple motif discovery tools. The NAUTY algorithm utilizes canonical labeling in order to tell which graphs are isomorphic to each other [3]. If they are isomorphic, then their canonical label should be the same. A canonical label for each graph is formed by taking the of a graph and concatenating it row by row in order to form a binary number. By leaf partitioning (partitioning the graph into singleton sets) each of the vertices automorphism can be found by checking the adjacency matrix of different orderings of vertices and seeing if the matrix is the same. NAUTY then examines the automorphisms and computes a canonical label, which is the largest or smallest possible concatenated adjacency matrix [4]. Additional difficulties in developing a network motif discovery algorithm include the fact that the number of network motifs exponentially increases with increases in

network size and that there is an absence of the downward closure property in many networks [5]. These difficulties make it so that full enumeration of subgraphs can be extremely time consuming and may require large amounts of computational power. In order to study a network in context, randomized networks are used for comparison. These randomized networks are developed in such a way that their structure is random and thus not a result of any constraints or significant design elements [6]. This allows for aspects of the network in question (such as motifs) to be compared to the randomized networks to see whether they are a result of the intrinsic properties of the network or if they are indicative of real world functional constraints and/or design principles due to selection [2]. Varying parameters are used to describe the network motif occurrences and to determine whether they are statistically significant. The frequency of a motif is the number of times the motif appears in a network [7]. Different tools use different restrictions for counting frequencies based off of whether or not overlapping of nodes and edges is allowed. Some motif discovery tools ask for a user input that sets a threshold of how many motif occurrences are required for a motif to be considered ‘frequent’. To determine whether a motif is significant in a specific network and not just present due to intrinsic properties of the network, a uniqueness factor is sometimes applied [7]. If the network in question has a motif with a higher frequency than in a certain amount of random networks (threshold set by the user) then the motif is considered to be ‘unique’. In addition, statistical numbers such as z-score and p- value are often used to determine whether the frequency of a motif is statistically significant. The z-score is calculated by finding the difference in the frequency of the motif in the specific target network and the mean frequency of the motif in the randomized networks divided by the standard deviation of the frequency in the randomized networks [3]. The higher the z-score corresponds with the motif being more overrepresented. The z-score threshold over which the motifs are considered overrepresented is often 2. The p-value looks at whether the probability that the number of times a motif appears in a randomized network is equal to or greater than the number of times the motif is present in the network in question [2]. The lower the p-value means the more significant the motif. The threshold under which the p-value must be to be considered significant is commonly 0.01. All of these parameters are important for setting standards to help distinguish between which subgraphs are overrepresented and which are not. Multiple algorithms and tools have been developed, each with different advantages and disadvantages, to identify network motifs. Network motif discovery is a crucial problem to solve in order to gain further insights into the important characteristics, functions, and inner workings of systems with networks. Therefore it has been the goal of many researchers to develop ways to efficiently identify network motifs. It is our goal in this paper to summarize, collect experimental data about, and analyze the various network motif discovery tools and algorithms that have been created.

2 Methodology

Major aspects of motif discovery tools that must be considered when examining these tools are the methods of determining frequencies of motifs, the ways of developing randomized networks, the algorithms used for full enumeration, the strategies of identifying motifs without full enumeration, , and the data sets the tools can be applied to. All of these factors must be considered when developing a motif discovery tool and each tool uses different variations and combinations of all of these factors.

I) Restrictions for Determining Frequencies An important aspect of each tool and algorithm is the method of determining motif frequencies. Frequency refers to the number of matches of a motif in a network [8]. Different methods for determining motif frequency depend on restrictions of how network elements are shared [2]. Different methods lead to different frequency results. There are three types of frequency concepts: (1) F1, (2) F2, and (3) F3. F1 allows overlapping of nodes and edges arbitrarily. Only node overlapping is allowed in concept F2. F3 does not allow for any overlapping of nodes or edges [3]. The method of determining motif frequencies is very important because motif frequency is used in the calculation of statistical elements such as z-score and p-value. [7]. Numerous tools use these statistical parameters to indicate whether or not motifs are statistically significant. It is important to note which frequency concept is used by which tool. The different restrictions upheld by each concept cause the frequencies calculated by the different concepts to be significantly different [2]. Sometimes it is also important to use tools with certain frequency concepts for specific networks. In some networks the overlapping of edges and nodes may be an important aspect of motifs whereas sometimes it might be only relevant to find motifs that do not overlap at all. Thus, paying attention to the frequency concepts is very important when using and designing motif discovery tools.

II) Random network generation As mentioned previously, random networks are essential in network motif discovery because they are needed for comparison with the input network. Subgraph occurrences in the input network are compared to those in the random networks to see if differences are present which would indicate a significant motif. Multiple methods are used to generate randomized networks. Common randomization techniques include the switching method, the stubs method, and the “go with the winners” algorithm [6]. (1) The switching method implements the Markov chain method. It involves using the nodes of the input network, preserving their degree in and degree out, and switching the edges between the nodes numerous times to obtain randomization. The draw back to the switching method is that the time required for proper mixing is not known for the Markov chain method. [9] (2) The stubs method keeps the same in and out degrees of the nodes of the input network. Each node has „stubs‟ that are „in-stubs‟ (for all in degrees of the node) and „out-stubs‟ (for all out degrees of the

node). A matching algorithm is used to put all of the in-stubs in a pair with an out-stub. Theoretically this creates random edges between nodes while still preserving the in and out degrees of all nodes. The method discards any self edges or multiple edges. This becomes a problem because numerous real world networks have nodes with degrees such that there will most likely be more than one edge between two nodes. [9] (3) The “go with the winners” algorithm starts with multiple graphs. It then carries out the stubs method. To compensate for the graphs that are eliminated (due to self or multiple edges) the algorithm periodically copies all of its graphs which results in the number of graphs being constant on average. Once all stubs have been linked the process stops and a random network is chosen from all the remaining graphs. This algorithm can be very slow, especially with large scale networks. [9]

The switching algorithm has been found to be the ideal method for generation and is often used in network motif discovery tools.

III) Classification of tools based on algorithms:

Network centric tools require that the entire network and all subgraphs have to be enumerated. On the other hand, non-network centric tools (motif centric tools) allow for a single specific motif to be examined [10]. Major network discovery tools have been classified into these two groups and segregated further within each group based on aspects of their algorithms.

NETWORK CENTRIC ALGORITHMS:

Algorithms that use trees:

NeMoFINDER

NeMoFINDER is a motif discovery algorithm used specifically to find motifs in PPI networks [5]. This tool uses trees to partition the network in question. It uses concept F1. By allowing for arbitrary node and edge overlap it ensures uniqueness and is not downward closed. The required inputs for the algorithm include maximum motif size, number of randomized networks, the target PPI network, frequency threshold, and uniqueness [7]. NeMoFINDER generates randomized networks via the switching method [5] and the Apriori algorithm is used for subgraph frequency determination [8]. The algorithm can be divided into three main steps [5]. The first step entails finding all occurrences of a 2 sized tree and subsequently larger sized trees up until all size trees from 2 to k have been found. This ensures all of the repeated subgraphs have been found. If the number of k sized trees is larger than a user given frequency threshold then the subgraph in question is considered statistically significant and is

designated as a motif. Step 2 involves the size k trees being used to partition the graph. Thus each section of the graph contains trees of size 2 through k. In step 3, for each size k tree a subgraph is generated with k-1 edges and k nodes. A new set of subgraphs is then generated by combing each k-1 edge subgraph with a size k tree resulting in subgraphs with k edges. This new set contains subgraphs that are all candidates for being a motif. The number of occurrences of each candidate subgraph is found in the partition of the network by the k sized trees. If the occurrence is more than a given threshold then the subgraph is added to a set of repeated subgraphs. These subgraphs are then combined with novel generated subgraphs to find k+1 sized subgraphs. This process continues until all repeated subgraphs of size 2 through k are detected. Because the network is partitioned by trees the algorithm is consequently scalable. [5] NeMoFINDER also uses the concept of graph cousins to generate possible motif candidates [5]. However, graph cousin generation can be ambiguous and symmetry breaking is not used in the NeMoFINDER algorithm resulting in the discovery of redundant subgraphs [8]. Performance studies have been carried out on NeMoFINDER. This was done by ranking PPI network motifs of different sizes by frequency, uniqueness, and individual motif size. Motif strengths were generated and scored from these parameters. The scores were compared by function homogeneity, localization coherence, and gene expression correlation. Reliability of each motif was determined using this scoring method. [5]

Kavosh

Kavosh is a network motif discovery tool that uses trees to enable the detection of motifs. It can handle both directed and undirected networks [3]. There are four main parts of the Kavosh algorithm: (1) enumeration, (2) classification, (3) random graph generation, and (4) motif identification [3]. Enumeration looks at the network in question and finds all subgraphs of given sizes (also preformed on random graphs). This is achieved by selecting one node and all the combinations of connections with the neighboring nodes via tree representation. The first level of the tree is the selected node, the second level consists of the neighbors of this node, the third level of the tree is made up of the neighbors of the previous neighbors, and so on. If a k sized graph is being searched for, all compositions of size k-1 are found. The „revolving door algorithm‟ is used to go through all of the nodes at each level ascending from the bottom level and labeling each node as „visited‟. This ensures that no tree or subgraph is constructed more than once. The algorithm finds all of the combinations of the nodes including subgraphs with nodes in the same level (i.e. a subgraph size 3 can be made up of an initial node and two neighbors or an initial node, a neighbor, and a neighbor of a neighbor). After these motifs are found, the node is removed and a new node is used. This process is also carried out on the randomized networks to find the frequency and identify the motifs in the randomized cases for comparison. Constraints are placed on the construction of these trees (some explained above) so that each specific tree is only generated once. This avoids redundancy and extra computational time.

Classification involves placing the subgraphs found in the enumeration step into isomorphic classes. This is done using the NAUTY algorithm [3]. Random graphs are generated in Kavosh using the switching method. The frequencies of subgraphs in the input network are compared to frequencies in the random networks. Subgraphs are dubbed as motifs if frequencies are higher in the input network than they are in the random networks. Parameters often used include p-value, frequency level, and z scores. [3]

MA Visto

MA Visto is able to consider all 3 frequency concepts when enumerating subgraphs [2]. This allows for an effective visual representation of the frequency concepts. MA Visto finds all of the subgraphs of a certain size and finds the frequencies for each subgraph using all three frequency concepts. The flexible pattern finder (FPF) algorithm is used by MA Visto to search for the motifs [13]. The FPF algorithm looks at patterns that are of the same size as the given target size (i.e. looks for all patterns of size 4 when looking for size 4 motifs). As the size of the pattern increases the number of possible patterns of that size also increases meaning that finding all of the patterns of one size would be computationally costly. A tree is constructed with each level of the tree is comprised of patterns of a certain size up until a level where the desired size is reached. In order to avoid generating all the possible patterns of a given size, FPF eliminates patterns that are not supported by (cannot be mapped to) the input network as soon as it appears in the tree. This stops a pattern from being generated as soon as it is seen which allows for elimination of unnecessary branches [14]. Also, since frequencies of patterns decreases with increasing pattern size, if an intermediate (and smaller) sized pattern is found to have a smaller frequency than patterns of the desired (and larger) size the branch of the tree is discontinued because it will never have a high enough frequency [13]. MA Visto uses the frequencies of the subgraphs in the input network as well as the frequencies in the randomized networks in order to find z scores and p values for the different motifs [2].

Probabilistic algorithms:

Full enumeration can be computationally costly and require a lot of time. As the size of the subgraphs being searched for increases the possible isomorphic types increases. This makes exhaustive enumeration algorithms extremely time consuming and costly because they need to find the frequencies of each different isomorphic graph of all sizes in both the input network and the randomized networks. Kashtan et al developed a „sampling method for subgraph counting‟ which is a probabilistic algorithm [11]. This algorithm deals with estimating subgraph frequencies by sampling subgraphs. This is less time consuming than full enumeration. The algorithm makes it so that runtime does not increase asymptotically as network size increases. With Kashtan‟s sampling algorithm larger networks than

full enumeration algorithms can handle are able to be analyzed and larger motifs can be identified. A random n-sized subgraph is found in this sampling algorithm. An edge is picked randomly and its neighbors are all made into candidates to be the next edge. One of the candidates is picked at random and its neighbors are the new candidates. This process continues with one edge from all the neighbors being chosen randomly to be the next edge until a subgraph of size n is created. All of the nodes from these edges and all the edges that connect these nodes make up the sampled subgraph. [11] An ordered set of n-1 edges needs to be picked for an n sized subgraph to be found. The probabilities of getting these ordered pairs is used to find the probability that an n sized subgraph will be sampled. From this and a few additional calculations the estimated subgraph concentrations are found. [11] A major problem with Kashtan et al‟s method is that it has bias sampling [8]. This means that each subgraph does not have a uniform probability of being sampled [10]. Therefore, occurrences of a subgraph cannot be impartially estimated [8]. The algorithm tries to take this into account by weighting each subgraph with a value of 1/(probability of the subgraph being chosen) [10]. Other tools that use probabilistic sampling algorithms as alternatives to full enumeration are MFinder and FANMOD. MFinder uses a bias algorithm like that of Kashtan‟s while FANMOD uses an improved method that achieves unbiased sampling [10].

MFinder

MFinder is capable of analyzing directed and undirected networks [2]. Concept F1 is used when finding the frequency for the subgraphs. Also, concept F3 is applied in order to determine a lower bound for uniqueness value [2]. MFinder fully enumerates subgraphs by starting with an edge. All motifs of different sizes are found that contain this edge [6]. Once a subset of nodes is found that is connected to the initial edge the subset is added to a hash table so it cannot be revisited [10]. When no more subgraphs can be identified the hash tables are cleared and the process begins again with a different edge. This is repeated until all edges have been used. Because a specific subgraph will be counted each time one of its edges is examined there is redundancy and number of times the subgraph will be counted is a multiple of its edge number [6]. Therefore, the count for a subgraph must be divided by the number of edges in the motif. Since MFinder looks at so many motifs and has redundancy it requires large amounts of memory. This causes the runtimes to be large and makes it hard for large motifs to be searched for [10]. Therefore, MFinder uses the biased sampling method that Kashtan et al developed.

FANMOD

FANMOD is a tool that can be used to analyze both directed and undirected networks [2]. It is able to identify motifs of sizes 3 – 8. Only induced subgraphs are

found from FANMOD. It determines frequencies of subgraphs with concept F1 and uses z-score and p-value to deem whether or not a motif is statistically significant [2]. The full enumeration part of the FANMOD algorithm begins with one node and a list of possible vertices to which this node can be connected (i.e. the node‟s neighbors). Once a possible vertex is extended to it is removed from the list of possible extensions and its neighbors are added to the candidates that this vertex can be connected to next. Different combinations of possible extensions are chosen in order to form subgraphs of different sizes. Since the list of possible extensions is constantly changed, each subgraph is only enumerated once. Like Kavosh, FANMOD uses the NAUTY algorithm to test for graph isomorphism. [10] FANMOD‟s alternative method uses probabilistic sampling to reduce runtimes for identifying motifs. It uses randomized enumeration algorithm known and RAND- ESU. This sampling works by changing the full enumeration algorithm so that it randomly skips subgraphs. The FANMOD sampling algorithm chooses each size k subgraph with a certain probability [12]. This means that all subgraphs have the same probability of being sampled and all samples give different subgraphs. Because of the adjustments to the Kashtan et al algorithm, FANMOD is unbiased and results in all subgraphs having the same probability of being chosen [10].

MOTIF CENTRIC ALGORITHMS:

Mapping algorithms:

Grochow

Grochow is a motif centric tool that can be applied to directed and undirected networks [15]. The algorithm progressively maps a specific target subgraph onto a global network. By doing this Grochow checks for isomorphism as it maps the query graph onto the network [10]. This eliminates the extra time and memory it would take to check for isomorphism and avoids full enumeration. The mapping algorithm goes through the query subgraph node by node in order to map the subgraph onto the network. A node will be specified and the tool will find all the “candidate nodes”, nodes in the network that have the same characteristics (i.e. same degree and neighbors with correct degrees). As the algorithm goes through each node in the query subgraph possible matches in the network are found while others that are not exactly the same are eliminated once any inconsistency is found. This mapping ensures that only exact isomorphic subgraphs in the network are detected. [15] Grochow uses a method known as symmetry breaking to make sure that each subgraph is only mapped to once in order to reduce run time and redundancy [15]. Graphs that are self-isomorphic are said to have the same symmetries. Nodes that can be mapped to one another are defined as equivalent. Therefore, the nodes in a specific subgraph can be separated into equivalence classes. The Grochow algorithm ensures that mapping begins only from one representative of each equivalence class so that

multiple mappings are not carried out beginning with equivalent nodes. Also, restrictions are added to the labeling of each vertex so that symmetry is avoided. [10]

MODA

MODA utilizes a pattern growth algorithm that takes in a query graph [8]. It uses information based on previously found query graphs. By maintaining information about formerly found mappings, it reduces computational time. It uses the concept of expansion trees, which are similar to pattern trees used in MA Visto, but applicable to the frequency concept F1. The expansion tree starts with a root node at level 0. Then it finds all minimally connected size-k trees of the root node, which is level 1. It then adds an edge at each level until a is obtained. The first level of the tree therefore represents the number of non-isomorphic trees. Each node of the expansion tree can be represented by an adjacency matrix consisting of 0‟s and 1‟s. For undirected graphs, which are symmetric, only the numbers below the main diagonal are stored. Expansion trees are stored for every size k-graph. They are a static data structure which can be stored and retrieved and do not have to be found each time. [8] The mapping algorithm takes the query graph from the first level of the expansion tree, which is composed of trees themselves, and maps them onto the network. It holds onto their calculated frequencies. The frequencies at the second level of the expansion tree can be found with respect to the first level of the expansion tree, which are their parent nodes. MODA utilizes the symmetry-breaking conditions of the Grochow algorithm. It only uses the Grochow algorithm for the first level of the expansion tree. All the information the algorithm finds about the first level can be exploited to find the frequencies of the all the next levels which are supergraphs of the first level. By exploiting information of formerly found mappings, MODA can be used to reduce computational costs. Additionally, MODA has a sampling method that can be used to reduce runtimes with the sacrifice of accuracy. [8]

3 Experiments and Analysis

Data from experiments on runtimes of various algorithms are presented here along with MA Visto frequency concept data and experimental motif results from PPI networks.

I) Runtimes

Experiments:

Many experimental runs have been carried out to determine runtimes for network motif discovery tools. As shown in Table 1, Omidi et al compared runtimes of MODA, MFinder, Grochow, FPF (algorithm used in MA Visto), and FANMOD [8].

Searches were carried out for subgraphs size 3 – 9. In Table 2 is shown Chen et al’s comparison of the runtimes of NeMoFINDER and FPF (algorithm used in Ma Visto) [5]. Kavosh et al compared the runtimes for Kavosh, FANMOD, MA Visto, and MFinder for subgraphs between size 3 and size 10 as shown in Table 3 [3].

Table 1. Data from Figure 7 from Omidi et al [8] showing runtimes (in seconds) for size 3-9 subgraphs in E. coli transcription network. Tools compared include MODA, MFinder, Grochow, FPF algorithm, FANMOD. [8]

3 4 5 6 7 8 9 Mfinder 2.0 7.9 7.9x101 3.2x103 FPF(MA Visto) 1.1 1.6 6.3 5.0x101 1.0x103 5.6x104 Fanmod 1.1 1.3 2.0 7.9 5.6x101 7.9x102 MODA 1.1 1.3 3.2 2.0x101 1.8x102 3.2x103 6.3x104 Grochow 1.3 2.5 1.6x101 2.2x102 1.8x104

Table 2. Data from Figure 11 from Chen et al [5] showing runtimes (in seconds) for size 3-13 subgraphs in Utez PPI network. Tools compared include NeMoFINDER, FPF algorithm, sampling algorithm, and full enumeration algorithm.

3 4 5 6 7 8 9 10 11 12 13 FPF 2.2x101 7.9x101 3.2x102 3.5x103 6.3x104 4.0x105 3.2x106

NeMo 2.2x101 7.9x101 2.8x102 1.6x103 6.3x103 1.6x104 2.0x104 3.5x104 4.0x104 5.6x104 7.1x104 FINDER

Table 3. Data compiled from Table 4 from Kashani et al [3]. Runtimes (in seconds) for identifying subgraphs in yeast S. cereviciae transcription network of sizes between 3 and 10 are shown. Tools compared include Kavosh, FANMOD, MA Visto, and MFinder.

3 4 5 6 7 8 9 10 Kavosh 3.0x10-1 1.8 1.5x101 1.4x102 1.4x103 1.3x104 1.2x105 1.1x106 FANMOD 8.1x10-1 2.5 1.6x101 1.3x102 1.2x103 9.3x103 MA VISTO 1.4x104 (FPF) Mfinder 3.1x101 3.0x102 2.4x104

Kashtan et al compared the times it took their probabilistic sampling method to the time it took for full enumeration to complete while identifying motifs in different sized networks (Figure 1) [11]. The network sizes for which these comparisons were made were between 1000 and 8000 nodes.

Figure 1. Figure 4 from Kashtan et al [11] showing runtimes for different network sizes (on a log-log scale). Kashtan’s probabilistic algorithm and a full enumeration algorithm were compared.

Runtimes were found for MA Visto when finding subgraphs of size 3-4 and 4-5 (Table 4). Networks analyzed included E. coli transcription network and yeast transcription network [16].

Table 4. Examples of runtimes for MA Visto analyzing E. coli and yeast transcription networks [16]. Subgraphs of size 3-4 and 4-5 were searched for. For each run 100 randomized networks were generated.

3-4 Nodes 4-5 Nodes E. Coli transcription network (418 nodes, 519 edges) 909.904 8366.359

Yeast transcription Network 507.574 >25200 (688 nodes, 1079 edges)

Runtimes were found for FANMOD when finding subgraphs of size 3 – 7 (Table 5). A protein structure network [16], PPI network [17], yeast transcription network [16], and E. coli transcription network [16] were used.

Table 5. Runtimes (in seconds) for FANMOD tool finding subgraphs of size 3 – 7 for networks including protein structure [16] , PPI [17], yeast transcription[16], and E. coli transcription [16]. 1000 random networks were generated in all of the runs. 3 4 5 6 7 Protein Structure 2.157 17.89 315.79 1306.8 1452.08 (Undirected, 96 nodes, 213 edges)

Protein-Protein Interaction 150.766 9705.86 - - - (Undirected, 4470 nodes, 3886 edges)

Yeast transcription Network 14.562 312.025 - - - (Directed, 689 nodes, 1078 edges)

E. Coli transcription 6.485 145.504 4084.25 - - Network (Directed, (418 nodes, 519 edges)

Omidi et al [8] and Kashani et al [3] both did experimental runs on FANMOD, MA Visto (or the FPF algorithm), and MFinder. Kashani et al measured the runtimes of the tools to fully enumerate the input network and to generate and enumerate 100 random networks. Omidi et al measured the runtimes only for full enumeration of the input network.

Table 6. Data from Figure 7. Of Omidi et al [8] and Table 4 of Kashani et al [3]. Runtimes (in seconds) for motif searches done by FANMOD, MA Visto (FPF algorithm), and MFinder. Times are given for runs that fully enumerated the input network and 100 random networks with a 3.2 GHz AMD Opteron processor and 8 GB RAM (shown with no shading) [3] and for runs that only fully enumerated the input network with IBM R50e laptop with Intel Pentium 1.8 GHz and 1 GB Ram (shown with grey shading) [8].

3 4 5 6 7 8

With 100 8.1x10-1 2.5 1.6x101 1.3x102 1.2x103 9.3x103 random networks FanMod With 0 1.1 1.3 2.0 7.9 5.6x101 7.9x102 random networks

With 100 1.4x104 random networks MA With 0 1.1 1.6 6.3 5.0x101 1.0x103 5.6x104 Visto/FPF random networks

With 100 3.1x101 3.0x102 2.4x104 random networks MFinder With 0 2.0 7.9 7.9x101 3.2x103 random networks

Analysis and Limitations:

Experimental runs carried out by Omidi et al [8], Chen et al [5], and Kavosh et al [3] allow for comparisons of a variety of network motif discovery tools. A consistent trend seen in these experiments is the inability of MA Visto and MFinder to handle subgraphs as large as the other tools. Often they were only able to find subgraphs of size 5 or less and usually had runtimes larger than most other tools. FANMOD was able to identify motifs up to size 8 but NeMoFINDER, Kavosh, MODA, and Grochow were seen to be able to deal with subrgaphs larger than 8. Despite the ability of some tools to handle subgraphs larger than 8 it can be seen that the runtimes for these experiments are very large. Overall, it can be concluded that the current motif discovery tools are very limited in the size of subgraphs that they can

handle in reasonable amounts of time. NeMoFINDER shows promise in being able to search for larger sized motifs. Another limitation for motif discovery tools is the size of the network able to be analyzed. As seen in Kashtan et al’s [11] experiment the runtimes for motif searches increases exponentially as network sizes increase. All of the tools discussed have difficulty searching larger networks (in the thousands) in reasonable amounts of time. Therefore, networks such as PPI and most social networks that have thousands of nodes are difficult to fully enumerate in a reasonable time. Kashtan’s probabilistic sampling method has shown to produce a fairly consistent runtime with increases in network size. The sampling method takes significantly less time than exhaustive enumeration as network size increases. However, as discussed above, Kashtan’s sampling algorithm has bias and results in loss of accuracy. The number of randomly generated networks is also a limitation to be considered. Random networks are used in the tools for comparison with the input network and multiple of them are needed for an accurate comparison. However, runtimes increase with the number of random networks that need to be generated and searched for motifs. Omidi et al’s [8] experiments only involved full enumeration of the input network where as Kashani et al’s [3] fully enumerated the input network and generated and enumerated 100 random networks. Despite the fact that Kashani et al used a computer with greater computational powers, the runtimes for Kashani et al’s experiments were significantly larger than those for Omidi et al’s experiment. Therefore, even with a more powerful computer, the generation and enumeration of many random networks adds on significant time to searches.

II) Frequency Concepts

Experiments: MA Visto was used to compare the frequency results for concepts F1, F2, and F3 for the same subgraph within the same network (Table 6, Table 7).

Table 6. Values for concept F1, F2, and F3 for each size 3 motif found by MA Visto in the E. coli gene transcription network [16].

F1 F2 F3 4819 207 47

269 131 23

42 18 7

202 39 18

Table 7. Values for concept F1, F2, and F3 for each size 3 motif found by MA Visto in the yeast gene transcription network [16] F1 F2 F3 107 31 10

111 33 13

11 6 5

116 32 13

4 1 1

2 1 1

2 1 1

11 2 1

1 1 1

1 1 1

1 1 1

1 1 1

Analysis and Limitations:

MA Visto’s ability to calculate the frequency of motifs for all three frequency concepts, F1, F2, and F3 allows for comparisons of discrepancies in each case. Experimental runs on two different data sets (E. coli transcription and yeast transcription) demonstrate that there can be differences in the frequencies calculated by different concepts. When a small amount of a certain subgraph is in the network then the discrepancy is not large but for more frequent subgraphs the frequency concept results vary substantially.

III) Protein-Protein Interaction Network Motifs

Experiments:

An E. coli PPI network [17] was analyzed by FANMOD. Motifs of size 3 and 4 were found (Figure 2).

Figure 2. Motifs of size 3 and 4 from the E. coli PPI network [17] as identified by FANMOD. Z-scores for each motif are shown.

Analysis and Limitations:

Although NeMoFINDER had success identifying larger sized motifs other tools struggled to find motifs larger than size 5. All tools had issues with identifying motifs in a reasonable amount of time. Due to the large size of PPI networks they are harder to analyze than smaller networks such as E. coli gene transcription networks. Preliminary findings from runs done by FANMOD show the size 3 and 4 motifs found in an E. coli PPI network. The motifs and their z scores are listed. For both the size 3 and size 4 motifs the most frequent motif (motif with the largest z score) was that of the complete graph, a graph with an edge between all pairs of nodes. Previous studies have also found complete graphs with high frequencies in PPI networks [18]. Further studies involving PPI networks may support these findings further. Although some methods have been used to predict PPI network motifs [19] there is still much about the biological significance of the PPI motifs to be explored.

4 Conclusion

Increasing interest in network motifs and emphasis on motif significance has led to an ongoing process of network motif discovery tool development and continual revision of previous work. As wet lab techniques have become more advanced, increasing amounts of information about different biological systems and organisms have been collected. This has allowed for databases to be developed that provide full information sets concerning networks. The study of networks provides insights into how organisms and systems work as a whole. Network motifs are the building blocks of networks [1] and are often biologically significant which makes the identification of the motifs extremely important in the search for the understanding of networks.

Researchers have struggled to overcome the difficulties in developing network motif discovery tools. The graph isomorphism problem makes it so that finding all the motifs in different networks is highly unreasonable [2]. Also, dealing with large networks, discovering large motifs, and generating and searching numerous random networks are all issues that cause network motif discovery to be extremely computationally costly. Although these factors are costly, they are also integral parts of the network motif discovery process and in understanding networks. Results of various experimental runs carried out using different network motif discovery tools have helped to determine which tools are more efficient and useful. Furthermore, these comparisons help to highlight which algorithmic methods improve tool performance. Overall, MA Visto and MFinder were computationally costly and had large runtimes in comparison to other tools searching the same network for the same subgraph sizes. MFinder’s algorithm requires full enumeration and exhaustively searches using a technique that counts the same subgraph multiple times [6]. This redundancy contributes to increased computational cost and runtimes. MA Visto calculates the frequencies for all three frequency concepts which requires more time than only doing searches with one frequency concept [13]. Other tools were found to perform better than both MFinder and MA Visto; all had better runtimes and were able to search for larger subgraphs than either MA Visto or MFinder. FANMOD is a well-established and well-known tool which performs relatively well partially due to its use of the NAUTY algorithm to test for graph isomorphism. This, along with the fact that FANMOD’s algorithm ensures that each subgraph is only counted once, makes full enumeration with FANMOD relatively reasonable [10]. FANMOD also uses an unbias sampling algorithm that helps to reduce runtimes in comparison to full enumeration [12]. FANMOD has been shown to have smaller runtimes than Grochow and MODA but it can only search for subgraphs of size 8. Kavosh, like FANMOD, uses the NAUTY algorithm and has also shown in experimental runs that it has relatively good runtimes. The restrictions put on tree structures formed while searching for motifs design the algorithm to only enumerate each subgraph once [3]. This along with the use of the NAUTY algorithm results in the Kavosh tool having relatively good search efficiency. Grochow achieves some efficiency due to its symmetry breaking techniques. With symmetry breaking, Grochow is able to reduce redundant counts of subgraphs. The algorithm’s ability to eliminate the subgraphs that are being mapped as soon as it is discovered that they do not match any patterns in the input network also helps boost efficiency. This prevents irrelevant subgraphs from being generated which saves time and computational power. This also ensures that the subgraphs identified are isomorphic to the subgraph in question which means that an isomorphic test is not required [15]. MODA uses some of the techniques from Grochow such as symmetry breaking and uses the actual Grochow algorithm to find frequencies of some of the patterns in question [8]. MODA’s algorithm also uses expansion trees to build patterns that make subgraphs. These expansion trees and the mapping information for the patterns are stored so that redundancy does not occur and computational time is saved. NeMoFinder has been found to be able to identify meso-scale motifs (specifically, up to size 12) although it is limited to analyzing PPI networks and thus only undirected networks [5]. By partitioning networks into sets of graphs with

repeated trees the algorithm is more efficient than some other tools. NeMoFinder is different than many other network discovery tools because of its use of graph cousins to generate possible subgraphs and to determine subgraph frequencies. Although graph cousins allow for generation of candidate graphs their use also causes redundancy which adds more time to the runs [7]. The good and bad aspects of each tool are important to take note of so that algorithmic shortcomings can be avoided in future tools while successful aspects can be capitalized on. Sampling techniques have shown promise in reduction of runtimes and should be considered when developing algorithms (along with the sacrifice in accuracy). Also, the concept frequency that each tool uses is important to take note of because, as seen from experimental runs, the frequencies vary greatly between the different concepts. Network motif discovery has proved to be a very complex task. Although many tools, algorithms and methods have been created for finding network motifs, further improvements and new developments are a necessity in order to increase motif discovery capabilities.

Future directions:

These further directions include: (1) the ability to intelligently search respective networks for possible biologically relevant motifs that have been identified as significant sub-graphs from experimental runs and literature review, and (2) the idea of employing modern computing infrastructure to search concurrently for network motifs that are larger than those that presently available tools can search.

Acknowledgements:

We would like to thank the National Science Foundation for providing the funding for the Bio-Grid REU program and making this research possible. We would also like to thank the University of Connecticut for hosting this program and especially Dr. Chun- His Huang for advising and mentoring.

5 References

1. Milo, R., Shen-Orr S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network Motifs: Simple Building Blocks of Complex Networks. Science. 298, 824—827 (2002) 2. Schwobbermeyer, H.: Network Motifs. In Junker, B., Schreiber, F. (eds.) Analysis of Biological Networks. Pp. 85 – 111. NJ: John Wiley & Sons, Inc (2008) 3. Kashani, Z., Ahrabian, H., Elahi, E., Nowzari-Dalini, A., Ansari, E., Asadi, S., Mohammadi, S., Schreiber, F., Masoudi-Nejad, A.: Kavosh: a new algorithm for finding network motifs. BMC Bioinf. 10:318 (2009). 4. Fortin, S.: The Graph Isomorphism Problem. University of Alberta: Dept of Computing Science, Alberta (1996) 5. Chen, J., Hsu, M., Lee, L., Ng, SK.: NeMofinder: . genome-wide protein- protein interactions with meso-scale network motifs. KDD. 106—115 (2006). 6. Kashtan, N., Itzkovitz, S., Milo, R., Alon, U.: Network Motif Detection Tool: mfinder Tool Guide. Weizmann Institute of Science: Depts of Mol Cell Bio and Comp Sci & Applied Math, Rehovot, Israel (2002-2005) 7. Ciriello, G., Guerr,a C.: A review on models and algorithms for motif discovery in protein-protein interaction networks. Briefings in Functional Genomics and Proteomics Advance Access. (2008) 8. Omidi, S., Schreiber, F., Masoudi-Nejad, A.: MODA: An efficient algorithm for network motif discovery in biological networks. Genes Genet. Syst. 84, 385 – 395 (2009) 9. Milo, R., Kashtan, N., Itzkovitz, S., Newman, M., Alon, U.: Uniform generation of random graphs with arbitrary degree sequences. (2004) 10. Ribeiro, P., Silva, F., Kaiser, M.: Strategies for network motifs discovery. IEEE International Conference. 81-86 (2009) 11. Kashtan, N., Itzkovitz, S., Milo, R., Alon, U.: Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics. 20, 1746-1758 (2004) 12. Wernicke, S.: A Faster Algorithm for Detecting Network Motifs. In Casadio, R., Myers, G. (eds.) Algorithms in Bioinformatics: 5th international workshop. pp. 165 – 176. Springer (2005) 13. Schreiber, F., Schwobbermeyer, F. MAVisto: a tool for the exploration of network motifs. Bioinformatics Applications Note. 21, 3572-3574 (2005) 14. Schreiber, F., Schwobbermeyer, H.: Frequency Concepts and Pattern Detection for the Analysis of Motifs in Networks. Trans. On Comput. Syst. Biol. III. 89-104 (2005) 15. Grochow. J., Kellis, M.: Network motif discovery using subgraph enumeration and symmetry-breaking. Recomb. 92-106 (2007) 16. Collections of complex networks, http://www.weizmann.ac.il/mcb/UriAlon/groupNetworksData.html 17. Bacteriome, http://www.compsysbio.org/bacteriome/dataset/core_interactions.txt

18. Przulj, N., Wigle, D., Jurisica, I.: Functional topology in a network of protein interactions. Bioinformatics. 20, 340 – 348 (2004) 19. Albert, I., Albert, R.: Conserved network motifs allow protein-protein interaction predication. Bioinformatics. 20, 3346-3352 (2004)