Network Motif Survey

On Network Tools for Network Motif Finding: A Survey Study Elisabeth A. Wong1,2, Brittany Baur1,3 1 2010 NSF Bio-Grid REU Research Fellows at Univ of Connecticut 2Bowdoin College 3Manhattanville College Abstract. Network motifs have been called the building blocks of networks [1]. Graph theory is used to computationally represent and search networks. Many efforts have been put into developing motif discovery tools to search for and find network motifs, patterns or subgraphs within the input network that occur more frequently in the input network than in randomized networks where patterns occur by chance [2]. Complications involved with network motif discovery include the graph isomorphism problem which is NP-complete. A myriad of tools and algorithms have been developed for both full enumeration of subgraphs and methods for avoiding full enumeration in order to lessen runtimes and required computational power. Experimental data from various tools is provided in this paper including (1) runtimes for different subgraph sizes, network sizes, and number of random networks generated, (2) differences in frequencies based on different search restrictions, and (3) protein-protein interaction (PPI) network results. The limitations that still exist especially concerning size of motifs and networks that can be searched are also included. This paper presents a survey study of current network motif discovery tools; algorithms, experimental data, limitations, and pros and cons of tools are examined and discussed. Keywords: network, motif, algorithm, isomorphism 1 Introduction Networks are integral parts of many real systems and thus it has become a priority in many research fields to analyze them. Emphasis has been placed on the importance of studying small aspects of networks in order to gain a better understanding of the entire network. Recently graph theory has been used to allow for computational analysis of networks. Through graph theory, it has been found that numerous networks contain network motifs, small sub-graphs that appear more frequently than expected in randomized networks [1]. Because these motifs are statistically significant, it has been hypothesized that they are also significant to their respective networks and systems; higher frequencies of subgraphs than result by chance suggest that the motifs are present due to factors such as being conserved evolutionarily and having an important function or purpose [1]. Each network has different motifs that are more frequent and thus more important to the system or organism that they are in. For example, gene regulation transcriptional networks and neuronal connectivity networks have been found to have motifs known as ‘feed forward loops’ [2] and ‘bifans’ [2]. This suggests that these two networks are similar in some design aspects. The feed-forward loop is thought to be used in information processes where its design helps with controlling connections and signals [1]. In contrast, food web networks which do not deal with information processing have motifs unique to networks such as neuronal connectivity and gene regulation. This demonstrates how motifs are biologically significant in their ability to help analyze, explain, and classify networks. Due to the significance of network motifs many efforts have been put forth into developing tools that can detect network motifs. In order for graph theory to be applied to the study of networks and motifs, networks need to be represented by graphs. Each entity in a network (i.e. a protein, a gene, a person) is represented by a node (or vertex) while the connections between the entities (i.e. an interaction, a regulatory signal, a correspondence) are represented by edges. In some networks the nodes and edges have different characteristics (i.e. different types of genes or different signals passed from gene to gene). In these cases the nodes or edges are „colored‟ with each color representing a different kind of entity or connection. Furthermore another aspect of a network that must be considered and included in the graph representation is whether or not a connection or edge is „directed‟ or „undirected‟. Gene regulatory networks have directed connections as it is important that the signal travels from one specific gene to another. On the other hand networks such as a social network that counts handshaking as a connection is not directed because the shaking of hands is not direction specific. „Undirected‟ edges versus „directed‟ edges can be differentiated on a graph with different edge colors and node colors or by arrows. A graph can be split up into small graphs known as subgraphs. As stated before, statistically overrepresented subgraphs are defined as network motifs. The number of edges that enter and exit a node are summed to determine the node‟s „degree‟ and the number of nodes that make up a motif determine the subgraph‟s size. Tools for network motif discovery have proved very difficult to develop to be both efficient and able to find motifs of all sizes (not just small sized motifs). One of the larger obstacles in finding an efficient and thorough algorithm is the graph isomorphism problem. This problem entails determining if a bijection occurs between the nodes of two graphs and that each corresponding node is adjacent to the same corresponding nodes [3]. Two isomorphic graphs have the same number of nodes and edges and the same number of degrees for corresponding nodes. The graph isomorphism problem is computationally complex and is classified as an NP complete problem. The NAUTY algorithm is a well known and powerful algorithm that has been developed to test for graph isomorphism. It is used by multiple motif discovery tools. The NAUTY algorithm utilizes canonical labeling in order to tell which graphs are isomorphic to each other [3]. If they are isomorphic, then their canonical label should be the same. A canonical label for each graph is formed by taking the adjacency matrix of a graph and concatenating it row by row in order to form a binary number. By leaf partitioning (partitioning the graph into singleton sets) each of the vertices automorphism can be found by checking the adjacency matrix of different orderings of vertices and seeing if the matrix is the same. NAUTY then examines the automorphisms and computes a canonical label, which is the largest or smallest possible concatenated adjacency matrix [4]. Additional difficulties in developing a network motif discovery algorithm include the fact that the number of network motifs exponentially increases with increases in network size and that there is an absence of the downward closure property in many networks [5]. These difficulties make it so that full enumeration of subgraphs can be extremely time consuming and may require large amounts of computational power. In order to study a network in context, randomized networks are used for comparison. These randomized networks are developed in such a way that their structure is random and thus not a result of any constraints or significant design elements [6]. This allows for aspects of the network in question (such as motifs) to be compared to the randomized networks to see whether they are a result of the intrinsic properties of the network or if they are indicative of real world functional constraints and/or design principles due to selection [2]. Varying parameters are used to describe the network motif occurrences and to determine whether they are statistically significant. The frequency of a motif is the number of times the motif appears in a network [7]. Different tools use different restrictions for counting frequencies based off of whether or not overlapping of nodes and edges is allowed. Some motif discovery tools ask for a user input that sets a threshold of how many motif occurrences are required for a motif to be considered ‘frequent’. To determine whether a motif is significant in a specific network and not just present due to intrinsic properties of the network, a uniqueness factor is sometimes applied [7]. If the network in question has a motif with a higher frequency than in a certain amount of random networks (threshold set by the user) then the motif is considered to be ‘unique’. In addition, statistical numbers such as z-score and p- value are often used to determine whether the frequency of a motif is statistically significant. The z-score is calculated by finding the difference in the frequency of the motif in the specific target network and the mean frequency of the motif in the randomized networks divided by the standard deviation of the frequency in the randomized networks [3]. The higher the z-score corresponds with the motif being more overrepresented. The z-score threshold over which the motifs are considered overrepresented is often 2. The p-value looks at whether the probability that the number of times a motif appears in a randomized network is equal to or greater than the number of times the motif is present in the network in question [2]. The lower the p-value means the more significant the motif. The threshold under which the p-value must be to be considered significant is commonly 0.01. All of these parameters are important for setting standards to help distinguish between which subgraphs are overrepresented and which are not. Multiple algorithms and tools have been developed, each with different advantages and disadvantages, to identify network motifs. Network motif discovery is a crucial problem to solve in order to gain further insights into the important characteristics, functions, and inner workings of systems with networks. Therefore it has been the goal of many researchers to develop ways to efficiently identify network motifs. It is our goal in this paper to summarize, collect experimental data about, and analyze the various network motif discovery tools and algorithms that have been created. 2 Methodology Major aspects of motif discovery tools that must be considered when examining these tools are the methods of determining frequencies of motifs, the ways of developing randomized networks, the algorithms used for full enumeration, the strategies of identifying motifs without full enumeration, , and the data sets the tools can be applied to.

Load more