Efficient and Scalable Algorithms for Network Motifs Discovery
Total Page:16
File Type:pdf, Size:1020Kb
Pedro Manuel Pinto Ribeiro Efficient and Scalable Algorithms for Network Motifs Discovery Departamento de Ciˆencia de Computadores Faculdade de Ciˆencias da Universidade do Porto 2011 Pedro Manuel Pinto Ribeiro Efficient and Scalable Algorithms for Network Motifs Discovery Tese submetida `aFaculdade de Ciˆencias da Universidade do Porto para obten¸c˜ao do grau de Doutor em Ciˆencia de Computadores Advisors: Prof. Fernando Silva and Prof. Lu´ıs Lopes Departamento de Ciˆencia de Computadores Faculdade de Ciˆencias da Universidade do Porto Junho de 2011 Thesis Committee: Prof. Ant´onio Porto, University of Porto (Chair) Prof. Arlindo Oliveira, IST - Technical University of Lisbon (Examiner) Prof. Marcus Kaiser, Newcastle University (Examiner) Prof. Jos´eCardoso e Cunha, New University of Lisbon Prof. V´ıtor Santos Costa, University of Porto Prof. Fernando Silva, University of Porto (Advisor) Prof. Lu´ıs Lopes, University of Porto (Co-Advisor) 2 To Samuel and Liliana 3 4 Acknowledgments First of all, I would like to thank my two advisors, Fernando Silva and Lu´ıs Lopes. They have always been there when I needed them, providing guidance, support and a lot of patience over the last years. Their collaboration was essential and we always had fruitful discussions on the emerging ideas on this subject. I would like to thank the financial support of FCT that was allowed me to dedicate the initial 4 years pursuing this research (PhD grant SFRH/BD/19753/2004). I would also like to thank CRACS/INESC-Porto L.A. for support on the fifth year, allowing me to further improve my work and finish my PhD. I would like to thank Paul Watson and particularly Marcus Kaiser for receiving me so well in Newcastle University for 3 months. This stay provided me with some of the initial motivation for entering the motif discovery and complex network analysis world. I would also like to thank all persons that helped in providing me the computational infrastructures that I used during my work. In particular Hugo Ribeiro, my brother and local computer science department network administrator, and Enrico Pontelli, for the use of Inter Cluster in the New Mexico State University. I also thank all other researchers from the CRACS research center for providing a very good environment for research and all my office colleagues during these years, which were too many to nominate them all. Finally, I would like to thank all my friends and family, that have always supported me. In particular, my parents, my parents-in-law, my wife, Liliana, and my son, Samuel, which was born when I was beginning my work. 5 6 Abstract Networks are a powerful representation for a multitude of natural and artificial sys- tems. They are ubiquitous in real-world systems, presenting substantial non-trivial topological features. These are called complex networks and have received increasing attention in recent years. In order to understand their design principles, the concept of network motifs emerged. These are recurrent over-represented patterns of inter- connections, conjectured to have some significance, that can be seen as basic building blocks of networks. Algorithmically, discovering network motifs is a hard problem related to graph isomorphism. The needed execution time grows exponentially as the size of networks or motifs increases, thus limiting their applicability. Since motifs are a fundamental concept, increasing the efficiency in its detection can lead to new insights in several areas of knowledge. To develop efficient and scalable algorithms for motifs discovery is precisely the main aim of this thesis. We provide a thorough survey of existing methods, complete with an associated chronology, taxonomy, algorithmic description and empirical evaluation and compari- son. We propose a novel data-structure, g-tries, designed to represent a collection of graphs. Akin to a prefix tree, it takes advantage of common substructures to both reduce the memory needed to store the graphs, and to produce a new more efficient sequential algorithm to compute their frequency as subgraphs of another larger graph. We also introduce a sampling methodology for g-tries that successfully trades accuracy for faster execution times. We identify opportunities for parallelism in motif discovery, creating an associated taxonomy. We expose the whole motif computation as a tree based search and devise a general methodology for parallel execution with dynamic load balancing, including a novel strategy capable of efficiently stopping and dividing computation on the fly. In particular we provide parallel algorithms for ESU and g-tries. Finally, we extensively evaluate our algorithms on a set of diversified complex net- works. We show that we are able to outperform all existing sequential algorithms, and are able to scale our parallel algorithms up to 128 processors almost linearly. By combining the power of g-tries and parallelism, we speedup motif discovery by several orders of magnitude, thus effectively pushing the limits in its applicability. 7 8 Resumo As redes s˜ao uma poderosa representa¸c˜ao para uma grande variedade de sistemas naturais e artificiais, sendo ub´ıquas em sistemas do mundo real. As redes complexas apresentam uma quantidade substancial de caracter´ısticas topol´ogicas n˜ao triviais e tˆem recebido uma aten¸c˜ao crescente nos ´ultimos anos. O conceito de network motifs (padr˜oes de rede) surgiu para ajudar a perceber como s˜ao estas redes desenhadas. Os motifs s˜ao padr˜oes de liga¸c˜oes frequentes sobre-representados que se conjectura que possam ter algum significado e podem ser vistos como blocos b´asicos de constru¸c˜ao de redes. Em termos algor´ıtmicos, descobrir estes padr˜oes ´eum problem dif´ıcil, relacionado com o isomorfismo de grafos. O tempo de execu¸c˜ao necess´ario cresce `amedida que o tamanho das redes e dos padr˜oes aumenta. Como os motifs s˜ao um conceito fundamental, aumentar a eficiˆencia da sua detec¸c˜ao pode levar a novas descobertas em v´arias ´areas do conhecimento. Desenvolver algoritmos eficientes e escal´aveis para a descoberta de network motifs ´eprecisamente o principal objectivo desta tese. Fornecemos um completo estudo do estado da arte das metodologias utilizadas, com a cronologia e taxonomia associadas, uma descri¸c˜ao algor´ıtmica e uma avalia¸c˜ao e compara¸c˜ao emp´ırica. Propomos uma nova estrutura de dados, g-tries, desenhada para representar colec¸c˜oes de grafos. Sendo aparentada a uma ´arvore de prefixos, tira partido de subestruturas comuns para reduzir a mem´oria necess´aria para guardar os grafos e para criar um novo algoritmo sequencial mais eficiente para calcular as suas frequˆencias como subgrafos de um outro grafo maior. Tamb´em introduzimos uma t´ecnica de amostragem para as g-tries que troca com sucesso precis˜ao por tempos de execu¸c˜ao mais r´apidos. Expomos toda a computa¸c˜ao de motifs como uma pesquisa em ´arvore e desenhamos uma metodologia geral para a execu¸c˜ao paralela com balancea- mento dinˆamico de carga, incluindo uma nova estrat´egia capaz de eficientemente parar e dividir a computa¸c˜ao durante a sua execu¸c˜ao. Em particular, fornecemos algoritmos paralelos para o ESU e para as g-tries. Finalmente, fazemos uma avalia¸c˜ao extensiva dos nossos algoritmos num conjunto de redes diversificadas. Mostramos que temos melhor desempenho que todos os algoritmos sequenciais existentes e que somos capazes de escalar quase linearmente 9 os nossos algoritmos paralelos at´e128 processadores. Combinando o potencial das g-tries com o seu paralelismo, conseguimos tempos de execu¸c˜ao para descoberta de motifs v´arias ordens de magnitude mais r´apidos, efectivamente aumentando os limites da sua aplicabilidade. 10 R´esum´e Les r´eseaux sont une puissante repr´esentation pour une vari´et´ede syst`emes naturels et artificiels. Ils sont ubiquit´edans les syst`emes du monde r´eel et pr´esentent une quantit´esubstantielle de caract´eristiques topologiques qui ne sont pas simples. Ces r´eseaux sont pour cela appel´es r´eseaux complexes et ont re¸cu d`es quelques ann´ees une reconnaissance croissante. Le concept de network motifs (patron de r´eseau) est n´ede fa¸con `ales mieux comprendre. Ce sont des patrons de liaisons fr´equentes sur- repr´esent´es qui se conjecture pour avoir un sens quelconque et puisse ˆetre vu comme des bloques basiques de construction de r´eseaux. Du point de vue algorithmique, d´ecouvrir ces patrons est un probl`eme difficile qui a rapport avec l’isomorphisme de graphes. Le temps d’ex´ecution grandie `ala mesure que la grandeur des r´eseaux et des patrons augmentent. Comme les motifs sont un concept essentiel, cela augmente l’efficience de la d´etection qui mener `ade nouvelles d´ecouvertes en plusieurs branches de la connaissance. Le principal objectif de cette th`ese est pr´ecis´ement d´evelopper des algorithmiques efficients et extensibles pour la d´ecouverte de network motifs. On fait une ´etude compl`ete de l’´etat de l’art dans les m´ethodes utilis´ees, avec la chronologie et la taxonomie, une description algorithmique et une ´evaluation et com- paraison empiriques. On propose une nouvelle structure de donn´ees, g-tries, dessin´ee pour repr´esent´edes collections de graphes. Cette structure est similaire `aun arbre pr´efixe et profite de sous-structures communes pour r´eduire la m´emoire n´ecessaire pour garder les graphes, et pour cr´eer un nouveau algorithme s´equentiel plus efficient pour calculer ses fr´equences comme sous-graphes d’un autre graphe plus grand. On introduit une technique d’´echantillonnage pour les g-tries qui ´echange, avec succ`es, la pr´ecision du calcul pour un meilleur temps d’ex´ecution. On montre que la d´ecouverte de motifs avec g-tries peut ˆetre vu comme une recherche en arbre.