Efficient and Scalable Algorithms for Network Motifs Discovery

Pedro Manuel Pinto Ribeiro Efficient and Scalable Algorithms for Network Motifs Discovery Departamento de Ciência de Computadores Faculdade de Ciências da Universidade do Porto 2011 Pedro Manuel Pinto Ribeiro Efficient and Scalable Algorithms for Network Motifs Discovery Tese submetida àFaculdade de Ciências da Universidade do Porto para obten¸cão do grau de Doutor em Ciência de Computadores Advisors: Prof. Fernando Silva and Prof. Lu´ıs Lopes Departamento de Ciência de Computadores Faculdade de Ciências da Universidade do Porto Junho de 2011 Thesis Committee: Prof. António Porto, University of Porto (Chair) Prof. Arlindo Oliveira, IST - Technical University of Lisbon (Examiner) Prof. Marcus Kaiser, Newcastle University (Examiner) Prof. JoséCardoso e Cunha, New University of Lisbon Prof. V´ıtor Santos Costa, University of Porto Prof. Fernando Silva, University of Porto (Advisor) Prof. Lu´ıs Lopes, University of Porto (Co-Advisor) 2 To Samuel and Liliana 3 4 Acknowledgments First of all, I would like to thank my two advisors, Fernando Silva and Lu´ıs Lopes. They have always been there when I needed them, providing guidance, support and a lot of patience over the last years. Their collaboration was essential and we always had fruitful discussions on the emerging ideas on this subject. I would like to thank the financial support of FCT that was allowed me to dedicate the initial 4 years pursuing this research (PhD grant SFRH/BD/19753/2004). I would also like to thank CRACS/INESC-Porto L.A. for support on the fifth year, allowing me to further improve my work and finish my PhD. I would like to thank Paul Watson and particularly Marcus Kaiser for receiving me so well in Newcastle University for 3 months. This stay provided me with some of the initial motivation for entering the motif discovery and complex network analysis world. I would also like to thank all persons that helped in providing me the computational infrastructures that I used during my work. In particular Hugo Ribeiro, my brother and local computer science department network administrator, and Enrico Pontelli, for the use of Inter Cluster in the New Mexico State University. I also thank all other researchers from the CRACS research center for providing a very good environment for research and all my office colleagues during these years, which were too many to nominate them all. Finally, I would like to thank all my friends and family, that have always supported me. In particular, my parents, my parents-in-law, my wife, Liliana, and my son, Samuel, which was born when I was beginning my work. 5 6 Abstract Networks are a powerful representation for a multitude of natural and artificial systems. They are ubiquitous in real-world systems, presenting substantial non-trivial topological features. These are called complex networks and have received increasing attention in recent years. In order to understand their design principles, the concept of network motifs emerged. These are recurrent over-represented patterns of inter- connections, conjectured to have some significance, that can be seen as basic building blocks of networks. Algorithmically, discovering network motifs is a hard problem related to graph isomorphism. The needed execution time grows exponentially as the size of networks or motifs increases, thus limiting their applicability. Since motifs are a fundamental concept, increasing the efficiency in its detection can lead to new insights in several areas of knowledge. To develop efficient and scalable algorithms for motifs discovery is precisely the main aim of this thesis. We provide a thorough survey of existing methods, complete with an associated chronology, taxonomy, algorithmic description and empirical evaluation and compari- son. We propose a novel data-structure, g-tries, designed to represent a collection of graphs. Akin to a prefix tree, it takes advantage of common substructures to both reduce the memory needed to store the graphs, and to produce a new more efficient sequential algorithm to compute their frequency as subgraphs of another larger graph. We also introduce a sampling methodology for g-tries that successfully trades accuracy for faster execution times. We identify opportunities for parallelism in motif discovery, creating an associated taxonomy. We expose the whole motif computation as a tree based search and devise a general methodology for parallel execution with dynamic load balancing, including a novel strategy capable of efficiently stopping and dividing computation on the fly. In particular we provide parallel algorithms for ESU and g-tries. Finally, we extensively evaluate our algorithms on a set of diversified complex networks. We show that we are able to outperform all existing sequential algorithms, and are able to scale our parallel algorithms up to 128 processors almost linearly. By combining the power of g-tries and parallelism, we speedup motif discovery by several orders of magnitude, thus effectively pushing the limits in its applicability. 7 8 Resumo As redes são uma poderosa representa¸cão para uma grande variedade de sistemas naturais e artificiais, sendo ub´ıquas em sistemas do mundo real. As redes complexas apresentam uma quantidade substancial de caracter´ısticas topológicas não triviais e têm recebido uma aten¸cão crescente nos últimos anos. O conceito de network motifs (padrões de rede) surgiu para ajudar a perceber como são estas redes desenhadas. Os motifs são padrões de liga¸cões frequentes sobre-representados que se conjectura que possam ter algum significado e podem ser vistos como blocos básicos de constru¸cão de redes. Em termos algor´ıtmicos, descobrir estes padrões éum problem dif´ıcil, relacionado com o isomorfismo de grafos. O tempo de execu¸cão necessário cresce àmedida que o tamanho das redes e dos padrões aumenta. Como os motifs são um conceito fundamental, aumentar a eficiência da sua deteçcão pode levar a novas descobertas em várias áreas do conhecimento. Desenvolver algoritmos eficientes e escaláveis para a descoberta de network motifs éprecisamente o principal objectivo desta tese. Fornecemos um completo estudo do estado da arte das metodologias utilizadas, com a cronologia e taxonomia associadas, uma descri¸cão algor´ıtmica e uma avalia¸cão e compara¸cão emp´ırica. Propomos uma nova estrutura de dados, g-tries, desenhada para representar coleçcões de grafos. Sendo aparentada a uma árvore de prefixos, tira partido de subestruturas comuns para reduzir a memória necessária para guardar os grafos e para criar um novo algoritmo sequencial mais eficiente para calcular as suas frequências como subgrafos de um outro grafo maior. Também introduzimos uma técnica de amostragem para as g-tries que troca com sucesso precisão por tempos de execu¸cão mais rápidos. Expomos toda a computa¸cão de motifs como uma pesquisa em árvore e desenhamos uma metodologia geral para a execu¸cão paralela com balancea- mento dinâmico de carga, incluindo uma nova estratégia capaz de eficientemente parar e dividir a computa¸cão durante a sua execu¸cão. Em particular, fornecemos algoritmos paralelos para o ESU e para as g-tries. Finalmente, fazemos uma avalia¸cão extensiva dos nossos algoritmos num conjunto de redes diversificadas. Mostramos que temos melhor desempenho que todos os algoritmos sequenciais existentes e que somos capazes de escalar quase linearmente 9 os nossos algoritmos paralelos até128 processadores. Combinando o potencial das g-tries com o seu paralelismo, conseguimos tempos de execu¸cão para descoberta de motifs várias ordens de magnitude mais rápidos, efectivamente aumentando os limites da sua aplicabilidade. 10 Résumé Les réseaux sont une puissante représentation pour une variétéde systèmes naturels et artificiels. Ils sont ubiquitédans les systèmes du monde réel et présentent une quantitésubstantielle de caractéristiques topologiques qui ne sont pas simples. Ces réseaux sont pour cela appelés réseaux complexes et ont re¸cu dès quelques années une reconnaissance croissante. Le concept de network motifs (patron de réseau) est néde fa¸con àles mieux comprendre. Ce sont des patrons de liaisons fréquentes sur- représentés qui se conjecture pour avoir un sens quelconque et puisse être vu comme des bloques basiques de construction de réseaux. Du point de vue algorithmique, découvrir ces patrons est un problème difficile qui a rapport avec l’isomorphisme de graphes. Le temps d’exécution grandie àla mesure que la grandeur des réseaux et des patrons augmentent. Comme les motifs sont un concept essentiel, cela augmente l’efficience de la détection qui mener àde nouvelles découvertes en plusieurs branches de la connaissance. Le principal objectif de cette thèse est précisément développer des algorithmiques efficients et extensibles pour la découverte de network motifs. On fait une étude complète de l’état de l’art dans les méthodes utilisées, avec la chronologie et la taxonomie, une description algorithmique et une évaluation et com- paraison empiriques. On propose une nouvelle structure de données, g-tries, dessinée pour représentédes collections de graphes. Cette structure est similaire àun arbre préfixe et profite de sous-structures communes pour réduire la mémoire nécessaire pour garder les graphes, et pour créer un nouveau algorithme séquentiel plus efficient pour calculer ses fréquences comme sous-graphes d’un autre graphe plus grand. On introduit une technique d’échantillonnage pour les g-tries qui échange, avec succès, la précision du calcul pour un meilleur temps d’exécution. On montre que la découverte de motifs avec g-tries peut être vu comme une recherche en arbre.

Efficient and Scalable Algorithms for Network Motifs Discovery

A Novel Neuronal Network Approach to Express Network Motifs

A Meta-Analysis of Boolean Network Models Reveals Design Principles of Gene Regulatory Networks

Multivariate Relations Aggregation Learning in Social Networks

Motif Aware Node Representation Learning for Heterogeneous Networks

GAMARALLAGE-THESIS-2020.Pdf

Motifs in Temporal Networks

Biological Network Motif Detection: Principles and Practice Elisabeth Wong, Brittany Baur, Saad Quader and Chun-Hsi Huang

Subgraphs and Motifs in a Dynamic Airline Network Marius Agasse-Duval, Steve Lawford

Annotated Hypergraphs: Models and Applications

Snapshot: Network Motifs Oren Shoval and Uri Alon Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot 76100, Israel

Annotated Hypergraphs: Models and Applications

Classifying Types of Network Communities Using Motifs