Thèse De Doctorat

Total Page:16

File Type:pdf, Size:1020Kb

Thèse De Doctorat THÈSE DE DOCTORAT de l’Université de recherche Paris Sciences et Lettres PSL Research University Préparée à l’École Normale Supérieure Nouvelles approches pour le partitionnement de grands graphes Novel Approaches to the Clustering of Large Graphs Ecole doctorale n°386 SCIENCES MATHÉMATIQUES DE PARIS CENTRE Spécialité MATHÉMATIQUES COMPOSITION DU JURY : M. CHANG Cheng-Shang National Tsing Hua University, Rapporteur M. LAMBIOTTE Renaud University of Oxford, Rapporteur M. VAYATIS Nicolas ENS Paris-Saclay, Président du jury Soutenue par ALEXANDRE HOLLOCOU le 19 décembre 2018 M. BONALD Thomas h Télécom Paristech, Membre du jury M. D’ALCHÉ-BUC Florence Dirigée par Marc LELARGE Télécom Paristech, Membre du jury et Thomas BONALD M. LELARGE Marc INRIA Paris, Membre du jury h M. VIENNOT Laurent INRIA Paris, Membre du jury 2 Abstract Graphs are ubiquitous in many fields of research ranging from sociology to biology. A graph is a very simple mathematical structure that consists of a set of elements, called nodes, connected to each other by edges. It is yet able to represent complex systems such as protein-protein interaction or scientific collaborations. Graph clustering is a central problem in the analysis of graphs whose objective is to identify dense groups of nodes that are sparsely connected to the rest of the graph. These groups of nodes, called clusters, are fundamental to an in-depth understanding of graph structures. There is no universal definition of what a good cluster is, and different approaches might be best suited for different applications. Whereas most of classic methods focus on finding node partitions, i.e. on coloring graph nodes so that each node has one and only one color, more elaborate approaches are often necessary to model the complex structure of real-life graphs and to address sophisticated applications. In particular, in many cases, we must consider that a given node can belong to more than one cluster. Besides, many real-world systems exhibit multi-scale structures and one much seek for hierarchies of clusters rather than flat clusterings. Furthermore, graphs often evolve over time and are too massive to be handled in one batch so that one must be able to process stream of edges. Finally, in many applications, processing entire graphs is irrelevant or expensive, and it can be more appropriate to recover local clusters in the neighborhood of nodes of interest rather than color all graph nodes. In this work, we study alternative approaches and design novel algorithms to tackle these different problems. First, we study soft clustering algorithms that aim at finding a degree of membership for each node to each cluster, allowing a given node to be a member of multiple clusters. Then, we explore the subject of edge clustering whose objective is to find partitions of edges instead of node partitions. Third, we study hierarchical clustering techniques to find multi-scale clusters in graphs. We then examine streaming methods for graph clustering, able to handle graphs with billions of edges. Finally, we focus on local graph clustering approaches, whose objective is to find clusters in the neighborhood of so-called seed nodes. The novel methods that we propose to address these different problems are mostly inspired by variants of modularity, a classic measure that accesses the quality of a node partition, and by random walks, stochastic processes whose properties are closely related to the graph structure. We provide analyses that give theoretical guarantees for the different proposed techniques, and endeavour to evaluate these algorithms on real-world datasets and use cases. 3 4 List of publications Multiple local community detection. Hollocou, A., Bonald, T., and Lelarge, M. (2017). ACM SIGMETRICS Performance Evaluation Review, 45(2):76–83. Reference [Hollocou et al., 2017a]. A streaming algorithm for graph clustering. Hollocou, A., Maudet, J., Bonald, T., and Lelarge, M. (2017). NIPS Workshop. Reference [Hollocou et al., 2017b]. Hierarchical graph clustering by node pair sampling. Bonald, T., Charpentier, B., Galland, A., and Hollocou, A. (2018). KDD Workshop. Reference [Bonald et al., 2018a]. Weighted spectral embedding of graphs. Bonald, T., Hollocou, A., and Lelarge, M. (2018). In Communication, Control, and Computing (Allerton), 2018 56th Annual Allerton Conference on. Reference [Bonald et al., 2018c]. Modularity-based sparse soft graph clustering. Hollocou, A., Bonald, T., and Lelarge, M. (2019). AISTATS. Reference [Hollocou et al., 2019]. A fast agglomerative algorithm for edge clustering. Hollocou, A., Lutz, Q., and Bonald, T. (2018). Preprint. Reference [Hollocou et al., 2018]. Forward and backward embedding of directed graphs. Bonald, T., De Lara, N., and Hollocou, A. (2018). Preprint. Reference [Bonald et al., 2018b]. 5 6 Acknowledgements First, I would like to express my sincere gratitude to my advisors Marc Lelarge and Thomas Bonald for their support during the last three years. I would like to thank you for encouraging my research and for your invaluable contribution to this work. It has been a truly great pleasure to work with you. I hope that we will have the opportunity to continue discussions in the future. I am also extremely grateful to my colleagues and friends at the Ministère des Armées. It was an amazing experience working with you, and I will never forget these moments. Special thanks to Fabrice, without whom my Ph.D. would not have been possible, and Pierre, for his extensive knowledge and his always helpful insights. I would also like to thank my committee members, professor Cheng-Shang Chang, professor Renaud Lambiotte, professor Florence D’Alche-Buc, professor Nicolas Vayatis and professor Laurent Viennot for serving as my committee members even at hardship and busy schedules. I am also grateful to all my friends for their encouragements. Special thanks to the Joko team, Xavier, Nicolas and Arthur, for your patience and your support. Lastly, I would like to thank my family and Nathalie for their unwavering love and for supporting me for everything. I cannot thank you enough for encouraging me throughout this experience. 7 8 Contents 1 Introduction 13 1.1 Motivations . 13 1.1.1 Graphs . 13 1.1.2 Graph clustering . 16 1.2 Contributions . 18 1.3 Notations . 19 1.4 Classic graph clustering methods . 20 1.4.1 Modularity . 20 1.4.2 Spectral methods . 27 1.4.3 Some other graph clustering approaches . 31 1.5 Evaluation of graph clustering algorithms . 32 1.5.1 Synthetic graph models . 32 1.5.2 Real-life graphs and ground-truth clusters . 34 2 Soft clustering 37 2.1 Introduction . 37 2.2 Fuzzy c-means: an illustration of soft clustering . 38 2.2.1 The k-means problem . 38 2.2.2 Fuzzy c-means . 39 2.3 Soft modularity . 41 2.4 Related work . 42 2.4.1 The softmax approach . 42 2.4.2 Other approaches . 44 2.5 Alternating projected gradient descent . 45 2.6 Soft clustering algorithm . 48 2.6.1 Local gradient descent step . 48 2.6.2 Cluster membership representation . 48 2.6.3 Projection onto the probability simplex . 49 2.6.4 MODSOFT . 49 2.7 Link with the Louvain algorithm . 49 2.8 Algorithm pseudo-code . 51 2.9 Experimental results . 53 2.9.1 Synthetic data . 53 2.9.2 Real data . 53 3 Edge clustering 57 3.1 Introduction . 57 3.2 Related work . 58 3.3 A modularity function for edge partitions . 58 3.4 The approach of Evans and Lambiotte . 60 3.4.1 Incidence graph and bipartite graph projections . 60 3.4.2 Line graph . 61 3.4.3 Weighted line graph . 61 3.4.4 Random-walk-based projection . 61 9 3.4.5 Limitation . 62 3.5 An agglomerative algorithm for maximizing edge modularity . 62 3.5.1 Agglomerative step . 63 3.5.2 Greedy maximization step . 64 3.5.3 Algorithm . 64 3.6 Experimental results . 64 3.6.1 Synthetic data . 64 3.6.2 Real-life datasets . 65 4 Hierarchical clustering 69 4.1 Introduction . 69 4.2 Related work . 69 4.3 Agglomerative clustering and nearest-neighbor chain . 70 4.3.1 Agglomerative algorithms . 70 4.3.2 Classic distances . 71 4.3.3 Ward’s method . 72 4.3.4 The Lance-Williams family . 72 4.3.5 Nearest-neighbor chain algorithm . 72 4.4 Node pair sampling . 73 4.5 Clustering algorithm . 75 4.6 Link with modularity . 76 4.7 Experiments . 77 5 Streaming approaches 85 5.1 Introduction . 85 5.2 Related work . 86 5.2.1 Edge streaming models . 86 5.2.2 Connected components and pairwise distances . 86 5.2.3 Minimum cut . 87 5.3 A streaming algorithm for graph clustering . 88 5.3.1 Intuition . ..
Recommended publications
  • Federal Register/Vol. 86, No. 132
    Federal Register / Vol. 86, No. 132 / Wednesday, July 14, 2021 / Proposed Rules 37091 Availability and Summary of navigation, it is certified that this ANE NH E5 Portsmouth, NH [Amended] Documents for Incorporation by proposed rule, when promulgated, will Portsmouth International Airport at Pease, Reference not have a significant economic impact NH (Lat. 43°04′41″ N, long. 70°49′24″ W) This document proposes to amend on a substantial number of small entities FAA Order 7400.11E, Airspace under the criteria of the Regulatory That airspace extending upward from 700 Flexibility Act. feet above the surface within an 8.2-mile Designations and Reporting Points, radius of Portsmouth International Airport at dated July 21, 2020, and effective Environmental Review Pease. September 15, 2020. FAA Order 7400.11E is publicly available as listed This proposal will be subject to an Issued in College Park, Georgia, on July 8, environmental analysis in accordance 2021. in the ADDRESSES section of this document. FAA Order 7400.11E lists with FAA Order 1050.1F, Andreese C. Davis, Class A, B, C, D, and E airspace areas, ‘‘Environmental Impacts: Policies and Manager, Airspace & Procedures Team South, air traffic service routes, and reporting Procedures’’, prior to any FAA final Eastern Service Center, Air Traffic Organization. points. regulatory action. Lists of Subjects in 14 CFR Part 71 [FR Doc. 2021–14932 Filed 7–13–21; 8:45 am] The Proposal BILLING CODE 4910–13–P The FAA proposes an amendment to Airspace, Incorporation by reference, Title 14 CFR part 71
    [Show full text]
  • Docket No. FWS–R4–ES–2020–0059; FF09E22000 FXES11130900000 212]
    This document is scheduled to be published in the Federal Register on 07/14/2021 and available online at Billing Code 4333–15 federalregister.gov/d/2021-14661, and on govinfo.gov DEPARTMENT OF THE INTERIOR Fish and Wildlife Service 50 CFR Part 17 [Docket No. FWS–R4–ES–2020–0059; FF09E22000 FXES11130900000 212] RIN 1018–BE56 Endangered and Threatened Wildlife and Plants; Reclassification of the Palo de Rosa from Endangered to Threatened with Section 4(d) Rule AGENCY: Fish and Wildlife Service, Interior. ACTION: Proposed rule. SUMMARY: We, the U.S. Fish and Wildlife Service (Service), propose to reclassify palo de rosa (Ottoschulzia rhodoxylon) from endangered to threatened (downlist) under the Endangered Species Act of 1973, as amended (Act). The proposed downlisting is based on our evaluation of the best available scientific and commercial information, which indicates that the species’ status has improved such that it is not currently in danger of extinction throughout all or a significant portion of its range, but that it is still likely to become so in the foreseeable future. We also propose a rule under section 4(d) of the Act that provides for the conservation of palo de rosa. DATES: We will accept comments received or postmarked on or before [INSERT DATE 60 DAYS AFTER DATE OF PUBLICATION IN THE FEDERAL REGISTER]. Comments submitted electronically using the Federal eRulemaking Portal (see ADDRESSES, below) must be received by 11:59 p.m. Eastern Time on the closing date. We must receive requests for a public hearing, in writing, at the address shown in FOR FURTHER INFORMATION CONTACT by [INSERT DATE 45 DAYS AFTER DATE OF PUBLICATION IN THE FEDERAL REGISTER].
    [Show full text]
  • Requiem /Etemam the Last Five Hundred Years of Mammalian Species Extinctions
    CHAPTER 13 Requiem /Etemam The Last Five Hundred Years of Mammalian Species Extinctions R. D. E. MACPHEE and CLARE FLEMMING 1. Introduction This contribution provides an analytical accounting of mammal species that have disap­ peared during the past 500 years-the "modem era" of this chapter. Our choice of date was dictated by several considerations, but two are paramount. AD 1500 marks more or less pre­ cisely the beginning of Europe's expansion across the rest of the world, a portentous event in human history by any definition. It is an equally momentous date for natural history be­ cause it marks the point at which empirical knowledge of the planet began to burgeon ex­ ponentially. These two factors, linked for both good and ill for the past half-millennium, have affected every aspect of life on earth, and are thus fitting subjects to commemorate in a record such as this. A number of modem-era extinction lists (hereafter, "compilations") for Mammalia have been published in recent years (e.g., Goodwin and Goodwin, 1973; Williams and Nowak, 1986, 1993; World Conservation Monitoring Centre [WCMC), 1992; International Union for the Conservation of Nature and Natural Resources [IUCN], 1993, 1996; Cole et al., 1994}. One may wonder why yet another recension is needed. Superficially, the empirical problem seems uncomplicated: A species either exists or it doesn't-what is there to dis­ agree about? Regrettably, things are not so simple. A target taxon that has not been collect­ ed or seen for an appreciable interval may indeed be extinct-or it may simply not have been collected or seen.
    [Show full text]
  • Brief Contents
    Brief Contents hn hk io il sy SY ek eh hn hk io il sy SY ek eh hn hk io il sy SY ek eh hn hk io il sy SY ek eh Preface xii Chapter 17 Carnivora 352 hn hk io il sy SY ek eh Chapter 18 Rodentia and Lagomorpha 374 hn hk io il sy SY ek eh P arT 1 Chapter 19 Proboscidea, Hyracoidea, hn hk io il sy SY ek eh Introduction 1 and Sirenia 402 hn hk io il sy SY ek eh Chapter 20 Perissodactyla and Artiodactyla 418 hn hk io il sy SY ek eh Chapter 1 The Study of Mammals 2 Chapter 21 Cetacea 444 hn hk io il sy SY ek eh Chapter 2 History of Mammalogy 8 hn hk io il sy SY ek eh Chapter 3 Methods for Studying Mammals 22 Chapter 4 Phylogeny and Diversification of Mammals 50 P arT 4 Chapter 5 Evolution and Dental Behavior and Ecol ogy 467 Characteristics 62 Chapter 22 Communication, Aggression, Chapter 6 Biogeography 82 and Spatial Relations 468 Chapter 23 Sexual Selection, Parental Care, and Mating Systems 484 P arT 2 Chapter 24 Social Behavior 500 Structure and Function 109 Chapter 25 Dispersal, Habitat Selection, and Migration 514 Chapter 7 Integument, Support, Chapter 26 Populations and Life History 530 and Movement 110 Chapter 27 Community Ecol ogy 548 Chapter 8 Modes of Feeding 132 Chapter 9 Control Systems and Biological Rhythms 162 Chapter 10 Environmental Adaptations 176 P arT 5 Chapter 11 Reproduction 214 Special Topics 569 Chapter 28 Parasites and Diseases 570 Chapter 29 Domestication and P arT 3 Domesticated Mammals 590 Adaptive Radiation and Diversity 237 Chapter 30 Conservation 604 Chapter 12 Monotremes and Marsupials 242 Glossary 624 Chapter
    [Show full text]