Parallelizing Louvain Algorithm: Distributed Memory Challenges
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Scikit-Network Documentation Release 0.24.0 Bertrand Charpentier
scikit-network Documentation Release 0.24.0 Bertrand Charpentier Jul 27, 2021 GETTING STARTED 1 Resources 3 2 Quick Start 5 3 Citing 7 Index 205 i ii scikit-network Documentation, Release 0.24.0 Python package for the analysis of large graphs: • Memory-efficient representation as sparse matrices in the CSR formatof scipy • Fast algorithms • Simple API inspired by scikit-learn GETTING STARTED 1 scikit-network Documentation, Release 0.24.0 2 GETTING STARTED CHAPTER ONE RESOURCES • Free software: BSD license • GitHub: https://github.com/sknetwork-team/scikit-network • Documentation: https://scikit-network.readthedocs.io 3 scikit-network Documentation, Release 0.24.0 4 Chapter 1. Resources CHAPTER TWO QUICK START Install scikit-network: $ pip install scikit-network Import scikit-network in a Python project: import sknetwork as skn See examples in the tutorials; the notebooks are available here. 5 scikit-network Documentation, Release 0.24.0 6 Chapter 2. Quick Start CHAPTER THREE CITING If you want to cite scikit-network, please refer to the publication in the Journal of Machine Learning Research: @article{JMLR:v21:20-412, author= {Thomas Bonald and Nathan de Lara and Quentin Lutz and Bertrand Charpentier}, title= {Scikit-network: Graph Analysis in Python}, journal= {Journal of Machine Learning Research}, year={2020}, volume={21}, number={185}, pages={1-6}, url= {http://jmlr.org/papers/v21/20-412.html} } scikit-network is an open-source python package for the analysis of large graphs. 3.1 Installation To install scikit-network, run this command in your terminal: $ pip install scikit-network If you don’t have pip installed, this Python installation guide can guide you through the process. -
Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks
Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks OSU Booth Talk (SC ’19) Ammar Ahmad Awan [email protected] Network Based Computing Laboratory Dept. of Computer Science and Engineering The Ohio State University Agenda • Introduction – Deep Learning Trends – CPUs and GPUs for Deep Learning – Message Passing Interface (MPI) • Research Challenges: Exploiting HPC for Deep Learning • Proposed Solutions • Conclusion Network Based Computing Laboratory OSU Booth - SC ‘18 High-Performance Deep Learning 2 Understanding the Deep Learning Resurgence • Deep Learning (DL) is a sub-set of Machine Learning (ML) – Perhaps, the most revolutionary subset! Deep Machine – Feature extraction vs. hand-crafted Learning Learning features AI Examples: • Deep Learning Examples: Logistic Regression – A renewed interest and a lot of hype! MLPs, DNNs, – Key success: Deep Neural Networks (DNNs) – Everything was there since the late 80s except the “computability of DNNs” Adopted from: http://www.deeplearningbook.org/contents/intro.html Network Based Computing Laboratory OSU Booth - SC ‘18 High-Performance Deep Learning 3 AlexNet Deep Learning in the Many-core Era 10000 8000 • Modern and efficient hardware enabled 6000 – Computability of DNNs – impossible in the 4000 ~500X in 5 years 2000 past! Minutesto Train – GPUs – at the core of DNN training 0 2 GTX 580 DGX-2 – CPUs – catching up fast • Availability of Datasets – MNIST, CIFAR10, ImageNet, and more… • Excellent Accuracy for many application areas – Vision, Machine Translation, and -
Parallel Programming
Parallel Programming Parallel Programming Parallel Computing Hardware Shared memory: multiple cpus are attached to the BUS all processors share the same primary memory the same memory address on different CPU’s refer to the same memory location CPU-to-memory connection becomes a bottleneck: shared memory computers cannot scale very well Parallel Programming Parallel Computing Hardware Distributed memory: each processor has its own private memory computational tasks can only operate on local data infinite available memory through adding nodes requires more difficult programming Parallel Programming OpenMP versus MPI OpenMP (Open Multi-Processing): easy to use; loop-level parallelism non-loop-level parallelism is more difficult limited to shared memory computers cannot handle very large problems MPI(Message Passing Interface): require low-level programming; more difficult programming scalable cost/size can handle very large problems Parallel Programming MPI Distributed memory: Each processor can access only the instructions/data stored in its own memory. The machine has an interconnection network that supports passing messages between processors. A user specifies a number of concurrent processes when program begins. Every process executes the same program, though theflow of execution may depend on the processors unique ID number (e.g. “if (my id == 0) then ”). ··· Each process performs computations on its local variables, then communicates with other processes (repeat), to eventually achieve the computed result. In this model, processors pass messages both to send/receive information, and to synchronize with one another. Parallel Programming Introduction to MPI Communicators and Groups: MPI uses objects called communicators and groups to define which collection of processes may communicate with each other. -
Parallel Computer Architecture
Parallel Computer Architecture Introduction to Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 2 – Parallel Architecture Outline q Parallel architecture types q Instruction-level parallelism q Vector processing q SIMD q Shared memory ❍ Memory organization: UMA, NUMA ❍ Coherency: CC-UMA, CC-NUMA q Interconnection networks q Distributed memory q Clusters q Clusters of SMPs q Heterogeneous clusters of SMPs Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 2 Parallel Architecture Types • Uniprocessor • Shared Memory – Scalar processor Multiprocessor (SMP) processor – Shared memory address space – Bus-based memory system memory processor … processor – Vector processor bus processor vector memory memory – Interconnection network – Single Instruction Multiple processor … processor Data (SIMD) network processor … … memory memory Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 3 Parallel Architecture Types (2) • Distributed Memory • Cluster of SMPs Multiprocessor – Shared memory addressing – Message passing within SMP node between nodes – Message passing between SMP memory memory nodes … M M processor processor … … P … P P P interconnec2on network network interface interconnec2on network processor processor … P … P P … P memory memory … M M – Massively Parallel Processor (MPP) – Can also be regarded as MPP if • Many, many processors processor number is large Introduction to Parallel Computing, University of Oregon, -
Parallel Programming Using Openmp Feb 2014
OpenMP tutorial by Brent Swartz February 18, 2014 Room 575 Walter 1-4 P.M. © 2013 Regents of the University of Minnesota. All rights reserved. Outline • OpenMP Definition • MSI hardware Overview • Hyper-threading • Programming Models • OpenMP tutorial • Affinity • OpenMP 4.0 Support / Features • OpenMP Optimization Tips © 2013 Regents of the University of Minnesota. All rights reserved. OpenMP Definition The OpenMP Application Program Interface (API) is a multi-platform shared-memory parallel programming model for the C, C++ and Fortran programming languages. Further information can be found at http://www.openmp.org/ © 2013 Regents of the University of Minnesota. All rights reserved. OpenMP Definition Jointly defined by a group of major computer hardware and software vendors and the user community, OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from multicore systems and SMPs, to embedded systems. © 2013 Regents of the University of Minnesota. All rights reserved. MSI Hardware • MSI hardware is described here: https://www.msi.umn.edu/hpc © 2013 Regents of the University of Minnesota. All rights reserved. MSI Hardware The number of cores per node varies with the type of Xeon on that node. We define: MAXCORES=number of cores/node, e.g. Nehalem=8, Westmere=12, SandyBridge=16, IvyBridge=20. © 2013 Regents of the University of Minnesota. All rights reserved. MSI Hardware Since MAXCORES will not vary within the PBS job (the itasca queues are split by Xeon type), you can determine this at the start of the job (in bash): MAXCORES=`grep "core id" /proc/cpuinfo | wc -l` © 2013 Regents of the University of Minnesota. -
A Survey of Results on Mobile Phone Datasets Analysis
Blondel et al. RESEARCH A survey of results on mobile phone datasets analysis Vincent D Blondel1*, Adeline Decuyper1 and Gautier Krings1,2 *Correspondence: [email protected] Abstract 1Department of Applied Mathematics, Universit´e In this paper, we review some advances made recently in the study of mobile catholique de Louvain, Avenue phone datasets. This area of research has emerged a decade ago, with the Georges Lemaitre, 4, 1348 increasing availability of large-scale anonymized datasets, and has grown into a Louvain-La-Neuve, Belgium Full list of author information is stand-alone topic. We will survey the contributions made so far on the social available at the end of the article networks that can be constructed with such data, the study of personal mobility, geographical partitioning, urban planning, and help towards development as well as security and privacy issues. Keywords: mobile phone datasets; big data analysis 1 Introduction As the Internet has been the technological breakthrough of the ’90s, mobile phones have changed our communication habits in the first decade of the twenty-first cen- tury. In a few years, the world coverage of mobile phone subscriptions has raised from 12% of the world population in 2000 up to 96% in 2014 – 6.8 billion subscribers – corresponding to a penetration of 128% in the developed world and 90% in de- veloping countries [1]. Mobile communication has initiated the decline of landline use – decreasing both in developing and developed world since 2005 – and allows people to be connected even in the most remote places of the world. In short, mobile phones are ubiquitous. -
Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters
Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters Presented at GTC ’15 Presented by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] hCp://www.cse.ohio-state.edu/~panda Accelerator Era GTC ’15 • Accelerators are becominG common in hiGh-end system architectures Top 100 – Nov 2014 (28% use Accelerators) 57% use NVIDIA GPUs 57% 28% • IncreasinG number of workloads are beinG ported to take advantage of NVIDIA GPUs • As they scale to larGe GPU clusters with hiGh compute density – hiGher the synchronizaon and communicaon overheads – hiGher the penalty • CriPcal to minimize these overheads to achieve maximum performance 3 Parallel ProGramminG Models Overview GTC ’15 P1 P2 P3 P1 P2 P3 P1 P2 P3 LoGical shared memory Shared Memory Memory Memory Memory Memory Memory Memory Shared Memory Model Distributed Memory Model ParPPoned Global Address Space (PGAS) DSM MPI (Message PassinG Interface) Global Arrays, UPC, Chapel, X10, CAF, … • ProGramminG models provide abstract machine models • Models can be mapped on different types of systems - e.G. Distributed Shared Memory (DSM), MPI within a node, etc. • Each model has strenGths and drawbacks - suite different problems or applicaons 4 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compuPnG • Proposed DesiGns and Alternaves • Performance Evaluaon • ExploiPnG GPUDirect RDMA 5 ParPPoned Global Address Space (PGAS) Models GTC ’15 • PGAS models, an aracPve alternave to tradiPonal message passinG - Simple -
Multilevel Hierarchical Kernel Spectral Clustering for Real-Life Large
Multilevel Hierarchical Kernel Spectral Clustering for Real-Life Large Scale Complex Networks Raghvendra Mall, Rocco Langone and Johan A.K. Suykens Department of Electrical Engineering, KU Leuven, ESAT-STADIUS, Kasteelpark Arenberg,10 B-3001 Leuven, Belgium {raghvendra.mall,rocco.langone,johan.suykens}@esat.kuleuven.be Abstract Kernel spectral clustering corresponds to a weighted kernel principal compo- nent analysis problem in a constrained optimization framework. The primal formulation leads to an eigen-decomposition of a centered Laplacian matrix at the dual level. The dual formulation allows to build a model on a repre- sentative subgraph of the large scale network in the training phase and the model parameters are estimated in the validation stage. The KSC model has a powerful out-of-sample extension property which allows cluster affiliation for the unseen nodes of the big data network. In this paper we exploit the structure of the projections in the eigenspace during the validation stage to automatically determine a set of increasing distance thresholds. We use these distance thresholds in the test phase to obtain multiple levels of hierarchy for the large scale network. The hierarchical structure in the network is de- termined in a bottom-up fashion. We empirically showcase that real-world networks have multilevel hierarchical organization which cannot be detected efficiently by several state-of-the-art large scale hierarchical community de- arXiv:1411.7640v2 [cs.SI] 2 Dec 2014 tection techniques like the Louvain, OSLOM and Infomap methods. We show a major advantage our proposed approach i.e. the ability to locate good quality clusters at both the coarser and finer levels of hierarchy using internal cluster quality metrics on 7 real-life networks. -
An Overview of the PARADIGM Compiler for Distributed-Memory
appeared in IEEE Computer, Volume 28, Number 10, October 1995 An Overview of the PARADIGM Compiler for DistributedMemory Multicomputers y Prithvira j Banerjee John A Chandy Manish Gupta Eugene W Ho dges IV John G Holm Antonio Lain Daniel J Palermo Shankar Ramaswamy Ernesto Su Center for Reliable and HighPerformance Computing University of Illinois at UrbanaChampaign W Main St Urbana IL Phone Fax banerjeecrhcuiucedu Abstract Distributedmemory multicomputers suchasthetheIntel Paragon the IBM SP and the Thinking Machines CM oer signicant advantages over sharedmemory multipro cessors in terms of cost and scalability Unfortunately extracting all the computational p ower from these machines requires users to write ecientsoftware for them which is a lab orious pro cess The PARADIGM compiler pro ject provides an automated means to parallelize programs written using a serial programming mo del for ecient execution on distributedmemory multicomput ers In addition to p erforming traditional compiler optimizations PARADIGM is unique in that it addresses many other issues within a unied platform automatic data distribution commu nication optimizations supp ort for irregular computations exploitation of functional and data parallelism and multithreaded execution This pap er describ es the techniques used and pro vides exp erimental evidence of their eectiveness on the Intel Paragon the IBM SP and the Thinking Machines CM Keywords compilers distributed memorymulticomputers parallelizing compilers parallel pro cessing This research was supp orted -
Xerox University Microfilms
INFORMATION TO USERS This material was produced from a microfilm copy of the original document. While the most advanced technological means to photograph and reproduce this document have been used, the quality is heavily dependent upon the quality of the original submitted. The following explanation of techniques is provided to help you understand markings or patterns which may appear on this reproduction. I.The sign or "target" for pages apparently lacking from the document photographed is "Missing Page(s)". If it was possible to obtain the missing page(s) or section, they are spliced into the film along with adjacent pages. This may have necessitated cutting thru an image and duplicating adjacent pages to insure you complete continuity. 2. When an image on the film is obliterated with a large round black mark, it is an indication that the photographer suspected that the copy may have moved during exposure and thus caijse a blurred image. You will find a good image of die page in the adjacent frame. 3. When a map, drawing or chart, etif., was part of the material being photographed the photographer followed a definite method in "sectioning" the material. It is customary to begin photoing at the upper left hand corner of a large sheet ang to continue photoing from left to right in equal sections with a small overlap. If necessary, sectioning is continued again — beginning below tf e first row and continuing on until complete. 4. The majority of users indicate that thetextual content is of greatest value, however, a somewhat higher quality reproduction could be made from "photographs" if essential to the understanding of the dissertation. -
Fast Community Detection Using Local Neighbourhood Search
Fast community detection using local neighbourhood search Arnaud Browet,∗ P.-A. Absil, and Paul Van Dooren Universit´ecatholique de Louvain Department of Mathematical Engineering, Institute of Information and Communication Technologies, Electronics and Applied Mathematics (ICTEAM) Av. Georges Lema^ıtre 4-6, B-1348 Louvain-la-Neuve, Belgium. (Dated: August 30, 2013) Communities play a crucial role to describe and analyse modern networks. However, the size of those networks has grown tremendously with the increase of computational power and data storage. While various methods have been developed to extract community structures, their computational cost or the difficulty to parallelize existing algorithms make partitioning real networks into commu- nities a challenging problem. In this paper, we propose to alter an efficient algorithm, the Louvain method, such that communities are defined as the connected components of a tree-like assignment graph. Within this framework, we precisely describe the different steps of our algorithm and demon- strate its highly parallelizable nature. We then show that despite its simplicity, our algorithm has a partitioning quality similar to the original method on benchmark graphs and even outperforms other algorithms. We also show that, even on a single processor, our method is much faster and allows the analysis of very large networks. PACS numbers: 89.75.Fb, 05.10.-a, 89.65.-s INTRODUCTION ity is encoded in the edge weight, this task is known in graph theory as community detection and has been introduced by Newman [9]. Community detection has Over the years, the evolution of information and com- already proven to be of great interest in many different munication technology in science and industry has had research areas such as epidemiology [10], influence and a significant impact on how collected data are being spread of information over social networks [11, 12], anal- used to understand and analyse complex phenomena [1]. -
A Survey of Results on Mobile Phone Datasets Analysis
Blondel et al. EPJ Data Science (2015)4:10 DOI 10.1140/epjds/s13688-015-0046-0 R E V I E W Open Access A survey of results on mobile phone datasets analysis Vincent D Blondel1*, Adeline Decuyper1 and Gautier Krings1,2 *Correspondence: [email protected] Abstract 1Department of Applied Mathematics, Université catholique In this paper, we review some advances made recently in the study of mobile phone de Louvain, Avenue Georges datasets. This area of research has emerged a decade ago, with the increasing Lemaitre, 4, Louvain-La-Neuve, availability of large-scale anonymized datasets, and has grown into a stand-alone 1348, Belgium Full list of author information is topic. We survey the contributions made so far on the social networks that can be available at the end of the article constructed with such data, the study of personal mobility, geographical partitioning, urban planning,andhelp towards development as well as security and privacy issues. Keywords: mobile phone datasets; big data analysis; data mining; social networks; temporal networks; geographical networks 1 Introduction As the Internet has been the technological breakthrough of the ’s, mobile phones have changed our communication habits in the first decade of the twenty-first century. In a few years, the world coverage of mobile phone subscriptions has raised from % of the world population in up to % in - . billion subscribers - corresponding to a penetration of % in the developed world and % in developing countries []. Mobile communication has initiated the decline of landline use - decreasing both in developing and developed world since - and allows people to be connected even in the most remote places of the world.