DGCL: an Efficient Communication Library for Distributed GNN Training

DGCL: An Efficient Communication Library for Distributed GNN Training Zhenkun Cai Xiao Yan∗ Yidi Wu The Chinese University of Hong Kong Southern University of Science and The Chinese University of Hong Kong [email protected] Technology [email protected] [email protected] Kaihao Ma James Cheng Fan Yu The Chinese University of Hong Kong The Chinese University of Hong Kong Huawei Technologies Co. Ltd [email protected] [email protected] [email protected] Abstract ACM Reference Format: Graph neural networks (GNNs) have gained increasing pop- Zhenkun Cai, Xiao Yan, Yidi Wu, Kaihao Ma, James Cheng, and Fan ularity in many areas such as e-commerce, social networks Yu. 2021. DGCL: An Efficient Communication Library for Dis- tributed GNN Training. In Sixteenth European Conference on Com- and bio-informatics. Distributed GNN training is essential for puter Systems (EuroSys ’21), April 26ś29, 2021, Online, United King- handling large graphs and reducing the execution time. How- dom. ACM, New York, NY, USA, 15 pages. https://doi.org/10.1145/ ever, for distributed GNN training, a peer-to-peer communi- 3447786.3456233 cation strategy suffers from high communication overheads. Also, different GPUs require different remote vertex embeddings, which leads to an irregular communication pattern 1 Introduction and renders existing communication planning solutions un- Graph neural networks (GNNs) are a special kind of deep suitable. We propose the distributed graph communication neural network, in which each vertex aggregates the embed- library (DGCL) for efficient GNN training on multiple GPUs. dings of its neighbors in the graph to compute its own embed- At the heart of DGCL is a communication planning algorithm ding. Many GNN models have been proposed, e.g., GCN [17], tailored for GNN training, which jointly considers fully uti- GAT [33], CommNet [32], GraphSAGE [8], GIN [39] and lizing fast links, fusing communication, avoiding contention GGNN [20]. These GNN models achieve excellent perfor- and balancing loads on different links. DGCL can be easily mance for graph related tasks such as node classification, adopted to extend existing single-GPU GNN systems to dis- link prediction and graph clustering. To support the effi- tributed training. We conducted extensive experiments on cient training of GNN models, a number of GNN systems different datasets and network configurations to compare such as DGL [35], PyG [7], NeuGraph [21], RoC [13] and DGCL with alternative communication schemes. In our ex- AliGraph [40], have been developed for CPU or GPU. As real periments, DGCL reduces the communication time of the graphs can be very large, e.g., containing millions of vertices peer-to-peer communication by 77.5% on average and the and billions of edges, it is essential to conduct distributed training time for an epoch by up to 47%. GNN training using many GPUs for efficiency and scalabil- ity. However, most existing GNN systems are designed for Keywords Graph Neural Networks, Distributed and Paral- a single worker (e.g., DGL and PyG) or a small number of lel Training, Network Communication workers on the same machine (e.g., NeuGraph). To conduct distributed GNN training, a graph is usually ∗Corresponding author. partitioned [4, 14, 15, 30] to assign its vertices to multiple workers, and the workers compute the vertex embedding in parallel. As vertices need to access the embeddings of their Permission to make digital or hard copies of all or part of this work for neighbors that reside on other workers, it is necessary to personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear conduct embedding passing during training. However, using this notice and the full citation on the first page. Copyrights for components direct peer-to-peer communication (i.e., each worker fetches of this work owned by others than ACM must be honored. Abstracting with the required embeddings directly from other workers) [12] credit is permitted. To copy otherwise, or republish, to post on servers or to for embedding passing results in high communication over- redistribute to lists, requires prior specific permission and/or a fee. Request heads. The communication time can easily take up over 50% permissions from [email protected]. of the total training time and even exceeds 90% in some cases EuroSys ’21, April 26ś29, 2021, Online, United Kingdom © 2021 Association for Computing Machinery. according to our measurements in ğ3. ACM ISBN 978-1-4503-8334-9/21/04...$15.00 Our profiling and analysis show that peer-to-peer com- https://doi.org/10.1145/3447786.3456233 munication has poor efficiency for two main reasons (ğ3): (1) It fails to fully utilize fast links. Modern GPU servers (e.g., h e h e NVIDIA DGX systems [22]) contain heterogeneous physical GPU 3 GPU 2 connections such as NVLink [25], PCIe, QPI, IB [36] and 0 Ethernet, and the bandwidth of the fastest link can be an g i f d g i f d order of magnitude of the slowest link. Peer-to-peer communication simply uses the direct links between workers and j a b j a b does not fully exploit the fast links. (2) It does not consider l GPU 4 k c GPU 1 the communications between different worker pairs jointly. l k c With peer-to-peer communication, all workers communicate (a) Graph (b) Partition to 4 GPUs with each other concurrently for embedding passing. This results in severe network contention and load imbalance Figure 1. An example graph and its partitioning given the complex connections among GPUs in a modern GPU cluster. To eliminate embedding passing, we considered an alternative that replicates the remote neighbors of the vertices to the local worker but found that replication results in high memory and computation costs. work in a single-worker mode and does not need to know the To optimize communication efficiency of distributed GNN details about distributed execution. In addition to the SPST training, we formulate a communication planning problem, algorithm, we introduce several system designs in DGCL which is to find a plan that minimizes the time to pass the to improve communication efficiency, including decentral- required vertex embeddings among the workers. The prob- ized communication coordination, automatic communica- lem is challenging for two reasons: (1) The communication tion method selection and non-atomic aggregation. pattern is irregular as different workers require different re- To evaluate the performance of DGCL, we conducted ex- mote embeddings. This is in contrast to training normal deep tensive experiments on four graphs and three GNN mod- neural networks (e.g., VGG [31] and ResNet [9] variants), els under different network configurations. We compared for which each worker sends/receives the same set of model with three alternative communication schemes, i.e., peer- parameters and communication planning has been solved to-peer communication, swap (which uses main memory for by NCCL [24]. (2) The physical connections are complex in embedding exchange among GPUs as in NeuGraph [21]) and modern GPU servers and the communications of all worker replication (which replicates cross-partition vertices to elimi- pairs need to be jointly considered to avoid contention and nate communication as in Medusa [43]). The results show balance the loads among different links. that DGCL consistently achieves shorter communication To solve the communication planning problem, we first time and hence shorter per-epoch time than the baselines show that the optimal strategy to transmit a vertex embed- for distributed GNN training. Compared with peer-to-peer ding from its source worker (i.e., the worker that computes communication, DGCL reduces the communication time by it) to its destination workers (i.e., the workers that use it as at most 85.8% and 77.5% on average. The per-epoch time input) is a tree that has the source worker as its root and of peer-to-peer is reduced by up to 47% in some cases. We contains all the destination workers. Then, a communication also found that replication has poor performance in process- plan is defined as the union of the communication treesof ing dense graphs and runs out of memory for large graphs. all vertices, which provides the maximum flexibility as each Swap often has the worst performance as it needs to dump vertex is allowed to choose its own communication strategy. all vertex embeddings to main memory. We also build a cost model to predict the execution time of Paper outline. We give the background on GNN training a communication plan, which divides the communications in ğ2. We analyze existing strategies for distributed GNN into stages and considers contention, load balancing and training in ğ3 to show their limitations and hence motivate parallelization. Finally, we propose a shortest path spanning our work. We present the architecture and the API of DGCL tree (SPST) algorithm to minimize the cost model in a greedy in ğ4. We discuss the details of communication planning and manner. The vertices are considered sequentially and the the SPST algorithm in ğ5. We discuss the system designs communication tree for each vertex is solved by minimizing and some implementation details in ğ6. We report the per- the cost blow-up considering the processed vertices. formance evaluation results in ğ7. We discuss related work Based on the SPST algorithm, we develop a package called in ğ8 and conclude our work in ğ9. the distributed graph communication library (DGCL). DGCL handles tasks related to distributed GNN training including graph partitioning, communication planning, and ac- 2 Background on GNN Training tual communication execution. Existing GNN systems can Given a graph G = (V, E), GNN models (e.g., GCN [17], easily invoke DGCL with user-friendly APIs for efficient CommNet [32] and GIN [39]) learn an embedding for each distributed communication.

DGCL: an Efficient Communication Library for Distributed GNN Training

20201130 Gcdv V1.0.Pdf

NVIDIA DGX Station the First Personal AI Supercomputer 1.0 Introduction

The NVIDIA DGX-1 Deep Learning System

Computing for the Most Demanding Users

Nvidia Autonomous Driving Platform

Investigations of Various HPC Benchmarks to Determine Supercomputer Performance Efficiency and Balance

Fabric Manager for NVIDIA Nvswitch Systems

NVIDIA DGX-1 System Architecture White Paper

NVIDIA A100 Tensor Core GPU Architecture UNPRECEDENTED ACCELERATION at EVERY SCALE

NVIDIA GPU COMPUTING: a JOURNEY from PC GAMING to DEEP LEARNING Stuart Oberman | October 2017 GAMING PROENTERPRISE VISUALIZATION DATA CENTER AUTO

Nvidia Dgx Os 5.0

NVIDIA Topology-Aware GPU Selection 0.1.0 (Early Access)