Processing Multi-Trillion Edge Graphs on Millions of Cores in Seconds
Total Page:16
File Type:pdf, Size:1020Kb
ShenTu: Processing Multi-Trillion Edge Graphs on Millions of Cores in Seconds Heng Lin1,2, Xiaowei Zhu1,5, Bowen Yu1, Xiongchao Tang1,5, Wei Xue1, Wenguang Chen1, Lufei Zhang3, Torsten Hoefler4, Xiaosong Ma5, Xin Liu6, Weimin Zheng1, and Jingfang Xu7 Abstract—Graphs are an important abstraction used in many III. OVERVIEW OF THE PROBLEM scientific fields. With the magnitude of graph-structured data constantly increasing, effective data analytics requires efficient Graphs are one of the most important tools to model and scalable graph processing systems. Although HPC systems complex systems. Scientific graph structures range from multi- have long been used for scientific computing, people have only billion-edge graphs (e.g., in protein interactions, genomics, recently started to assess their potential for graph processing, a workload with inherent load imbalance, lack of locality, epidemics, and social networks) to trillion-edge ones (e.g., in and access irregularity. We propose ShenTu8, the first general- connectomics and internet connectivity). Timely and efficient purpose graph processing framework that can efficiently utilize processing of such large graphs is not only required to advance an entire Petascale system to process multi-trillion edge graphs in scientific progress but also to solve important societal chal- seconds. ShenTu embodies four key innovations: hardware spe- lenges such as detection of fake content or to enable complex cialization, supernode routing, on-chip sorting, and degree-aware messaging, which together enable its unprecedented performance data analytics tasks, such as personalized medicine. and scalability. It can traverse a record-size 70-trillion-edge graph Improved scientific data acquisition techniques fuel the in seconds. Furthermore, ShenTu enables the processing of a rapid growth of large graphs. For example, cheap sequencing spam detection problem on a 12-trillion edge Internet graph, techniques lead to massive graphs representing millions of hu- making it possible to identify trustworthy and spam webpages directly at the fine-grained page level. man individuals as annotated paths, enabling quick advances in Index Terms—Application programming interfaces; Big data medical data analytics [1]. For each individual, human genome applications; Data analysis; Graph theory; Supercomputers researchers currently assemble de Bruijn graphs with over 5 billion vertices/edges [2]. Similarly, connectomics models the I. JUSTIFICATION FOR ACM GORDON BELL PRIZE human brain, with over 100 billion neurons and an average of ShenTu enables highly efficient general-purpose graph pro- 7,000 synaptic connections each [3]. cessing with novel use of heterogeneous cores and extremely Meanwhile, researchers face unprecedented challenges in large networks, scales to the full TaihuLight, and enables graph the study of human interaction graphs. Malicious activities analytics on 70-trillion-edge graphs. It computes PageRank such as the distribution of phishing emails or fake content, as and TrustRank distributions for an unprecedented 12-trillion- well as massive scraping of private data, are posing threats to edge real-world web graph in 8.5 seconds per iteration. human society. It is necessary to scale graph analytics with the growing online community to detect and react to such threats. II. PERFORMANCE ATTRIBUTES In the past 23 years, the number of Internet users increased Performance Attributes Content by 164× to 4.2 billion, while the number of domains grew by nearly 70,000× to 1.7 billion. Sogou, one of the leading Category of achievement Scalability, Time-to-solution Internet companies in China, crawled more than 271.9 billion Type of method used Framework for graph algorithms Results reported based on Whole application including I/O Chinese pages with over 12.3 trillion inter-page links in early Precision reported Mixed precision (Int and Double) 2017 and expects a 4× size increase with whole-web crawling. System scale Measured on full-scale system Graph processing and analytics differ widely from tradi- Measurement mechanism Timers & Code Instrumentation tional scientific computing: they exhibit significantly more load imbalance, lack of locality, and access irregularity [4]. Modern HPC systems are designed for workloads that have 1Tsinghua University. Email: fxuewei,cwg,[email protected], some degree of locality and access regularity. It is thus flinheng11,zhuxw16,yubw15,[email protected] natural to map regular computations, such as stencil-based 2Fma Technology and structured grid-based, to complex architectures achieving 3 State Key Laboratory of Mathematical Engineering and Advanced Comput- high performance, as recent Gordon Bell Prize winners demon- ing. Email: [email protected] 4ETH Zurich. Email: [email protected] strated. However, as witnessed by the set of recent finalists, 5Qatar Computing Research Institute. Email: [email protected] very few projects tackle the challenge of completely irregular 6National Research Centre of Parallel Computer Engineering and Technology. computations at largest scale. We argue that the science drivers Email: [email protected] outlined above as well as the convergence of data science, 7Beijing Sogou Technology Development Co., Ltd. Email: [email protected] big data, and HPC necessitate serious efforts to enable graph 8ShenTu means “magic graph” in Chinese. computations on leading supercomputing architectures. SC18, November 11-16, 2018, Dallas, Texas, USA 978-1-5386-8384-2/18/$31.00 c 2018 IEEE TABLE I SUMMARY OF STATE-OF-THE-ART GRAPH FRAMEWORKS AND THEIR REPORTED PAGERANK RESULTS In-memory Synthetic or Max. number Time for one Performance Year System Platform or out-of-core Real-world of edges PageRank iteration (GPEPS) 2010 Pregel In-memory Synthetic 300 servers 127 billion n/a n/a 2015 Giraph In-memory Real-world 200 servers 1 trillion less than 3 minutes >5.6 2015 GraM In-memory Synthetic 64 servers 1.2 trillion 140 seconds 8.6 2015 Chaos Out-of-core Synthetic 32 servers (480GB SSD each) 1 trillion 4 hours 0.07 2016 G-Store Out-of-core Synthetic 1 server with 8x512GB SSDs 1 trillion 4214 seconds 0.23 2017 Graphene Out-of-core Synthetic 1 server with 16x500GB SSDs 1 trillion about 1200 seconds 0.83 2017 Mosaic Out-of-core Synthetic 1 server with 6 NVMe SSDs 1 trillion 1247 seconds 0.82 In order to execute trillion-edge graph processing on leading edges in 8.5 seconds running one iteration of PageRank supercomputers, one needs to map the imbalanced, irregular, utilizing all nodes of Sunway TaihuLight. and non-local graph structure to a complex and rather regular machine. In particular, naive domain partitioning would suffer IV. CURRENT STATE OF THE ART from extreme load imbalance due to the power-law degree Driven by the increasing significance of data analytics, distribution of vertices in real-world graphs [5]. For example, numerous graph processing frameworks were introduced in the in-degrees in the Sogou web graph vary from zero to three recent years. Here, we focus on related frameworks that either billion. This load imbalance can be alleviated by either vertex aim at high performance or large-scale graphs. We adopt randomization or edge partitioning, yet, both techniques in- a simple metric, processed edges per second (PEPS) (see crease inter-process communication in many graph algorithms. Section VII for detail), as a pendant to the well-known FLOPS The challenge is further compounded by system features metric to gauge the performance in graph processing. not considered by widely used distributed graph processing Large-scale graph processing: Table I summarizes published frameworks, including the high core counts, complex memory large-scale frameworks and results, based on PageRank, one of hierarchy, and heterogeneous processor architecture within the most important graph algorithms [8]. It has been studied each node on modern supercomputers. and optimized in hundreds of research papers and it is the Data scientists tweak algorithms in agile development gold standard for comparing graph processing frameworks. frameworks and require quick turnaround times to evolve their Among such frameworks, Google Pregel [9] introduced a experiments with ever growing data volumes. We squarely vertex-centric abstraction in 2010, computing the PageRank of address all these challenges with ShenTu, the first general- a synthetic graph with 127 billion edges on 300 machines. In purpose graph processing framework to utilize machines with 2015, Facebook scaled Giraph [10] to complete one PageRank ten millions of cores. It shares a vertex-centric programming iteration on a trillion-edge graph in “less than 3 minutes”. In model with established small-scale graph processing frame- the same year, GraM [11] uses an RDMA-based communica- works [6], [7], allowing data scientists to express arbitrary tion stack tuned for multi-core machines to process a similar graph algorithms via familiar APIs. In particular, as a general- graph on an InfiniBand cluster, outperforming the Giraph purpose graph framework, ShenTu is within a factor of two result with less than one third of the nodes. The achieved 8.6 GPEPS remains the fastest trillion-edge PageRank throughput of the breadth first search (BFS) performance of the current 1 2nd entry in the Graph500 list. When disabling optimizations reported to date. Recently, out-of-core systems, such as that only apply to the BFS benchmark, ShenTu’s performance Chaos [13], G-Store [14], Graphene [15], and Mosaic [16] is equal to this highly-optimized