Graphscope: a Unified Engine for Big Graph Processing

GraphScope: A Unified Engine For Big Graph Processing Wenfei Fan§,†, Tao He, Longbin Lai, Xue Li, Yong Li, Zhao Li, Zhengping Qian, Chao Tian, Lei Wang, Jingbo Xu, Youyang Yao, Qiang Yin, Wenyuan Yu, Jingren Zhou, Diwen Zhu, Rong Zhu Alibaba Group § University of Edinburgh † Shenzhen Institute of Computing Sciences [email protected] Relational Linear Deep Neural ABSTRACT Algebra Algebra Networks GraphScope GraphScope is a system and a set of language extensions that en- Uniﬁed language (with Graph Library) NumPy-like TensorFlow Koalas able a new programming interface for large-scale distributed graph API API Graph Pattern Iterative Graph Traversal Matching Algorithms Sampling computing. It generalizes previous graph processing frameworks (e.g., Pregel, GraphX) and distributed graph databases (e.g., Janus- Python Graph, Neptune) in two important ways: by exposing a unified TensorFlow Spark Dask GraphScope Runtime programming interface to a wide variety of graph computations Runtime such as graph traversal, pattern matching, iterative algorithms and graph neural networks within a high-level programming language; and by supporting the seamless integration of a highly optimized Figure 1: The GraphScope system stack, and how it interacts graph engine in a general purpose data-parallel computing system. with the PyData ecosystem. A GraphScope program is a sequential program composed of declarative data-parallel operators, and can be written using stan- dard Python development tools. The system automatically handles than alternative parallel-computing paradigms, such as Parallel the parallelization and distributed execution of programs on a clus- Random Access Machines (PRAM) [20], and enables sophisticated ter of machines. It outperforms current state-of-the-art systems by optimizations to achieve high performance. enabling a separate optimization (or family of optimizations) for However, the operator semantics supported by these systems are each graph operation in one carefully designed coherent framework. ill-suited to efficiently solve a variety of important problems which We describe the design and implementation of GraphScope and requires a deeper analysis of heterogeneous data. In such cases, anal- evaluate system performance using several real-world applications. ysis tools involving graph computation are often called for instead. Social network mining, for example, routinely requires ranking and PVLDB Reference Format: classification algorithms such as PageRank [44], connected compo- Wenfei Fan, Tao He, Longbin Lai, Xue Li, Yong Li, Zhao Li, Zhengping nents [29], and betweenness centrality [23]; similarly for identifying Qian, Chao Tian, Lei Wang, Jingbo Xu, Youyang Yao, Qiang Yin, Wenyuan fraudulent activities in on-line payments, which involves matching Yu, Jingren Zhou, Diwen Zhu, Rong Zhu. GraphScope: A Unified Engine of complex (subgraph) patterns [48]. And many algorithms [60] For Big Graph Processing. PVLDB, 14(12): 2879 - 2892, 2021. commonly used in product and advertisement recommendation doi:10.14778/3476311.3476369 boil down to deep learning over graph-structured data. PVLDB Artifact Availability: Given the importance of graph computation, it is rational to The source code, data, and/or other artifacts have been made available at design a scalable engine for processing large graphs with high-level https://github.com/alibaba/GraphScope/tree/vldb. language support. Ideally, such an engine should allow developers to easily program graph algorithms and naturally exploit graph- 1 INTRODUCTION specific optimizations, while at the same time, maintaining thescal- ability and efficiency of a dataflow execution model. Unfortunately, Distributed execution engines with high-level language support it is challenging to efficiently scale diverse graph computation types such as Koalas [45], Dask [15], and TensorFlow [5], have been (Section 2.2) that require different design trade-offs and optimiza- widely adopted with great success in the development of modern tions tightly coupled with specific programming abstractions. For data-intensive applications. Two factors largely account for the example, while the popular vertex-centric model [34] works nicely success of these systems. First, they provide developers with easy for iterative algorithms, developers still have to derive specializa- access to a core subset of domain-specific operators, such as re- tions for random graph walking and pattern matching [24, 61]. This lational join, matrix multiplication, and convolution, and allow explains why existing graph processing systems [25, 32, 39] are further extensions via arbitrary user-defined functions. Second, designed for a particular type of graph computation. they adopt the dataflow execution model, which is more scalable In contrast, real-world graph applications are often far more This work is licensed under the Creative Commons BY-NC-ND 4.0 International complicated that intertwine many types of graph computation in License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of one single workload. As a result, developers often have to comprise this license. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights multiple systems with potentially very different programming mod- licensed to the VLDB Endowment. els and runtime, which gives rise to a number of issues such as Proceedings of the VLDB Endowment, Vol. 14, No. 12 ISSN 2150-8097. doi:10.14778/3476311.3476369 managing the complexities of data representation, resource sched- uling, and performance tuning across multiple systems, etc. There 2879 belongs_to thus urgently needs a unified programming abstraction and run- Product 2 3 User time that allows developers to write applications in a high-level {name:gift, path {type:seller, price:$99} name:Jack} programming language for a wide range of graph computations. order deliver Furthermore, in web-scale graph analytics, a graph pipeline also User {type:buyer, 1 4 includes the construction of an input graph from various sources, name:Tom} home_of Address as well as the preparation of final results for the downstream tasks to consume, which needs complex data extraction, cleaning, and Figure 2: An example łe-commercež property graph. transformations (such as joins), and often requires excessive data movements and non-trivial interplay among an array of systems (such as Hadoop [37] and Spark [63]). There have been attempts to implement diverse graph computations on top of the dataflow GraphScope’s programming interface. Section 4 and 5 detail our model (GraphX [26]) to ease such inter-operations. However, cast- design and implementation of GraphScope. Section 6 describes ing graph computations as a sequence of relational operators can example applications and Section 7 presents evaluation results. incur significant overheads for problems that require low-latency Section 8 discusses related work and we conclude in Section 9. such as graph traversal [4]. Moreover, as we will show in Section 5, it fails to take advantage of the well-defined semantics of graph 2 BACKGROUND AND PRELIMINARIES computations to enable sophisticated optimizations such as pipelin- As mentioned in the introduction, existing large-scale graph com- ing. As a result, while distributed graph algorithms are already hard puting engines are typically tailored to solve one particular type of to implement efficiently in existing systems, implementing complex graph computation. In this section we provide a categorisation of graph pipelines becomes more challenging. graph computation and review the limitations of a state-of-the-art To tackle the aforementioned problems, we propose a unified framework in unifying graph computation. engine for big graph processing called GraphScope. Figure 1 gives the conceptual overview of the GraphScope system stack. At the 2.1 Property Graph Data Model bottom is a dataflow runtime that serves as the fabric to compose In graph computation, data is typically represented as a property distributed execution of different graph computations, leveraging graph [9], in which vertices and edges can have a set of properties. all the resources available in a cluster. This execution layer enables Every entity (vertex or edge) is identified by a unique identifier a separate optimization (or family of optimizations) for each graph (ID), and has a (label) indicating its type or role. Each property computation in one carefully designed coherent framework, while is a key-value pair with combination of entity ID and property at the same time it offers a simple and powerful programming inter- name as the key. Figure 2 shows an example property graph. It face. Further up the stack, we have developed a graph library with a contains user, product, and address vertices connected by large fraction of frequently used graph computations. Last but not order, deliver, belongs_to, and home_of edges. least, by embedding the language (and hence the graph library) in Python, GraphScope can be integrated with other existing engines 2.2 Categorisation of Graph Computation to deliver a holistic development experience. In summary, we make the following contributions: While there is no textbook categorisation of graph computation, • A simple and unified programming interface for a wide we propose one based on the long history of research and practice variety of graph computations, which supports language of graph computation. We first

Graphscope: a Unified Engine for Big Graph Processing

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support