The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing
Total Page:16
File Type:pdf, Size:1020Kb
The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, M. Tamer Özsu David R. Cheriton School of Computer Science University of Waterloo {s3sahu,amine.mhedhbi,semih.salihoglu,jimmylin,tamer.ozsu}@uwaterloo.ca ABSTRACT 52, 55], and distributed graph processing systems [17, 21, 27]. In Graph processing is becoming increasingly prevalent across many the academic literature, a large number of publications that study application domains. In spite of this prevalence, there is little re- numerous topics related to graph processing regularly appear across search about how graphs are actually used in practice. We conducted a wide spectrum of research venues. an online survey aimed at understanding: (i) the types of graphs Despite their prevalence, there is little research on how graph data users have; (ii) the graph computations users run; (iii) the types is actually used in practice and the major challenges facing users of graph software users use; and (iv) the major challenges users of graph data, both in industry and research. In April 2017, we face when processing their graphs. We describe the participants’ conducted an online survey across 89 users of 22 different software responses to our questions highlighting common patterns and chal- products, with the goal of answering 4 high-level questions: lenges. We further reviewed user feedback in the mailing lists, bug (i) What types of graph data do users have? reports, and feature requests in the source repositories of a large (ii) What computations do users run on their graphs? suite of software products for processing graphs. Through our re- (iii) Which software do users use to perform their computations? view, we were able to answer some new questions that were raised (iv) What are the major challenges users face when processing their by participants’ responses and identify specific challenges that users graph data? face when using different classes of graph software. The partici- Our major findings are as follows: pants’ responses and data we obtained revealed surprising facts Variety: about graph processing in practice. In particular, real-world graphs • Graphs in practice represent a very wide variety of enti- represent a very diverse range of entities and are often very large, ties, many of which are not naturally thought of as vertices and and scalability and visualization are undeniably the most pressing edges. Most surprisingly, traditional enterprise data comprised challenges faced by participants. We hope these findings can guide of products, orders, and transactions, which are typically seen as future research. the perfect fit for relational systems, appear to be a very common form of data represented in participants’ graphs. PVLDB Reference Format: • Ubiquity of Very Large Graphs: Many graphs in practice are Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, and M. very large, often containing over a billion edges. These large Tamer Özsu. The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing. PVLDB, 11(4): 420 - 431, 2017. graphs represent a very wide range of entities and belong to DOI: https://doi.org/10.1145/3164135.3164139 organizations at all scales from very small enterprises to very large ones. This refutes the sometimes heard assumption that large graphs are a problem for only a few large organizations 1. INTRODUCTION such as Google, Facebook, and Twitter. Graph data representing connected entities and their relationships ap- • Challenge of Scalability: Scalability is unequivocally the most pear in many application domains, most naturally in social networks, pressing challenge faced by participants. The ability to process the web, the semantic web, road maps, communication networks, very large graphs efficiently seems to be the biggest limitation biology, and finance, just to name a few examples. There has been of existing software. a noticeable increase in the prevalence of work on graph process- • Visualization: Visualization is a very popular and central task ing both in research and in practice, evidenced by the surge in the in participants’ graph processing pipelines. After scalability, number of different commercial and research software for man- participants indicated visualization as their second most pressing aging and processing graphs. Examples include graph database challenge, tied with challenges in graph query languages. systems [3,8,14,35,48,53], RDF engines [38,64,67], linear algebra • Prevalence of RDBMSes: software [6,46], visualization software [13,16], query languages [28, Relational databases still play an important role in managing and processing graphs. Permission to make digital or hard copies of all or part of this work for Our survey also highlights other interesting facts, such as the preva- personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies lence of machine learning on graph data, e.g., for clustering vertices, bear this notice and the full citation on the first page. To copy otherwise, to predicting links, and finding influential vertices. republish, to post on servers or to redistribute to lists, requires prior specific We further reviewed user feedback in the mailing lists, bug re- permission and/or a fee. Articles from this volume were invited to present ports, and feature requests in the source code repositories of 22 their results at The 44th International Conference on Very Large Data Bases, software products between January and September of 2017 with August 2018, Rio de Janeiro, Brazil. two goals: (i) to answer several new questions that the participants’ Proceedings of the VLDB Endowment, Vol. 11, No. 4 Copyright 2017 VLDB Endowment 2150-8097/17/12... $ 10.00. responses raised; and (ii) to identify more specific challenges in DOI: https://doi.org/10.1145/3164135.3164139 different classes of graph technologies than the ones we could iden- 420 tify in participants’ responses. For some of the questions in our Table 1: Software products used for recruiting participants and the online survey, we also compared the graph data, computations, and number of active mailing list users from February to April 2017. software used by the participants with those studied in academic publications. For this, we reviewed 90 papers from 6 conferences Technology Software # Users across different academic venues. ArrangoDB [3] 40 In addition to discussing the insights we gained through our study, Caley [8] 14 Graph Database DGraph [14] 33 we discuss several directions about the future of graph processing. 233 We hope our study can inform research about real use cases and System JanusGraph [35] 32 important problems in graph processing. Neo4j [48] 69 OrientDB [53] 45 Apache Jena [38] 87 2. METHODOLOGY AND PARTICIPANTS RDF Engine Sparksee [64] 5 115 In this section, we first describe the format of our survey and then Virtuoso [67] 23 how we recruited the participants. Next we describe the demo- Apache Flink (Gelly) [17] 24 Distributed Graph graphic information of the participants, including the organizations Apache Giraph [21] 8 39 Processing Engine they come from and their roles in their organizations. Then we Apache Spark (GraphX) [27] 7 describe our methodology of reviewing academic publications. We Query Language Gremlin [28] 82 82 end this section by describing our methodology for reviewing the Graph for Scala [22] 4 user feedback in the mailing lists, bug reports, and feature requests GraphStream [24] 8 in the source code repositories of the software products. Graphtool [25] 28 Graph Library 97 NetworKit [50] 10 2.1 Survey Format NetworkX [51] 27 The survey was in the format of an online form. All of the questions SNAP [62] 20 were optional and participants could skip any number of questions. Cytoscape [13] 93 Graph Visualization 116 There were 2 types of questions: Elasticsearch 23 (i) Multiple Choice: There were 3 types of multiple choice ques- (X-Pack Graph) [16] tions: (a) yes-no questions; (b) questions that allowed only Graph Representation Conceptual Graphs [11] 6 6 a single choice as a response; and (c) questions that allowed multiple choices as a response. The participants could use an • Slack Channels: Two of the software products on our list, Neo4j Other option when their answers required further explanation and Cayley, had Slack channels for their users. We posted the or did not match any of the provided choices. We random- survey to these channels. ized the order of choices in questions about the computations • Twitter: A week after posting our survey to the mailing lists and participants run and the challenges they face. Slack channels and sending private emails, we posted a tweet (ii) Short Answer: For these questions, the participants entered with a link to our survey to 7 of the 22 software products that their responses in a text box. had an official Twitter account. Only Neo4j retweeted our tweet. There were 34 questions grouped into five categories: (i) demo- Participants that we recruited through different methods shared the graphic questions; (ii) graph datasets; (iii) graph and machine learn- same online link and we could not tell the number of participants ing computations; (iv) graph software; and (v) workload breakdown recruited from each method. In particular, we suspect that there and major challenges. were more users from graph database systems mainly because their mailing lists contained more active users, as can be seen in Ta- 2.2 Participant Recruitment ble 1. Moreover, 4 of the 5 mailing lists that allowed us to send We prepared a list of 22 popular software products for processing private emails and the Slack and Twitter channels belonged to graph graphs (see Table 1) that had public user mailing lists covering database systems. We note that after posting the survey on Twitter, 6 types of technologies: graph database systems, RDF engines, we received 12 responses. distributed graph processing systems (DGPSes), graph libraries In the end, there were 89 participants.