Graph Processing: A Panoramic View and Some Open Problems

M. Tamer Ozsu¨

University of Waterloo David R. Cheriton School of Science https://cs.uwaterloo.ca/~tozsu

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27)1 Graph Theory

Graph Graph Algorithms Systems

Graph Research is Dispersed

Knowledge graphs Semantic Web

Graph Graph Analytics

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27)2 Knowledge graphs Semantic Web

Graph Graph Databases Analytics

Graph Research is Dispersed

Graph Theory

Graph Graph Algorithms Systems

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27)2 Knowledge graphs Graph Theory Semantic Web

Graph Graph AlgorithmsDatabases AnalyticsSystems

Graph Research is Dispersed

Database

Social Data Mining Computing

AI/ML

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27)2 Knowledge graphs Graph Theory Semantic Web

Graph Graph AlgorithmsDatabases AnalyticsSystems

Graph Research is Dispersed

Knowledge graphs Semantic Web

Graph ? Graph Databases Analytics

Database Graph Theory

Social Data Mining Computing Graph Graph Algorithms Systems

AI/ML

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27)2 RDF graph

Property graph

My objectives

Three things... 1 Discuss a way to coherently position work in the various communities; 2 A tour across different communities to provide a panoramic view of the research; 3 Highlight some problems that interest me!

Knowledge graphs Semantic web

Graph Graph Analytics DBMSs Systems

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27)3 My objectives

Three things... 1 Discuss a way to coherently position work in the various communities; 2 A tour across different communities to provide a panoramic view of the research; 3 Highlight some problems that interest me!

RDF graph

Knowledge graphs Semantic web

Graph Graph Analytics DBMSs Systems

Property graph

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27)3 Freud’s Recommendation for a Good Talk...

By Way of Moshe Vardi...

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27)4 Introduction RDF Engines Graph DBMSs Graph Analytics Systems Dynamic & Streaming Graphs Concluding Remarks

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27)5 In the beginning...

Aircraft Engine Right engine ... there was IMS Left engine Body By IBM (along with Fuselage FWD Rockwell & Caterpillar) Section 4 Section 43 For the Apollo program Fuselage AFT First deployed in 1968 Section 46 Empennage Managing Bill of Materials ... (BOM) of Saturn rocket Wings Hierarchical model because Left Wing LW Leading Edge BOM is hierarchical RW Leading Edge Wing Box ...

Right Wing ...

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27)6 In the beginning...

... and IDS By GE To control their manufacturing processes First deployed in 1964 Manufacturing processes (with scheduling constraints) form a graph Led to CODASYL standard

Source: https://dba.stackexchange.com/questions/ 119380/er-vs-database-schema-diagrams

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27)6 D´ej`avu all over again?

Network models were also used in I Object DBMSs I XML

Network (CODASYL)

• 4. The arrow is directed to the member of the corresponding set type.

Consider a more complicated example. Suppose a company produces and sells . In this case all record types are evident: • 1. Computer products that the company sells (PRODUCT);

• 2. Customers who buy the products (CUSTOMER); • 3.Network Representatives who sell the DBMSsproducts (REPRESENT); Network (CODASYL) Data Model • 4. Sales transactions (TRANSACTION);

All set types are also evident: The Data Sets might look as follows: • 1. Product and all transactions which include this company's product (PT); • 1. Products and all transactions which include these company's products (PT); • 2. Customer and all transactionsCODASYL which were done by Languagethis customer (CT); • 3. Representative and all transactions which were done by this representative (RT); • 2. Customer and all transactions which were done by these customers(CT); A current state of the database miIght lookFIND as follows: The withRecords, for keyexample, might be: • 3. Representatives and all transactions which were done by these representatives(RT). • 1. Computer products that the company sells (PRODUCT); 2.3 Data Updating Facilities • 2. Customers who buy theI productsNavigate (CUSTOMER); within the set, within elements of the same We have discussed only the first part of Network Data Model - the data description facilities. • 3. Representatives who sell the products (REPRESENT); record type, etc Each data model also includes particular data manipulation facilities - Data Manipulation • 4. Sales transactions (TRANSACTION); Language (DML). The data manipulation facilities of a concrete DML can be also divided into two parts: • (i) data update functions;

• (ii) data retrieve functions.

The main update functions of a Network data manipulation language are: • 1. To store new occurrences of the record type declared in the current data base schema.

• 2. To modify existing occurrences of the record type declared in the current data base schema.

• 3. To delete existing occurrences of the record type declared in the current data base schema.

Source: Network (CODASYL) Data Model, https://coronet.iicm.tugraz.at/is/scripts/lesson03.pdf• 4. To insert existing occurrences of the record type declared in the current data base schema as a member of a certain data set into the exactly one occurrence of this data set. © M. Tamer Ozsu¨ VLDB 2019 (2019/08/27)7 • 5.To remove existing occurrence of the member of the data set from the occurrence of this data set.

Putting a new record occurrence into a database: D´ej`avu all over again?

Network (CODASYL) Data Model

• 4. The arrow is directed to the member of the corresponding set type.

Consider a more complicated example. Suppose a company produces and sells computers. In this case all record types are evident: • 1. Computer products that the company sells (PRODUCT);

• 2. Customers who buy the products (CUSTOMER); • 3.Network Representatives who sell the DBMSsproducts (REPRESENT); Network (CODASYL) Data Model • 4. Sales transactions (TRANSACTION);

All set types are also evident: The Data Sets might look as follows: • 1. Product and all transactions which include this company's product (PT); • 1. Products and all transactions which include these company's products (PT); • 2. Customer and all transactionsCODASYL which were done by Languagethis customer (CT); • 3. Representative and all transactions which were done by this representative (RT); • 2. Customer and all transactions which were done by these customers(CT); A current state of the database miIght lookFIND as follows: The withRecords, for keyexample, might be: • 3. Representatives and all transactions which were done by these representatives(RT). • 1. Computer products that the company sells (PRODUCT); 2.3 Data Updating Facilities • 2. Customers who buy theI productsNavigate (CUSTOMER); within the set, within elements of the same We have discussed only the first part of Network Data Model - the data description facilities. • 3. Representatives who sell the products (REPRESENT); record type, etc Each data model also includes particular data manipulation facilities - Data Manipulation • 4. Sales transactions (TRANSACTION); Language (DML). The data manipulation facilities of a concrete DML can be also divided into two parts: • (i) data update functions; Network models were also• (ii) used data retrieve in functions. The main update functions of a Network data manipulation language are: I Object DBMSs • 1. To store new occurrences of the record type declared in the current data base schema.

• 2. To modify existing occurrences of the record type declared in the current data base I XML schema.

• 3. To delete existing occurrences of the record type declared in the current data base schema.

Source: Network (CODASYL) Data Model, https://coronet.iicm.tugraz.at/is/scripts/lesson03.pdf• 4. To insert existing occurrences of the record type declared in the current data base schema as a member of a certain data set into the exactly one occurrence of this data set. © M. Tamer Ozsu¨ VLDB 2019 (2019/08/27)7 • 5.To remove existing occurrence of the member of the data set from the occurrence of this data set.

Putting a new record occurrence into a database: Network (CODASYL) Data Model

• 4. The arrow is directed to the member of the corresponding set type.

Consider a more complicated example. Suppose a company produces and sells computers. In this case all record types are evident: • 1. Computer products that the company sells (PRODUCT);

• 2. Customers who buy the products (CUSTOMER); • 3.Network Representatives who sell the DBMSsproducts (REPRESENT); Network (CODASYL) Data Model • 4. Sales transactions (TRANSACTION);

All set types are also evident: The Data Sets might look as follows: • 1. Product and all transactions which include this company's product (PT); • 1. Products and all transactions which include these company's products (PT); • 2. Customer and all transactionsCODASYL which were done by Languagethis customer (CT); • 3. Representative and all transactions which were done by this representative (RT); • 2. Customer and all transactions which were done by these customers(CT); A current state of the database miIght lookFIND as follows: The withRecords, for keyexample, might be: • 3. Representatives and all transactions which were done by these representatives(RT). • 1. Computer products that the company sells (PRODUCT); 2.3 Data Updating Facilities • 2. Customers who buy theI productsNavigateD´ej`avu (CUSTOMER); within the all set, over within elements again? of the same We have discussed only the first part of Network Data Model - the data description facilities. • 3. Representatives who sell the products (REPRESENT); record type, etc Each data model also includes particular data manipulation facilities - Data Manipulation • 4. Sales transactions (TRANSACTION); Language (DML). The data manipulation facilities of a concrete DML can be also divided into two parts: • (i) data update functions; Network models were also• (ii) used data retrieve in functions. The main update functions of a Network data manipulation language are: I Object DBMSs • 1. To store new occurrences of the record type declared in the current data base schema.

• 2. To modify existing occurrences of the record type declared in the current data base I XML schema.

• 3. To delete existing occurrences of the record type declared in the current data base schema.

Source: Network (CODASYL) Data Model, https://coronet.iicm.tugraz.at/is/scripts/lesson03.pdf• 4. To insert existing occurrences of the record type declared in the current data base schema as a member of a certain data set into the exactly one occurrence of this data set. © M. Tamer Ozsu¨ VLDB 2019 (2019/08/27)7 • 5.To remove existing occurrence of the member of the data set from the occurrence of this data set.

Putting a new record occurrence into a database: Modern graphs are different and diverse

Trade volumes and Internet connections Social networks

Linked Audio LOV User Slideshare tags2con Feedback 2RDF delicious Moseley Scrobbler Bricklink Sussex GTAA Folk (DBTune) Reading St. Magna- Klapp- Lists Andrews tune stuhl- Resource NTU DB club Lists Resource Lotico Semantic yovisto Tropes John Music Man- Lists Music Tweet chester Hellenic Peel Brainz NDL Brainz Reading FBD (DBTune) (Data subjects EUTC (zitgist) Lists Open Incubator) Linked t4gm Hellenic Produc- Library Open Surge Crunch- info PD tions RDF Discogs base Library Radio ohloh Ontos Source Code Crime (Data Plymouth (Talis) News Ecosystem Reading RAMEAU LEM Reports business Incubator) Portal Lists Crime data.gov. Music Jamendo Linked Data SH UK (En- uk Brainz (DBtune) LinkedL Ox AKTing) ntnusc FanHubz gnoss SSW CCN Points (DBTune) Last.FM Poké- Thesaur Thesau- Popula- artists pédia Didactal us rus W LIBRIS tion (En- patents (DBTune) Last.FM ia theses. LCSH Rådata reegle AKTing) research MARC data.gov. data.go (rdfize) my fr Codes nå! NHS Good- Ren. uk v.uk Experi- List (En- Classical win flickr Energy (DB Pokedex ment Norwe- AKTing) Mortality Family wrappr Sudoc PSH Genera- BBC Tune) gian tors (En- Program AKTing) MeSH mes semantic IdRef GND CO2 educatio OpenEI BBC web.org Energy SW Sudoc ndlna Emission n.data.g Music (En- Chronic- Linked Dog VIAF EEA (En- ov.uk UB AKTing) ling Event MDB Portu- Food AKTing) guese Mann- Europeana BBC America Media DBpedia Calames heim Wildlife Deutsche Open Ord- Recht- Revyu DDC Openly Finder Bio- Election nance spraak. lobid RDF graphie NSZL Data legislation Survey Local nl data Ulm Resources Swedish EU Tele- New Book Project data.gov.uk Catalog Open Insti- graphis York bnf.fr Open Mashup Cultural tutions Times URI Greek P20 UK Post- Heritage Burner DBpedia Calais codes statistics ECS Wiki lobid data.gov. Taxon South- GovWILD LOIUS iServe BNB Organi- Brazilian uk Concept ECS ampton sations Geo World OS BibBase STW GESIS Poli- ESD South- ECS Names Fact- (RKB ticians stan- reference ampton data.gov.uk book Freebase Explorer) Budapest dards data.gov. EPrints intervals NASA Lichfield uk (Data Project OAI transport DBpedia data Pisa Incu- Guten- Spen- data.gov. dcs RESEX Scholaro- bator) Fishes berg DBLP ISTAT ding uk DBLP meter Scotland Geo Immi- of Texas (FU (L3S) Pupils & Uberblic DBLP gration Species Berlin) IRIT Exams Euro- dbpedia data- TCM (RKB London stat open- ACM lite Gene Explorer) IBM NVD Traffic Gazette (FUB) Geo ac-uk Eurostat DIT Scotland TWC LOGD Linked Daily UN/ Data UMBEL ERA Data Med LOCODE DEPLOY Gov.ie YAGO CORDIS lingvoj Disea- New- (RKB some SIDER RAE2001 castle LOCAH CORDIS Explorer) Linked Eurécom Eurostat Drug CiteSeer Roma (FUB) Sensor Data GovTrack (Ontology (Kno.e.sis) Open Bank Pfam Course- Central) riese Enipedia Cyc Lexvo LinkedCT ware Linked UniProt PDB VIVO EDGAR EURES US SEC Indiana ePrints dotAC (Ontology IEEE (rdfabout) totl.net Central) WordNet RISKS UniProt US Census (VUA) Taxono EUNIS Twarql (Bio2RDF) HGNC Semantic (rdfabout) Cornetto my VIVO PRO- FTS XBRL ProDom STITCH Cornell LAAS SITE Scotland KISTI NSF LODE Geo- GeoWord graphy Net WordNet WordNet JISC (W3C) Climbing Linked (RKB Affy- KEGG SMC Explorer) SISVU Pub Drug VIVO UF Piedmont GeoData metrix ECCO- Finnish Journals PubMed Gene SGD Chem Accomo- TCP Munici- El AGROV Ontology Media dations Alpine bible palities Viajero OC Tourism Ski KEGG ontology Austria Ocean Enzyme PBAC Geographic GEMET ChEMBL Italian Drilling Metoffice KEGG Open OMIM public Codices AEMET Weather Linked MGI Pathway Forecasts Data InterPro GeneID Publications schools Thesau- Open KEGG Turismo EARTh rus Colors Reaction de User-generated content Zaragoza Product Smart KEGG Weather DB Link Medi Glycan Janus Stations Care Product UniParc UniRef KEGG AMP Types Italian UniSTS Com- Government Yahoo! Homolo Airports Ontology Museums Google pound Geo Gene Art National Chem2 Cross-domain Planet wrapper Radio- Bio2RDF activity UniPath JP Sears Open Linked OGOLOD way Life sciences Corpo- Amster- Reactome medu- Open rates dam Museum cator Numbers As of September 2011 Biological networks Linked data Road network

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27)8 Graph Usage Study [Sahu et al., 2017, 2019] Objectives

1 What kind of graph data, computations, software, and major challenges real users have in practice? 2 What types of graph data, computations, software, and major challenges researchers target in publications? Methodology I Online survey I Personal interviews 89 participants: 36 researchers; 4 interviews with survey 53 industry participants 22 graph software products 4 additional in-person interviews: 2 developers and 2 users I Review of academic publications 7 conferences, 3 yrs for each I Applications from white papers 252 papers 4 graph DBMSs + 4 RDF engines I Review of emails, bug reports, and feature requests 12 applications from graph DBMSs + 5 from RDF engines over 6000 emails and issues (with overlap)

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27)9 Q1. Which real world entities do your graphs represent? Q2. Which non-human entities do your graphs represent?

2 Graphs are indeed very large! 3 ML on graphs is very popular! At least 68% of respondents use ML workload 4 Scalability is the most pressing challenge! Followed by visualization & query languages

5 RDBMS still play an important role!

Major Findings [Sahu et al., 2017, 2019]

1 Graphs are indeed everywhere!

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 10 Q2. Which non-human entities do your graphs represent?

2 Graphs are indeed very large! 3 ML on graphs is very popular! At least 68% of respondents use ML workload 4 Scalability is the most pressing challenge! Followed by visualization & query languages

5 RDBMS still play an important role!

Major Findings [Sahu et al., 2017, 2019]

1 Graphs are indeed everywhere! Entity #Part. Humans 44 Q1. Which real world entities do your (e.g., customers, graphs represent? friends) Non-human entities 61 (e.g., web, products, files) RDF 23 Scientific 15 (e.g., proteins, molecules, bonds)

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 10 2 Graphs are indeed very large! 3 ML on graphs is very popular!#Papers At least 68% of respondents30 use ML workload 8 2 4 Scalability is the most pressing11 challenge! 2 Followed by visualization &0 query languages 3

5 RDBMS still play an important role!

Major Findings [Sahu et al., 2017, 2019]

1 Graphs are indeed everywhere! Entity #Part. Humans 44 Q1. Which real world entities do your (e.g., customers, graphs represent? friends) Q2. Which non-human entities do your Non-human entities 61 graphs represent? (e.g., web, products, files) RDF 23 Entity #Part. Scientific 15 Web 4 (e.g., proteins, Bus. & Financial 8 molecules, bonds) Product 13 Geo 7 Infrastructure 9 Digital 5 Knowledge 11

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 10 2 Graphs are indeed very large! 3 ML on graphs is very popular! At least 68% of respondents use ML workload 4 Scalability is the most pressing challenge! Followed by visualization & query languages

5 RDBMS still play an important role!

Major Findings [Sahu et al., 2017, 2019]

1 Graphs are indeed everywhere! Entity #Part. Humans 44 Q1. Which real world entities do your (e.g., customers, graphs represent? friends) Q2. Which non-human entities do your Non-human entities 61 graphs represent? (e.g., web, products, files) RDF 23 Entity #Part. #Papers Scientific 15 Web 4 30 (e.g., proteins, Bus. & Financial 8 8 molecules, bonds) Product 13 2 Geo 7 11 Infrastructure 9 2 Digital 5 0 Knowledge 11 3

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 10 3 ML on graphs is very popular! At least 68% of respondents use ML workload 4 Scalability is the most pressing challenge! Followed by visualization & query languages

5 RDBMS still play an important role!

Major Findings [Sahu et al., 2017, 2019]

1 E #Part. Graphs are indeed everywhere! | | <10K 23 Q1. Which real world entities do your 10K–100K 22 graphs represent? 100K–1M 13 Q2. Which non-human entities do your 1M–10M 9 graphs represent? 10M–100M 21 2 Graphs are indeed very large! 100M–1B 21 >1B 20

#Bytes #Part. <10MB 23 10MB–100MB 22 100MB–1GB 13 1GB–10GB 9 10GB–100GB 21 100GB–1TB 21 >1TB 20

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 10 4 Scalability is the most pressing challenge! Followed by visualization & query languages

5 RDBMS still play an important role!

Major Findings [Sahu et al., 2017, 2019]

1 Graphs are indeed everywhere! Q1. Which real world entities do your graphs represent? Q2. Which non-human entities do your graphs represent?

2 Graphs are indeed very large! 3 ML on graphs is very popular! At least 68% of respondents use ML workload

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 10 5 RDBMS still play an important role!

Major Findings [Sahu et al., 2017, 2019]

1 Graphs are indeed everywhere! Q1. Which real world entities do your graphs represent? Q2. Which non-human entities do your graphs represent?

2 Graphs are indeed very large! 3 ML on graphs is very popular! At least 68% of respondents use ML workload 4 Scalability is the most pressing challenge! Followed by visualization & query languages

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 10 Major Findings [Sahu et al., 2017, 2019]

1 Graphs are indeed everywhere! Q1. Which real world entities do your graphs represent? Q2. Which non-human entities do your graphs represent?

2 Graphs are indeed very large! 3 ML on graphs is very popular! At least 68% of respondents use ML workload 4 Scalability is the most pressing challenge! Followed by visualization & query languages

5 RDBMS still play an important role!

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 10 More Info – Please read the papers

The VLDB Journal https://doi.org/10.1007/s00778-019-00548-x

SPECIAL ISSUE PAPER The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing The ubiquity of large graphs and surprising challenges of graph processing: extended survey Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, M. Tamer Özsu David R. Cheriton School of Computer Science Siddhartha Sahu1 · Amine Mhedhbi1 · Semih Salihoglu1 · Jimmy Lin1 · M. Tamer Özsu1 University of Waterloo {s3sahu,amine.mhedhbi,semih.salihoglu,jimmylin,tamer.ozsu}@uwaterloo.ca Received: 21 January 2019 / Revised: 9 May 2019 / Accepted: 13 June 2019 © Springer-Verlag GmbH Germany, part of Springer Nature 2019 ABSTRACT 52, 55], and distributed graph processing systems [17, 21, 27]. In Graph processing is becoming increasingly prevalent across many the academic literature, a large number of publications that study Abstract application domains. In spite of this prevalence, there is little re- numerous topics related to graph processing regularly appear across a wide spectrum of research venues. Graph processing is becoming increasingly prevalent across many application domains. In spite of this prevalence, there is little search about how graphs are actually used in practice. We conducted research about how graphs are actually used in practice. We performed an extensive study that consisted of an online survey an online survey aimed at understanding: (i) the types of graphs Despite their prevalence, there is little research on how graph data users have; (ii) the graph computations users run; (iii) the types is actually used in practice and the major challenges facing users of 89 users, a review of the mailing lists, source repositories, and white papers of a large suite of graph software products, of graph software users use; and (iv) the major challenges users of graph data, both in industry and research. In April 2017, we and in-person interviews with 6 users and 2 developers of these products. Our online survey aimed at understanding: (i) the face when processing their graphs. We describe the participants’ conducted an online survey across 89 users of 22 different software types of graphs users have; (ii) the graph computations users run; (iii) the types of graph software users use; and (iv) the major products, with the goal of answering 4 high-level questions: responses to our questions highlighting common patterns and chal- challenges users face when processing their graphs. We describe the participants’ responses to our questions highlighting lenges. We further reviewed user feedback in the mailing lists, bug (i) What types of graph data do users have? common patterns and challenges. Based on our interviews and survey of the rest of our sources, we were able to answer some reports, and feature requests in the source repositories of a large (ii) What computations do users run on their graphs? suite of software products for processing graphs. Through our re- (iii) Which software do users use to perform their computations? new questions that were raised by participants’ responses to our online survey and understand the specific applications that view, we were able to answer some new questions that were raised (iv) What are the major challenges users face when processing their use graph data and software. Our study revealed surprising facts about graph processing in practice. In particular, real-world by participants’ responses and identify specific challenges that users graph data? graphs represent a very diverse range of entities and are often very large, scalability and visualization are undeniably the face when using different classes of graph software. The partici- Our major findings are as follows: most pressing challenges faced by participants, and data integration, recommendations, and fraud detection are very popular pants’ responses and data we obtained revealed surprising facts Variety: Graphs in practice represent a very wide variety of enti- applications supported by existing graph software. We hope these findings can guide future research. about graph processing in practice. In particular, real-world graphs • represent a very diverse range of entities and are often very large, ties, many of which are not naturally thought of as vertices and and scalability and visualization are undeniably the most pressing edges. Most surprisingly, traditional enterprise data comprised Keywords User survey · Graph processing · Graph databases · RDF systems challenges faced by participants. We hope these findings can guide of products, orders, and transactions, which are typically seen as future research. the perfect fit for relational systems, appear to be a very common form of data represented in participants’ graphs. PVLDB Reference Format: Ubiquity of Very Large Graphs: Many graphs in practice are 1 Introduction prevalence of work on graph processing both in research and Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, and M. • very large, often containing over a billion edges. These large Tamer Özsu. The Ubiquity of Large Graphs and Surprising Challenges of in practice, evidenced by the surge in the number of different Graph Processing. PVLDB, 11(4): xxxx-yyyy, 2017. graphs represent a very wide range of entities and belong to Graph data representing connected entities and their relation- commercial and research software for managing and pro- DOI: https://doi.org/10.1145/3164135.3164139 organizations at all scales from very small enterprises to very large ones. This refutes the sometimes heard assumption that ships appear in many application domains, most naturally in cessing graphs. Examples include systems large graphs are a problem for only a few large organizations social networks, the Web, the Semantic Web, road maps, [13,20,26,49,65,73,90], RDF engines [52,96], linear alge- 1. INTRODUCTION such as Google, Facebook, and Twitter. communication networks, biology, and finance, just to name bra software [17,63], visualization software [25,29], query Graph data representing connected entities and their relationships ap- Challenge of Scalability: Scalability is unequivocally the most a few examples. There has been a noticeable increase in the languages [41,72,78], and distributed graph processing sys- pear in many application domains, most naturally in social networks, • pressing challenge faced by participants. The ability to process tems [30,34,40]. In the academic literature, a large number of the web, the semantic web, road maps, communication networks, very large graphs efficiently seems to be the biggest limitation Electronic supplementary material The online version of this article biology, and finance, just to name a few examples. There has been of existing software. publications that study numerous topics related to graph pro- a noticeable increase in the prevalence of work on graph process- (https://doi.org/10.1007/s00778-019-00548-x) contains cessing regularly appear across a wide spectrum of research Visualization: Visualization is a very popular and central task supplementary material, which is available to authorized users. ing both in research and in practice, evidenced by the surge in the • in participants’ graph processing pipelines. After scalability, venues. number of different commercial and research software for man- participants indicated visualization as their second most pressing Siddhartha Sahu Despite their prevalence, there is little research on how aging and processing graphs. Examples include graph database B challenge, tied with challenges in graph query languages. [email protected] systems [3,8,14,35,48,53], RDF engines [38,64,67], linear algebra graph data are actually used in practice and the major chal- Prevalence of RDBMSes: Relational databases still play an software [6,46], visualization software [13,16], query languages [28, • Amine Mhedhbi lenges facing users of graph data, both in industry and in important role in managing and processing graphs. [email protected] research. In April 2017, we conducted an online survey across Permission to make digital or hard copies of all or part of this work for Our survey also highlights other interesting facts, such as the preva- personal or classroom use is granted without fee provided that copies are lence of machine learning on graph data, e.g., for clustering vertices, Semih Salihoglu 89 users of 22 different software products, with the goal of not made or distributed for profit or commercial advantage and that copies [email protected] bear this notice and the full citation on the first page. To copy otherwise, to predicting links, and finding influential vertices. answering 4 high-level questions: republish, to post on servers or to redistribute to lists, requires prior specific We further reviewed user feedback in the mailing lists, bug re- Jimmy Lin permission and/or a fee. Articles from this volume were invited to present ports, and feature requests in the source code repositories of 22 [email protected] their results at The 44th International Conference on Very Large Data Bases, software products between January and September of 2017 with (i) What types of graph data do users have? August 2018, Rio de Janeiro, Brazil. two goals: (i) to answer several new questions that the participants’ M. Tamer Özsu (ii) What computations do users run on their graphs? Proceedings of the VLDB Endowment, Vol. 11, No. 4 [email protected] Copyright 2017 VLDB Endowment 2150-8097/17/12... $ 10.00. responses raised; and (ii) to identify more specific challenges in (iii) Which software do users use to perform their computa- different classes of graph technologies than the ones we could iden- DOI: https://doi.org/10.1145/3164135.3164139 1 University of Waterloo, Waterloo, Canada tions? 123

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 11 Graph Characteristics (Not complete) Degree distribution; max degree Diameter Global/local density Connectedness Directed/undirected Weighted/unweighted Homogeneous/heterogeneous Simple/multi/hyper

How I Think of This Domain!...

Graph Types Graph Dynamism Algorithm Types Workload Types

RDF Property Static Dynamic Streaming Online Analytics Graphs Graphs Graphs Graphs Graphs Queries Workloads

Offline Online Dynamic

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 12 Graph Characteristics (Not complete) Degree distribution; max degree Diameter Global/local density Connectedness Directed/undirected Weighted/unweighted Homogeneous/heterogeneous Simple/multi/hyper

How I Think of This Domain!...

Graph Types Graph Dynamism Algorithm Types Workload Types

RDF Property Static Dynamic Streaming Online Analytics Graphs Graphs Graphs Graphs Graphs Queries Workloads

Offline Online Dynamic

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 12 How I Think of This Domain!...

Graph Types Graph Dynamism Algorithm Types Workload Types

RDF Property Static Dynamic Streaming Online Analytics Graphs Graphs Graphs Graphs Graphs Queries Workloads

Graph Characteristics (Not complete) Degree distribution; maxOffline degree Online Dynamic Diameter Global/local density Connectedness Directed/undirected Weighted/unweighted Homogeneous/heterogeneous Simple/multi/hyper

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 12 Graph Characteristics (Not complete) Degree distribution; max degree Diameter Global/local density Connectedness Directed/undirected Weighted/unweighted Homogeneous/heterogeneous Simple/multi/hyper

How I Think of This Domain!...

Graph Types Graph Dynamism Algorithm Types Workload Types

RDF Property Static Dynamic Streaming Online Analytics Graphs Graphs Graphs Graphs Graphs Queries Workloads

Offline Online Dynamic

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 12 Graph Characteristics (Not complete) Degree distribution; max degree Diameter Global/local density Connectedness Directed/undirected Weighted/unweighted Homogeneous/heterogeneous Simple/multi/hyper

How I Think of This Domain!...

Graph Types Graph Dynamism Algorithm Types Workload Types

RDF Property Static Dynamic Streaming Online Analytics Graphs Graphs Graphs Graphs Graphs Queries Workloads

Offline Online Dynamic

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 12 Graph Characteristics (Not complete) Degree distribution; max degree Diameter Global/local density Connectedness Directed/undirected Weighted/unweighted Homogeneous/heterogeneous Simple/multi/hyper

How I Think of This Domain!...

Graph Types Graph Dynamism Algorithm Types Workload Types

RDF Property Static Dynamic Streaming Online Analytics Graphs Graphs Graphs Graphs Graphs Queries Workloads

Offline Online Dynamic

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 12 Example Design Points

Graph Type Graph Dynamism Algorithm Types Workload Types

RDF Property Static Dynamic Streaming Online Analytics Graphs Graphs Graphs Graphs Graphs Queries Workloads

Offline Online Dynamic

Compute the query result over the graph as it exists.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 13 Example Design Points

Graph Type Graph Dynamism Algorithm Types Workload Types

RDF Property Static Dynamic Streaming Online Analytics Graphs Graphs Graphs Graphs Graphs Queries Workloads

Offline Online Dynamic

Compute the query result over the graph incrementally.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 13 Example Design Points

Graph Type Graph Dynamism Algorithm Types Workload Types

RDF Property Static Dynamic Streaming Online Analytics Graphs Graphs Graphs Graphs Graphs Queries Workloads

Offline Online Dynamic

Perform the analytic computation from scratch on each snapshot.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 13 Example Design Points – Not all alternatives make sense

Graph Type Graph Dynamism Algorithm Types Workload Types

RDF Property Static DynamicStreaming Online Analytics Graphs Graphs Graphs Graphs Graphs Queries Workloads

Offline Online Dynamic

Batch Dynamic

Dynamic (or batch-dynamic) algorithms do not make sense for static graphs.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 14 Alternative Classification

Inputs Input ingestion Generative model Queued/non-queued ... Input data Graph type Graph characteristics Graph dynamism ... Input workload ... Processing Algorithms ... Output Output generation (or release) time Output type Output interface

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 15 Graph System Architectural Design Decisions

Disk-based vs memory-based Most graph analytics systems are memory-based Others are mixed Scale-up vs scale-out Controversial point discussed next Computing paradigm A number of alternatives exist Discussed separately for each type of system

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 16 Scale-out: Parallel (cluster) execution

DEPARTMENT: Big Data Bites DEPARTMENT: Big Data Bites Dataset |V | |E| Regular size Single Machine∗ Live Journal 4,847,571 68,993,773 1.08GB 6.3GB ScaleUSA Road Up or Scale Out23,947,347 for 58,333,344 Response951MB to “Scale9.09GB Up or Scale-outTwitter is the only41,652,230 way1,468,365,182 to go!... 26GB 128 GB GraphUK0705 Processing?82,240,700 2,829,101,180 Scale48GB Out for Graph247GB ThereWorld is Roadno way to682,496,072 deal with the717,016,716 emergingProcessing” real15GB graph 194GB sizes on single Jimmy Lin This column explores a simple question: scale up or UniversityCommonCrawl2014 of Waterloo 1,727,000,000 64,422,000,000 1.3TB Out of memory (ordinary) machinesscale out for graph processing? Should we simply ∗ throw “beefier” individual multi-core, large-memory Using (PowerLyra) Semih Salihoglu In this article, the authors provide their views on machines at graph processing tasks and focus on developing more efficient multi- University of Waterloo whether organizations should scale up or scale out threaded algorithms, or are investments in distributed graph processing frameworks and M. Tamer Özsu their graph computations. This question was explored accompanying algorithms worthwhile? For rhetorical convenience, I adopt customary University of Waterloo in a previous installment of this column by Jimmy Lin, definitions, referring to the former as scale up and the latter as scale out. Under what Editor: Jimmy Lin where he made a case for scale-up through several circumstances should we prefer one approach over the other? [email protected] examples. In response, the authors discuss three cases for scale-out. Whether we should scale up or out for graph processing is a consequential question from two perspectives: For big data practitioners, it would be desirable to develop a set of best practices that provide guidance to organizations building and deploying graph-processing capabilities. For Our colleague Jimmy Lin in the University of Waterloo’s Data Systems Group wrote an article big data researchers, these best practices translate into priorities for future work and provide a for this department giving his perspective on whether organizations should scale up or scale out roadmap prioritizing real-world pain points. for graph analytics.1 Similarly to that article, for rhetorical convenience, we use “scale up” to re- tl;dr – I advocate scale-up solutions for graph processing as the first thing to try, since they are fer to using software running on multicore large-memory machines and “scale out” to refer to much simpler to design, implement, deploy, and maintain. If you really need distributed scale- using distributed software running on multiple machines. out solutions, it means that your organization has become immensely successful. Not just “suc- It is difficult to disagree with the central message of Jimmy’s article: For many organizations cessful” but on a growth trajectory that is outpacing Moore’s Law (it’ll become clear what this that have large-scale graphs and want to run analytical computations, using a multicore single means below). This is unlikely to be the case for most organizations, and even if you were fortu- machine with a lot of RAM is a better option than a distributed cluster because single-machine nate enough to experience such explosive growth, success would likely bring commensurate re- software, compared to distributed software, is easier to develop in-house or use out of the box, is sources to throw at the problem. So as long as you leave yourself enough headroom, it should be often more efficient, and is easier to maintain. This is indeed true, and for the social-network possible to build a scale-out graph processing solution just in time. The alternative is sunk costs graphs and the computations discussed in that article—e.g., a search for a diamond structure or in distributed graph processing infrastructure, which comes with huge parallelization overheads, an online random-walk computation for recommendations—scale-up is likely the better ap- anticipating a problem that never arrives. proach. However, Jimmy’s article gave the impression that only a handful of applications require It makes sense to begin by more carefully describing the scope of the problem: the type of graph scale-out computing, and it failed to highlight several common scenarios in which scale-out is processing I am referring to involves analytical queries to extract insights or to power data prod- necessary. ucts. An example of the former might be clustering a large social network to infer latent commu- In this response article, we discuss three cases for scale-out: nities of interest. An example of the latter might be building a graph-based recommendation system. Typically, these queries require traversing large portions of the graph where throughput, • Trillion-edge-size graphs. Several application domains, such as finance, retail, e-com- not latency, is the more important performance consideration. merce, telecommunications, and scientific computing, have naturally appearing graphs Gartner defines “operational” databases as “relational and non-relational DBMS products suita- at the trillion-edge-size scale. ble for a broad range of enterprise-level transactional applications, and DBMS products support- ing interactions and observations as alternative types of transactions.” I am specifically not

IEEE Internet Computing Published by the IEEE Computer Society IEEE Internet Computing Published by the IEEE Computer Society May/June 2018 72 1089-7801/18/$33.00 ©2018 IEEE September/October 2018 18 1089-7801/18/$33.00 ©2018 IEEE

Scale-up or Scale-out? Scale-up: Single machine execution Graph datasets are small and can fit in a single machine – even in main memory Single machine avoids parallel execution complexities (multithreading is a different issue)

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 17

DEPARTMENT: Big Data Bites DEPARTMENT: Big Data Bites

Scale-outScale Up isor Scale the onlyOut for way to go!... Response to “Scale Up or Graph Processing? Scale Out for Graph There is no way to deal with the emergingProcessing” real graph sizes on single Jimmy Lin This column explores a simple question: scale up or University of Waterloo (ordinary) machinesscale out for graph processing? Should we simply throw “beefier” individual multi-core, large-memory Semih Salihoglu In this article, the authors provide their views on machines at graph processing tasks and focus on developing more efficient multi- University of Waterloo whether organizations should scale up or scale out threaded algorithms, or are investments in distributed graph processing frameworks and M. Tamer Özsu their graph computations. This question was explored accompanying algorithms worthwhile? For rhetorical convenience, I adopt customary University of Waterloo in a previous installment of this column by Jimmy Lin, definitions, referring to the former as scale up and the latter as scale out. Under what Editor: Jimmy Lin where he made a case for scale-up through several circumstances should we prefer one approach over the other? [email protected] examples. In response, the authors discuss three cases for scale-out. Whether we should scale up or out for graph processing is a consequential question from two perspectives: For big data practitioners, it would be desirable to develop a set of best practices that provide guidance to organizations building and deploying graph-processing capabilities. For Our colleague Jimmy Lin in the University of Waterloo’s Data Systems Group wrote an article big data researchers, these best practices translate into priorities for future work and provide a for this department giving his perspective on whether organizations should scale up or scale out roadmap prioritizing real-world pain points. for graph analytics.1 Similarly to that article, for rhetorical convenience, we use “scale up” to re- tl;dr – I advocate scale-up solutions for graph processing as the first thing to try, since they are fer to using software running on multicore large-memory machines and “scale out” to refer to much simpler to design, implement, deploy, and maintain. If you really need distributed scale- using distributed software running on multiple machines. out solutions, it means that your organization has become immensely successful. Not just “suc- It is difficult to disagree with the central message of Jimmy’s article: For many organizations cessful” but on a growth trajectory that is outpacing Moore’s Law (it’ll become clear what this that have large-scale graphs and want to run analytical computations, using a multicore single means below). This is unlikely to be the case for most organizations, and even if you were fortu- machine with a lot of RAM is a better option than a distributed cluster because single-machine nate enough to experience such explosive growth, success would likely bring commensurate re- software, compared to distributed software, is easier to develop in-house or use out of the box, is sources to throw at the problem. So as long as you leave yourself enough headroom, it should be often more efficient, and is easier to maintain. This is indeed true, and for the social-network possible to build a scale-out graph processing solution just in time. The alternative is sunk costs graphs and the computations discussed in that article—e.g., a search for a diamond structure or in distributed graph processing infrastructure, which comes with huge parallelization overheads, an online random-walk computation for recommendations—scale-up is likely the better ap- anticipating a problem that never arrives. proach. However, Jimmy’s article gave the impression that only a handful of applications require It makes sense to begin by more carefully describing the scope of the problem: the type of graph scale-out computing, and it failed to highlight several common scenarios in which scale-out is processing I am referring to involves analytical queries to extract insights or to power data prod- necessary. ucts. An example of the former might be clustering a large social network to infer latent commu- In this response article, we discuss three cases for scale-out: nities of interest. An example of the latter might be building a graph-based recommendation system. Typically, these queries require traversing large portions of the graph where throughput, • Trillion-edge-size graphs. Several application domains, such as finance, retail, e-com- not latency, is the more important performance consideration. merce, telecommunications, and scientific computing, have naturally appearing graphs Gartner defines “operational” databases as “relational and non-relational DBMS products suita- at the trillion-edge-size scale. ble for a broad range of enterprise-level transactional applications, and DBMS products support- ing interactions and observations as alternative types of transactions.” I am specifically not

IEEE Internet Computing Published by the IEEE Computer Society IEEE Internet Computing Published by the IEEE Computer Society May/June 2018 72 1089-7801/18/$33.00 ©2018 IEEE September/October 2018 18 1089-7801/18/$33.00 ©2018 IEEE

Scale-up or Scale-out? Scale-up: Single machine execution Scale-out: Parallel (cluster) execution Graph data sets grow when they are expanded to their storage formats Workstations of appropriate size are still expensive Some graphs are very large: Billions of vertices, hundreds of billions of edges Dataset size may not be the only factor ⇒ parallelizing computation is important Applications may operate in a distributed environment Downside: graph partitioning is difficult

Dataset |V | |E| Regular size Single Machine∗ Live Journal 4,847,571 68,993,773 1.08GB 6.3GB USA Road 23,947,347 58,333,344 951MB 9.09GB Twitter 41,652,230 1,468,365,182 26GB 128 GB UK0705 82,240,700 2,829,101,180 48GB 247GB World Road 682,496,072 717,016,716 15GB 194GB CommonCrawl2014 1,727,000,000 64,422,000,000 1.3TB Out of memory ∗ Using (PowerLyra)

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 17

DEPARTMENT:Dataset Big Data Bites |V | |E| DEPARTMENT:Regular Big size Data Bites Single Machine∗ Live Journal 4,847,571 68,993,773 1.08GB 6.3GB USA Road 23,947,347 58,333,344 951MB 9.09GB ScaleTwitter Up or Scale Out41,652,230 for 1,468,365,182 Response26GB to “Scale128 GB Up or GraphUK0705 Processing?82,240,700 2,829,101,180 Scale48GB Out for Graph247GB World Road 682,496,072 717,016,716 15GB 194GB CommonCrawl2014 1,727,000,000 64,422,000,000 Processing”1.3TB Out of memory Jimmy Lin This column explores a simple question: scale up or University of Waterloo ∗ scale out for graph processing? Should we simply Using (PowerLyra) throw “beefier” individual multi-core, large-memory Semih Salihoglu In this article, the authors provide their views on machines at graph processing tasks and focus on developing more efficient multi- University of Waterloo whether organizations should scale up or scale out threaded algorithms, or are investments in distributed graph processing frameworks and M. Tamer Özsu their graph computations. This question was explored accompanying algorithms worthwhile? For rhetorical convenience, I adopt customary University of Waterloo in a previous installment of this column by Jimmy Lin, definitions, referring to the former as scale up and the latter as scale out. Under what Editor: Jimmy Lin where he made a case for scale-up through several circumstances should we prefer one approach over the other? [email protected] examples. In response, the authors discuss three cases for scale-out. Whether we should scale up or out for graph processing is a consequential question from two perspectives: For big data practitioners, it would be desirable to develop a set of best practices that provide guidance to organizations building and deploying graph-processing capabilities. For Our colleague Jimmy Lin in the University of Waterloo’s Data Systems Group wrote an article big data researchers, these best practices translate into priorities for future work and provide a for this department giving his perspective on whether organizations should scale up or scale out roadmap prioritizing real-world pain points. for graph analytics.1 Similarly to that article, for rhetorical convenience, we use “scale up” to re- tl;dr – I advocate scale-up solutions for graph processing as the first thing to try, since they are fer to using software running on multicore large-memory machines and “scale out” to refer to much simpler to design, implement, deploy, and maintain. If you really need distributed scale- using distributed software running on multiple machines. out solutions, it means that your organization has become immensely successful. Not just “suc- It is difficult to disagree with the central message of Jimmy’s article: For many organizations cessful” but on a growth trajectory that is outpacing Moore’s Law (it’ll become clear what this that have large-scale graphs and want to run analytical computations, using a multicore single means below). This is unlikely to be the case for most organizations, and even if you were fortu- machine with a lot of RAM is a better option than a distributed cluster because single-machine nate enough to experience such explosive growth, success would likely bring commensurate re- software, compared to distributed software, is easier to develop in-house or use out of the box, is sources to throw at the problem. So as long as you leave yourself enough headroom, it should be often more efficient, and is easier to maintain. This is indeed true, and for the social-network possible to build a scale-out graph processing solution just in time. The alternative is sunk costs graphs and the computations discussed in that article—e.g., a search for a diamond structure or in distributed graph processing infrastructure, which comes with huge parallelization overheads, an online random-walk computation for recommendations—scale-up is likely the better ap- anticipating a problem that never arrives. proach. However, Jimmy’s article gave the impression that only a handful of applications require It makes sense to begin by more carefully describing the scope of the problem: the type of graph scale-out computing, and it failed to highlight several common scenarios in which scale-out is processing I am referring to involves analytical queries to extract insights or to power data prod- necessary. ucts. An example of the former might be clustering a large social network to infer latent commu- In this response article, we discuss three cases for scale-out: nities of interest. An example of the latter might be building a graph-based recommendation system. Typically, these queries require traversing large portions of the graph where throughput, • Trillion-edge-size graphs. Several application domains, such as finance, retail, e-com- not latency, is the more important performance consideration. merce, telecommunications, and scientific computing, have naturally appearing graphs Gartner defines “operational” databases as “relational and non-relational DBMS products suita- at the trillion-edge-size scale. ble for a broad range of enterprise-level transactional applications, and DBMS products support- ing interactions and observations as alternative types of transactions.” I am specifically not

IEEE Internet Computing Published by the IEEE Computer Society IEEE Internet Computing Published by the IEEE Computer Society May/June 2018 72 1089-7801/18/$33.00 ©2018 IEEE September/October 2018 18 1089-7801/18/$33.00 ©2018 IEEE

Scale-up or Scale-out?

Scale-up: Single machine execution Scale-out: Parallel (cluster) execution

Scale-out is the only way to go!... There is no way to deal with the emerging real graph sizes on single (ordinary) machines

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 17 Scale-up: Single machine execution Scale-out: Parallel (cluster) execution

Dataset |V | |E| Regular size Single Machine∗ Live Journal 4,847,571 68,993,773 1.08GB 6.3GB USA Road 23,947,347 58,333,344 951MB 9.09GB Scale-outTwitter is the only41,652,230 way1,468,365,182 to go!... 26GB 128 GB UK0705 82,240,700 2,829,101,180 48GB 247GB ThereWorld is Roadno way to682,496,072 deal with the717,016,716 emerging real15GB graph194GB sizes on single CommonCrawl2014 1,727,000,000 64,422,000,000 1.3TB Out of memory (ordinary) machines ∗ Using (PowerLyra)

Scale-up or Scale-out?

DEPARTMENT: Big Data Bites DEPARTMENT: Big Data Bites

Scale Up or Scale Out for Response to “Scale Up or Graph Processing? Scale Out for Graph Processing” Jimmy Lin This column explores a simple question: scale up or University of Waterloo scale out for graph processing? Should we simply throw “beefier” individual multi-core, large-memory Semih Salihoglu In this article, the authors provide their views on machines at graph processing tasks and focus on developing more efficient multi- University of Waterloo whether organizations should scale up or scale out threaded algorithms, or are investments in distributed graph processing frameworks and M. Tamer Özsu their graph computations. This question was explored accompanying algorithms worthwhile? For rhetorical convenience, I adopt customary University of Waterloo in a previous installment of this column by Jimmy Lin, definitions, referring to the former as scale up and the latter as scale out. Under what Editor: Jimmy Lin where he made a case for scale-up through several circumstances should we prefer one approach over the other? [email protected] examples. In response, the authors discuss three cases for scale-out. Whether we should scale up or out for graph processing is a consequential question from two perspectives: For big data practitioners, it would be desirable to develop a set of best practices that provide guidance to organizations building and deploying graph-processing capabilities. For Our colleague Jimmy Lin in the University of Waterloo’s Data Systems Group wrote an article big data researchers, these best practices translate into priorities for future work and provide a for this department giving his perspective on whether organizations should scale up or scale out roadmap prioritizing real-world pain points. for graph analytics.1 Similarly to that article, for rhetorical convenience, we use “scale up” to re- tl;dr – I advocate scale-up solutions for graph processing as the first thing to try, since they are fer to using software running on multicore large-memory machines and “scale out” to refer to much simpler to design, implement, deploy, and maintain. If you really need distributed scale- using distributed software running on multiple machines. out solutions, it means that your organization has become immensely successful. Not just “suc- It is difficult to disagree with the central message of Jimmy’s article: For many organizations cessful” but on a growth trajectory that is outpacing Moore’s Law (it’ll become clear what this that have large-scale graphs and want to run analytical computations, using a multicore single means below). This is unlikely to be the case for most organizations, and even if you were fortu- machine with a lot of RAM is a better option than a distributed cluster because single-machine nate enough to experience such explosive growth, success would likely bring commensurate re- software, compared to distributed software, is easier to develop in-house or use out of the box, is sources to throw at the problem. So as long as you leave yourself enough headroom, it should be often more efficient, and is easier to maintain. This is indeed true, and for the social-network possible to build a scale-out graph processing solution just in time. The alternative is sunk costs graphs and the computations discussed in that article—e.g., a search for a diamond structure or in distributed graph processing infrastructure, which comes with huge parallelization overheads, an online random-walk computation for recommendations—scale-up is likely the better ap- anticipating a problem that never arrives. proach. However, Jimmy’s article gave the impression that only a handful of applications require It makes sense to begin by more carefully describing the scope of the problem: the type of graph scale-out computing, and it failed to highlight several common scenarios in which scale-out is processing I am referring to involves analytical queries to extract insights or to power data prod- necessary. ucts. An example of the former might be clustering a large social network to infer latent commu- In this response article, we discuss three cases for scale-out: nities of interest. An example of the latter might be building a graph-based recommendation system. Typically, these queries require traversing large portions of the graph where throughput, • Trillion-edge-size graphs. Several application domains, such as finance, retail, e-com- not latency, is the more important performance consideration. merce, telecommunications, and scientific computing, have naturally appearing graphs Gartner defines “operational” databases as “relational and non-relational DBMS products suita- at the trillion-edge-size scale. ble for a broad range of enterprise-level transactional applications, and DBMS products support- ing interactions and observations as alternative types of transactions.” I am specifically not

IEEE Internet Computing Published by the IEEE Computer Society IEEE Internet Computing Published by the IEEE Computer Society May/June 2018 72 1089-7801/18/$33.00 ©2018 IEEE September/October 2018 18 1089-7801/18/$33.00 ©2018 IEEE

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 17 Amazon Neptune Oracle Spatial & Graph

Graph Systems

RDF Engines

Jena DB2-RDF Virtuoso EAGRE RDF-3X gStore Titan GraphLab JanusGraph Haloop Neo4j Sparksee Pregel/Giraph TurboGraph++ OrientDB Trinity GraphChi Blogel TigerGraph Graph DBMSs Graph Analytics GraphFlow GraphX Systems

There are a number of others!...

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 18 Graph Systems

RDF Engines Amazon Neptune Oracle Spatial & Graph Jena DB2-RDF Virtuoso EAGRE RDF-3X gStore Titan GraphLab JanusGraph Haloop Neo4j Sparksee Pregel/Giraph TurboGraph++ OrientDB Trinity GraphChi Blogel TigerGraph Graph DBMSs Graph Analytics GraphFlow GraphX Systems

There are a number of others!...

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 18 RDF Engines

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 19 RDF Example Instance

Prefixes: mdb=http://data.linkedmdb.org/resource/; geo=http://sws.geonames.org/ bm=http://wifo5-03.informatik.uni-mannheim.de/bookmashup/ lexvo=http://lexvo.org/id/;wp=http://en.wikipedia.org/wiki/ Subject Predicate Object mdb: film/2014 rdfs:label “The Shining” mdb:film/2014 movie:initial release date “1980-05-23”’ URI mdb:film/2014 movie:director mdb:director/8476 Literal mdb:film/2014 movie:actor mdb:actor/29704 mdb:film/2014 movie:actor mdb: actor/30013 mdb:film/2014 movie:music contributor mdb: music contributor/4110 mdb:film/2014 foaf:based near geo:2635167 mdb:film/2014 movie:relatedBook bm:0743424425 mdb:film/2014 movie:language lexvo:iso639-3/eng URI mdb:director/8476 movie:director name “Stanley Kubrick” mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:film/424 movie:director mdb:director/8476 mdb:film/424 rdfs:label “Spartacus” mdb:actor/29704 movie:actor name “Jack Nicholson” mdb:film/1267 movie:actor mdb:actor/29704 mdb:film/1267 rdfs:label “The Last Tycoon” mdb:film/3418 movie:actor mdb:actor/29704 mdb:film/3418 rdfs:label “The Passenger” geo:2635167 gn:name “United Kingdom” geo:2635167 gn:population 62348447 geo:2635167 gn:wikipediaArticle wp:United Kingdom bm:books/0743424425 dc:creator bm:persons/Stephen+King bm:books/0743424425 rev:rating 4.7 bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer lexvo:iso639-3/eng rdfs:label “English” lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 20 RDF Graph

“The Passenger” “The Last Tycoon”

wp:UnitedKingdom refs:label refs:label

mdb:film/3418 mdb:film/1267 gn:wikipediaArticle bm:offers/0743424425amazonOffer “United Kingdom” 62348447 scam:hasOffer movie:actor movie:actor gn:name gn:population dc:creator bm:persons/StephenKing bm:books/0743424425 geo:2635167 mdb:actor/29704 rev:rating movie:actor name 4.7 movie:relatedBook foaf:based near movie:actor “Jack Nicholson” refs:label music contributor “Stanley Kubrick” “The Shining” mdb:film/2014 mob:music contributor

movie:director name movie:director language mdb:director/8476 movie:initial release date movie:actor lexvo:iso639 3/eng

movie:director movie:director mdb:actor/30013 rdf:label lvont:usesScript mdb:film/2685 mdb:film/424 “1980-05-23” lvont:usedIn “English” lexvo:script/latin refs:label refs:label movie:actor name lexvo:iso3166/CA “A Clockwork Orange” “Spartacus” “Shelley Duvall”

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 21 SPARQL Queries

SELECT ?name WHERE { ?m rdfs:label ?name. ?m movie:director ?d. ?d movie:director name ”Stanley Kubrick”. ?m movie:relatedBook ?b. ?b rev:rating ?r. FILTER(? r > 4 . 0 ) }

FILTER(?r > 4.0) rev:rating ?b ?r ?name “Stanley Kubrick”

rdfs:label movie:relatedBook movie:director name ?m ?d movie:director

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 22 Direct Relational Mapping

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 23 Direct Relational Mapping

Bad Idea!...

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 23 Easy to implement but too many self-joins!

Direct Relational Mapping

SELECT ?name WHERE { ?m rdfs:label ?name. ?m movie:director ?d. ?d movie:director name ”Stanley Kubrick”. ?m movie:relatedBook ?b. ?b rev:rating ?r. FILTER(? r > 4 . 0 ) } Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:initial release date “1980-05-23” SELECT T1 . o b j e c t mdb:film/2014 movie:director mdb:director/8476 mdb:film/2014 movie:actor mdb:actor/29704 FROM T as T1 , T as T2 , T as T3 , mdb:film/2014 movie:actor mdb: actor/30013 mdb:film/2014 movie:music contributor mdb: music contributor/4110 T as T4 , T as T5 mdb:film/2014 foaf:based near geo:2635167 mdb:film/2014 movie:relatedBook bm:0743424425 WHERE T1.p=”rdfs : label” mdb:film/2014 movie:language lexvo:iso639-3/eng mdb:director/8476 movie:director name “Stanley Kubrick” AND T2.p=”movie: relatedBook” mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” AND T3.p=”movie: director” mdb:film/424 movie:director mdb:director/8476 mdb:film/424 rdfs:label “Spartacus” AND T4.p=”rev : rating” mdb:actor/29704 movie:actor name “Jack Nicholson” mdb:film/1267 movie:actor mdb:actor/29704 AND T5.p=”movie: director n a m e ” mdb:film/1267 rdfs:label “The Last Tycoon” mdb:film/3418 movie:actor mdb:actor/29704 AND T1 . s=T2 . s mdb:film/3418 rdfs:label “The Passenger” geo:2635167 gn:name “United Kingdom” AND T1 . s=T3 . s geo:2635167 gn:population 62348447 geo:2635167 gn:wikipediaArticle wp:United Kingdom AND T2 . o=T4 . s bm:books/0743424425 dc:creator bm:persons/Stephen+King bm:books/0743424425 rev:rating 4.7 AND T3 . o=T5 . s bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer lexvo:iso639-3/eng rdfs:label “English” AND T4 . o > 4 . 0 lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn AND T5.o=”Stanley Kubrick”

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 23 Direct Relational Mapping

SELECT ?name WHERE { ?m rdfs:label ?name. ?m movie:director ?d. ?d movie:director name ”Stanley Kubrick”. ?m movie:relatedBook ?b. ?b rev:rating ?r. Easy to implement FILTER(? r > 4 . 0 ) } but Subject Property Object mdb:film/2014 rdfs:label “The Shining” too many self-joins! mdb:film/2014 movie:initial release date “1980-05-23” SELECT T1 . o b j e c t mdb:film/2014 movie:director mdb:director/8476 mdb:film/2014 movie:actor mdb:actor/29704 FROM T as T1 , T as T2 , T as T3 , mdb:film/2014 movie:actor mdb: actor/30013 mdb:film/2014 movie:music contributor mdb: music contributor/4110 T as T4 , T as T5 mdb:film/2014 foaf:based near geo:2635167 mdb:film/2014 movie:relatedBook bm:0743424425 WHERE T1.p=”rdfs : label” mdb:film/2014 movie:language lexvo:iso639-3/eng mdb:director/8476 movie:director name “Stanley Kubrick” AND T2.p=”movie: relatedBook” mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” AND T3.p=”movie: director” mdb:film/424 movie:director mdb:director/8476 mdb:film/424 rdfs:label “Spartacus” AND T4.p=”rev : rating” mdb:actor/29704 movie:actor name “Jack Nicholson” mdb:film/1267 movie:actor mdb:actor/29704 AND T5.p=”movie: director n a m e ” mdb:film/1267 rdfs:label “The Last Tycoon” mdb:film/3418 movie:actor mdb:actor/29704 AND T1 . s=T2 . s mdb:film/3418 rdfs:label “The Passenger” geo:2635167 gn:name “United Kingdom” AND T1 . s=T3 . s geo:2635167 gn:population 62348447 geo:2635167 gn:wikipediaArticle wp:United Kingdom AND T2 . o=T4 . s bm:books/0743424425 dc:creator bm:persons/Stephen+King bm:books/0743424425 rev:rating 4.7 AND T3 . o=T5 . s bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer lexvo:iso639-3/eng rdfs:label “English” AND T4 . o > 4 . 0 lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn AND T5.o=”Stanley Kubrick”

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 23 Approaches 1 Property table Group together the properties that tend to occur in the same (or similar) subjects Examples: Jena[Wilkinson, 2006], DB2-RDF[Bornea et al., 2013] 2 Vertically partitioned tables For each property, build a two-column table, containing both subject and object, ordered by subjects Binary tables[Abadi et al., 2007, 2009] 3 Exhaustive indexing Create indexes for each permutation of the three columns: SPO, SOP, PSO, POS, OPS, OSP RDF-3X[Neumann and Weikum, 2008, 2009], Hexastore[Weiss et al., 2008]

Optimizations to Tabular Representation Objectives 1 Eliminate/Reduce the number of self-joins 2 Use merge-join

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 24 Optimizations to Tabular Representation Objectives 1 Eliminate/Reduce the number of self-joins 2 Use merge-join

Approaches 1 Property table Group together the properties that tend to occur in the same (or similar) subjects Examples: Jena[Wilkinson, 2006], DB2-RDF[Bornea et al., 2013] 2 Vertically partitioned tables For each property, build a two-column table, containing both subject and object, ordered by subjects Binary tables[Abadi et al., 2007, 2009] 3 Exhaustive indexing Create indexes for each permutation of the three columns: SPO, SOP, PSO, POS, OPS, OSP RDF-3X[Neumann and Weikum, 2008, 2009], Hexastore[Weiss et al., 2008]

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 24 Graph-based Approach [Zou and Ozsu,¨ 2017]

Answering SPARQL query ≡ subgraph matching using homomorphism gStore [Zou et al., 2011, 2014], chameleon-db [Alu¸cet al., 2013]

FILTER(?r > 4.0) rev:rating Su “The Passenger” “The Last Tycoon” ?b ?r bg ?name “Stanley Kubrick” ra p wp:UnitedKingdom refs:label refs:label rdfs:label movie:relatedBook movie:director name h M mdb:film/3418 mdb:film/1267 ?m ?d gn:wikipediaArticle movie:director bm:offers/0743424425amazonOffer a tc h“United Kingdom” 62348447 scam:hasOffer in movie:actor movie:actor g gn:name gn:population dc:creator bm:persons/StephenKing bm:books/0743424425 geo:2635167 mdb:actor/29704 rev:rating movie:actor name 4.7 movie:relatedBook foaf:based near movie:actor “Jack Nicholson” refs:label music contributor “Stanley Kubrick” “The Shining” mdb:film/2014 mob:music contributor

movie:director name movie:director language mdb:director/8476 movie:initial release date movie:actor lexvo:iso639 3/eng

movie:director movie:director mdb:actor/30013 rdf:label lvont:usesScript mdb:film/2685 mdb:film/424 “1980-05-23” lvont:usedIn “English” lexvo:script/latin refs:label refs:label movie:actor name lexvo:iso3166/CA “A Clockwork Orange” “Spartacus” “Shelley Duvall”

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 25 Graph-based Approach [Zou and Ozsu,¨ 2017]

Answering SPARQL query ≡ subgraph matching using homomorphism gStore [Zou et al., 2011, 2014], chameleon-db [Alu¸cet al., 2013]

FILTER(?r > 4.0) rev:rating Su “The Passenger” “The Last Tycoon” ?b ?r bg ?name “Stanley Kubrick” ra p wp:UnitedKingdom refs:label refs:label rdfs:label movie:relatedBook movie:director name h M mdb:film/3418 mdb:film/1267 ?m ?d gn:wikipediaArticle Advantagesmovie:director bm:offers/0743424425amazonOffer a tc h“United Kingdom” 62348447 scam:hasOffer in movie:actor movie:actor g gn:name gn:population I Maintains the graph structuredc:creator bm:persons/StephenKing bm:books/0743424425 geo:2635167 mdb:actor/29704 Full set of queries can berev:rating handled I movie:actor name 4.7 movie:relatedBook foaf:based near movie:actor “Jack Nicholson” refs:label music contributor “Stanley Kubrick” “The Shining” mdb:film/2014 mob:music contributor Disadvantages movie:director name movie:director language mdb:director/8476 movie:initial release date I Graph pattern matching is expensive movie:actor lexvo:iso639 3/eng movie:director movie:director mdb:actor/30013 rdf:label lvont:usesScript mdb:film/2685 mdb:film/424 “1980-05-23” lvont:usedIn “English” lexvo:script/latin refs:label refs:label movie:actor name lexvo:iso3166/CA “A Clockwork Orange” “Spartacus” “Shelley Duvall”

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 25 Scaling-out RDF Engines

Cloud-based solutions [Kaoudi and Manolescu, 2015]

RDF dataset D is partitioned into {D1,..., Dn} and placed on cloud platforms (such as HDFS, HBase) SPARQL query is run through MapReduce jobs Data parallel execution Examples: HARD[Rohloff and Schantz, 2010] , HadoopRDF[Husain et al., 2011] , EAGRE[Zhang et al., 2013] and JenaHBase[Khadilkar et al., 2012]

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 26 Scaling-out RDF Engines

Cloud-based solutions [Kaoudi and Manolescu, 2015] Partition-based approaches

Partition an RDF dataset D into fragments {D1,..., Dn} each of which is located at a site SPARQL query Q is decomposed into a set of subqueries {Q1,..., Qk } Distributed execution of {Q1,..., Qk } over {D1,..., Dn} Examples: GraphPartition[Huang et al., 2011], WARP[Hose and Schenkel, 2013] , Partout[Galarraga et al., 2014] , Vertex-block[Lee and Liu, 2013]

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 26 Scaling-out RDF Engines

Cloud-based solutions [Kaoudi and Manolescu, 2015] Partition-based approaches Partial Query Evaluation (PQE)

Partition an RDF dataset D into several fragments {D1,..., Dn} each of which is located at a site SPARQL query is not decomposed; the full query is sent to each site PQE at each site producing partial results Join the results (similar to distributed join processing) to find matching edges that might cross fragments Distributed gStore[Peng et al., 2016]

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 26 How to implement SPARQL fully Support for dynamic and streaming RDF graphs Data quality over RDF datasets is a real issue RDF in IoT

What are Some Open Issues?

These systems are not performant or scalable to large data sets What is the right scale-out architecture and techniques? It is not clear what the best storage format is Optimization of SPARQL queries RDF Engines require more experimentation

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 27 Support for dynamic and streaming RDF graphs Data quality over RDF datasets is a real issue RDF in IoT

What are Some Open Issues?

These systems are not performant or scalable to large data sets How to implement SPARQL fully Current focus on basic graph patterns (sets of triple patterns) Additional constructs, e.g., property paths, OPTIONAL, UNION, FILTER, aggregation, ... Reasoning over RDF needs to be considered ⇒ entailment regimes (see Ontologies and Semantic Web,[Tena Cucala et al., 2019])

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 27 Data quality over RDF datasets is a real issue RDF in IoT

What are Some Open Issues?

These systems are not performant or scalable to large data sets How to implement SPARQL fully Support for dynamic and streaming RDF graphs Few existing systems (e.g., C-SPARQL[Barbieri et al., 2010] and CQUELS[Phuoc et al., 2011]) are early attempts; more work required

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 27 RDF in IoT

What are Some Open Issues?

These systems are not performant or scalable to large data sets How to implement SPARQL fully Support for dynamic and streaming RDF graphs Data quality over RDF datasets is a real issue Some initial work exists, e.g., CLAMS[Farid et al., 2016]

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 27 What are Some Open Issues?

These systems are not performant or scalable to large data sets How to implement SPARQL fully Support for dynamic and streaming RDF graphs Data quality over RDF datasets is a real issue RDF in IoT Both as a common model across devices, and as embedded into devices

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 27 What are Some Open Issues?

These systems are not performant or scalable to large data sets How to implement SPARQL fully Support for dynamic and streaming RDF graphs Data quality over RDF datasets is a real issue RDF in IoT

Most important ... 1 DB community really needs to get engaged to develop performant & scalable engines 2 SPARQL is not easy ⇒ language front-ends are desperately needed 3 Proper implementation of full SPARQL with optimizations for performance & scalability

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 27 Graph DBMSs

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 28 Online workloads Graph query languages

Graph DBMS Properties

Property graph model Vertices and edges have one or more labels and zero or more properties Graph can be directed or undirected

film 3418 film 1267 UnitedKingdom (label, “The Passenger”) (label, “The Last Tycoon”) offers 0743424425amazonOffer (wikipediaArticle) (hasOffer) (actor) (actor) geo 2635167 books 0743424425 (name, “United Kingdom”) (rating, 4.7) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) (creator) (relatedBook) (based near) (actor) StephenKing film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) (music contributor, music contributor/4110) (language, (iso639 3/eng) (label, “English”) (usedIn, iso3166/CA) (usesScript, script/latn))

(director) (actor)

director 8476 actor 30013 (director name, “Stanley Kubrick”) (actor name, “Shelley Duvall”)

(director) (director)

film 2685 film 424 (label, “A Clockwork Orange”) (label, “Spartacus”)

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 29 Graph query languages

Graph DBMS Properties

Property graph model Vertices and edges have one or more labels and zero or more properties Graph can be directed or undirected Online workloads Each query accesses a portion of the graph Can be assisted by indexes Query latency is important Examples Reachability Single source shortest-path Subgraph matching SPARQL queries

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 29 Graph DBMS Properties

Property graph model Vertices and edges have one or more labels and zero or more properties Graph can be directed or undirected Online workloads Graph query languages Regular Queries[Reutter et al., 2017] Unions of Conjunctive Nested 2-Way Regular Path Queries (UC2NRPQ) Formalism for structural graph queries (Neo4j) Declarative, comparable to UCRPQ (TinkerPop) Imperative, XPath like navigation language G-Core (LDBC)[Angles et al., 2018] Declarative, graphs as first-class citizens PGQL (Oracle PGX) SQL-like language with pattern matching and reachability GSQL (TigerGraph)

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 29 Ternary tables Each edge, vertex and property is a separate record E.g., Neo4j keeps data in separate files, each of which holds data of one type (nodes/relations/properties) Pivoted tables Similar to above, tables are pivoted This is similar to property table approach in RDF engines Each column is a property key and the column value is the property value All properties of a vertex is stored in a single row Examples: SAP Hana Graph, IBM SQLGraph

Property Graph Storage Approaches

Key-value stores Vertices are keys, entire edge and property information is stored in the value Examples: Titan (Datastax Enterprise Graph), JanusGraph, Dgraph

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 30 Pivoted tables Similar to above, tables are pivoted This is similar to property table approach in RDF engines Each column is a property key and the column value is the property value All properties of a vertex is stored in a single row Examples: SAP Hana Graph, IBM SQLGraph

Property Graph Storage Approaches

Key-value stores Vertices are keys, entire edge and property information is stored in the value Examples: Titan (Datastax Enterprise Graph), JanusGraph, Dgraph Ternary tables Each edge, vertex and property is a separate record E.g., Neo4j keeps data in separate files, each of which holds data of one type (nodes/relations/properties)

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 30 Property Graph Storage Approaches

Key-value stores Vertices are keys, entire edge and property information is stored in the value Examples: Titan (Datastax Enterprise Graph), JanusGraph, Dgraph Ternary tables Each edge, vertex and property is a separate record E.g., Neo4j keeps data in separate files, each of which holds data of one type (nodes/relations/properties) Pivoted tables Similar to above, tables are pivoted This is similar to property table approach in RDF engines Each column is a property key and the column value is the property value All properties of a vertex is stored in a single row Examples: SAP Hana Graph, IBM SQLGraph

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 30 Graph Querying [Bonifati et al., 2018]

Querying graph topology and graph properties Data queries are essentially relational queries Querying these are usually treated separately Core graph query functionalities Path navigation (reachability) queries Subgraph pattern queries

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 31 Reachability Queries

film 3418 film 1267 UnitedKingdom (label, “The Passenger”) (label, “The Last Tycoon”) offers 0743424425amazonOffer (wikipediaArticle) (hasOffer) (actor) (actor) geo 2635167 books 0743424425 (name, “United Kingdom”) (rating, 4.7) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) (creator) (relatedBook) (based near) (actor) StephenKing film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) (music contributor, music contributor/4110) (language, (iso639 3/eng) (label, “English”) (usedIn, iso3166/CA) (usesScript, script/latn))

(director) (actor)

director 8476 actor 30013 (director name, “Stanley Kubrick”) (actor name, “Shelley Duvall”)

(director) (director)

film 2685 film 424 (label, “A Clockwork Orange”) (label, “Spartacus”)

Can you reach film 1267 from film 2014?

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 32 Reachability Queries

film 3418 film 1267 UnitedKingdom (label, “The Passenger”) (label, “The Last Tycoon”) offers 0743424425amazonOffer (wikipediaArticle) (hasOffer) (actor) (actor) geo 2635167 books 0743424425 (name, “United Kingdom”) (rating, 4.7) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) (creator) (relatedBook) (based near) (actor) StephenKing film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) (music contributor, music contributor/4110) (language, (iso639 3/eng) (label, “English”) (usedIn, iso3166/CA) (usesScript, script/latn))

(director) (actor)

director 8476 actor 30013 (director name, “Stanley Kubrick”) (actor name, “Shelley Duvall”)

(director) (director)

film 2685 film 424 (label, “A Clockwork Orange”) (label, “Spartacus”) Is there a book whose rating is > 4.0 associated with a film that was directed by Stanley Kubrick?

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 32 Reachability Queries

Think of Facebook graph and finding friends of friends.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 32 Index construction time 2-hop[Cohen et al., 2002] n3|TC|

Full TC[Simon, 1988] n ∗ m

Chain-Cover[Chen and Chen, 2008] n2 + c3/2n GRAIL[Yildirim et al., 2010] IP[Wei et al., 2014] c(n + m) BFL[Su et al., 2017]

n + m

BFS/DFS 1 Desirable Query time 1 log c c m1/2 m − n m + n

Path (Reachability) Query Execution Approaches

This is computing the transitive closure Fully materialized: O(n ∗ m) index time (n vertices, m edges), O(1) query time BFS/DFS: O(1) index time, O(n + m) query time

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 33 Path (Reachability) Query Execution Approaches

This is computing the transitive closure Fully materialized: O(n ∗ m) index time (n vertices, m edges), O(1) query time BFS/DFS: O(1) index time, O(n + m) query time Index construction time 2-hop[Cohen et al., 2002] n3|TC|

Full TC[Simon, 1988] n ∗ m

Chain-Cover[Chen and Chen, 2008] n2 + c3/2n GRAIL[Yildirim et al., 2010] IP[Wei et al., 2014] c(n + m) BFL[Su et al., 2017]

n + m

BFS/DFS 1 Desirable Query time 1 log c c m1/2 m − n m + n

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 33 Path (Reachability) Query Execution Approaches

Q(p, f ) ← knows · worksFor · knows+ knows worksFor knows p x y Regular Path Queries (RPQ) f knows RPQ = path query that defines desired paths using a regular expression ⇒ labels of a path form a word in the language specified by RPQ Generalization of reachability queries ⇒ reachability query = A RPQ that accepts all words α-RA – Relational Algebra extended with Transitive Closure Finite-automata Based RPQ Evaluation Traversal guided by an FA: G+[Cruz et al., 1987; Mendelzon and Wood, 1995] Hybrid α-RA & FA-based traversals: Waveguide [Yakovets et al., 2016]

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 34 Subgraph Matching

FILTER(?r > 4.0) Su rev:rating b ?b ?r g ?name “Stanley Kubrick” ra p h rdfs:label movie:relatedBook movie:director name M ?m ?d a t movie:director c h “The Passenger” “The Last Tycoon” in g refs:label refs:label

mdb:film/3418 mdb:film/1267 bm:offers/0743424425amazonOffer “United Kingdom” 62348447 scam:hasOffer movie:actor movie:actor gn:name gn:population bm:books/0743424425 geo:2635167 mdb:actor/29704 rev:rating movie:actor name 4.7 movie:relatedBook foaf:based near movie:actor “Jack Nicholson” refs:label “Stanley Kubrick” “The Shining” mdb:film/2014

movie:director name movie:director movie:actor mdb:director/8476 movie:initial release date mdb:actor/30013 movie:director movie:director

mdb:film/2685 mdb:film/424 “1980-05-23”

refs:label refs:label

“A Clockwork Orange” “Spartacus”

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 35 Subgraph Pattern Query Execution Approaches

Q(p, c) ← knows(p, x) ∧ knows(p, y)∧ worksFor(x, c) ∧ worksFor(y, c)

Conjunctive Graph Queries (CQ) x worksFor Set of edge predicates to define knows substructures of interest p c Akin to joins in relational query processing knows Worst-case Optimal Join Processing y worksFor Leapfrog Triejoin[Veldhuizen, 2014] EmptyHeaded[Aberger et al., 2017] Generalized hypertree decompositions Graphflow Hybrid[Mhedhbi and Salihoglu, 2019] Adaptive, cost-based planning with WCO and binary joins

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 36 Subgraph Pattern Query Execution Approaches

Q(p, c) ← knows(p, x) ∧ knows(p, y)∧ worksFor(x, c) ∧ worksFor(y, c)

Conjunctive Graph Queries (CQ) x worksFor Set of edge predicates to define knows substructures of interest p c Akin to joins in relational query processing knows Worst-case Optimal Join Processing y worksFor Leapfrog Triejoin[Veldhuizen, 2014Research] session 28 on EmptyHeaded[Aberger et al., 2017Thursday] 16:00 Generalized hypertree decompositions Graphflow Hybrid[Mhedhbi and Salihoglu, 2019] Adaptive, cost-based planning with WCO and binary joins

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 36 There is poor locality in graph workloads Query languages need attention Query processing and optimization Fuzzy querying over uncertain and probabilistic graphs [Yuan et al., 2011, 2012] Too much focus on simple homogeneous graphs ⇒ multigraphs, heterogeneous graphs are important

What are Some Open Issues?

Existing systems generally have performance issues Generally involve joins of intermediate results, which may be quite large There are not extensive performance studies (LDBC is a good start [Erling et al., 2015])

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 37 Query languages need attention Query processing and optimization Fuzzy querying over uncertain and probabilistic graphs [Yuan et al., 2011, 2012] Too much focus on simple homogeneous graphs ⇒ multigraphs, heterogeneous graphs are important

What are Some Open Issues?

Existing systems generally have performance issues There is poor locality in graph workloads Caching does not help much

A commercial system; LDBC-1 1 0.9 0.8 0.7 0.6 0.5

Hit rate 0.4 0.3 0.2 0.1 0

Q6 Q7 Q8 Q9 Q10 LDBC Queries Native LRU

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 37 Query languages need attention Query processing and optimization Fuzzy querying over uncertain and probabilistic graphs [Yuan et al., 2011, 2012] Too much focus on simple homogeneous graphs ⇒ multigraphs, heterogeneous graphs are important

Graph-aware Disk Layout - Locality Experiments

What are Some Open Issues?

Existing systems generally have performance issues There is poor locality in graph workloads Caching does not help much Proper clustering of vertices and edges on pages may reduce page I/O

(Commercial system; LDBC-1)

Native

Clustered

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 37

Query languages need attention Query processing and optimization Fuzzy querying over uncertain and probabilistic graphs [Yuan et al., 2011, 2012] Too much focus on simple homogeneous graphs ⇒ multigraphs, heterogeneous graphs are important

What are Some Open Issues?

Existing systems generally have performance issues There is poor locality in graph workloads Caching does not help much Proper clustering of vertices and edges on pages may reduce page I/O Native graph storage system design requires more work What should graph databases cache? (subgraphs, paths, vertices, query plans, or what)

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 37 Query processing and optimization Fuzzy querying over uncertain and probabilistic graphs [Yuan et al., 2011, 2012] Too much focus on simple homogeneous graphs ⇒ multigraphs, heterogeneous graphs are important

What are Some Open Issues?

Existing systems generally have performance issues There is poor locality in graph workloads Query languages need attention Need to capture both graph topology and properties Most current work is simplistic Promising: Register automata-based execution for RPQ evaluation Query semantics (and syntax) are still not clarified or standardized Are the proposed languages complete? Proof? How to determine a query is safe? G-Core effort is important

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 37 Fuzzy querying over uncertain and probabilistic graphs [Yuan et al., 2011, 2012] Too much focus on simple homogeneous graphs ⇒ multigraphs, heterogeneous graphs are important

What are Some Open Issues?

Existing systems generally have performance issues There is poor locality in graph workloads Query languages need attention Query processing and optimization What are the primary operators? Can we have a closed algebra? (see [Salihoglu and Widom, 2014; Mattson et al., 2013]) Advanced query plan generation issues

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 37 Too much focus on simple homogeneous graphs ⇒ multigraphs, heterogeneous graphs are important

What are Some Open Issues?

Existing systems generally have performance issues There is poor locality in graph workloads Query languages need attention Query processing and optimization Fuzzy querying over uncertain and probabilistic graphs [Yuan et al., 2011, 2012]

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 37 What are Some Open Issues?

Existing systems generally have performance issues There is poor locality in graph workloads Query languages need attention Query processing and optimization Fuzzy querying over uncertain and probabilistic graphs [Yuan et al., 2011, 2012] Too much focus on simple homogeneous graphs ⇒ multigraphs, heterogeneous graphs are important Some work exists – on multigraphs: Constraints on individual edges [Erling et al., 2015] Constraints on a full path [Zhang and Ozsu,¨ 2019]

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 37 What are Some Open Issues?

Existing systems generally have performance issues There is poor locality in graph workloads Query languages need attention Query processing and optimization Fuzzy querying over uncertain and probabilistic graphs [Yuan et al., 2011, 2012] Too much focus on simple homogeneous graphsResearch⇒ sessionmultigraphs, 18 on heterogeneous graphs are important Thursday @ 11:00 Some work exists – on multigraphs: Constraints on individual edges [Erling et al., 2015] Constraints on a full path [Zhang and Ozsu,¨ 2019]

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 37 What are Some Open Issues?

Existing systems generally have performance issues There is poor locality in graph workloads Query languages need attention Query processing and optimization MostFuzzy queryingimportant over ... uncertain and probabilistic graphs [Yuan et al., 2011, 2012] 1 Disk-based systems ⇒ storage system design needs much work Too much focus on simple homogeneous graphs ⇒ multigraphs, & experimentation heterogeneous graphs are important 2 Query languages/semantics are current bottleneck ⇒ optimization work would benefit 3 Non-trivial scale-out architectures and processing requires further study

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 37 Graph Analytics Systems

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 38 Offline workloads Each query accesses the entire graph – indexes may not help Queries are iterative until a fix point is reached Examples PageRank Clustering Connected components Diameter finding Graph colouring All pairs shortest path Graph pattern mining Machine learning algorithms (Belief propagation, Gaussian non-negative matrix factorization) Almost all of the existing systems are scale-out

Graph Analytics System Properties

Property graph model

film 3418 film 1267 UnitedKingdom (label, “The Passenger”) (label, “The Last Tycoon”) offers 0743424425amazonOffer (wikipediaArticle) (hasOffer) (actor) (actor) geo 2635167 books 0743424425 (name, “United Kingdom”) (rating, 4.7) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) (creator) (relatedBook) (based near) (actor) StephenKing film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) (music contributor, music contributor/4110) (language, (iso639 3/eng) (label, “English”) (usedIn, iso3166/CA) (usesScript, script/latn))

(director) (actor)

director 8476 actor 30013 (director name, “Stanley Kubrick”) (actor name, “Shelley Duvall”)

(director) (director)

film 2685 film 424 (label, “A Clockwork Orange”) (label, “Spartacus”)

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 39 Almost all of the existing systems are scale-out

Graph Analytics System Properties

Property graph model Offline workloads Each query accesses the entire graph – indexes may not help Queries are iterative until a fix point is reached Examples PageRank Clustering Connected components Diameter finding Graph colouring All pairs shortest path Graph pattern mining Machine learning algorithms (Belief propagation, Gaussian non-negative matrix factorization)

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 39 Graph Analytics System Properties

Property graph model Offline workloads Each query accesses the entire graph – indexes may not help Queries are iterative until a fix point is reached Examples PageRank Clustering Connected components Diameter finding Graph colouring All pairs shortest path Graph pattern mining Machine learning algorithms (Belief propagation, Gaussian non-negative matrix factorization) Almost all of the existing systems are scale-out

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 39 Yes, but not a good idea I Immutable data & computation is not guaranteed to be on the same machine in subsequent iterations I High I/O cost due to repeated read/write to/from store between iterations

There are systems that try to address these concerns I HaLoop[Bu et al., 2010, 2012] I GraphX over Spark[Gonzalez et al., 2014]

Can MapReduce be Used for Graph Analytics?

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 40 There are systems that try to address these concerns I HaLoop[Bu et al., 2010, 2012] I GraphX over Spark[Gonzalez et al., 2014]

Can MapReduce be Used for Graph Analytics?

Yes, but not a good idea I Immutable data & computation is not guaranteed to be on the same machine in subsequent iterations I High I/O cost due to repeated read/write to/from store between iterations

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 40 Can MapReduce be Used for Graph Analytics?

Yes, but not a good idea I Immutable data & computation is not guaranteed to be on the same machine in subsequent iterations I High I/O cost due to repeated read/write to/from store between iterations

There are systems that try to address these concerns I HaLoop[Bu et al., 2010, 2012] I GraphX over Spark[Gonzalez et al., 2014]

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 40 Classification of Graph Analytics Systems [Han, 2015]

Programming model Computation model

Gather-Apply Scatter (GAS)

Asynchronous Computation Model

Block Synchronous Parallel (BSP) Vertex-centric Partition-centric Edge-centric Programming Model

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 41 Partition-centric (Block-centric) Edge-centric

Programming Models

Vertex-centric Computation on a vertex is the focus “Think like a vertex” Vertex computation depends on its own state + states of its neighbors ? Compute(vertex v) GetValue(), WriteValue()

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 42 Edge-centric

Programming Models

Vertex-centric Partition-centric (Block-centric) Computation on an entire partition is specified “Think like a block” or “Think like a graph” Aim is to reduce the communication cost among vertices

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 42 Programming Models

Vertex-centric Partition-centric (Block-centric) Edge-centric Computation is specified on each edge rather than on each vertex or block Compute(edge e)

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 42 Asynchronous Parallel Gather-Apply-Scatter (GAS) Similar to BSP, but pull-based Gather: pull state Apply: Compute function Scatter: Update state Updates of states separated from scheduling

Computational Models

Bulk Synchronous Parallel (BSP) [Valiant, 1990]

Computation Superstep 1 Superstep 2 Superstep 3

Machine 1 Machine 1 Machine 1

Machine 2 Machine 2 Machine 2

Machine 3 Machine 3 Machine 3

Communication Communication Barrier Barrier Each machine performs At the end of each superstep computation results are pushed to other on its graph partition workers

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 43 Gather-Apply-Scatter (GAS) Similar to BSP, but pull-based Gather: pull state Apply: Compute function Scatter: Update state Updates of states separated from scheduling

Computational Models

Bulk Synchronous Parallel (BSP) [Valiant, 1990] Asynchronous Parallel Machine 1 Machine 1 No communication barriers Uses the most recent values Machine 2 Machine 2 Implemented via distributed locking Machine 3 Machine 3

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 43 Computational Models

Bulk Synchronous Parallel (BSP) [Valiant, 1990] Asynchronous Parallel Gather-Apply-Scatter (GAS) Similar to BSP, but pull-based Gather: pull state Apply: Compute function Scatter: Update state Updates of states separated from scheduling

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 43 GraphLab [Low et al., 2010]

GiraphCU [Han and Daudjee, 2015]

Blogel X-Stream [Yan et al., 2014] [Roy et al., 2013]

Classification of Graph Analytics Systems

Vertex-centric GAS

Gather-Apply Scatter (GAS) Vertex-centric Asynchronous

Asynchronous

Computation Model Vertex-centric Partition-centric Edge-centric BSP BSP BSP

Bulk Synchronous Parallel (BSP) Vertex-centric Partition-centric Edge-centric Programming Model

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 44 GraphLab [Low et al., 2010]

GiraphCU [Han and Daudjee, 2015]

Blogel X-Stream [Yan et al., 2014] [Roy et al., 2013]

Classification of Graph Analytics Systems

Vertex-centric GAS

Gather-Apply Scatter (GAS) Vertex-centric Asynchronous

Asynchronous

Computation Model Vertex-centric Partition-centric Edge-centric BSP BSP BSP

Bulk Synchronous Parallel (BSP) Vertex-centric Partition-centric Edge-centric Programming Model Pregel [Malewicz et al., 2010], Apache Giraph, GPS [Salihoglu and Widom, 2013], Mizan [Khayyat et al., 2013], Trinity [Shao et al., 2013]

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 44 Classification of Graph Analytics Systems

Vertex-centric GAS GraphLab [Low et al., 2010] Gather-Apply Scatter (GAS) Vertex-centric Asynchronous GiraphCU [Han and Daudjee, 2015]

Asynchronous

Computation Model Vertex-centric Partition-centric Edge-centric BSP BSP BSP Blogel X-Stream [Yan et al., 2014] [Roy et al., 2013]

Bulk Synchronous Parallel (BSP) Vertex-centric Partition-centric Edge-centric Programming Model Pregel [Malewicz et al., 2010], Apache Giraph, GPS [Salihoglu and Widom, 2013], Mizan [Khayyat et al., 2013], Trinity [Shao et al., 2013]

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 44 GraphLab [Low et al., 2010]

GiraphCU [Han and Daudjee, 2015]

Blogel X-Stream [Yan et al., 2014] [Roy et al., 2013]

Classification of Graph Analytics Systems

Vertex-centric ??? ??? GAS ? Gather-Apply Scatter (GAS) Vertex-centric ??? ??? Asynchronous

Asynchronous

Computation Model Vertex-centric Partition-centric Edge-centric BSP BSP BSP

Bulk Synchronous Parallel (BSP) Vertex-centric Partition-centric Edge-centric Programming Model

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 44 OLAP Over Graphs

OLAP in RDBMS Usage: Data Warehousing + Business Intelligence Model: Multidimensional cube Operations: Roll-up, drill-down, and slice and dice Analytics that we discussed over graphs is much different Can we do OLAP-style analytics over graphs? There is some work Graph summarization[Tian et al., 2008] Snapshot-based Aggregation[Chen et al., 2008] Graph Cube[Zhao et al., 2011] Pagrol[Wang et al., 2014] Gagg Model[Maali et al., 2015]

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 45 Integration with data science workflows Focus has been mostly on single computation Analytics as part of a complete workflow: financial analysis, litigation analytics Single algorithms → systems ML workloads over graphs are interesting and requires more attention

Some Open Problems

Current systems would have difficulty scaling to some large graphs Graphs with billions of vertices, hundreds of billions edges are becoming more common Brain network is a trillion edge graph Even the large graphs we play with are small

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 46 ML workloads over graphs are interesting and requires more attention

Some Open Problems

Current systems would have difficulty scaling to some large graphs Graphs with billions of vertices, hundreds of billions edges are becoming more common Brain network is a trillion edge graph Even the large graphs we play with are small Integration with data science workflows Focus has been mostly on single computation Analytics as part of a complete workflow: financial analysis, litigation analytics Single algorithms → systems

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 46 Some Open Problems

Current systems would have difficulty scaling to some large graphs Graphs with billions of vertices, hundreds of billions edges are becoming more common Brain network is a trillion edge graph Even the large graphs we play with are small Integration with data science workflows Focus has been mostly on single computation Analytics as part of a complete workflow: financial analysis, litigation analytics Single algorithms → systems ML workloads over graphs are interesting and requires more attention

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 46 Some Open Problems

Current systems would have difficulty scaling to some large graphs Graphs with billions of vertices, hundreds of billions edges are becoming more common Brain network is a trillion edge graph Even the large graphs we play with are small Integration with data science workflows MostFocus important... has been mostly on single computation Analytics as part of a complete workflow: financial analysis, litigation 1 Areanalytics the types of systems we have been focusing on still relevantSingle algorithms & reasonable?→ systems 2 ML workloadsSerious scaling over graphs⇒ computation are interesting over HPC and requires infrastructures more attention might become important 3 Consider analytics as part of a full workflow

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 46 Dynamic & Streaming Graphs

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 47 Graph sees updates over time Update can be both on topology and properties Existing work predominantly focusing on topology updates Graph is bounded and fully available to the algorithms Computation approaches Batch computation of each snapshot Incremental computation General purpose: Differential dataflow [McSherry et al., 2013] Specialized algorithms for specific workloads: E.g., subgraph matching [Ammar et al., 2018; Fan et al., 2011], shortest path [Nannicini and Liberti, 2008], connected components [McColl et al., 2013]

Dynamic Graphs

C

B F

D

A

E

Time t1

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 48 Graph sees updates over time Update can be both on topology and properties Existing work predominantly focusing on topology updates Graph is bounded and fully available to the algorithms Computation approaches Batch computation of each snapshot Incremental computation General purpose: Differential dataflow [McSherry et al., 2013] Specialized algorithms for specific workloads: E.g., subgraph matching [Ammar et al., 2018; Fan et al., 2011], shortest path [Nannicini and Liberti, 2008], connected components [McColl et al., 2013]

Dynamic Graphs

C C

B F B F

D D A A X

E E

Time t1 t2

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 48 Graph sees updates over time Update can be both on topology and properties Existing work predominantly focusing on topology updates Graph is bounded and fully available to the algorithms Computation approaches Batch computation of each snapshot Incremental computation General purpose: Differential dataflow [McSherry et al., 2013] Specialized algorithms for specific workloads: E.g., subgraph matching [Ammar et al., 2018; Fan et al., 2011], shortest path [Nannicini and Liberti, 2008], connected components [McColl et al., 2013]

Dynamic Graphs

C C C

B F B F B F

D D D

A A A G

E E E

Time t1 t2 t3

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 48 Graph sees updates over time Update can be both on topology and properties Existing work predominantly focusing on topology updates Graph is bounded and fully available to the algorithms Computation approaches Batch computation of each snapshot Incremental computation General purpose: Differential dataflow [McSherry et al., 2013] Specialized algorithms for specific workloads: E.g., subgraph matching [Ammar et al., 2018; Fan et al., 2011], shortest path [Nannicini and Liberti, 2008], connected components [McColl et al., 2013]

Dynamic Graphs

C C C C

B F B F B F B F

D D D D

A A A G A G

E E E E

Time t1 t2 t3 t4

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 48 Dynamic Graphs

C C C C

B F B F B F B F

D D D D

A A A G A G

E E E E

Time t1 t2 t3 t4 Graph sees updates over time Update can be both on topology and properties Existing work predominantly focusing on topology updates Graph is bounded and fully available to the algorithms Computation approaches Batch computation of each snapshot Incremental computation General purpose: Differential dataflow [McSherry et al., 2013] Specialized algorithms for specific workloads: E.g., subgraph matching [Ammar et al., 2018; Fan et al., 2011], shortest path [Nannicini and Liberti, 2008], connected components [McColl et al., 2013] © M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 48 ∆A ∆B 2 1 ∆A1

When changes arrive, each operator is change ⇓ asked if there are any changes ∆op1 If there are, push the changes to next operator change ⇓ If not, stop ⇒ early stop saves work ∆op2 Iterative workloads, e.g., graph change analytics ⇓ ∆op3 Changes come both from input and from previous iteration Timestamped set of changes Uses partial order to optimize “Generalized incremental dataflow maintenance” ∆Output

Differential Dataflow

Applies to any data flow computation Does not depend on the semantics of Input B Input A

the computation op1

op2

op3

. .

filter

Output

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 49 When changes arrive, each operator is change ⇓ asked if there are any changes ∆op1 If there are, push the changes to next operator change ⇓ If not, stop ⇒ early stop saves work ∆op2 Iterative workloads, e.g., graph change analytics ⇓ ∆op3 Changes come both from input and from previous iteration Timestamped set of changes Uses partial order to optimize “Generalized incremental dataflow maintenance”

Differential Dataflow

∆A ∆B 2 Applies to any data flow computation 1 ∆A1 Does not depend on the semantics of Input B Input A

the computation op1

op2

op3

. .

filter

Output ∆Output

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 49 Iterative workloads, e.g., graph analytics Changes come both from input and from previous iteration Timestamped set of changes Uses partial order to optimize “Generalized incremental dataflow maintenance”

Differential Dataflow

∆A ∆B 2 Applies to any data flow computation 1 ∆A1 Does not depend on the semantics of Input B Input A

the computation op1 When changes arrive, each operator is change ⇓ asked if there are any changes ∆op1 If there are, push the changes to next op2 operator change ⇓ If not, stop ⇒ early stop saves work ∆op2 op3 change . ⇓ . ∆op3

filter

Output ∆Output

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 49 Differential Dataflow

∆A ∆B 2 Applies to any data flow computation 1 ∆A1 Does not depend on the semantics of Input B Input A

the computation op1 When changes arrive, each operator is change ⇓ asked if there are any changes ∆op1 If there are, push the changes to next op2 operator change ⇓ If not, stop ⇒ early stop saves work ∆op2 Iterative workloads, e.g., graph op3 change analytics . ⇓ . ∆op3 Changes come both from input and from previous iteration filter Timestamped set of changes Uses partial order to optimize “Generalized incremental dataflow Output maintenance” ∆Output

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 49 Streaming Graphs

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 50 Streaming Graphs

B

A

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13

B

A

t1

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 50 Streaming Graphs

B

A

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13

B

A

t1

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 50 Streaming Graphs

B

A

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13

B

A

t1

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 50 Streaming Graphs

B C

A B

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13

C

B B

A A

t1 t4

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 50 Streaming Graphs

C

B C D

A B A

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13

C C

B B B

D

A A A

t1 t4 t5

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 50 Streaming Graphs

C C

B C D F

A B A D

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13

C C C

B B B B F

D D

A A A A

t1 t4 t5 t7

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 50 Streaming Graphs

B

F

C C A

B C D F E

A B A D D

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13

C C C C

B B B B F B F

D D D

A A A A A

E

t1 t4 t5 t7 t9

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 50 Streaming Graphs

B

F

C C A

B C D F E E X A B A D D F

Time t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13

C C C C C

B B B B F B F B F

D D D D

A A A A A A

E E

t1 t4 t5 t7 t9 t12

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 50 Windowed: for more complex queries

Streaming Graphs

B Combines two difficult problems: F streaming+graphs C C A Unbounded ⇒ don’t see entire graph Streaming rates can be very high B C D F E E ··· X Computational models A B A D D F Continuous: for simple Time transactional operations t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13

C C C C C

B B B B F B F B F

D D D D

A A A A A A

E E

t1 t4 t5 t7 t9 t12

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 50 Streaming Graphs

B Combines two difficult problems: F streaming+graphs C C A Unbounded ⇒ don’t see entire graph Streaming rates can be very high B C D F E E ··· X Computational models A B A D D F Continuous: for simple Time transactional operations t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 Windowed: for more complex queries C C C C C

B B B B F B F B F

D D D D

A A A A A A

E E

t1 t4 t5 t7 t9 t12

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 50 Continuous Computation

Query: Vertices reachable from vertex A

C C C C C

B B B B F B F B F

D D D D

A A A A A A

E E t1 t4 t5 t7 t9 t12 Time Incoming edge Results t1 A,B B h i { } t2 t3 t4 B,C B,C h i { } t5 A,D , D,C B,C,D h i h i { } t6 t7 C,F , D,F B,C,D,F h i h i { } t8 t9 D,E , A,E , B,E , E,F B,C,D,F,E h i h i h i h i { } t10 © M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 51 Windowed Computation Query: Vertices reachable from vertex A Window size=5

C C C C C

B B B B F B F B F

D D D D

A A A A A A

E E t1 t4 t5 t7 t9 t12 Time Incoming edge Expired edges Results t1 A,B B h i { } t2 t3 t4 B,C B,C h i { } t5 A,D , D,C B,C,D h i h i { } t6 A,B B,C,D h i { } t7 C,F , D,F C,D,F h i h i { } t8 t9 D,E , A,E , B,E , E,F B,C C,D,F,E h i h i h i h i h i { } t10 A,D , D,C C,D,F,E h i h i { } © M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 52 Graph Stream Algorithms

Unboundedness brings up space issues Continuous computation (pure streams) model requires linear space ⇒ unrealistic Many graph problems are not solvable (see [McGregor, 2014] for a survey) Semi-streaming model ⇒ sublinear space[Feigenbaum et al., 2005] Sufficient to store vertices but not edges (typically V E ) Approximation for many graph algorithms, spanners| [Elkin,|  | 2011| ], connectivity [Feigenbaum et al., 2005], matching [Kapralov, 2013], etc.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 53 Querying Graph Streams

Remember graph query functionalities Subgraph matching queries & reachability (path) queries Doing these in the streaming context This is querying beyond simple transactional operations on an incoming edge Edge represents a user purchasing an item do some operation Edge represents events in news send an→ alert → Subgraph pattern matching under stream of updates Windowed join processing Graphflow[Kankanamge et al., 2017], TurboFlux[Kim et al., 2018] These are not designed to deal with unboundedness of the data graph Path queries under stream of updates Windowed RPQ evaluation on unbounded streams

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 54 Analytics on Graph Streams

Many use cases Recommender systems Fraud detection[Qiu et al., 2018] ... Existing relevant work Snapshot-based systems Aspen [Dhulipala et al., 2019], STINGER [Ediger et al., 2012] Consistent graph views across updates Snapshot + Incremental Computations Kineograph [Cheng et al., 2012], GraPu [Sheng et al., 2018], GraphIn [Sengupta et al., 2016], GraphBolt [Mariappan and Vora, 2019] Identify and re-process subgraphs that are effected by updates Designed to handle high velocity updates Cannot handle unbounded streams Similar to dynamic graph processing solutions

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 55 Concluding Remarks

The entire field is pretty much open!... 1 We can do more with dynamic graphs, but efficient systems that incorporate novel techniques are needed 2 Unboundedness in streams raises real challenges 3 Most graph problems are unbounded under edge insert/delete

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 56 So, what is the big story?...

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 57 Take home message!...

Reorient research... 1 A lot of the research has been algorithmic; time to shift focus to systems-aspects 2 Storage system architectures & structures 3 Indexing graph data? 4 Query primitives, processing methodology & optimization techniques

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 58 Take home message!...

Performance & scaling are real problems... 1 There are few independent large-scale performance studies (e.g., [Rusu and Huang, 2019; Han et al., 2014; Ammar and Ozsu,¨ 2018]) 2 Reasonable benchmarks are emerging: LDBC for graph DBMS [Erling et al., 2015], WatDiv for RDF [Alu¸cet al., 2014], Graph500 for very large graphs 3 These are application benchmarks; microbenchmarks for system testing are needed

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 58 Take home message!...

Focus on dynamic & streaming graphs... 1 We paid enough attention to static graphs; many real graphs are not static & many real applications require real-time answers 2 Dynamic 6= streaming 3 Alert: this area is tough and you are not likely to write as many papers

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 58 Take home message!...

Common DBMS for RDF & property graphs? 1 They both deal with online workloads and focus on querying 2 SPARQL only deals with subgraph queries ⇒ how to efficiently do path queries? 3 SPARQL semantics is graph homomorphism; subgraph queries over property graphs use graph isomorphism 4 Some discussion has started: W3C Workshop on Web Standardization for Graph Data: Creating Bridges: RDF, Property Graph and SQL

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 58 Take home message!...

Looking for graph HTAP systems... 1 There are use cases and demand from users/industry 2 We need to decide what type of analytics we are considering: OLAP or offline workloads 3 There is work: TigerGraph, Quegel [Yan et al., 2016]

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 58 Some Issues That I Did Not Talk About

Graphs in AI/ML 1 Graphs in ML models 2 ML for graph analytics

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 59 Some Issues That I Did Not Talk About

Hardware support for graph processing 1 Quite a bit of work in using GPUs for acceleration 2 Mostly focus on managing GPU restrictions 3 Some work on using FPGAs 4 Worthwhile to consider a unified architecture: CPU+GPU+FPGA 5 Use of NVM for both in-memory and on-disk graph systems

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 59 Some Issues That I Did Not Talk About

Security & privacy issues 1 What is appropriate security granularity? Can you get multilevel security as in relational systems? 2 Anonymization of graphs (especially dynamic graphs) is difficult

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 59 Some Issues That I Did Not Talk About

Graphs in related/other fields 1 Network analysis: E.g., “Networks, Crowds, and Markets”, “Information and Influence Propagation in Social Networks” 2 Biological networks 3 Neuroscience: E.g., “Networks of the Brain” 4 ...

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 59 Thank you!

To the DSG group

... and 40+ grad students and post-docs

To my collaborators ...

To the supporters

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 60 © M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 61 ReferencesI

Abadi, D. J., Marcus, A., Madden, S., and Hollenbach, K. (2009). SW-Store: a vertically partitioned DBMS for semantic web data management. VLDB J., 18(2):385–406. Available from: https://doi.org/10.1007/s00778-008-0125-y. Abadi, D. J., Marcus, A., Madden, S. R., and Hollenbach, K. (2007). Scalable semantic web data management using vertical partitioning. In Proc. 33rd Int. Conf. on Very Large Data Bases, pages 411–422. Aberger, C. R., Lamb, A., Tu, S., N¨otzli,A., Olukotun, K., and R´e,C. (2017). EmptyHeaded: A relational engine for graph processing. ACM Trans. Database Syst., 42(4):20:1–20:44. Available from: http://doi.acm.org/10.1145/3129246. Alu¸c,G., Hartig, O., Ozsu,¨ M. T., and Daudjee, K. (2014). Diversified stress testing of RDF data management systems. In Proc. 13th Int. Semantic Web Conf., pages 197–212. Available from: https://doi.org/10.1007/978-3-319-11964-9_13. Alu¸c,G., Ozsu,¨ M. T., Daudjee, K., and Hartig, O. (2013). chameleon-db: a workload-aware robust RDF data management system. Technical Report CS-2013-10, University of Waterloo. Available at https://cs.uwaterloo.ca/sites/ca. computer-science/files/uploads/files/CS-2013-10.pdf.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 62 ReferencesII

Ammar, K., McSherry, F., Salihoglu, S., and Joglekar, M. (2018). Distributed evaluation of subgraph queries using worst-case optimal and low-memory dataflows. Proc. VLDB Endowment, 11(6):691–704. Available from: https://doi.org/10.14778/3184470.3184473. Ammar, K. and Ozsu,¨ M. T. (2018). Experimental analysis of distributed graph systems. Proc. VLDB Endowment, 11(10):1151–1164. Available from: https://doi.org/10.14778/3231751.3231764. Angles, R., Arenas, M., Barcel´o,P., Boncz, P. A., Fletcher, G. H. L., Gutierrez, C., Lindaaker, T., Paradies, M., Plantikow, S., Sequeda, J. F., van Rest, O., and Voigt, H. (2018). G-CORE: A core for future graph query languages. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 1421–1432. Available from: http://doi.acm.org/10.1145/3183713.3190654. Barbieri, D. F., Braga, D., Ceri, S., Valle, E. D., and Grossniklaus, M. (2010). C-SPARQL: a continuous for RDF data streams. Int. J. Semantic Computing, 4(1):3–25. Available from: https://doi.org/10.1142/S1793351X10000936.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 63 ReferencesIII

Bonifati, A., Fletcher, G., Voigt, H., and Yakovets, N. (2018). Querying Graphs. Synthesis Lectures on Data Management. Morgan & Claypool. Available from: https://doi.org/10.2200/S00873ED1V01Y201808DTM051. Bornea, M. A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P., Udrea, O., and Bhattacharjee, B. (2013). Building an efficient RDF store over a relational database. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 121–132. Available from: http://doi.acm.org/10.1145/2463676.2463718. Bu, Y., Howe, B., Balazinska, M., and Ernst, M. D. (2010). HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endowment, 3(1):285–296. Available from: http://dl.acm.org/citation.cfm?id=1920841.1920881. Bu, Y., Howe, B., Balazinska, M., and Ernst, M. D. (2012). The HaLoop approach to large-scale iterative data analysis. VLDB J., 21(2):169–190. Available from: https://doi.org/10.1007/s00778-012-0269-7. Chen, C., Yan, X., Zhu, F., Han, J., and Yu, P. S. (2008). Graph OLAP: Towards online analytical processing on graphs. In Proc. 8th IEEE Int. Conf. on Data Mining, pages 103–112. Available from: https://doi.org/10.1109/ICDM.2008.30.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 64 ReferencesIV

Chen, Y. and Chen, Y. (2008). An efficient algorithm for answering graph reachability queries. In Proc. 24th Int. Conf. on Data Engineering, pages 893 –902. Cheng, R., Hong, J., Kyrola, A., Miao, Y., Weng, X., Wu, M., Yang, F., Zhou, L., Zhao, F., and Chen, E. (2012). Kineograph: Taking the pulse of a fast-changing and connected world. In Proc. 7th ACM SIGOPS/EuroSys European Conf. on Comp. Syst., pages 85–98. Available from: http://doi.acm.org/10.1145/2168836.2168846. Cohen, E., Halperin, E., Kaplan, H., and Zwick, U. (2002). Reachability and distance queries via 2-hop labels. In Proc. 13th Annual ACM-SIAM Symp. on Discrete Algorithms, pages 937–946. Available from: http://doi.acm.org/10.1145/545381.545503. Cruz, I. F., Mendelzon, A. O., and Wood, P. T. (1987). A graphical query language supporting recursion. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 323–330. Dhulipala, L., Blelloch, G. E., and Shun, J. (2019). Low-latency graph streaming using compressed purely-functional trees. In Proc. ACM SIGPLAN 2019 Conf. on Design and Implementation, pages 918–934. Available from: http://doi.acm.org/10.1145/3314221.3314598.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 65 ReferencesV

Ediger, D., McColl, R., Riedy, J., and Bader, D. A. (2012). STINGER: High performance data structure for streaming graphs. In Proc. 2012 IEEE Conf. on High Performance Extreme Comp., pages 1–5. Elkin, M. (2011). Streaming and fully dynamic centralized algorithms for constructing and maintaining sparse spanners. ACM Trans. Algorithms, 7(2):20:1–20:17. Available from: http://doi.acm.org/10.1145/1921659.1921666. Erling, O., Averbuch, A., Larriba-Pey, J., Chafi, H., Gubichev, A., Prat, A., Pham, M.-D., and Boncz, P. (2015). The ldbc social network benchmark: Interactive workload. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 619–630. Available from: http://doi.acm.org/10.1145/2723372.2742786. Fan, W., Li, J., Luo, J., Tan, Z., Wang, X., and Wu, Y. (2011). Incremental graph pattern matching. In Proc. ACM International Conference on Management of Data, pages 925–936. Farid, M. H., Roatis, A., Ilyas, I. F., Hoffmann, H., and Chu, X. (2016). CLAMS: bringing quality to data lakes. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 2089–2092. Available from: https://doi.org/10.1145/2882903.2899391.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 66 ReferencesVI

Feigenbaum, J., Kannan, S., McGregor, A., Suri, S., and Zhang, J. (2005). On graph problems in a semi-streaming model. Theor. Comp. Sci., 348(2):207–216. Available from: http://www.sciencedirect.com/science/article/pii/S0304397505005323. Galarraga, L., Hose, K., and Schenkel, R. (2014). Partout: a distributed engine for efficient RDF processing. In Proc. 26th Int. World Wide Web Conf. (Companion Volume), pages 267–268. Available from: http://doi.acm.org/10.1145/2567948.2577302. Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J., and Stoica, I. (2014). GraphX: graph processing in a distributed dataflow framework graph processing in a distributed dataflow framework. In Proc. 11th USENIX Symp. on Design and Implementation, pages 599–613. Available from: https://www.usenix.org/conference/osdi14/technical-sessions/ presentation/gonzalez. Han, M. (2015). On improving distributed Pregel-like graph processing systems. Master’s thesis, University of Waterloo, David R. Cheriton School of Computer Science. Available from: http://hdl.handle.net/10012/9484.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 67 ReferencesVII

Han, M. and Daudjee, K. (2015). Giraph unchained: Barrierless asynchronous parallel execution in Pregel-like graph processing systems. Proc. VLDB Endowment, 8(9):950–961. Available from: http://www.vldb.org/pvldb/vol8/p950-han.pdf. Han, M., Daudjee, K., Ammar, K., Ozsu,¨ M. T., Wang, X., and Jin, T. (2014). An experimental comparison of Pregel-like graph processing systems. Proc. VLDB Endowment, 7(12):1047–1058. Available from: http://www.vldb.org/pvldb/vol7/p1047-han.pdf. Hose, K. and Schenkel, R. (2013). WARP: Workload-aware replication and partitioning for RDF. In Proc. Workshops of 29th Int. Conf. on Data Engineering, pages 1–6. Available from: http://dx.doi.org/10.1109/ICDEW.2013.6547414. Huang, J., Abadi, D. J., and Ren, K. (2011). Scalable SPARQL querying of large RDF graphs. Proc. VLDB Endowment, 4(11):1123–1134. Husain, M. F., McGlothlin, J., Masud, M. M., Khan, L. R., and Thuraisingham, B. (2011). Heuristics-based query processing for large RDF graphs using cloud computing. IEEE Trans. Knowl. and Data Eng., 23(9):1312–1327.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 68 ReferencesVIII

Kankanamge, C., Sahu, S., Mhedbhi, A., Chen, J., and Salihoglu, S. (2017). Graphflow: An active graph database. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 1695–1698. Available from: http://doi.acm.org/10.1145/3035918.3056445. Kaoudi, Z. and Manolescu, I. (2015). RDF in the clouds: A survey. VLDB J., 24:67–91. Kapralov, M. (2013). Better bounds for matchings in the streaming model. In Proc. 24th Annual ACM-SIAM Symp. on Discrete Algorithms, pages 1679–1697. Available from: http://dl.acm.org/citation.cfm?id=2627817.2627938. Khadilkar, V., Kantarcioglu, M., Thuraisingham, B. M., and Castagna, P. (2012). Jena-HBase: A distributed, scalable and efficient RDF triple store. In Proc. International Semantic Web Conference Posters & Demos Track. Khayyat, Z., Awara, K., Alonazi, A., Jamjoom, H., Williams, D., and Kalnis, P. (2013). Mizan: A system for dynamic load balancing in large-scale graph processing. In Proc. 8th ACM SIGOPS/EuroSys European Conf. on Comp. Syst., pages 169–182. Available from: http://doi.acm.org/10.1145/2465351.2465369.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 69 ReferencesIX

Kim, K., Seo, I., Han, W., Lee, J., Hong, S., Chafi, H., Shin, H., and Jeong, G. (2018). Turboflux: A fast continuous subgraph matching system for streaming graph data. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 411–426. Lee, K. and Liu, L. (2013). Scaling queries over big RDF graphs with semantic hash partitioning. Proc. VLDB Endowment, 6(14):1894–1905. Available from: http://www.vldb.org/pvldb/vol6/p1894-lee.pdf. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., and Guestrin, C. (2010). GraphLab: new framework for parallel machine learning. In Proc. 26th Conf. on Uncertainty in Artificial Intelligence, pages 340–349. Maali, F., Campinas, S., and Decker, S. (2015). Gagg: A graph aggregation operator. In Proc. 14th Int. Semantic Web Conf., pages 491–504. Malewicz, G., Austern, M. H., Bik, A. J. C., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. (2010). Pregel: a system for large-scale graph processing. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 135–146. Mariappan, M. and Vora, K. (2019). GraphBolt: Dependency-driven synchronous processing of streaming graphs. In Proc. 14th ACM SIGOPS/EuroSys European Conf. on Comp. Syst., pages 25:1–25:16. Available from: http://doi.acm.org/10.1145/3302424.3303974.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 70 ReferencesX

Mattson, T., Bader, D., Berry, J., Buluc, A., Dongarra, J., Faloutsos, C., Feo, J., Gilbert, J., Gonzalez, J., Hendrickson, B., Kepner, J., Leiserson, C., Lumsdaine, A., Padua, D., Poole, S., Reinhardt, S., Stonebraker, M., Wallach, S., and Yoo, A. (2013). Standards for graph algorithm primitives. In Proc. 2013 IEEE High Performance Extreme Comp. Conf., pages 1–2. McColl, R., Green, O., and Bader, D. A. (2013). A new parallel algorithm for connected components in dynamic graphs. In 20th Annual Int. Conf. High Performance Computing, pages 246–255. McGregor, A. (2014). Graph stream algorithms: A survey. ACM SIGMOD Rec., 43(1):9–20. Available from: http://doi.acm.org/10.1145/2627692.2627694. McSherry, F., Murray, D. G., Isaacs, R., and Isard, M. (2013). Differential dataflow. In Proc. 6th Biennial Conf. on Innovative Data Systems Research. Mendelzon, A. O. and Wood, P. T. (1995). Finding regular simple paths in graph databases. SIAM J. on Comput., 24(6):1235—1258. Mhedhbi, A. and Salihoglu, S. (2019). Optimizing subgraph queries by combining binary and worst-case optimal joins. Proc. VLDB Endowment, 12.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 71 ReferencesXI

Nannicini, G. and Liberti, L. (2008). Shortest paths on dynamic graphs. Int. Trans. Operations Research, 15(5):551–563. Available from: https://doi.org/10.1111/j.1475-3995.2008.00649.x. Neumann, T. and Weikum, G. (2008). RDF-3X: a RISC-style engine for RDF. Proc. VLDB Endowment, 1(1):647–659. Neumann, T. and Weikum, G. (2009). The RDF-3X engine for scalable management of RDF data. VLDB J., 19(1):91–113. Peng, P., Zou, L., Ozsu,¨ M. T., Chen, L., and Zhao, D. (2016). Processing SPARQL queries over distributed RDF graphs. VLDB J., 25(2):243–268. Phuoc, D. L., Dao-Tran, M., Parreira, J. X., and Hauswirth, M. (2011). A native and adaptive approach for unified processing of linked streams and linked data. In Proc. 10th Int. Semantic Web Conf., pages 370–388. Available from: https://doi.org/10.1007/978-3-642-25073-6_24. Qiu, X., Cen, W., Qian, Z., Peng, Y., Zhang, Y., Lin, X., and Zhou, J. (2018). Real-time constrained cycle detection in large dynamic graphs. Proc. VLDB Endowment, 11(12):1876–1888.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 72 ReferencesXII

Reutter, J. L., Romero, M., and Vardi, M. Y. (2017). Regular queries on graph databases. Theory of Computing Systems, 61(1):31–83. Available from: https://doi.org/10.1007/s00224-016-9676-2. Rohloff, K. and Schantz, R. E. (2010). High-performance, massively scalable distributed systems using the mapreduce software framework: the shard triple-store. In Proc. Int. Workshop on Programming Support Innovations for Emerging Distributed Applications. Article No. 4. Roy, A., Mihailovic, I., and Zwaenepoel, W. (2013). X-stream: edge-centric graph processing using streaming partitions. In Proc. 24th ACM Symp. on Operating System Principles, pages 472–488. Available from: http://doi.acm.org/10.1145/2517349.2522740. Rusu, F. and Huang, Z. (2019). In-depth benchmarking of graph database systems with the linked data benchmark council (LDBC) social network benchmark (SNB). CoRR, abs/1907.07405. Available from: http://arxiv.org/abs/1907.07405. Sahu, S., Mhedhbi, A., Salihoglu, S., Lin, J., and Ozsu,¨ M. T. (2017). The ubiquity of large graphs and surprising challenges of graph processing. Proc. VLDB Endowment, 11(4):420–431.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 73 ReferencesXIII

Sahu, S., Mhedhbi, A., Salihoglu, S., Lin, J., and Ozsu,¨ M. T. (2019). The ubiquity of large graphs and surprising challenges of graph processing. VLDB J. Salihoglu, S. and Widom, J. (2013). GPS: a graph processing system. In Proc. 25th Int. Conf. on Scientific and Statistical Database Management, pages 22:1–22:12. Available from: http://doi.acm.org/10.1145/2484838.2484843. Salihoglu, S. and Widom, J. (2014). HelP: high-level primitives for large-scale graph processing. In Proc. 2nd Int. Workshop on Graph Data Management Experiences and Systems, pages 3:1–3:6. Available from: http://doi.acm.org/10.1145/2621934.2621938. Sengupta, D., Sundaram, N., Zhu, X., Willke, T. L., Young, J., Wolf, M., and Schwan, K. (2016). GraphIn: An online high performance incremental graph processing framework. In Proc. 22nd Int. Euro-Par Conf., pages 319–333. Available from: http://dx.doi.org/10.1007/978-3-319-43659-3_24. Shao, B., Wang, H., and Li, Y. (2013). Trinity: a distributed graph engine on a memory cloud. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 505–516. Available from: http://doi.acm.org/10.1145/2463676.2467799.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 74 References XIV

Sheng, F., Cao, Q., Cai, H., Yao, J., and Xie, C. (2018). Grapu: Accelerate streaming graph analysis through preprocessing buffered updates. In Proc. 9th ACM Symp. on Cloud Computing, pages 301–312. Available from: http://doi.acm.org/10.1145/3267809.3267811. Simon, K. (1988). An improved algorithm for transitive closure on acyclic digraphs. Theor. Comp. Sci., 58(1-3):325–346. Available from: http://dx.doi.org/10.1016/0304-3975(88)90032-1. Su, J., Zhu, Q., Wei, H., and Yu, J. X. (2017). Reachability querying: Can it be even faster? IEEE Trans. Knowl. and Data Eng., 29(3):683–697. Tena Cucala, D., Cuenca Grau, B., and Horrocks, I. (2019). 15 years of consequence-based reasoning. In Lutz, C., Sattler, U., Tinelli, C., Turhan, A.-Y., and Wolter, F., editors, Description Logic, Theory Combination, and All That: Essays Dedicated to Franz Baader on the Occasion of His 60th Birthday, pages 573–587. Springer International Publishing. Available from: https://doi.org/10.1007/978-3-030-22102-7_27. Tian, Y., Hankins, R. A., and Patel, J. M. (2008). Efficient aggregation for graph summarization. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 567–580.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 75 ReferencesXV

Valiant, L. G. (1990). A bridging model for parallel computation. Commun. ACM, 33(8):103–111. Veldhuizen, T. L. (2014). Leapfrog triejoin: A simple, worst-case optimal join algorithm. In Proc. 17th Int. Conf. on Database Theory, pages 96–106. Wang, Z., Fan, Q., Wang, H., Tan, K. L., Agrawal, D., and El Abbadi, A. (2014). Pagrol: PArallel GRaph OLap over lege-scale attributed graphs. In Proc. 30th Int. Conf. on Data Engineering, pages 496–507. Wei, H., Yu, J. X., Lu, C., and Jin, R. (2014). Reachability querying: An independent permutation labeling approach. Proc. VLDB Endowment, 7(12):1191–1202. Available from: http://www.vldb.org/pvldb/vol7/p1191-wei.pdf. Weiss, C., Karras, P., and Bernstein, A. (2008). Hexastore: sextuple indexing for semantic web data management. Proc. VLDB Endowment, 1(1):1008–1019. Wilkinson, K. (2006). Jena property table implementation. Technical Report HPL-2006-140, HP Laboratories Palo Alto. Yakovets, N., Godfrey, P., and Gryz, J. (2016). Query planning for evaluating SPARQL property paths. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 1875–1889.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 76 References XVI

Yan, D., Cheng, J., Lu, Y., and Ng, W. (2014). Blogel: A block-centric framework for distributed computation on real-world graphs. Proc. VLDB Endowment, 7(14):1981–1992. Available from: http://www.vldb.org/pvldb/vol7/p1981-yan.pdf. Yan, D., Cheng, J., Ozsu,¨ M. T., Yang, F., Lu, Y., Liu, J. C., Zhang, Q., and Ng, W. (2016). A general-purpose query-centric framework for querying big graphs. Proc. VLDB Endowment, 9(7):564 – 575. Available from: http://www.vldb.org/pvldb/vol9/p564-yan.pdf. Yildirim, H., Chaoji, V., and Zaki, M. J. (2010). GRAIL: scalable reachability index for large graphs. Proc. VLDB Endowment, 3(1):276–284. Available from: http://dl.acm.org/citation.cfm?id=1920841.1920879. Yuan, Y., Wang, G., Chen, L., and Wang, H. (2012). Efficient subgraph similarity search on large probabilistic graph databases. Proc. VLDB Endowment, 5(9):800–811. Yuan, Y., Wang, G., Wang, H., and Chen, L. (2011). Efficient subgraph search over large uncertain graphs. Proc. VLDB Endowment, 4(11):876–886.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 77 References XVII

Zhang, X., Chen, L., Tong, Y., and Wang, M. (2013). EAGRE: towards scalable I/O efficient SPARQL query evaluation on the cloud. In Proc. 29th Int. Conf. on Data Engineering, pages 565–576. Zhang, X. and Ozsu,¨ M. T. (2019). Correlation constraint shortest path over large multi-relation graphs. Proc. VLDB Endowment, 12(5):488 – 501. Zhao, P., Li, X., Xin, D., and Han, J. (2011). Graph cube: on warehousing and OLAP multidimensional networks. In Proc. ACM International Conference on Management of Data, pages 853–864. Zou, L., Mo, J., Chen, L., Ozsu,¨ M. T., and Zhao, D. (2011). gStore: answering SPARQL queries via subgraph matching. Proc. VLDB Endowment, 4(8):482–493. Zou, L. and Ozsu,¨ M. T. (2017). Graph-based RDF data management. Data Science and Engineering, 2(1):56–70. Available from: https://dx.doi.org/10.1007/s41019-016-0029-6. Zou, L., Ozsu,¨ M. T., Chen, L., Shen, X., Huang, R., and Zhao, D. (2014). gStore: A graph-based SPARQL query engine. VLDB J., 23(4):565–590.

© M. Tamer Ozsu¨ VLDB 2019 (2019/08/27) 78