Technical Report
Total Page:16
File Type:pdf, Size:1020Kb
Comparison and Performance Evaluation of Processing Platforms – Technical Report Gala Barquero Javier Troya Antonio Vallecillo July 21, 2020 Abstract This technical report serves as a complementary study to our paper entitled ‘Improving Query Performance on Dynamic Graphs’, where we propose an algorithm that reduces the input data in order to improve the memory consumption and time performance. To achieve this, it selects only the information that is relevant for a query. In this document we analyse six processing platforms and three languages commonly used to work with graphs, in order to find the most suitable technology to develop our algorithm. These technologies have been evaluated in two case studies and we have chosen the processing platform and DSL combination that better suit with our requirements. 1 Introduction Current companies and organizations increasingly demand the processing of data streams generated from different sources, such us social networks and streaming platforms. The main challenge of these applications is to provide efficient-enough responses when dealing with large amounts of real-time information. Despite this kind of information is sometimes represented as flows of single events without any relation among them, it is true that there are cases in which the information is composed of data related among them. In these cases, the sources generate data-structured information in the form of graphs, such as in social networks or geolocation systems domains. Furthermore, data of different nature that come from heterogeneous sources have to be considered. In this way, two types of information are normally taken into account in our approach: persistent and transitory. The former is permanently stored in the system (e.g. users in a social network or coordinates in a geolocation system), whereas the latter is temporarily stored and expires after some period of time (e.g. tweets or routes). This means that we need to consider heterogeneous nature and structure of the information, which is indeed a challenge. This technical report is a complementary document to our paper entitled ‘Improving Query Performance on Dynamic Graphs’. The approach of this paper is based on the fact that not all the information that is stored in the system at the current time is useful to obtain valid conclusions. For this reason, we have developed an algorithm, called Source DataSet Reduction (SDR) algorithm, to improve the performance of queries over the graphs by selecting only relevant data to be queried. Then, to achieve this, it was necessary to find a technology and a language that meet three fundamental requirements: (i) they must allow to perform queries and update the information as quickly as possible in order to provide real-time responses, (ii) they must cope with graph-structured information, and (iii) they must provide a clear syntax in order to be able to study the type of query to be run over the data. Then, in this technical report, we analyse the benefits and limitations of several processing platforms and query languages when working with high volumes of data. The platforms comprise TinkerGraph [37], Neo4j [28], CrateDB [10], Memgraph [25], JanusGraph [23] and OrientDB [8], whereas the languages comprise Gremlin [3], Cypher [27], and SQL. All these technologies are commonly used in Big Data applications [24]. In order to study their performance, we tested them in two case studies and we measured the execution time of 1 Figure 1: Twitter and Flickr joint Metamodel. the queries in two scenarios. First, queries are run over the data in order to obtain useful information without implying any side effect in the source information. Second, same queries are run but their results modify the source graph by adding, removing or updating the existing information. In addition, the complexity of the languages is measured and compared by counting the number of characters, operators and internal variables used for each query. Finally, a combination of platform and query language is chosen to implement the SDR algorithm, according to our requirements. The structure of this document is as follows. First, in Section 2 we present a running example in the domain of social networks to illustrate our proposal. Then, Section 3 exposes 6 processing platforms that we compare in our study and their main features, whereas Section 4 presents the query languages that are used for these platforms. Platforms and languages are evaluated and compared in Section 5 in terms of execution time and complexity of the syntax, by using two case studies where one platform and one language are chosen as the most suitable combination for our proposal. Finally, in Section 6 we discuss related work and Section 7 concludes the technical report. 2 A running example In order to illustrate the DSL syntax used for each Graph Processing Platform studied in this technical report, consider a system that works with information provided by Flickr1 and Twitter2, and analyses them together to identify some situations of interest [6]. The metamodel is depicted in Figure 1, where there is one common class between the two domains, namely Hashtag. Note that there is no automated manner to implement a completely reliable relation among the remaining elements (e.g. automatically relating users based on their Twitter and Flickr identifiers is impossible). Given such a metamodel, we are interested in identifying the following situations of interest: 1https://www.flickr.com/ 2https://twitter.com/ 2 Q1. HotTopic: A hashtag has been used by both Twitter and Flickr users at least 100 times in the last hour. We would like to generate a new HotTopic object that refers to this hashtag. Q2. PopularTwitterPhoto: The hashtag of a photo is mentioned in a tweet that receives more than 30 likes in the last hour. A PopularTwitterPhoto element is created. Q3. PopularFlickrPhoto: A photo is favored by more than 50 Flickr users who have more than 50 followers. A PopularFlickrPhoto element is created. Q4. NiceTwitterPhoto: A user, with an h-index 3 higher than 50, posts three tweets in a row in the last hour containing a hashtag that describes a photo. In this case, we generate a NiceTwitterPhoto object. Q5. ActiveUserTweeted: Considering the 10000 most recent tweets, search a user that has published one of these tweets, with h-index higher than 30 that follows more than 5K users. In this case, an ActiveUserTweeted object is generated pointing to the user. Note that we have included 5 extra objects (shaded in gray) in the metamodel. These objects refer to the objects created as a result of the queries. In this way, a model of this application contains both the information from the sources (Twitter and Flickr) and also the information we generate during its lifetime, as the system evolves. 3 Processing platforms As mentioned in Section 1, one of our requirements is to provide real-time responses, which requires low-latency in the processing. Thus, we have studied six powerful platforms for processing huge amounts of data. These platforms have been tested in two case studies with graph-structured information (explained in the following sections), in order to choose the technology with the lowest latency. In addition to this feature, other aspects have been taken into account when choosing the most suitable technology for our proposal, such as the DSL expressiveness or the possibility of updating the information. In this section, an overview of the six platforms is presented. These technologies include five graph databases (Neo4j, JanusGraph, OrientDB, TinkerGraph and Memgraph) and a distributed SQL database (CrateDB). Their features are summarized in Table 1. Platform Distributed In-memory Disk Updatable Query languages Neo4j No No Yes Yes Cypher JanusGraph Yes Yes Yes Yes Gremlin OrientDB Yes Yes Yes Yes Gremlin, SQL TinkerGraph No Yes Yes Yes Cypher, Gremlin Memgraph Yes Yes Yes Yes Cypher CrateDB Yes No Yes Yes SQL Table 1: Processing platforms used in the experiments of the performance study 3.1 Neo4j Neo4j [28] is an open-source graph database that combines native graph storage (since the data are stored as a graph and pointers are used to navigate and traverse its elements), ACID transaction compliance (that implies Atomicity, Consistency, Isolation and Durability) and a scalable architecture designed to 3In Twitter, a user has an h-index of h if she has at least h followers who have more than h followers each. 3 perform queries quickly. Neo4j stores the data in disk and its features makes it suitable for production scenarios. In order to provide an intuitive way to manipulate the database, Neo4j uses Cypher [27] as query language. This DSL is explained in Section 4.2 in detail. Some domains where Neo4j is commonly used are financial services, manufacturing, technology companies, etc. 3.2 JanusGraph JanusGraph [23] is an open source scalable graph database designed to be distributed in a multi-machine cluster. This feature makes it suitable to store and query large graphs with hundred of billions of elements between edges and nodes. However, it also allows single-node configuration. JanusGraph can store the data in memory or in disk by making use of a backend with Apache Cassandra, Apache HBase, Google Cloud Bigtable, Oracle BerkeleyDB or ScyllaDB. Among the most important benefits of JanusGraph we find support for global graph analytics through the Hadoop framework, concurrent transactions and operational graph processing as well as native support for Apache Tinkerpop [2] and Gremlin language [3]. This DSL is explained in Section 4.3 in detail. In this technical report, we have implemented JanusGraph with the BerkeleyDB backend (disk storage). 3.3 OrientDB OrientDB [8] is an open source multi-model and NoSQL database that is written in Java and it provides both the power of graphs and the flexibility and scalability of documents into one high-performance operational database.