MASARYK UNIVERSITY FACULTY OF INFORMATICS

Study materials for Big Data processing tools

BACHELOR'S THESIS

Martin Durkac

Brno, Spring 2021 MASARYK UNIVERSITY FACULTY OF INFORMATICS

Study materials for Big Data processing tools

BACHELOR'S THESIS

Martin Durkáč

Brno, Spring 2021 This is where a copy of the official signed thesis assignment and a copy of the Statement of an Author is located in the printed version of the document. Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Martin Durkäc

Advisor: RNDr. Martin Macák

i Acknowledgements

I would like to thank my supervisor RNDr. Martin Macak for all the support and guidance. His constant feedback helped me to improve and finish it. I would also like to express my gratitude towards my colleagues at Greycortex for letting me use their resources to develop the practical part of the thesis.

ii Abstract

This thesis focuses on providing study materials for the Big Data sem• inar. The thesis is divided into six chapters plus a conclusion, where the first chapter introduces Big Data in general, four chapters contain information about Big Data processing tools and the sixth chapter describes study materials provided in this thesis. For each Big Data processing tool, the thesis contains a practical demonstration and an assignment for the seminar in the attachments. All the assignments are provided in both with and without solution forms. The four Big Data processing tools in this thesis are for the gen• eral batch processing, for SQL systems, Streaming for the stream processing, and GraphFrames package of the Apache Spark for graph processing.

iii Keywords

Big Data, Apache Hadoop, HDFS, YARN, MapReduce, Apache Hive, Apache Spark, Spark Streaming, GraphFrames

iv Contents

Introduction 2

1 Big data 3

2 Batch processing 5 2.1 Characteristics 5 2.2 Hadoop Distributed File System 5 2.3 Yet Another Resource Negotiator 6 2.4 Hadoop MapReduce 7 2.5 Application structure 8 2.5.1 Writable 9 2.5.2 Record Reader 10 2.5.3 Mapper 11 2.5.4 Reducer 12 2.5.5 Comparator 13 2.5.6 Partitioner 13 2.6 Apache Hadoop alternative 13

3 Structured data processing using SQL 15 3.1 Characteristics 15 3.2 Architecture 15 3.3 HiveServerl 17 3.4 Hive Metastore 17 3.5 Limitations 18 3.6 HiveQL 18 3.7 Apache Hive alternative 19

4 Stream processing 21 4.1 Apache Spark 22 4.2 Resilient Distributed Datasets 22 4.3 Spark Streaming 24 4.4 Apache Spark Streaming alternative 25

5 Graph processing 26 5.1 GraphFrames 26 5.2 DataFrameAPI 26

v 5.3 GraphFrames features 27 5.3.1 Motif finding 27 5.3.2 Breadth-first search 28 5.3.3 Subgraphs 28 5.3.4 Other built-in graph algorithms 28 5.4 GraphFrames alternative 29

6 Study materials 31 6.1 Tasks and demonstrations 31

6.2 Usage 32

7 Conclusion 33

A Attachments 34

Bibliography 35

vi List of Tables

2.1 Differences between Apache Hadoop and Apache Spark 14 3.1 Differences between SQL-92 standard and HiveQL 19 4.1 Examples of actions in Apache Spark 23 4.2 Examples of transformations in Apache Spark 23 List of Figures

2.1 Example distribution of HDFS blocks across a cluster with replication factor two. 6 2.2 MapReduce data flow 7 2.3 Example of the Apache Hadoop application structure 8 3.1 Apache Hive architecture 16 4.1 Apache Spark architecture with Spark Standalone cluster manager and HDFS storage 21 4.2 Spark Streaming usage of D-Streams 24 5.1 Apache Giraph architecture 29 Listings

2.1 Writable example 9 2.2 Record reader initialize method example 10 2.3 Record reader nextKeyValue method example 11 2.4 Record reader createRecordReader method example . . 11 2.5 Reducer example 12 3.1 Word count query in Apache Hive 19

1 Introduction

Today we are surrounded by Big Data in many forms and our daily interactions are affected by them. Big data are used in social media, medicine, IoT (Internet of Things), and many more areas [1]. They are here to stay and continuously growing bigger. Big data similar to other areas of computer science are continuously evolving and many IT workers, analysts and economists do not understand and cannot work with them. This is mainly because of their properties such as huge volume, great variety, and velocity [2]. The challenge of Big Data management is coping with all three of these properties in the most efficient way possible. The primary purpose of this thesis is to provide study materials for the most popular Big Data tools for each category of Big Data as described in [3]. These are Apache Hadoop for general purpose batch processing, Apache Hive for big SQL systems. For both streaming and graph processing, the thesis provides the materials about Apache Spark with its Spark Streaming library and GraphFrames package. Each of these chapters also contains a section about one or more alter• natives for a given Big Data tool, describing differences, advantages, and disadvantages in comparison. Study materials from the thesis explain and describe all of the Big Data tools mentioned above, including a tutorial for downloading, installation, and demonstration on a simple example. Furthermore, there are assignments on each Big Data tool with the solution provided. These tutorials are compatible with most of the Linux distributions. System requirements are mentioned in the setup part of the tutorials. The outcome of the thesis is to provide people with study materials about previously mentioned Big Data tools. The materials contain all the important information to start using these tools and improve skills in the Big Data field of computer science. The thesis is organized as follows. Chapter 1 introduces Big Data and processing concepts related to it. Apache Hadoop and its components are introduced and described in chapter 2. Chapter 3 contains information about Apache Hive. Chapters 4 and 5 describe Apache Spark Streaming and Apache Spark GraphFrames respectively. In chapter 6, the usage of the thesis and materials provided are explained. Chapter 7 concludes the thesis.

2 1 Big data

This thesis is focused on describing options for working with Big Data using current Big Data processing platforms. Big Data is an idea of data exponentially growing in size, being in unstructured form coming from all the digital devices, computation systems, usage of the Internet, and other applications in people's daily life. As for 2012, then the number of Internet users was around 2.27 billion [3]. In 2021, the number of Internet users reached 4.66 billion [4] and is continuing to grow bigger. More Internet users result in a huge and increasing number of user-generated content such as hundreds of hours of videos uploaded on YouTube, hundreds of thousands of tweets on Twitter, and millions of Google searches happening every minute. Many companies are continuously gathering massive datasets containing information about customer interactions, product sales, and other information that may help improve their business. This is only a fraction of the new data that is being created. Big Data can be described using the multiple properties of Big Data. The number of properties is ranging from three to seven, depending on which definition is used. The most important properties describing Big Data are [3]:

• Volume,

• Variety,

• Velocity.

Volume property is linked with a huge amount of data that can be billions of rows and millions of columns. Variety of data is another challenge of Big Data, where no exact format of the data is defined. The data comes in different formats, data sources, and data structures. Velocity describes the trend, where most of the data have been created in the most recent years. Speed, which new data is being generated is increasing further and there is now a need to not only analyze already stored data but do real-time analysis on those enormous volumes of data.

3 i. BIG DATA

Additionally, there are other definitions adding more properties de• scribing the Big Data [5]. These are other properties, that can describe Big Data:

• Variability - the problem of the interpretation of data, which changes depending on the context, especially in areas such as language processing,

• Visualization - the data is readable and accessible,

• Veracity - the accuracy and trustworthiness of the data, since the incomplete data may render useless,

• Value - the storage and analysis of the data would have no use if they cannot be turned into value.

There are many ways to categorize Big Data tools. The thesis is focusing on Big Data tools and categories used for processing the Data. One of the ways to classify the Big Data processing tools and the way it is described in this thesis is into these four categories [3]:

• Batch processing,

• Structured data processing,

• Stream processing,

• Graph processing.

4 2 Batch processing

Batch processing is the simplest way to handle unstructured data in very large volumes [6]. The data are being collected, stored, and then processed at once, which possibly takes a long time. Since the volumes are so large, the main principles of this type of processing are the possibility to run on low-cost unreliable commodity hardware yet being highly fault-tolerant, being highly scalable, and able to run in parallel in a cluster with multiple working nodes. This chapter contains information about the batch processing tool Apache Hadoop. It starts with basic information about Hadoop com• ponents, contains a description of an actual Hadoop application, and ends with a comparison with the Hadoop alternative.

2.1 Characteristics

Hadoop is an open-source framework written in Java but extends support to more languages such as Groovy, Ruby, C++, C, Python, and Perl [7]. Apache Hadoop allows distributed processing of large datasets across a cluster of computers using a simple programming model, offering scalability from single to thousands of machines, each with their storage and computation [8]. Every machine included in a Hadoop cluster uses the Hadoop Distributed File System (HDFS) for storing the data, MapReduce framework for computation and possibly "yet another resource negotiator" (YARN). YARN replaces the JobTracker from classic Hadoop, which was used to monitoring the running status of each job [9].

2.2 Hadoop Distributed File System

HDFS is the file system component of Hadoop. It is based on the UNIX file system, but it has been modified for better performance. HDFS stores data and metadata separately and allows for storing across many hosts. These hosts could create clusters of servers scalable by simply adding new ones [10]. It is a file system built for storing large files with streaming data access. HDFS puts a lot of emphasis in a

5 2. BATCH PROCESSING

Figure 2.1: Example distribution of HDFS blocks across a cluster with replication factor two. write-once, read-many-times pattern, meaning often dataset is copied from the source and various analyses are performed over that dataset. For each analysis, it is more important to read whole dataset, rather than a single record [11]. The data are split into multiple HDFS blocks, usually 128 MB big [12]. HDFS blocks are distributed and replicated across the nodes in the cluster. In a single-node setup, all the blocks are located on a single node. The Center of the HDFS is called NameNode, which holds the whole directory structure three and tracks where the data is kept. Figure 2.1 shows an example of HDFS cluster.

2.3 Yet Another Resource Negotiator

Introduced in Hadoop 2.0, YARN is used for more advanced resource management [13]. It provides greater scalability, higher efficiency, and more efficient sharing of a cluster between different frameworks [14]. The fundamental idea is to split the functionality of the resource man• agement and job scheduling into separate daemons by having a global resource manager and application master. Application master is a

6 2. BATCH PROCESSING framework-specific library used for resource sharing negotiations with resource manager and managing tasks for node managers [15].

2.4 Hadoop MapReduce

Figure 2.2: MapReduce data flow

Hadoop MapReduce is an open-source implementation of the MapReduce framework proposed by Google [16]. MapReduce is the preferred solution within developers [17] because it is simple to use and does not require database knowledge. To make a simple Hadoop application, users only need to define mapper and reducer. The input file is parsed into the key-value pairs, which are sent to the mapper. Mapper creates intermediate key-value pairs and then they are grouped by key and reducing is called for each group. Before sending it to the reducer, key-values groups can be sorted which means defining a specific comparator for the MapReduce job. This is described in the Figure 2.2.

7 2. BATCH PROCESSING

2.5 Application structure

main v I Java comparator

c CompositeKeyComparator

c NaturalKeyComparator format

c StudentsValuelnputFormat

c StudentsValueRecordReader mapper

c StudentsMapper partitioner

c Students Partitioner reducer

c StudentReducer writable

c Students Key

c SubjectFinish Students Driver resources . properties

Hhadoop.iml pom.xml

MD READMLmd

Figure 2.3: Example of the Apache Hadoop application structure

8 2. BATCH PROCESSING

One of the advantages of Hadoop's Map Reduce programming model is, that simple applications can be created using only mapper and reducer, but the Map Reduce framework in Java offers much more flexibility with what can be achieved. Figure 2.3 describes an example of a more complex Apache Hadoop Map Reduce application which is more in-depth described in the following subsections and the provided demonstration project from the attachments.

2.5.1 Writable

Listing 2.1: Writable example public interface Writable {

// how to write all the data // of the writable to data output void write(DataOutput dataOutput) throws IOException;

// how to read data from data input // to create writable object void readFields(Datalnput datalnput) throws IOException; } Writable is an interface for objects Map Reduce works with and uses them for serialization and deserialization. Hadoop library al• ready offers wrappers for most Java primitive types, for example, Text, IntWritable, DoubleWritable, and more. When more complex objects are required, using custom objects implementing the Writable interface can simplify the work with mappers and reducers. Imple• menting Writable interface, as seen in the Listing 2.1 below, is sim• ple and requires all fields of the objects to be written or read from DataOutput and Datalnput respectively. If fields of the objects do not support write and readFields methods, they need to be wrapped with one of the writable classes offered by Hadoop as mentioned before. If writable objects also needs to be comparable for sorting purposes, there is WritableComparable interface which extends Writable

9 2. BATCH PROCESSING and Comparable. When implementing Writable interface, the de• fault constructor is necessary as Hadoop uses reflection to create ob• jects, and not having a default constructor causes runtime exception. Overriding toString method, defines the format in which the object will be output by reducer in the final phase of the MapReduce job. Hadoop also offers NullWritable, which is a special type of writable used as a placeholder, when key or value is not needed [11].

2.5.2 Record Reader

RecordReader together with FilelnputFormat defines how the input file will be parsed into the writable objects and sent as key-value pairs into the mapper. Simple record reader that reads input file line by line can implement by extending RecordReader class where KEYIN and VALUEIN are types that will be used as input to a mapper. Initializing record reader is possible with LineRecordReader if the input file uses each line as a single record. Parsing is then done in nextKeyValue method. It is advised to have key, value, and record reader used to load raw data, as private fields. The Listing 2.2 and 2.3 shows how to create a simple record reader.

Listing 2.2: Record reader initialize method example OOverride public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException { // using record reader, which // loads the file line by line

1ineRecordReader = new LineRecordReader(); lineRecordReader.initialize(inputSplit, taskAttemptContext); }

10 2. BATCH PROCESSING

Listing 2.3: Record reader nextKeyValue method example OOverride public boolean nextKeyValue() throws IOException { // parse current value as string

// set key and value which are // private fields of the reader

return true; } To create a record reader instance, it is necessary to create in• put format by extending FileInputFormat and im• plementing createRecordReader method of your custom input format, which then will be registered to the job as described in the Listing 2.4.

Listing 2.4: Record reader createRecordReader method example ^Override public RecordReader createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException {

CustomRecordReader reader = new CustomRecordReader(); reader.initialize(split, context); return reader; }

2.5.3 Mapper

Mapper creates intermediate key-value pairs from the input of the RecordReader. To create a simple mapper, it is necessary to create a class that extends Mapper

11 2. BATCH PROCESSING

To add more functionality to mapping, not writing incorrect input to context or logging, for example, it is possible to override the map method.

2.5.4 Reducer

Reducing is the final phase of the MapReduce job, where the re• ducer takes the output of the mapper and processes it. The output of the reducer is stored in the HDFS. Custom reducer is created by extending Reducer where KEYIN, VALUEIN depends on the output of the mapper and KEYOUT, VALUEOUT is output which will be stored to the HDFS. The format of the output, how it will be written to the output file, is defined by toString method of the KEYOUT and VALUEOUT writable. For simple re• ducer, it is mostly enough to override the reduce method. The reduce method is called with each key and all values associated with that key after the mapping phase. The output of the reducer is sorted by SortComparatorClass if set in the job. Listing 2.5 shows a simple reducer. Listing 2.5: Reducer example OOverride protected void reduce(KEYIN key, Iterable values, Context context) throws IOException, InterruptedException {

// aggregation of the values

// creating keyOut and valueOut that // will besaved to the HDFS based // on their toString method

context.write(keyOut, valueOut); }

12 2. BATCH PROCESSING

2.5.5 Comparator

Comparators can be used to change the order of records in different stages of the MapReduce job. Two common comparators can be uti• lized in the MapReduce job. GroupingComparator is used to make sure that all the values for the key get to the reducer in one call [18]. SortComparator is used to sort keys after the mapping phase be• fore getting to the reducer. Both can be implemented by extending WritableComparator and setting them for the MapReduce job.

2.5.6 Partitioner

The partitioner is used between the mapping and reducing phase and is used to determine to which reducer the key will go. Imple• menting a partitioner is necessary when using multiple reducers, which means setting the number of reducers in the job by setNum- ReduceTasks(int tasks) method. Custom partitioner can be created by extending Part it i oner and implementing getPartition method.

2.6 Apache Hadoop alternative

One of the popular alternatives to Big Data batch processing is Apache Spark. Apache Spark is a versatile processing tool, used for all kinds of Big Data processing, including general-purpose batch processing. The difference between Apache Spark and Apache Hadoop is, that Apache spark was built to minimize disk usage. Apache Spark avoids using the disk for storing the data between the mapping and reducing phases, which results in better performance. This makes it more resource- demanding for memory than Apache Hadoop. Apache Hadoop is mainly intended for static data operations, meanwhile, Apache Spark is more often used for applications with multiple operations and real-time data analysis using Spark Streaming, which is described in chapter 4 [19]. Table 2.1 describes the main differences between Apache Hadoop and Apache Spark [20].

13 2. BATCH PROCESSING

Table 2.1: Differences between Apache Hadoop and Apache Spark

Apache Hadoop Apache Spark Performance Slower, many disk Faster, computations I/O operations more in-memory Intended for Batch processing Iterative, real-time processing Fault tolerance High, replication Can rebuild data across the nodes across nodes Scalability Easy, adding nodes More complicated, and disks relying on RAM computations Security LDAP, ACLs, Ker- Not secure, relies on beros,... Hadoop Machine learning Slower Better, MLib library

14 3 Structured data processing using SQL

Structured data processing using SQL is a concept of processing Big Data similarly to general-purpose batch processing, but using simple SQL-like declarative language instead of writing low-level Map Reduce programs. SQL in Big Data is just another layer of abstraction, which simulates structure and database-like storage of data with the same principles as in batch processing. A major disadvantage of Apache Hadoop is its complexity for simple analyst tasks. Users need to write tedious and sometimes diffi• cult MapReduce programs for data manipulation [21]. Even for the simplest MapReduce program, you need the ability to code withing some programming language such as Java or python and define basic components of the MapReduce program as described in the previous chapter. Apache Hive is one of the solutions for this problem and this chapter describes the Apache Hive components and its SQL-like query language HiveQL.

3.1 Characteristics

Apache Hive is an SQL engine which translates SQL queries into the MapReduce jobs, which are then executed in Hadoop MapReduce. It is an another layer of abstraction which eliminate need of understanding and programing often complex MapReduce programs [22]. The first version of the Hive was developed by . Facebook found, that most of their analyst found Apache Hadoop too low-level for simple everyday tasks. They decided to use SQL as a way to de• scribe MapReduce jobs since it was widely spread and most of their analyst knew how to use it [23].

3.2 Architecture

Apache Hive acts as a middleman between user and Hadoop cluster as described in Figure 3.1. It allows users to communicate through the two interfaces, which are Command Line Interface and HiveServerl. Hive Command Line Interface is a simple command-line shell and default

15 3- STRUCTURED DATA PROCESSING USING SQL

Figure 3.1: Apache Hive architecture

way to intract with Hive. HiveServer2 is the successor to HiveServer and enables remote clients to interact with Hive, run queries, and retrieve results. The most notable improvements from the original HiveServer are, that HiveServer2 allows for multi-client support and better authentication [24]. The interface submits a statement into the driver. The statement is parsed and then the driver creates an Abstract Syntax Tree, which is sent into the query planner. Query planner analyzes the abstract syn• tax tree and chooses a specific planner implementation. An executable MapReduce job is generated by the planner and run on the underly• ing data processing engine, which is currently Hadoop MapReduce. During this process, the driver contacts the Metastore to retrieve the needed metadata [21]. Metastore is run on the Relational Database Management System (RDBMS), which is by default the but can be set up with MySQL or other RDBMS.

16 3- STRUCTURED DATA PROCESSING USING SQL

3.3 HiveServer2

HiveServer2 is a service that enables client interaction with Hive. It is the successor to the original HiveServer, adding multi-client concur• rency and better authentication support. HiveServer2 is built on the Thrift-service base. Thrift is a Remote Procedure Call (RPC) framework from Apache for scalable services development [25]. To run HiveServer2, dependencies to the Metastore and Hadoop cluster are required. Thrift-based Hive service consists of four layers [26]:

• Server - establishing a connection,

• Transport - HTTP or TCP mode,

• Protocol - serialization and deserialization,

• Processor - application logic.

3.4 Hive Metastore

Hive metastore is a central repository for Apache Hive holding all the metadata [27]. It stores the schema, table locations, partitions, and more. The Metastore provides client access to this information and is run on the Relational Database Management System (RDBMS). Default RDBMS for Apache Hive is Apache Derby. However, Apache Derby is mostly used only for testing purposes, since it allows only one Hive session to be run at a time. Currently, Apache Hive supports five database systems to run the metastore:

• Apache Derby,

• MySQL,

• MS SQL Server,

• Oracle,

• PostgreSQL.

17 3- STRUCTURED DATA PROCESSING USING SQL

3.5 Limitations

Since Apache Hive is just an additional layer of abstraction on top of the Hadoop MapReduce, its features and limitations are depending on Hadoop and HDFS. HDFS is a write-once and read-many-times file system, it restricts Hive from performing row-level data modifying operations such as insert, update and delete. Because of this, the Hive is more appropriate for data analysis tasks rather than online trans• action processing. Additionally, Apache Hive has minimum latency built-in for each job [23]. Compared to standard relational database management systems, even on small datasets Hive is considerably slower, since it requires to compile given query statements and gener• ate MapReduce jobs. These jobs then need to be coordinated across the cluster, which takes time. However, as Hadoop which runs under the Hive is scalable and able to run MapReduce jobs across the cluster of nodes, for large datasets is reasonably quick. Therefore the Hive is a high-latency, high-throughoutput data processing system.

3.6 HiveQL

HiveQL is an SQL inspired query language of Apache Hive that in• teracts with one or more files stored in the local file system or the HDFS. Its main advantage is hiding the complexity of the Hadoop MapReduce framework from the developers. HiveQL does not sup• port all the SQL features, since it is limited by the Hadoop MapReduce framework, which it runs on top of. However, with the new versions of Apache Hive, more functions and features are being added to the HiveQL [28]. Differences between HiveQL and SQL-92 standard are described in the table 3.1 [29, 30, 31]. Most of the simple SQL operations, such as filtering, grouping, joining and aggregation functions are already implemented to the Apache Hive and using the same syntax. However, Apache Hive has some advanced functions, that help to do more complex queries in a simpler way as shown in Listing 3.1 [32].

18 3- STRUCTURED DATA PROCESSING USING SQL

Table 3.1: Differences between SQL-92 standard and HiveQL

SQL-92 HiveQL Write operations Insert, Update, Insert, Overwrite ta• Delete ble or partition Transactions Yes Limited transaction support Table constraints Primary, foreign key No primary or for• constrains eign key constraints Sub-queries Yes Only in FROM, WHERE or as HAVING clause Indexes Yes Not supported since version 3.0 Views Yes Read-only Multitable inserts No Yes Create table as select No Yes

Listing 3.1: Word count query in Apache Hive SELECT word ,count (1) FROM (

SELECT explode(split(line , 'u ') ) AS word FROM document) GROUP BY word;

3.7 Apache Hive alternative

Similar to the Apache Hadoop alternative, Apache Spark provides an alternative to the Apache Hive with its Spark SQL library. The best way to describe Spark SQL is, that it is an interface to execute SQL-like queries on the Apache Spark processing engine. Spark SQL is pretty compatible to run HiveQL queries through its interface and additionally allows to intermix relational and procedural processing

19 3- STRUCTURED DATA PROCESSING USING SQL techniques. Differences in resource utilization between Spark SQL and Apache Hive using BigBench benchmark [33] are following [34]:

• Spark SQL uses less CPU with a higher I/O wait time than Apache Hive,

• Spark SQL writes less data on disk but reads more data than Apache Hive,

• Apache Hive utilizes more memory than Spark SQL,

• Apache Hive sends more data over the network than Spark SQL.

Another viable alternative to the Apache Hive is . Apache Pig is also a layer of abstraction usually on top of the Hadoop MapReduce, which uses SQL-like language called Pig Latin. There is a big difference between Pig Latin and HiveQL. Pig Latin is a proce• dural data-flow language that fits the pipeline paradigm. HiveQL is a declarative query processing language. Nonetheless, both Apache Hive and Apache Pig offer similar features when it comes to joining, aggregating, sorting, or other SQL functions [35]. Apache Hive and Apache Pig are both great options for working with large datasets. The difference between them is that Apache Hive with simple queries can run faster than Apache Pig. However, as the queries get more complex with many joinings and filterings, Apache Pig can process them more efficiently by executing multiple steps at the same time [36].

20 4 Stream processing

Stream processing in Big Data is handling and providing real-time analysis of the incoming data. The main difference compared to batch processing is, that instead of storing large amounts of data processing them at once, they are processed as soon as they become available. However, this does not necessarily mean the data must be processed one by one. In certain situations, the data can be stored in micro- batches and handled more similarly to the batch processing, usually to improve the throughput. Both previously mentioned Big Data tools, Apache Hadoop and Apache Hive are used for processing and analyzing data already cap• tured and stored. In the case of stream processing, there is a neces• sity for low-latency analysis with real-time data, which is something Hadoop and Hive are just not built for. In recent years, new tools have been developed, such as Apache Spark with one of its components being Spark Streaming to overcome this issue [37].

f >i r > Spark Spark SQL GraphX Libraries Streaming MLib

*v J

Memory Shuffle Interpreter Scheduler Management Manager

Cluster Manager

Figure 4.1: Apache Spark architecture with Spark Standalone cluster manager and HDFS storage

21 4. STREAM PROCESSING

4.1 Apache Spark

Project of Apache Spark started in 2009 at the University of Califor• nia, Berkeley to design a unified engine for distributed data process- tag [38]. Apache Spark consists of an Apache Spark Core component and upper-level libraries, each created for the specific type of workloads. While having similarities to Hadoop's MapReduce, to accommodate for a different type of workloads, Apache Spark uses the Resilient Distributed Datasets as a core abstraction. RDDs allows for creating efficient and scalable data algorithms which have been evolving since the initial Spark version [39]. Apache Spark can run in the cluster with a master node and multi• ple worker nodes. These are managed by a cluster manager, for exam• ple Spark Standalone, Hadoop YARN, or Amazon EC2. Storage is handled for example by HDFS, Cassandra or HBase [40]. Figure 4.1 describes a simple architecture of Apache Spark. Additionally, a big advantage over Apache Hadoop is, that Apache Spark is very fast. Apache Spark supports both disk-based and in- memory computing as opposed to disk-based Apache Hadoop. Pro• cessing datasets in-memory helps Spark to be up to 100 times faster than Hadoop Map Reduce [41].

4.2 Resilient Distributed Datasets

Resilient Distributed Dataset (RDD) is the most fundamental data structure is Apache Spark. They are resilient, meaning they are im• mutable and any operation on a single RDD creates a new one. They are distributed because all the computations and analysis can be per• formed on multiple machines in different logical partitions.

22 4. STREAM PROCESSING

Table 4.1: Examples of actions in Apache Spark

Spark function Action take(n) returns first n elements of RDD first () returns first element, similar to take(l) count () returns number of elements in the RDD collect () returns all elements in the RDD save AsTextFile (path) writes element as Hadoop Se- quenceFile in the given path

Table 4.2: Examples of transformations in Apache Spark

Spark function Transformation map(f) returns a new RDD as a result of function/ applied to every ele• ment of the given RDD flatmap(f) similar to map, but function / can return multiple objects filter (f) returns a new RDD containing only elements on which/ returns true distinct () returns a new RDD with distinct elements sortBy(f) returns a new RDD with ele• ments sorted by/ reduceByKey(f) returns a new RDD with ele• ments as (K, V) pairs aggregated by an / which must be of type (VI, V2) => V3

Apache Spark allows two types of operations on RDDs. There are transformations and actions. Transformations return RDDs and

23 4. STREAM PROCESSING actions return non-RDD objects. Examples of actions and transforma• tions are shown in the tables 4.1 and 4.2 respectively [42,43].

4.3 Spark Streaming

Spark Streaming is a library of Apache Spark for processing a contin• uous stream of data. Instead of processing records one at a time like more traditional stream processing tools, it processes multiple records in so-called micro-batches [44]. These micro-batches are arriving as RDDs, which are then passed to the Spark Engine to be processed more like traditional batch workloads [45] as described in the Figure 4.2. This is achieved by using discretized streams (D-Streams). D-Streams divide the stream into a series of deterministic batch computations, which are running across the cluster on RDDs stored in the nodes [46]. Thanks to D-Streams, Spark Streaming is easy to use once the concept of RDDs in Apache Spark is understood.

Input data stream

Batches of input data Batches of processed data Spark Batch 1 Batch 2 Batch n Streaming

Figure 4.2: Spark Streaming usage of D-Streams

24 4- STREAM PROCESSING

4.4 Apache Spark Streaming alternative

Another stream processing tool, that is an alternative to Apache Spark Streaming is . Apache Storm is also a real-time dis• tributed processing system. Five key attributes of Apache Storm are high speed, ease of use, fault tolerance, reliability, and scalability. Ap• plication is built on architecture called Topologies. An advantage of using this architecture is, that it can be used with almost any pro• gramming language, that supports stdin/stdout communication via JSON [47]. Compared to the Apache Spark Streaming, Apache storm does not use micro-batching to process data. Rather than using D- Streams, Apache Storm tries to process an event as soon as it becomes available. This results in lower latency than Apache Spark Streaming. However, usage of micro-batching in Apache Spark Streaming means in this case higher throughput [48].

25 5 Graph processing

Graph processing is a type of processing, where large amounts of data are stored as a set of vertices with some sort of relationships between them. This is used for describing complicated structures such as social networks, biological networks, chemical compounds, or another sets of data, where interactions between objects are important. Emphasis is on the ability to run graph algorithms on these data and process them efficiently. With an increasing relations between people, devices, and other entities in general, graphs are becoming a popular data structure nowa• days. It is a great, flexible way to model relationships and interactions between objects [3].

5.1 GraphFrames

GraphFrames is a package for Apache Spark, built on top of the Spark SQL providing graph processing using DataFrames rather than RDDs mentioned in the previous chapter. GraphFrames are combining the functionality of GraphX, Apache Spark graph processing library, and DataFrames with more added features such as motif finding for pat• tern searching inside graphs [49]. Usage of DataFrames means, that instead of using RDDs, the data output from GraphFrames functions is represented as a group of named columns. These named columns are called DataFrames and they are part of Apache Spark SQL library [50]. Each GraphFrame is represented by two DataFrames. These two DataFrames are Vertices and edges [51].

5.2 DataFrameAPI

DataFrame comes from Spark SQL as its main abstraction. It is a dis• tributed collection of rows with the same schema. Compared to RDDs, DataFrame supports multiple relational SQL-like operations such as join, group by, select, and many more. The best way to imagine a

26 5. GRAPH PROCESSING

DataFrame is as an SQL table or an output of select statement in SQL database systems [52]. DataSet is another way to represent the data in GraphFrames, seem• ingly the same as a DataFrame. DataSet is an extension of DataFrame, currently available only in Java and Scala, providing a type-safe, object- oriented programming interface with more optimizations and the ability to be transformed back into the RDD [53]. This came into the Spark in the later versions and currently can be used to interact with the data.

5.3 GraphFrames features

With high-level APIs for Scala, Java, and Python, authors are still keep• ing it as a separate package from Apache Spark core. As previously mentioned, GraphFrames provides graph processing functionality of GraphX and data processing functionality of Spark SQL with some additional features [49]. These are some of the features that can be utilized with Graph- Frames [54, 55]:

• motif finding,

• breadth-first search (BFS),

• subgraphs,

• triangle count,

• connected components.

5.3.1 Motif finding

Motif finding is a pattern searching in a graph. GraphFrame uses a simple Domain-Specific Language (DSL) to find the vertices and edges combinations inside the graph that fits the pattern. An example of such a pattern could look like the following example.

graphFrame.find("(a)-[el]->(b); (b)-[e2]->(c)");

27 5. GRAPH PROCESSING

Such an expression would return a DataSet with columns a, el, b, el, c, where there exists an edge el from vertex a to vertex b and another edge el from vertex b to vertex c.

5.3.2 Breadth-first search

GraphFrame supports breadth-first search to look for the shortest path while defining first and last vertex with an expression that needs to be satisfied for each vertex. There is an optional edgeFilter and maxPathLength method to restrict the search. The result is returned in a form of a DataSet.

5.3.3 Subgraphs

GraphFrames offers a flexible way to create subgraphs by filtering vertices and edges. Isolated vertices can be removed by the built-in GraphFrame method dropIsolatedVertices. Since the GraphFrame is created with two datasets, vertices, and edges, it is even possible to create a subset of vertices and edges by using motif or other built-in algorithm and then selecting the wanted results.

5.3.4 Other built-in graph algorithms

The triangle counting is in graph theory used to detect communities. A triangle is a set of three vertices, where each vertex is connected to the other two. The method triangleCount on GraphFrame returns the number of triangles.

graphFrame.triangleCount().run();

Another built-in algorithm is an algorithm to find the connected components. Connected components are isolated clusters of vertices, where each vertex is reachable from any other vertex in the cluster. To search for connected components, a GraphFrame includes a method connectedComponents for this purpose.

graphFrame.connectedComponentsO.run();

28 5. GRAPH PROCESSING

5.4 GraphFrames alternative

The first thing to comes to mind when talking about GraphFrames alternatives is the Apache Spark GraphX library, which GraphFrames is extending with additional features. When using GrahpX alone, DataFrames are replaced with RDDs, similar to those discussed in Chapter 4 in Spark Streaming. GraphX lacks some additional Graph- Frames features such as pattern matching and relational queries. Graph- Frames is at least as expressive as GraphX because all GraphX opera• tors can be mapped to operators in GraphFrames[51]. A true alternative to GraphFrames and GraphX would be Apache Giraph. Apache Giraph is an iterative graph processing framework, built on top of Apache Hadoop [56] as described in Figure 5.1.

MapReduce II • Ap ache Hadoop

Figure 5.1: Apache Giraph architecture

It is hard to compare performance and usability between Graph- Frames and Apache Giraph because there are not many reliable com• parisons, but in recent years, Facebook published its comparison be• tween GraphX and Apache Giraph. The comparison helps to draw some conclusions about differences between GraphFrames and Apache Giraph, since the similarities in functions and underlying process• ing engine in GraphFrames and GraphX. The comparison found the

29 5. GRAPH PROCESSING

Apache Giraph was better at handling a large production-scale work• load, however, GraphX offered more features and easier develop• ment [57]. Since GraphFrames offers even more features than GraphX and works on top of the Apache Spark, it is safe to assume the perfor• mance and usability differences between GraphFrames and Apache Giraph are quite similar to the ones found in the Facebook comparison.

30 6 Study materials

This thesis provides multiple attachments in a form of a Java maven project. These projects are for educational purposes and related to each category of the Big Data processing tools mentioned in chapters 2 to 5. All the projects are prepared to run on Linux operating system, which is very common in the industry for real large-scale deployments [58]. Additionally, two datasets were created for this thesis used for tasks and demonstrations mentioned in the following section. The datasets used in the demonstrations and tasks are similar in the format of the data but differ in size. The smaller dataset is utilized in the graph task, since searching in the graph created from the bigger dataset would take too long and is from the perspective of understanding GraphX library unnecessary. The bigger graph is around 250 MB big, which is bigger than the default block size in the HDFS, which is 128 MB [12] and therefore the applications created in the tasks will have to work with multiple blocks.

6.1 Tasks and demonstrations

For every category of Big Data processing tools, the thesis provides three types of projects:

• demonstration,

• task,

• solution of the task.

The demonstration is a simple showcase of a certain use case for processing tools that are described in chapters 2 to 5. Each demon• stration contains a README.md file where installation and the initial setup are described if the given processing tool requires it to run. Some processing tools have a more complex initial setup, such as the Apache Hadoop and Apache Hive, because of their dependencies. On the other hand, Apache Spark GraphX is in this area much simpler and all its dependencies are imported through the maven pom.xml file located within the project.

31 6. STUDY MATERIALS

Tasks were created to practice given processing tools after learning about them. Description of each task is in the README.md of the task project. Tasks contain the basic structure of the project with missing im• plementation. Solution project showcase the intended implementation for each task.

6.2 Usage

Study materials provided in this thesis were first and foremost created to be used in the PV177/DataScience seminar however they can be used and applied in any seminar where Big Data is part of the course. The demonstrations are expected to be presented in the lectures as an example use case. The tasks are expected to be completed by the students during the semester in a form of assignments. Solutions can be released after the deadlines of the assignments and should explain the intended solution. To complete the tasks, basic knowledge of Java programming lan• guage and SQL is required, since all the tasks are implemented in Java with one of the tasks using SQL queries for completion. All demon• strations and tasks are made for the Linux platform and in some cases require running scripts or queries provided in materials through a command-line interface. These steps are described within the mate• rials, so it does not require comprehensive knowledge of the Linux system.

32 7 Conclusion

This thesis provides an insight into the four categories of Big Data processing tools. With each category, provided materials include infor• mation about the tool functionalities and theoretical background. This means, further insight into the architecture of a given processing tool, its standout features, and characteristics. Functions and methods ex• amples in the Java implementation of the processing tool. Additional practical demonstration of each tool and assignment in the attach• ments, that shows the practical side of the previously mentioned Big Data processing tools. Big Data processing tools chosen for each category were selected as popular choices between developers. The use case demonstrated in the practical part of the thesis, shows use cases and functions necessary for simple practical use. A combination of each theoretical and practical Big Data tool de• scription should act as the starting point for understanding the Big Data and their further processing. Text of the thesis and provided assignments could be used as a study and practice materials for the Big Data seminar.

33 A Attachments

An electronic version of the thesis contains multiple maven Java projects and datasets. These are all contained within thesis-attachments.zip, which contain the following projects and datasets:

• hadoop-demo: Batch processing demonstration with Apache Hadoop.

• hadoop-task: Task on batch processing with Apache Hadoop.

• hadoop-solution: Solution for the hadoop-task.zip.

• hive-demo: SQL in Big Data demonstration with Apache Hive.

• hive-task: Task on SQL in Big Data with Apache Hive.

• hive-solution: Solution for the hive-task.zip.

• spark-demo: Stream processing demonstration with Apache Spark Streaming.

• spark-task: Task on stream processing with Apache Spark Stream• ing.

• spark-solution: Solution for the streaming-task.zip.

• graph-demo: Graph processing demonstration with GraphFrames.

• graph-task: Task on graph processing with Apache Spark Graph- Frames.

• graph-solution: Solution for the graph-task.zip.

• students-data: Dataset for all the previous demonstrations and tasks except for the graph processing.

• students-data-graph: Reduced dataset for the graph processing task.

34 Bibliography

1. AGRAHARI, Anurag; RAO, Dharmaji. A review paper on Big Data: technologies, tools and trends. Int Res J Eng Technol. 2017, vol. 4, no. 10, pp. 10. 2. SAGIROGLU, Seref; SINANC, Duygu. Big data: A review. In: 2013 international conference on collaboration technologies and systems (CTS). 2013, pp. 42-47. 3. SAKR, Sherif. Big data 2.0 processing systems: a survey. Springer, 2016. 4. Global digital population as of January 2021 [online]. Statista, 2021 [visited on 2021-05-08]. Available from: https://www.statista. com/statistics/617136/digital-population-worldwide/. 5. Understanding the 7 V's of Big Data [online]. Big Data Path, 2019 [visited on 2021-05-08]. Available from: https : //bigdatapath. wordpress . com/2019/ll/13/understanding-the-7-vs-of- big-data/. 6. CASADO, Ruben; YOUNAS, Muhammad. Emerging trends and technologies in big data processing. Concurrency and Computation: Practice and Experience. 2015, vol. 27, no. 8, pp. 2078-2091. 7. NAZARI, Elham; SHAHRIARI, Mohammad Hasan; TABESH, Hamed. BigData Analysis in Healthcare: Apache Hadoop, Apache spark and . Frontiers in Health Informatics. 2019, vol. 8, no. 1, pp. 14. 8. BOBADE, Varsha B. Survey paper on big data and Hadoop. In• ternational Research Journal of Engineering and Technology (IRJET). 2016, vol. 3, no. 01. 9. YAO, Yi; WANG, Jiayin; SHENG, Bo; LIN, Jason; MI, Ningfang. Haste: Hadoop yarn scheduling based on task-dependency and resource-demand. In: 2014 IEEE 7th International Conference on Cloud Computing. 2014, pp. 184-191.

35 BIBLIOGRAPHY

10. SHVACHKO, Konstantin; KUANG, Hairong; RADIA, Sanjay; CHANSLER, Robert. The hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technolo• gies (MSST). 2010, pp. 1-10. 11. WHITE, Tome. Hadoop: The Definitive Guide. 3rd ed. O'Reilly Me• dia, Inc., 2012. ISBN 9781449311520. 12. HDFS Architecture [online]. Apache Hadoop, 2017 [visited on 2021-03-12]. Available from: https : / /hadoop . apache . org/ docs / r3 . 0 . 0 / hadoop - project - dist / hadoop - hdfs / Hdf sDesign.html. 13. Learn The 10 Best Difference Between MapReduce vs Yarn [online]. EDUCBA [visited on 2020-08-25]. Available from: https : //www. educba.com/mapreduce-vs-yarn/. 14. VAVILAPALLI, Vinod Kumar et al. Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th annual Sym• posium on Cloud Computing. 2013, pp. 1-16. 15. Apache Hadoop YARN [online]. Apache Hadoop, 2020 [visited on 2020-08-25]. Available from: https : //hadoop. apache . org/ docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html. 16. DITTRICH, Jens; QUIANE-RUIZ, Jorge-Arnulfo. Efficient big data processing in Hadoop MapReduce. Proceedings of the VLDB Endowment. 2012, vol. 5, no. 12, pp. 2014-2015. 17. DAS, Sufal; SYIEM, Brandon Victor; KALITA, Hemanta Kumar. Popularity Analysis on Social Network: A Big Data Analysis. International Journal of Computer Applications. 2014, vol. 975, pp. 27-31. 18. Hadoop - Custom Keys, Sorting and Grouping data [online]. My Big- Data Blog, 2017 [visited on 2020-08-25]. Available from: https: //my-bigdata-blog.blogspot.com/2017/08/hadoop-custom- key s - s ort ing-and-group ing.html. 19. MAVRIDIS, Ilias; KARATZA, Helen. Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark. Journal of Systems and Software. 2017, vol. 125, pp. 133-151.

36 BIBLIOGRAPHY

20. Hadoop vs Spark - Detailed Comparison [online]. PhoenixNap, 2020 [visited on 2021-04-04]. Available from: https : //phoenixnap . com/kb/hadoop-vs-spark. 21. HUAI, Yin et al. Major technical advancements in apache hive. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 2014, pp. 1235-1246. 22. BARBIERATO, Enrico; GRIBAUDO, Marco; IACONO, Mauro. Modeling apache hive based applications in big data architec• tures. In: Proceedings of the 7th International Conference on Perfor• mance Evaluation Methodologies and Tools. 2013, pp. 30-38. 23. HIVE, Apache. Apache Hive. 2013. 24. Setting Up HiveServer2 [online]. Apache, 2017 [visited on 2020-10- 06]. Available from: https : //cwiki . apache . org/conf luence/ display/Hive/Setting+up+HiveServer2. 25. [online]. Apache Thrift, 2017 [visited on 2020-10- 13]. Available from: https : //thrift. apache. org/. 26. HiveServer2 Overview [online]. Apache Hive, 2016 [visited on 2020-10-13]. Available from: https : / /cwiki . apache . org/ confluence/display/Hive/HiveServer2+0verview. 27. Hive Metastore - Different Ways to Configure Hive Metastore [online]. Data Flair, 2020 [visited on 2020-10-08]. Available from: https: //data-flair.training/blogs/apache-hive-metastore/. 28. KUMAR, Rakesh; GUPTA, Neha; CHARU, Shilpi; BANSAL, Somya; YADAV, Kusum. Comparison of SQL with HiveQL. In• ternational Journal for Research in Technological Studies. 2014, vol. 1, no. 9, pp. 2348-1439. 29. Differences Between SQL and HiveQL [online]. Tech Differences, 2019 [visited on 2020-10-13]. Available from: https : / / techdifferences . net / differences - between - sql - and - hiveql/. 30. LanguageManual SubQueries [online]. Apache Hive, 2014 [visited on 2020-10-13]. Available from: https : / /cwiki . apache . org/ confluence/display/Hive/LanguageManual+SubQueries.

37 BIBLIOGRAPHY

31. Difference between SQL and HiveQL [online]. GeeksforGeeks, 2020 [visited on 2020-10-13]. Available from: https : / /www . geeksforgeeks.org/difference-between-sql-and-hiveql/. 32. VILLALOBOS, Danilo Acosta; RUIZ, Ricardo Rojas. Distributed databases and Apache Hive. 33. GHAZAL, Ahmad; RABL, Tilmann; HU, Minqing; RAAB, Fran• cois; POESS, Meikel; CROLOTTE, Alain; JACOBSEN, Hans-Arno. Bigbench: Towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD international conference on Management of data. 2013, pp. 1197-1208. 34. IVANOV, Todor; BEER, Max-Georg. Evaluating hive and spark SQL with BigBench. arXiv preprint arXiv:1512.08417. 2015. 35. Apache Pig - Overview [online]. Tutorialspoint, 2021 [visited on 2021-04-06]. Available from: https : / /www . tutorialspoint . com/apache_pig/apache_pig_overview.htm. 36. FUAD, Ammar; ERWIN, Alva; IPUNG, Heru Purnomo. Process• ing performance on apache pig, apache hive and MySQL cluster. In: Proceedings of International Conference on Information, Commu• nication Technology and System (ICTS) 2014. 2014, pp. 297-302. 37. MAARALA, Altti Ilari; RAUTIAINEN, Mika; SALMI, Miikka; PIRTTIKANGAS, Susanna; RIEKKI, Jukka. Low latency analyt• ics for streaming traffic data with Apache Spark. In: 2015 IEEE International Conference on Big Data (Big Data). 2015, pp. 2855- 2858. 38. ZAHARIA, Matei et al. Apache spark: a unified engine for big data processing. Communications of the ACM. 2016, vol. 59, no. 11, pp. 56-65. 39. SALLOUM, Salman; DAUTOV, Ruslan; CHEN, Xiaojun; PENG, Patrick Xiaogang; HUANG, Joshua Zhexue. Big data analytics on Apache Spark. International Journal of Data Science and Analytics. 2016, vol. 1, no. 3-4, pp. 145-164. 40. Apache Spark Cluster Managers - YARN, Mesos & Standalone [on• line]. Data Flair, 2020 [visited on 2020-11-26]. Available from: https : / / data - flair . training / blogs / apache - spark - cluster-managers-tutorial/.

38 BIBLIOGRAPHY

41. SHORO, Abdul Ghaffar; SOOMRO, Tariq Rahim. Big data analy• sis: Apache spark perspective. Global Journal of Computer Science and Technology. 2015. 42. SPARK, Apache. Apache spark. Retrieved January. 2018, vol. 17, pp.2018. 43. RDD Operations [online]. Java T Point, 2010 [visited on 2020-11- 19]. Available from: https : / /www . javatpoint . com/apache - spark-rdd-operat ions. 44. CHENG, Dazhao; CHEN, Yuan; ZHOU, Xiaobo; GMACH, Daniel; MILOJICIC, Dejan. Adaptive scheduling of parallel jobs in spark streaming. In: IEEE INFOCOM 2017-IEEE Conference on Computer Communications. 2017, pp. 1-9. 45. Spark Streaming Programming Guide [online]. Apache Spark, 2020 [visited on 2020-11-26]. Available from: https : //spark. apache. org/docs/latest/streaming-programming-guide.html. 46. ZAHARIA, Matei; DAS, Tathagata; LI, Haoyuan; HUNTER, Tim• othy; SHENKER, Scott; STOICA, Ion. Discretized streams: Fault- tolerant streaming computation at scale. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles. 2013, pp. 423-438. 47. IQBAL, Muhammad Hussain; SOOMRO, Tariq Rahim. Big data analysis: Apache storm perspective. International journal of com• puter trends and technology. 2015, vol. 19, no. 1, pp. 9-14. 48. CHINTAPALLI, Sanket et al. Benchmarking streaming compu• tation engines: Storm, flink and spark streaming. In: 2016 IEEE international parallel and distributed processing symposium workshops (IPDPSW). 2016, pp. 1789-1792. 49. GraphFrames Overview [online]. GraphFrames, 2020 [visited on 2020-12-19]. Available from: https: //graphf rames.github. io/ graphframes/docs/_site/index.html. 50. Spark SQL, DataFrames and Datasets Guide [online]. Apache Spark, 2020 [visited on 2020-12-19]. Available from: http : / /spark . apache.org/docs/latest/sql-programming-guide.html.

39 BIBLIOGRAPHY

51. DAVE, Ankur; JINDAL, Alekh; LI, Li Erran; XIN, Reynold; GON• ZALEZ, Joseph; ZAHARIA, Matei. Graphframes: an integrated api for mixing graph and relational queries. In: Proceedings of the Fourth International Workshop on Graph Data Management Experi• ences and Systems. 2016, pp. 1-8. 52. ARMBRUST, Michael et al. Spark sql: Relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data. 2015, pp. 1383-1394. 53. Apache Spark RDD vs DataFrame vs DataSet [online]. Data Flair, 2020 [visited on 2020-12-21]. Available from: https : / /data- flair .training/blogs/apache-spark-rdd-vs-dataframe-vs- dataset/. 54. GraphFrames User Guide [online]. GraphFrames, 2020 [visited on 2020-12-19]. Available from: https: //graphf rames.github. io/ graphframes/docs/_site/user-guide.html. 55. Introduction to Spark Graph Processing with GraphFrames [online]. Baeldung, 2020 [visited on 2020-12-19]. Available from: https: //www.baeldung.com/spark-graph-graphframes. 56. Introduction to Apache Giraph [online]. Apache Giraph, 2020 [vis• ited on 2021-04-15]. Available from: http : / /giraph . apache . org/intro.html. 57. KABILJO, Maja; LOGOTHETIS, Dionysis; EDUNOV, Sergey; CHING, Avery. A comparison of state-of-the-art graph pro• cessing systems. Facebook Blog Post, http://tinyurl. com/giraph- vs-graphx. 2016. 58. SHAFER, Jeffrey; RIXNER, Scott; COX, Alan L. The hadoop dis• tributed filesystem: Balancing portability and performance. In: 2010 IEEE International Symposium on Performance Analysis of Sys• tems & Software (ISPASS). 2010, pp. 122-133.

40