BIGGR: Bringing Gradoop to Applications

Datenbank Spektrum https://doi.org/10.1007/s13222-019-00306-x SCHWERPUNKTBEITRAG BIGGR: Bringing Gradoop to Applications M. Ali Rostami1 · Matthias Kricke1 ·EricPeukert1 · Stefan Kühne1 · Moritz Wilke1 ·SteffenDienst1 · Erhard Rahm1 Received: 23 January 2019 / Accepted: 25 January 2019 © Gesellschaft für Informatik e.V. and Springer-Verlag GmbH Germany, part of Springer Nature 2019 Abstract Analyzing large amounts of graph data, e.g., from social networks or bioinformatics, has recently gained much attention. Unfortunately, tool support for handling and analyzing such graph data is still weak and scalability to large data volumes is often limited. We introduce the BIGGR approach providing a novel tool for the user-friendly and efficient analysis and visualization of Big Graph Data on top of the open-source software KNIME and GRADOOP. Users can visually program graph analytics workflows, execute them on top of the distributed processing framework Apache Flink and visualize large graphs within KNIME. For visualization, we apply visualization-driven data reduction techniques by pushing down sampling and layouting to GRADOOP and Apache Flink. We also discuss an initial application of the tool for the analysis of patent citation graphs. Keywords Graph analysis · Graph visualization · Graph sampling · Gradoop · KNIME 1Introduction graph data including graph databases and distributed graph processing platforms, but they still have substantial limita- The analysis of large network and graph datasets is becom- tions regarding either scalability to large datasets, support ing of increasing interest, for example, to gain insights from for complex graph mining (e.g., clustering, frequent pat- social networks, protein interaction networks in bioinfor- tern mining, etc.) or ease of use (flexible graph data model, matics or in business applications, e.g., in logistics. Graph graph visualization) [1]. For these reasons, we developed analytics aims at analyzing such networks, especially com- the open-source graph analysis platform GRADOOP [2, 3] plex relationships between heterogeneous data entities of at ScaDS1, that aims at overcoming the limitations of pre- interest. There are many approaches to manage and analyze vious graph processing platforms. GRADOOP provides a flexible graph data model based on extended property graphs [5] together with a variety M. Ali Rostami of powerful operators including pattern matching, graph [email protected] grouping and aggregation [6, 7] as well as a library of graph Matthias Kricke mining algorithms, e.g., for frequent subgraph mining [8]. [email protected] The operators and algorithms are implemented on top of Eric Peukert Apache Flink and can therefore be executed on shared- [email protected] nothing clusters to process large amounts of data in parallel. Stefan Kühne Programming graph analysis workflows that consist of [email protected] multiple data transformation and analysis steps is cumber- some and requires high technical expertise from the analyst, Moritz Wilke [email protected] which represents a considerable usability hurdle. Moreover, graph data mining tasks are typically embedded in data min- Steffen Dienst [email protected] ing workflows including operators for data preparation, ma- chine learning and result visualization. However, existing Erhard Rahm tools for constructing such data mining workflows such as [email protected] 1 Institute for Informatics, University of Leipzig, 1 Competence Center for Scalable Data Services and Solutions (ScaDS Augustusplatz 10, 04109 Leipzig, Germany Dresden/Leipzig) [4]. K Datenbank Spektrum KNIME [9], Kepler [10], RapidMiner [11] or Galaxy [12] reused (see [13] for a recent survey and comparison). Un- do not yet allow for handling and visualizing Big Graph fortunately, only some of them have some initial support Data with a huge number of vertices and edges. Therefore, for Big Data, namely KNIME, [9], Kepler [10], Apache we propose a tool which supports (1) the visual modeling of Taverna [14], RapidMiner [11] and Galaxy [12], by sup- graph analytics workflows, (2) executes these workflows in porting distributed execution, e.g., by providing some inte- a distributed fashion using GRADOOP, and (3) efficiently vi- gration with Big Data frameworks. For instance, KNIME sualizes Big Graph Data by pushing complex visualization- and RapidMiner offer workflow operators (called nodes) specific computations down to a distributed server system for loading data from and storing to Big Data stores such for parallel execution. as HDFS 3, HIVE 4,orIMPALA 5 and also provide an inte- We developed this tool called BIGGR on top of GRADOOP gration with APACHE SPARK 6 that maps workflow nodes and the data science platform KNIME2. In this paper, we to Spark operators. Scientific workflow management tools describe the approach and its initial application. In partic- like Kepler [10], Apache Taverna [14] or Galaxy [12]have ular, we make the following contributions: a special focus on executing workflows on HPC computing infastructures. Within another ScaDS project, such HPC We give an overview to the BIGGR tool for the user- support was recently also added for KNIME [15]. friendly and efficient analysis and visualization of large However, all these tools lack support for analyzing Big Big Graph Data. The tool allows users to visually define Graph Data. Initial attempts to support graph analysis have graph analysis workflows involving GRADOOP opera- been made with Galaxy, which is widely used in bioinfor- tors and existing KNIME operators. We sketch some of matics [12]. It supports a so-called cluster-adapter with the the challenges of integrating GRADOOP and Flink into Apache Spark Driver to run Spark jobs and can also utilize KNIME to achieve a distributed execution that is trans- Spark’s distributed graph processing API GraphX via the parent to the user. so-called GraphFlow [16]orSparkGalaxy[17] front-ends. We introduce the BIGGR approach for visualizing large Since Galaxy does not support graph data, results are not graphs. It pushes down visualization-specific compu- visualized as graphs and the integration is done by storing tations such as layouting and sampling as operators to and retrieving inputs and outputs from and to HDFS. The GRADOOP to be efficiently executed in a distributed KNIME tool provides an initial abstraction for graph data, fashion. but it has to be processed sequentially so that big graph min- We present a real-world application of the BIGGR ap- ing tasks are not yet supported. The BIGGR project aims proach to analyze patent data. at improving on this by using the distributed graph analy- After an initial discussion of related work and an in- sis system GRADOOP. Furthermore, we aim at providing troduction to GRADOOP and KNIME in Sect. 3,wede- advanced visualization support for Big Graph Data. scribe how GRADOOP is integrated into KNIME in Sec. 4. Visualizing Big Graph Data is a well researched topic In Sect. 5, we dive into our developed tooling for visual- and many techniques have been proposed to speed up lay- izing large graphs in KNIME. Sect. 6 introduces the real- outing for large graphs [18]. Unfortunately, the problem world use case for patent data and Sect. 7 finally gives is still not sufficiently solved as shown in a more recent conclusions and sketches future work. overview for visualizing linked data graphs [19]. Exist- ing graph databases such as Neo4J7 or an extension by Oracle8 offer support for visualizing graph data. However, 2 Related Work such tools mostly do not scale well for large graphs and the results are often cluttered as was shown for Neo4J in There are several workflow-based data science tools that can a DBpedia case study [20]. The authors point out that the be used to integrate and orchestrate multiple independently dense structure of vertices and edges requires simplifica- built data mining components or operators for data analysis tion approaches to make the visualizations comprehensible. or data manipulation. These tools are often domain-specific In our work, we investigate such simplification approaches for certain kinds of data (e.g., genomic sequence data in and in particular follow a so-called visualisation-driven data bioinformatics or spectrometry data in chemistry) and typically support a user to visually build workflows in the form 3 http://hadoop.apache.org. of operator trees that can be automatically executed and 4 https://hive.apache.org. 5 https://impala.apache.org. 6 https://spark.apache.org. 2 The technology transfer into KNIME is funded within a joint BMBF 7 project between ScaDS / University of Leipzig and Knime. It is https://neo4j.com. planned to make the described extensions freely available within an 8 https://www.oracle.com/technetwork/oracle-labs/parallel-graph- upcoming release of KNIME. analytix. K Datenbank Spektrum reduction (VDDR) that was recently proposed by Jugel et. GRALA (Graph Analytical Language) to implement a pro- al. [21] for relational data. In VDDR, the visualization sys- gram (workflow) for graph analysis. Table 1 shows available tem pushes down data reduction logic to the data source graph operators and graph algorithms categorized by their to reduce workload on a visualization client with the goal input. Besides general operators for graph transformation or of not significantly changing the actual visualization result. aggregation, GRADOOP also provides pattern matching [6] To our knowledge, we are the first to apply this concept for capabilities known from graph database systems and ana- graph visualization and we are pushing sampling, prepro- lytical operators, e.g., for graph grouping [7] and further cessing, and layouting down to Gradoop and Apache Flink structural graph transformations [24]. Moreover, the aux- to speedup graph visualization for the client. iliary operators apply and call can be used to seamlessly integrate user-specific operators in the analysis programs. GRADOOP supports several ways to store graph data. The 3Background GRALA interface DataSource is used to read and DataSink to write graph data. Hence, for each DataSink an appro- 3.1 GRADOOP priate DataSource exists.

BIGGR: Bringing Gradoop to Applications

Communicate the Future PROCEEDINGS HYATT REGENCY ORLANDO 20–23 MAY | ORLANDO, FL and Summit.Stc.Org #STC18

Efficient Support for Data-Intensive Scientific Workflows on Geo

Apache Taverna

A Review of Scalable Bioinformatics Pipelines

Programming Models to Support Data Science Workflows

ADMI Cloud Computing Presentation

Workspace - a Scientific Workflow System for Enabling Research Impact

Methods Enabling Portability of Scientific Workflows

Open Research Online Oro.Open.Ac.Uk

HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack

Arxiv:2007.04939V1 [Cs.DC] 9 Jul 2020

Intelligent Integrated Digital Platform Insilicokdd in Support of Precision Medicine for Scientific Research