Lodflow – a Workflow Management System for Linked Data Processing

LODFlow – a Workflow Management System for Linked Data Processing Sandro Rautenberg Ivan Ermilov Midwestern State University AKSW/BIS (UNICENTRO) University of Leipzig Guarapuava, PR, Brazil Leipzig, Germany [email protected] [email protected] leipzig.de Edgard Marx Sören Auer Axel-Cyrille N. Ngomo AKSW/BIS University of Bonn / AKSW/BIS University of Leipzig Fraunhofer IAIS, Germany University of Leipzig Leipzig, Germany [email protected] Leipzig, Germany [email protected] [email protected] leipzig.de leipzig.de ABSTRACT 1. INTRODUCTION The extraction and maintenance of Linked Data datasets is The creation and maintenance of Linked Data datasets is a cumbersome, time-consuming and resource-intensive ac- currently performed by extensive manual workflows or propri- tivity. The cost for producing Linked Data can be reduced etary scripts, which are not portable, reusable, and difficult by a workflow management system, which describes plans to maintain. For Linked Data extraction and publication to systematically support the lifecycle of RDF datasets. We projects such as LinkedGeoData or DBpedia, a long number present the LODFlow Linked Data Workflow Management of activities is required in order to produce a particular ver- System, which provides an environment for planning, execut- sion of the datasets. These activities include, for example, ing, reusing, and documenting Linked Data workflows. The downloading and loading dumps of the raw data, running LODFlow approach is based on a comprehensive knowledge various extraction tools, performing quality analysis and model for describing the workflows and a workflow execution sanity checks, loading the resulting data into a SPARQL end- engine supporting systematic workflow execution, reporting, point for publication, generating links to other datasets etc. and exception handling. The environment was evaluated in Performing such Linked Data production and maintenance a large-scale real-world use case. As result, LODFlow sup- activities requires in-depth knowledge of scripting languages, ports Linked Data engineers to systematically plan, execute various RDF serialization and publication technologies, tools and assess Linked Data production and maintenance work- such as SILK, LIMES for linking, and standards such as flows, thus improving efficiency, ease-of-use, reproducibility, R2RML for relational data mapping. As a result, such pro- reuseability and provenance. cesses are cumbersome, time-consuming, and error-prone, The environment was evaluated in a large-scale real-world use requiring substantial skills. case. As result, LODFlow supports Linked Data engineers to systematically plan, execute and assess Linked Data produc- In other domains, planning, executing, and orchestrating tion and maintenance workflows, thus improving efficiency, activities is supported by workflow management techniques ease-of-use, reproducibility, reuseability and provenance. and systems. With the Business Process Execution Lan- guage (BPEL), for example, a comprehensive language for Categories and Subject Descriptors the planning, execution, and orchestration of Web Services exists. There is also a large number of products from vendors H.4.1 [Office Automation]: Workflow management including SAP, IBM, and Apache supporting BPEL available on the market. In addition, for scientific workflows, a num- General Terms ber of languages and tools supporting the workflow design Linked Data, Workflow Management, Linked Data Workflow and execution exist. Specialized scientific workflow systems, Management System. e.g. Discovery Net [3], Apache Taverna [9] and Kepler [12], provide a visual programming front-end enabling user to construct their applications as a visual graph by connecting nodes together, and tools have been developed to execute such workflows in a platform-independent manner [10]. In this article, we present a comprehensive approach for defining, planning, orchestrating, and executing Linked Data production and maintenance workflows, the Linked Open Data workFlow system (LODFlow). LODFlow is based on defining workflows comprising reusable activities and steps where it is defined by Workflow Management Coalition as using a comprehensive ontological model, the Linked Data \the automation of a business process, in whole or part, during Workflow Project Ontology (LDWPO). We implemented a which documents, information or tasks are passed from one workflow execution environment, which takes workflows de- participant to another for action, according to a set of proce- fined in LDWPO ontology as an input, executes the various dural rules" [18]. Essentially, a Workflow orchestrates steps steps defined in the ontology, and generates detailed reports that transform resources in order to reach a desired result. about the execution and potential errors in machine and Benefits of systematic workflow management include explicit human-readable ways. The LODFlow is able to launch a description of activities, reproducibility, automatization and number of specialized tools such as DBpedia DIEF or Spar- generally increased efficiency. qlMap for extraction and mapping, SILK for linking, Luzzu for quality analysis. The LODFlow preserves provenance According to [7], Workflow Management is a technology sup- and adds comprehensive metadata, such as the version, invo- porting the reengineering of business and information pro- cation, and configuration of the tool execution in a concrete cesses. The main components of the Workflow Management workflow instantiation. are a workflow definition and technological infrastructure. The workflow definition describes the aspects for coordina- The benefits of the LODFlow include: tion, processing and controlling of the resource lifecycle. The technological infrastructure provides means for efficient design and implementation of workflows. To facilitate Workflow 1. Explicitness - the declarative description of Linked Management, the technological infrastructure should fulfill Data production and maintenance workflows makes the the following requirements: activities and processes explicit and easy to understand for Linked Data engineers in a high level of abstraction. • be component-oriented, thus enabling integration and 2. Reusability - the fine-grained definition of activities interoperability among loosely-coupled system and ser- and workflow steps facilitates the reuse across work- vice components; flows. • support the implementation of a workflow, facilitating 3. Repeatability - workflows can be easily executed over resource provenance and reproducibility; and over again, thus making them repeatable, facilitating testing and reliability. • ensure the correctness and reliability of applications in the presence of concurrency and failures; 4. Efficiency - the automatic workflow execution greatly • support the evolution, replacement, and addition of the reduces the required manual human intervention and workflow steps in a reengineering process. thus improves efficiency. 5. Ease of use - the high-level declarative description In a Linked Data environment, the Workflow Management of workflows and the generation of comprehensive exe- should cover the specification, execution, registration, and cution reports simplifies the definition, execution and control of procedures for producing and maintaining Linked analysis of Linked Data production and maintenance Data datasets. Therefore, in a Linked Data context, we can workflows. define the requirements for LDWFM as follows. As a result, we hope that LODFlow and LDWPO will improve Requirement 1. Linked Data Workflow Planning { the creation and maintenance of Linked Data. the capability for describing Linked Data datasets production strategy. Achieved by specifying a list of steps, where each The rest of the paper is structured as follows. We discuss step corresponds to a tool, tool configuration, as well as input Linked Data Workflow Management and its requirements and output datasets. in Section 2. In Section 3 we present the LODFlow. The evaluation of LODFlow is discussed in Section 4, based on a large-scale, real-world use case. In Section 5 we outline Requirement 2. Linked Data Workflow Execution { related work. Finally, in Section 6 we conclude the paper the capability for automating workflows, which involves a and outline directions for the future work. controlled environment for a plan execution. Achieved by running tools with tool configurations over input datasets. 2. LINKED DATA WORKFLOW MANAGE- MENT Requirement 3. Linked Data Workflow Reusability The creation and maintenance of Linked Data datasets re- { the capability to reuse the whole or a part of existing work- quire substantial efforts and resources. These efforts should flows. be planned, follow best practices, and be performed in a systematic way. The concept of Linked Data Workflow Man- Requirement 4. Linked Data Workflow Documen- agement (LDWFM) described in this section provides a for- tation { the capability to represent Linked Data Workflow mal grounding for the systematic rendering of Linked Data plans and executions in human-readable formats. management processes. The core concept of LDWFM is a Workflow. The Work- Requirement 5. Linked Data Workflow Repeatabil- flow concept originated in the business management domain, ity { the capability for reproducing the same result over time. 3. LODFLOW - THE LINKED DATA WORK- LODFlow is designed in compliance with the Linked Data FLOW MANAGEMENT SYSTEM Lifecycle [2]. Therefore, it supports the orchestration of In this section, we describe our Linked Data Workflow Man- Linked Data tools for

Lodflow – a Workflow Management System for Linked Data Processing

Communicate the Future PROCEEDINGS HYATT REGENCY ORLANDO 20–23 MAY | ORLANDO, FL and Summit.Stc.Org #STC18

Efficient Support for Data-Intensive Scientific Workflows on Geo

Apache Taverna

A Review of Scalable Bioinformatics Pipelines

Programming Models to Support Data Science Workflows

ADMI Cloud Computing Presentation

Workspace - a Scientific Workflow System for Enabling Research Impact

Methods Enabling Portability of Scientific Workflows

Open Research Online Oro.Open.Ac.Uk

HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack

Arxiv:2007.04939V1 [Cs.DC] 9 Jul 2020

Intelligent Integrated Digital Platform Insilicokdd in Support of Precision Medicine for Scientific Research