Common Motifs in Scientific Workflows: an Empirical Analysis Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble
Total Page:16
File Type:pdf, Size:1020Kb
Common motifs in scientific workflows: An empirical analysis Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble To cite this version: Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, et al.. Common motifs in scientific workflows: An empirical analysis. Future Generation Computer Systems, Elsevier, 2014, 36, 10.1016/j.future.2013.09.018. hal-01342933 HAL Id: hal-01342933 https://hal.archives-ouvertes.fr/hal-01342933 Submitted on 7 Jul 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Common Motifs in Scientific Workflows: An Empirical Analysis Daniel Garijo∗, Pinar Alper y, Khalid Belhajjamey, Oscar Corcho∗, Yolanda Gilz, Carole Gobley ∗Ontology Engineering Group, Universidad Politecnica´ de Madrid. fdgarijo, ocorchog@fi.upm.es ySchool of Computer Science, University of Manchester. falperp, khalidb, [email protected] zInformation Sciences Institute, Department of Computer Science, University of Southern California. [email protected] Abstract—While workflow technology has gained momentum [14] and CrowdLabs [8] have made publishing and finding in the last decade as a means for specifying and enacting compu- workflows easier, but scientists still face the challenges of re- tational experiments in modern science, reusing and repurposing use, which amounts to fully understanding and exploiting the existing workflows to build new scientific experiments is still a daunting task. This is partly due to the difficulty that scientists available workflows/fragments. One difficulty in understanding experience when attempting to understand existing workflows, workflows is their complex nature. A workflow may contain which contain several data preparation and adaptation steps in several scientifically-significant analysis steps, combined with addition to the scientifically significant analysis steps. One way various other data preparation activities, and in different to tackle the understandability problem is through providing implementation styles depending on the environment and abstractions that give a high-level view of activities undertaken within workflows. As a first step towards abstractions, we report context in which the workflow is executed. The difficulty in in this paper on the results of a manual analysis performed over understanding causes workflow developers to revert to starting a set of real-world scientific workflows from Taverna and Wings from scratch rather than re-using existing fragments. systems. Our analysis has resulted in a set of scientific workflow Through an analysis of the current practices in scientific motifs that outline i) the kinds of data intensive activities that are observed in workflows (data oriented motifs), and ii) the different workflow development, we could gain insights on the creation manners in which activities are implemented within workflows of understandable and more effectively re-usable workflows. (workflow oriented motifs). These motifs can be useful to inform Specifically, we propose an analysis with the following objec- workflow designers on the good and bad practices for workflow tives: development, to inform the design of automated tools for the generation of workflow abstractions, etc. 1) To reverse-engineer the set of current practices in work- flow development through an analysis of empirical evi- I. INTRODUCTION dence. 2) To identify workflow abstractions that would facilitate Scientific workflows have been increasingly used in the last understandability and therefore effective re-use. decade as an instrument for data intensive scientific analysis. 3) To detect potential information sources and heuristics In these settings, workflows serve a dual function: first as that can be used to inform the development of tools for detailed documentation of the method (i. e. the input sources creating workflow abstractions. and processing steps taken for the derivation of a certain data item) and second as re-usable, executable artifacts for In this paper we present the result of an empirical analysis data-intensive analysis. Workflows stitch together a variety performed over 177 workflow descriptions from Taverna [10] of data manipulation activities such as data movement, data and Wings [3]. Based on this analysis, we propose a catalogue transformation or data visualization to serve the goals of the of scientific workflow motifs. Motifs are provided through i) scientific study. The stitching is realized by the constructs a characterization of the kinds of data-oriented activities that made available by the workflow system used and is largely are carried out within workflows, which we refer to as data- shaped by the environment in which the system operates and oriented motifs, and ii) a characterization of the different man- the function undertaken by the workflow. ners in which those activity motifs are realized/implemented A variety of workflow systems are in use [10] [3] [7] [2] within workflows, which we refer to as workflow-oriented serving several scientific disciplines. A workflow is a software motifs. It is worth mentioning that, although important, motifs artifact, and as such once developed and tested, it can be that have to do with scheduling and mapping of workflows shared and exchanged between scientists. Other scientists can onto distributed resources [12] are out the scope of this paper. then reuse existing workflows in their experiments, e.g., as The paper is structured as follows. We begin by providing sub-workflows [17]. Workflow reuse presents several advan- related work in Section II, which is followed in Section III by tages [4]. For example, it enables proper data citation and brief background information on Scientific Workflows, and the improves quality through shared workflow development by two systems that were subject to our analysis. Afterwards we leveraging the expertise of previous users. Users can also describe the dataset and the general approach of our analysis. re-purpose existing workflows to adapt them to their needs We present the detected scientific workflow motifs in Section [4]. Emerging workflow repositories such as myExperiment IV and we highlight the main features of their distribution across the analyzed workflows in section V. Finally, we distill experiments instead of scientific workflows (in silico data the main findings of our study (in Section VI) and conclude oriented experiments). by outlining our future plans in Section VII. III. PRELIMINARIES For the purposes of our analysis, we use workflows specified II. RELATED WORK and used in the Taverna [10] and Wings [3] workflow systems. Our motifs could be observed as higher-level patterns ob- The choice of these two systems is due to : served over scientific workflow datasets. ”Workflow patterns” 1) The similarity in the kinds of workflow modeling con- have been extensively studied in the last two decades [15]. structs they provide. When compared to other scientific Work in this area is primarily focused on outlining the workflow systems both Taverna and Wings are observers inventory of workflow development constructs provided by of the pure data-flow paradigm, they provide no explicit different workflow languages and the ways of combining those mechanism for even the most basic control constructs constructs. Classification models have also been developed to such as conditionals and looping, while there are implicit detect additional patterns in structure, usage and data [13]. ways of doing this. Scientific workflows are characterized by the lack of complex 2) The variety of execution environments in which these control constructs, where the order of execution is determined systems operate. While Taverna provides users with a by the availability of data. As observed by a recent study means to specify workflows that mainly make use of [9], these systems largely support data-flow patterns1 and even autonomous third party services, Wings provides an bring-about new ones with their varied handling of data tokens. environment in which users have relatively more control Data-flow patterns outline ways of managing data resources over the resources and the analysis operations that sci- during workflow execution, such as visibility, data interaction entists use in their experiments. Within Wings, analysis and transfer. These patterns are orthogonal to the scientific tools and data artifacts are made part of the environment data-oriented function undertaken by the processing steps i.e. prior to workflow design, workflows are then composed our motifs. In this regard our work complements the workflow of using these encapsulated software/tools. patterns research with its focus on pinpointing characteristic data-oriented activities, rather than an analysis of workflow A. Taverna and Wings workflow systems languages, token handling or data dependencies. Our work is Taverna2 is an open-source workflow system, which has also based on an analysis of empirical evidence of how data- been initially devised for in-silico experimentation in the life- intensive activities have been implemented against different sciences. Recently, it has been extended