Mining Abstractions in Scientific Workflows

Departamento de Inteligencia Artificial Escuela TécnicaSuperior de Ingenieros Informáticos PhD Thesis Mining Abstractions in Scientific Workflows Author: Daniel Garijo Verdejo Supervisors: Prof. Dr. Oscar Corcho Prof. Dra. Yolanda Gil December, 2015 ii Tribunal nombrado por el Sr. Rector Magfco. de la Universidad Politécnicade Madrid, el d´ıa30 de octubre de 2015 Presidente: Dra. AsunciónGómezPérez Vocal: Dr. Jose Manuel GómezPérez Vocal: Dr. Malcolm Atkinson Vocal: Dr. Rafael Tolosana Secretario: Dr. Mark Wilkinson Suplente: Dr. Mariano FernándezLópez Suplente: Dra. BelénD´ıaz Agudo Realizado el acto de defensa y lectura de la Tesis el d´ıa3 de diciembre de 2015 en la Facultad de Informática Calificac´ıon: EL PRESIDENTE VOCAL 1 VOCAL 2 VOCAL 3 EL SECRETARIO iii iv A mis padres v vi Acknowledgements Finally, after five years, I can finally say that I see light at the end of the tunnel. Maybe the other side is still a bit cloudy at the moment, but the important thing is to have arrived here. And, honestly, I think I wouldn't have made it to this point without all the people who have been by my side during these years. First, I would like to thank my supervisors Oscar Corcho and Yolanda Gil for guiding me whenever I got stuck and for having the patience to answer all my questions. Furthermore, thanks to their help, together with Asunción GómezPérez'sadvice, I was granted the FPU (Formaciónde Profesorado Universitario) scholarship from the Ministerio de Ciencia e Innovación.This scholarship has funded the internships and the research described on this document, and I am very grateful for having had the opportunity to enjoy it. I would also like to thank my family, specially my parents (Francisco Javier and Mar´ıa Felisa) and my sister Elisa for all their support, advice and suggestions during this period. Even from the distance! Next up are my lab mates, who have helped me with the figures (Mar´ıa Poveda, I really think you could write a thesis just by doing cool figures), logos (Idafen Santana, also responsible for our soccer team), technical support (Miguel Angel Garc´ıaand RaúlAlcázar),advice for the thesis (Andrés Garc´ıaand Esther Lozano) or just cheering me up when hanging out with them (Dani, Freddy, Carlos, Pablo, Julia, Boris, Alejandro, Olga and Vic- tor). In this regard, I am also very grateful to my friends Sergio, Paloma, David, Cristina and Javier for being always available to have a chat with a beer and discuss things totally unrelated to this thesis. I also owe special thanks to Paolo Missier and Khalid Belhajjame, who have provided very valuable feedback with very little time for doing the review. Next, Varun Ratnakar has always been crucial for some of the technical parts described in this thesis. Varun is one of the best working colleagues one could ever ask for. And finally, I want to thank all the collaborators and projects pals I have interacted with during these years, from the w4Ever team (with Carole, Jun, Graham, Raúl,Piotr, Stian, Khalid, Kristina, Lourdes, Susana, Pique) to the people I have met during my internships at the ISI (Dirk, John, Matheus, Felix, Zori). Abstract Scientific workflows have been adopted in the last decade to represent the computational methods used in in silico scientific experiments and their associated research products. Scientific workflows have demonstrated to be useful for sharing and reproducing scientific experiments, allowing scientists to visualize, debug and save time when re-executing previous work. How- ever, scientific workflows may be difficult to understand and reuse. The large amount of available workflows in repositories, together with their heterogeneity and lack of documentation and usage examples may become an obstacle for a scientist aiming to reuse the work from other scientists. Fur- thermore, given that it is often possible to implement a method using differ- ent algorithms or techniques, seemingly disparate workflows may be related at a higher level of abstraction, based on their common functionality. In this thesis we address the issue of reusability and abstraction by exploring how workflows relate to one another in a workflow repository, mining abstractions that may be helpful for workflow reuse. In order to do so, we propose a simple model for representing and relating workflows and their executions, we analyze the typical common abstractions that can be found in workflow repositories, we explore the current practices of users regarding workflow reuse and we describe a method for discovering useful abstractions for workflows based on existing graph mining techniques. Our results ex- pose the common abstractions and practices of users in terms of workflow reuse, and show how our proposed abstractions have potential to become useful for users designing new workflows. ix x Resumen Los flujos de trabajo cient´ıficoshan sido adoptados durante la últimadécada para representar los métodos computacionales utilizados en experimentos in silico, as´ıcomo para dar soporte a sus publicaciones asociadas. Dichos flujos de trabajo han demostrado ser útiles para compartir y reproducir experimentos cient´ıficos,permitiendo a investigadores visualizar, depurar y aho- rrar tiempo a la hora de re-ejecutar un trabajo realizado con anterioridad. Sin embargo, los flujos de trabajo cient´ıficospueden ser en ocasiones dif´ıciles de entender y reutilizar. Esto es debido a impedimentos como el gran númerode flujos de trabajo existentes en repositorios, su heterogeneidad o la falta generalizada de documentacióny ejemplos de uso. Además,dado que normalmente es posible implementar un mismo método utilizando algo- ritmos o técnicasdistintas, flujos de trabajo aparentemente distintos pueden estar relacionados a un determinado nivel de abstracción,basadándose,por ejemplo, en su funcionalidad común.Esta tesis se centra en la reutilización de flujos de trabajo y su abstracciónmediante la exploraciónde relaciones entre los flujos de trabajo de un repositorio y la extracciónde abstracciones que podr´ıanayudar a la hora de reutilizar otros flujos de trabajo existentes. Para ello, se propone un modelo simple de representaciónde flujos de trabajo y sus ejecuciones, se analizan las abstracciones t´ıpicasque se pueden encontrar en los repositorios de flujos de trabajo, se exploran las prácticas habituales de los usuarios a la hora de reutilizar flujos de trabajo existentes y se describe un método para descubrir abstracciones útilespara usuarios, basadas en técnicasexistentes de teor´ıade grafos. Los resultados obtenidos exponen las abstracciones y prácticascomunes de usuarios en términos de reutilizaciónde flujos de trabajo, y muestran cómolas abstracciones que se extraen automáticamente tienen potencial para ser reutilizadas por usuarios que buscan dise~narnuevos flujos de trabajo. xi xii Contents 1 Introduction 1 1.1 Contributions . .3 1.2 Thesis Structure . .4 1.3 Publications . .5 1.4 External Contributions . .6 2 Related Work 9 2.1 Scientific Workflow Representation . 11 2.1.1 Scientific Workflow Management Systems . 14 2.1.2 Scientific Workflow Life Cycle . 17 2.1.3 Scientific Workflow Models . 19 2.1.4 Scientific Workflow Publication . 26 2.2 Workflow Abstraction . 28 2.2.1 Types of Abstractions in Scientific Workflows . 28 2.2.2 Workflow Patterns . 33 2.3 Workflow Reuse . 34 2.3.1 Measuring Workflow Reuse . 35 2.3.2 Workflow Mining for Reuse . 36 2.4 Summary . 40 3 Research Objectives 43 3.1 Research Hypotheses . 44 3.2 Open Research Challenges . 44 3.2.1 Workflow Representation Heterogeneity . 45 3.2.2 Inadequate Level of Workflow Abstraction . 45 xiii 3.2.3 Difficulties of Workflow Reuse . 46 3.2.4 Lack of Support for Workflow Annotation . 46 3.3 Research Methodology . 47 4 Scientific Workflow Representation and Publication 51 4.1 Scientific Workflow Model . 51 4.1.1 Representing the Provenance of Workflow Executions: The Open Provenance Model and W3C PROV . 52 4.1.2 Representing Workflow Templates and Instances: P-Plan . 56 4.1.3 OPMW . 58 4.2 Scientific Workflow Publication . 63 4.2.1 Workflows as Linked Data Resources . 65 4.2.2 A Methodology for Publishing Scientific Workflows as Linked Data 66 4.2.3 Linked Data Workflows: An Example . 68 4.3 Summary . 70 5 Workflow Abstraction and Reuse 73 5.1 Workflow Motifs . 74 5.1.1 Experimental Setup . 74 5.1.2 Workflow Corpus Description . 75 5.1.3 Methodology for Workflow Analysis . 78 5.1.4 A Motif Catalogue for Abstracting Scientific Workflows . 79 5.1.5 Workflow Analysis Results . 85 5.1.6 Summary . 92 5.2 Analysis of Workflow and Workflow Fragment Reuse . 93 5.2.1 Experimental Setup . 94 5.2.2 Workflow Reuse Analysis Results . 98 5.3 Workflow and Workflow Fragment Reuse: User Survey . 100 5.3.1 Experimental Setup . 100 5.3.2 User Survey Report . 101 5.4 Summary . 108 xiv 6 Workflow Fragment Mining 109 6.1 Data Preparation . 111 6.2 Common Workflow Fragment Extraction . 111 6.2.1 Frequent Sub-graph Mining . 112 6.2.2 Frequent Sub-graph Mining in FragFlow . 116 6.3 Fragment Filtering and Splitting . 120 6.4 Fragment Linking . 122 6.4.1 Workflow Fragment Representation . 122 6.4.2 Finding Fragments in Workflows . 124 6.5 Fragment Statistics and Visualization . 128 6.6 Summary . 128 7 Evaluation 131 7.1 Evaluation Metrics . 131 7.1.1 Occurrence and Generalization Evaluation Metrics . 132 7.1.2 Usefulness Evaluation Metrics . 132 7.2 Workflow Motif Detection and Workflow Generalization . 135 7.2.1 Experimental Setup . 135 7.2.2 Evaluation of the Application of Inexact FSM techniques . 137 7.2.3 Evaluation of the Application of Exact FSM Techniques .

Mining Abstractions in Scientific Workflows

Cybersecurity Through Nimble Task Allocation: Semantic Workflow Reasoning for Mission-Centered Network Models

A 20-Year Community Roadmap for Artificial Intelligence Research in the US

Annotated Table of Contents V B I P

AAAI Members Elect New President-Elect & Executive

Ricky J. Sethi Curriculum Vitae 2020

Towards the Geoscience Paper of the Future: Best Practices for Documenting and Sharing Research from Data to Software to Provenance

Arxiv:1705.02607V2 [Cs.SE] 18 May 2017 0002-3444-7615

The Transformative Potential of AI for Science

Prodigy Bidirectional Planning

AI MAGAZINE AAAI News

The Open Provenance Model Core Specification (V1.1)

An Interview with Yolanda Gil