Mining Abstractions in Scientific Workflows
Total Page:16
File Type:pdf, Size:1020Kb
Departamento de Inteligencia Artificial Escuela T´ecnicaSuperior de Ingenieros Inform´aticos PhD Thesis Mining Abstractions in Scientific Workflows Author: Daniel Garijo Verdejo Supervisors: Prof. Dr. Oscar Corcho Prof. Dra. Yolanda Gil December, 2015 ii Tribunal nombrado por el Sr. Rector Magfco. de la Universidad Polit´ecnicade Madrid, el d´ıa30 de octubre de 2015 Presidente: Dra. Asunci´onG´omezP´erez Vocal: Dr. Jose Manuel G´omezP´erez Vocal: Dr. Malcolm Atkinson Vocal: Dr. Rafael Tolosana Secretario: Dr. Mark Wilkinson Suplente: Dr. Mariano Fern´andezL´opez Suplente: Dra. Bel´enD´ıaz Agudo Realizado el acto de defensa y lectura de la Tesis el d´ıa3 de diciembre de 2015 en la Facultad de Inform´atica Calificaci´on: EL PRESIDENTE VOCAL 1 VOCAL 2 VOCAL 3 EL SECRETARIO iii iv A mis padres v vi Acknowledgements Finally, after five years, I can finally say that I see light at the end of the tunnel. Maybe the other side is still a bit cloudy at the moment, but the important thing is to have arrived here. And, honestly, I think I wouldn't have made it to this point without all the people who have been by my side during these years. First, I would like to thank my supervisors Oscar Corcho and Yolanda Gil for guiding me whenever I got stuck and for having the patience to answer all my questions. Furthermore, thanks to their help, together with Asunci´on G´omezP´erez'sadvice, I was granted the FPU (Formaci´onde Profesorado Universitario) scholarship from the Ministerio de Ciencia e Innovaci´on.This scholarship has funded the internships and the research described on this document, and I am very grateful for having had the opportunity to enjoy it. I would also like to thank my family, specially my parents (Francisco Javier and Mar´ıa Felisa) and my sister Elisa for all their support, advice and suggestions during this period. Even from the distance! Next up are my lab mates, who have helped me with the figures (Mar´ıa Poveda, I really think you could write a thesis just by doing cool figures), logos (Idafen Santana, also responsible for our soccer team), technical sup- port (Miguel Angel Garc´ıaand Ra´ulAlc´azar),advice for the thesis (Andr´es Garc´ıaand Esther Lozano) or just cheering me up when hanging out with them (Nandana, Dani, Freddy, Carlos, Pablo, Julia, Filip, Boris, Alejandro, Olga and Victor). In this regard, I am also very grateful to my friends Ser- gio, Paloma, David, Cristina and Javier for being always available to have a chat with a beer and discuss things totally unrelated to this thesis. I also owe special thanks to Paolo Missier and Khalid Belhajjame, who have provided very valuable feedback with very little time for doing the review. Next, Varun Ratnakar has always been crucial for some of the technical parts described in this thesis. Varun is one of the best working colleagues one could ever ask for. And finally, I want to thank all the collaborators and projects pals I have interacted with during these years, from the wf4Ever team (with Carole, Jun, Graham, Ra´ul,Piotr, Stian, Khalid, Kristina, Lourdes, Susana, Pique) to the people I have met during my internships at the ISI (Dirk, John, Matheus, Felix, Zori). Abstract Scientific workflows have been adopted in the last decade to represent the computational methods used in in silico scientific experiments and their associated research products. Scientific workflows have demonstrated to be useful for sharing and reproducing scientific experiments, allowing scientists to visualize, debug and save time when re-executing previous work. How- ever, scientific workflows may be difficult to understand and reuse. The large amount of available workflows in repositories, together with their het- erogeneity and lack of documentation and usage examples may become an obstacle for a scientist aiming to reuse the work from other scientists. Fur- thermore, given that it is often possible to implement a method using differ- ent algorithms or techniques, seemingly disparate workflows may be related at a higher level of abstraction, based on their common functionality. In this thesis we address the issue of reusability and abstraction by exploring how workflows relate to one another in a workflow repository, mining ab- stractions that may be helpful for workflow reuse. In order to do so, we propose a simple model for representing and relating workflows and their executions, we analyze the typical common abstractions that can be found in workflow repositories, we explore the current practices of users regarding workflow reuse and we describe a method for discovering useful abstractions for workflows based on existing graph mining techniques. Our results ex- pose the common abstractions and practices of users in terms of workflow reuse, and show how our proposed abstractions have potential to become useful for users designing new workflows. ix x Resumen Los flujos de trabajo cient´ıficoshan sido adoptados durante la ´ultimad´ecada para representar los m´etodos computacionales utilizados en experimentos in silico, as´ıcomo para dar soporte a sus publicaciones asociadas. Dichos flujos de trabajo han demostrado ser ´utiles para compartir y reproducir experi- mentos cient´ıficos,permitiendo a investigadores visualizar, depurar y aho- rrar tiempo a la hora de re-ejecutar un trabajo realizado con anterioridad. Sin embargo, los flujos de trabajo cient´ıficospueden ser en ocasiones dif´ıciles de entender y reutilizar. Esto es debido a impedimentos como el gran n´umerode flujos de trabajo existentes en repositorios, su heterogeneidad o la falta generalizada de documentaci´ony ejemplos de uso. Adem´as,dado que normalmente es posible implementar un mismo m´etodo utilizando algo- ritmos o t´ecnicasdistintas, flujos de trabajo aparentemente distintos pueden estar relacionados a un determinado nivel de abstracci´on,bas´andose,por ejemplo, en su funcionalidad com´un.Esta tesis se centra en la reutilizaci´on de flujos de trabajo y su abstracci´onmediante la exploraci´onde relaciones entre los flujos de trabajo de un repositorio y la extracci´onde abstracciones que podr´ıanayudar a la hora de reutilizar otros flujos de trabajo existentes. Para ello, se propone un modelo simple de representaci´onde flujos de tra- bajo y sus ejecuciones, se analizan las abstracciones t´ıpicasque se pueden encontrar en los repositorios de flujos de trabajo, se exploran las pr´acticas habituales de los usuarios a la hora de reutilizar flujos de trabajo existentes y se describe un m´etodo para descubrir abstracciones ´utilespara usuarios, basadas en t´ecnicasexistentes de teor´ıade grafos. Los resultados obtenidos exponen las abstracciones y pr´acticascomunes de usuarios en t´erminos de reutilizaci´onde flujos de trabajo, y muestran c´omolas abstracciones que se extraen autom´aticamente tienen potencial para ser reutilizadas por usuarios que buscan dise~narnuevos flujos de trabajo. xi xii Contents 1 Introduction 1 1.1 Contributions . 3 1.2 Thesis Structure . 4 1.3 Publications . 5 1.4 External Contributions . 6 2 Related Work 9 2.1 Scientific Workflow Representation . 11 2.1.1 Scientific Workflow Management Systems . 14 2.1.2 Scientific Workflow Life Cycle . 17 2.1.3 Scientific Workflow Models . 19 2.1.4 Scientific Workflow Publication . 26 2.2 Workflow Abstraction . 28 2.2.1 Types of Abstractions in Scientific Workflows . 28 2.2.2 Workflow Patterns . 33 2.3 Workflow Reuse . 34 2.3.1 Measuring Workflow Reuse . 35 2.3.2 Workflow Mining for Reuse . 36 2.4 Summary . 40 3 Research Objectives 43 3.1 Research Hypotheses . 44 3.2 Open Research Challenges . 44 3.2.1 Workflow Representation Heterogeneity . 45 3.2.2 Inadequate Level of Workflow Abstraction . 45 xiii 3.2.3 Difficulties of Workflow Reuse . 46 3.2.4 Lack of Support for Workflow Annotation . 46 3.3 Research Methodology . 47 4 Scientific Workflow Representation and Publication 51 4.1 Scientific Workflow Model . 51 4.1.1 Representing the Provenance of Workflow Executions: The Open Provenance Model and W3C PROV . 52 4.1.2 Representing Workflow Templates and Instances: P-Plan . 56 4.1.3 OPMW . 58 4.2 Scientific Workflow Publication . 64 4.2.1 Workflows as Linked Data Resources . 65 4.2.2 A Methodology for Publishing Scientific Workflows as Linked Data 66 4.2.3 Linked Data Workflows: An Example . 68 4.3 Summary . 70 5 Workflow Abstraction and Reuse 73 5.1 Workflow Motifs . 74 5.1.1 Experimental Setup . 74 5.1.2 Workflow Corpus Description . 75 5.1.3 Methodology for Workflow Analysis . 78 5.1.4 A Motif Catalogue for Abstracting Scientific Workflows . 79 5.1.5 Workflow Analysis Results . 86 5.1.6 Summary . 92 5.2 Analysis of Workflow and Workflow Fragment Reuse . 93 5.2.1 Experimental Setup . 94 5.2.2 Workflow Reuse Analysis Results . 98 5.3 Workflow and Workflow Fragment Reuse: User Survey . 100 5.3.1 Experimental Setup . 100 5.3.2 User Survey Report . 102 5.4 Summary . 108 xiv 6 Workflow Fragment Mining 111 6.1 Data Preparation . 113 6.2 Common Workflow Fragment Extraction . 113 6.2.1 Frequent Sub-graph Mining . 114 6.2.2 Frequent Sub-graph Mining in FragFlow . 118 6.3 Fragment Filtering and Splitting . 122 6.4 Fragment Linking . 124 6.4.1 Workflow Fragment Representation . 124 6.4.2 Finding Fragments in Workflows . 127 6.5 Fragment Statistics and Visualization . 130 6.6 Summary . 130 7 Evaluation 133 7.1 Evaluation Metrics . 133 7.1.1 Occurrence and Generalization Evaluation Metrics . 134 7.1.2 Usefulness Evaluation Metrics . 134 7.2 Workflow Motif Detection and Workflow Generalization . 137 7.2.1 Experimental Setup . 137 7.2.2 Evaluation of the Application of Inexact FSM techniques .