Pegasus, a Workflow Management System for Science Automation

Future Generation Computer Systems 46 (2015) 17–35 Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs Pegasus, a workflow management system for science automation Ewa Deelman a,∗, Karan Vahi a, Gideon Juve a, Mats Rynge a, Scott Callaghan b, Philip J. Maechling b, Rajiv Mayani a, Weiwei Chen a, Rafael Ferreira da Silva a, Miron Livny c, Kent Wenger c a University of Southern California, Information Sciences Institute, Marina del Rey, CA, USA b University of Southern California, Los Angeles, CA, USA c University of Wisconsin at Madison, Madison, WI, USA h i g h l i g h t s • Comprehensive description of the Pegasus Workflow Management System. • Detailed explanation of Pegasus workflow transformations. • Data management in Pegasus. • Earthquake science application example. article info a b s t r a c t Article history: Modern science often requires the execution of large-scale, multi-stage simulation and data analysis Received 16 February 2014 pipelines to enable the study of complex systems. The amount of computation and data involved in these Received in revised form pipelines requires scalable workflow management systems that are able to reliably and efficiently coordi- 30 September 2014 nate and automate data movement and task execution on distributed computational resources: campus Accepted 8 October 2014 clusters, national cyberinfrastructures, and commercial and academic clouds. This paper describes the Available online 29 October 2014 design, development and evolution of the Pegasus Workflow Management System, which maps abstract workflow descriptions onto distributed computing infrastructures. Pegasus has been used for more than Keywords: Scientific workflows twelve years by scientists in a wide variety of domains, including astronomy, seismology, bioinformatics, Workflow management system physics and others. This paper provides an integrated view of the Pegasus system, showing its capabilities Pegasus that have been developed over time in response to application needs and to the evolution of the scientific computing platforms. The paper describes how Pegasus achieves reliable, scalable workflow execution across a wide variety of computing infrastructures. ' 2014 Elsevier B.V. All rights reserved. 1. Introduction cations are wide-ranging: data may be distributed across a number of repositories; available compute resources may be heteroge- Modern science often requires the processing and analysis of neous and include campus clusters, national cyberinfrastructures, vast amounts of data in search of postulated phenomena, and the and clouds; and results may need to be exchanged with remote validation of core principles through the simulation of complex colleagues in pursuit of new discoveries. This is the case in fields system behaviors and interactions. The challenges of such appli- such as astronomy, bioinformatics, physics, climate, ocean mod- eling, and many others. To support the computational and data needs of today's science applications, the growing capabilities of the national and international cyberinfrastructure, and more re- ∗ Correspondence to: USC Information Sciences Institute, 4676 Admiralty Way cently commercial and academic clouds need to be delivered to the Suite 1001, Marina del Rey, CA, 90292, USA. scientist's desktop in an accessible, reliable, and scalable way. E-mail addresses: [email protected] (E. Deelman), [email protected] (K. Vahi), Over the past dozen years, our solution has been to develop [email protected] (G. Juve), [email protected] (M. Rynge), [email protected] workflow technologies that can bridge the scientific domain and (S. Callaghan), [email protected] (P.J. Maechling), [email protected] (R. Mayani), [email protected] (W. Chen), [email protected] (R. Ferreira da Silva), the available cyberinfrastructure. Our approach has always been to [email protected] (M. Livny), [email protected] (K. Wenger). work closely with domain scientists—both in large collaborations http://dx.doi.org/10.1016/j.future.2014.10.008 0167-739X/' 2014 Elsevier B.V. All rights reserved. 18 E. Deelman et al. / Future Generation Computer Systems 46 (2015) 17–35 such as the LIGO Scientific Collaboration [1], the Southern Cali- ahead of the execution. At runtime, Pegasus can also perform a fornia Earthquake Center (SCEC) [2], and the National Virtual Ob- number of actions geared towards improving scalability and reli- servatory [3], among others, as well as individual researchers—to ability. These capabilities are described in Section6. As with any understand their computational needs and challenges, and to build user-facing system, usability is important. Our efforts in this area software systems that further their research. are described in Section7. We then present a real user applica- The Pegasus Workflow Management System, first developed in tion, the CyberShake earthquake science workflow, which utilizes 2001, was grounded in that principle and was born out by the Vir- a number of Pegasus capabilities (Section8). Finally, Sections9 and tual Data idea explored within the GriPhyN project [4–6]. In this 10 give an overview of related work and conclude the paper. context, a user could ask for a data product, and the system could provide it by accessing the data directly, if it were already com- 2. System design puted and easily available, or it could decide to compute it on the fly. In order to produce the data on demand, the system would have We assume that: (1) the user has access to a machine, where the to have a recipe, or workflow, describing the necessary computa- workflow management system resides, (2) the input data can be tional steps and their data needs. Pegasus was designed to man- distributed across diverse storage systems connected by wide area age this workflow executing on potentially distributed data and or local area networks, and (3) the workflow computations can also compute resources. In some cases the workflow would consist of simple data access, and in others it could encompass a number of be conducted across distributed heterogeneous platforms. interrelated steps. This paper provides a comprehensive descrip- In Pegasus, workflows are described by users as DAGs, where tion of the current Pegasus capabilities and explains how we man- the nodes represent individual computational tasks and the edges age the execution of large-scale workflows running in distributed represent data and control dependencies between the tasks. Tasks environments. Although there are a number of publications that can exchange data between them in the form of files. In our model, focused on a particular aspect of the workflow management prob- the workflow is abstract in that it does not contain resource in- lem and showed quantitative performance or scalability improve- formation, or the physical locations of data and executables re- ments, this paper provides a unique, integrated view of Pegasus ferred to in the workflow. The workflow is submitted to a work- system today and describes the system features derived from re- flow management system that resides on a user-facing machine search and work with application partners. called the submit host. This machine can be a user's laptop or a The cornerstone of our approach is the separation of the work- community resource. The target execution environment can be a flow description from the description of the execution environ- local machine, like the submit host, a remote physical cluster or ment. Keeping the workflow description resource-independent grid [20], or a virtual system such as the cloud [21]. The Pegasus i.e. abstract, provides a number of benefits: (1) workflows can WMS approach is to bridge the scientific domain and the execution be portable across execution environments, and (2) the work- environment by mapping a scientist-provided high-level workflow flow management system can perform optimizations at ``compile description, an abstract workflow description, to an executable time'' and/or at ``runtime''. These optimizations are geared towards workflow description of the computation. The latter needs enough improving the reliability and performance of the workflow's ex- information to be executed in a potentially distributed execution ecution. Since some Pegasus users have complex and large-scale environment. In our model it is the workflow management sys- workflows (with O(million) tasks), scalability has also been an in- tem's responsibility to not only translate tasks to jobs and execute tegral part of the system design. One potential drawback of the ab- them, but also to manage data, monitor the execution, and handle stract workflow representation approach with compile time and failures. Data management includes tracking, staging, and acting runtime workflow modifications is that the workflow being exe- on workflow inputs, intermediate products (files exchanged be- cuted looks different to the user than the workflow the user sub- tween tasks in the workflow), and the output products requested mitted. As a result, we have devoted significant effort towards by the scientist. These actions are performed by the five major developing a monitoring and debugging system that can connect Pegasus subsystems: the two different workflow representations in a way that makes Mapper. Generates an executable workflow based on an abstract sense to the user [7,8]. workflow provided by the user or workflow composition system. It Pegasus workflows are based on the Directed Acyclic Graph finds the appropriate software,

Load more