A Comprehensive Perspective on Pilot-Job Systems
Total Page:16
File Type:pdf, Size:1020Kb
A Comprehensive Perspective on Pilot-Job Systems ∗ Matteo Turilli Mark Santcroos Shantenu Jha RADICAL Laboratory, ECE RADICAL Laboratory, ECE RADICAL Laboratory, ECE Rutgers University Rutgers University Rutgers University New Brunswick, NJ, USA New Brunswick, NJ, USA New Brunswick, NJ, USA [email protected] [email protected] [email protected] Pilot-Jobs provide a multi-stage mechanism to execute ABSTRACT workloads. Resources are acquired via a placeholder job and subsequently assigned to workloads. Pilot-Jobs are having Pilot-Job systems play an important role in supporting dis- a high impact on scientific and distributed computing [1]. tributed scientific computing. They are used to consume more They are used to consume more than 700 million CPU hours than 700 million CPU hours a year by the Open Science Grid a year [2] by the Open Science Grid (OSG) [3, 4] communi- communities, and by processing up to 1 million jobs a day for ties, and process up to 1 million jobs a day [5] for the ATLAS the ATLAS experiment on the Worldwide LHC Computing experiment [6] on theLarge Hadron Collider (LHC) [7] Com- Grid. With the increasing importance of task-level paral- puting Grid (WLCG) [8, 9]. A variety of Pilot-Job systems lelism in high-performance computing, Pilot-Job systems are used on distributed computing infrastructures (DCI): are also witnessing an adoption beyond traditional domains. Glidein/GlideinWMS [10, 11], the Coaster System [12], DI- Notwithstanding the growing impact on scientific research, ANE [13], DIRAC [14], PanDA [15], GWPilot [16], Nim- there is no agreement upon a definition of Pilot-Job system rod/G [17], Falkon [18], MyCluster [19] to name a few. and no clear understanding of the underlying abstraction A reason for the success and proliferation of Pilot-Job and paradigm. Pilot-Job implementations have proliferated systems is that they provide a simple solution to the with no shared best practices or open interfaces and little rigid resource management model historically found in high- interoperability. Ultimately, this is hindering the realization performance and distributed computing. Pilot-Jobs break of the full impact of Pilot-Jobs by limiting their robustness, free of this model in two ways: (i) by using late binding portability, and maintainability. This paper offers a com- to make the selection of resources easier and more effec- prehensive analysis of Pilot-Job systems critically assessing tive [20{22]; and (ii) by decoupling the workload specifica- their motivations, evolution, properties, and implementation. tion from the management of its execution. Late binding The three main contributions of this paper are: (i) an anal- results in the ability to utilize resources dynamically, i.e., ysis of the motivations and evolution of Pilot-Job systems; the workload is distributed onto resources only when they (ii) an outline of the Pilot abstraction, its distinguishing logi- are effectively available. Decoupling workload specification cal components and functionalities, its terminology, and its and execution simplifies the scheduling of workloads on those architecture pattern; and (iii) the description of core and resources. auxiliary properties of Pilot-Jobs systems and the analysis of In spite of the success and impact of Pilot-Jobs, we perceive seven exemplar Pilot-Job implementations. Together, these a problem: the development of Pilot-Job systems has not contributions illustrate the Pilot paradigm, its generality, been grounded on an analytical understanding of underpin- and how it helps to address some challenges in distributed ning abstractions, architectural patterns, or computational scientific computing. paradigms. The properties and functionalities of Pilot-Jobs have been understood mostly, if not exclusively, in relation arXiv:1508.04180v3 [cs.DC] 5 Mar 2016 Keywords to the needs of the containing software systems or on use Pilot-Job, distributed applications, distributed systems. cases justifying their immediate development. These limitations have also resulted in a fragmented soft- ware landscape, where many Pilot-Job systems lack general- 1. INTRODUCTION ity, interoperability, and robust implementations. This has ∗Corresponding author. led to a proliferation of functionally equivalent systems mo- tivated by similar objectives that often serve particular use cases and target particular resources. Addressing the limitations of Pilot systems while improving our general understanding of Pilot-Job systems is a priority Permission to make digital or hard copies of all or part of this work for due to the role they will play in the next generation of high- personal or classroom use is granted without fee provided that copies are performance computing. Most existing high-performance not made or distributed for profit or commercial advantage and that copies system software and middleware are designed to support bear this notice and the full citation on the first page. To copy otherwise, to the execution and optimization of single tasks. Based on republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. their current utilization, Pilot-Jobs have the potential to sup- Copyright 2015 ACM X-XXXXX-XX-X/XX/XX ...$15.00. 1 port the growing need for scalable task-level parallelism and resources can be traced back to 1922 as a way to reduce the dynamic resource management in high-performance comput- time to solution of differential equations [24]. In his Weather ing [12, 18, 23]. Forecast Factory [25], Lewis Fry Richardson imagined dis- The causes of the current status quo of Pilot-Job systems tributing computing tasks across 64,000 \human computers" are social, economic, and technical. While social and eco- to be processed in parallel. Richardson's goal was exploit- nomic considerations may play a determining role in promot- ing the parallelism of multiple processors to reduce the time ing fragmented solutions, this paper focuses on the technical needed for the computation. Today, task-level parallelism is aspects of Pilot-Jobs. We contribute a critical analysis of commonly adopted in weather forecasting on modern high the current state of the art describing the technical motiva- performance machines1 as computers. Task-level parallelism tions and evolution of Pilot-Job systems, their characterizing is also pervasive in computational science [26] (see Ref. [27] abstraction (the Pilot abstraction), and the properties of and references therein). their most representative and prominent implementations. Master-worker is a coordination pattern commonly used Our analysis will yield the Pilot paradigm, i.e., the way in for distributed computations [28{32]. Submitting tasks to which Pilot-Jobs are used to support and perform distributed multiple computers at the same time requires coordinating computing. the process of sending and receiving tasks; of executing them; The remainder of this paper is divided into four sections. x2 and of retrieving and aggregating their outputs [33]. In offers a description of the technical motivations of Pilot-Job the master-worker pattern, a \master" has a global view systems and of their evolution. of the overall computation and of its progress towards a In x3, the logical components and functionalities consti- solution. The master distributes tasks to multiple \workers", tuting the Pilot abstraction are discussed. We offer a termi- and retrieves and aggregates the results of each worker's nology consistent across Pilot-Job implementations, and an computation. Alternative coordination patterns have been architecture pattern for Pilot-Jobs systems is derived and devised, depending on the characteristics of the computed described. tasks but also on how the system implementing task-level In x4, the focus moves to Pilot-Job implementations and distribution and parallelism has been designed [34]. to their core and auxiliary properties. These properties Multi-tenancy defines how high-performance machines are described and then used alongside the Pilot abstraction are exposed to their users. Job schedulers, often called \batch and the pilot architecture pattern to describe and compare queuing systems" [35] and first used in the time of punched exemplar Pilot-Job implementations. cards [36,37], adopt the batch processing concept to promote In x5, we outline the Pilot paradigm, arguing for its gen- efficient and fair resource sharing. Job schedulers enable erality, and elaborating on how it impacts and relates to users to submit computational tasks called \jobs" to a queue. both other middleware and applications. Insight is offered The execution of these jobs is delayed waiting for the required about the future directions and challenges faced by the Pilot amount of the machine's resources to be available. The extent paradigm and its Pilot-Job systems. of delay depends on the number, size, and duration of the submitted jobs, resource availability, and policies (e.g., fair 2. EVOLUTION OF PILOT-JOB SYSTEMS usage). The resource provisioning of high-performance machines Three aspects of Pilot-Jobs are investigated in this pa- is limited, irregular, and largely unpredictable [38{41]. By per: the Pilot-Job system, the Pilot-Job abstraction, and definition, the resources accessible and available at any given the Pilot-Job paradigm. A Pilot-Job system is a type of soft- time can be fewer than those demanded by all the active ware, the Pilot-Job abstraction is the set of properties of that users. The resource usage patterns are also not stable over