Many-Task Computing for Grids and

Ioan Raicu 1, Ian T. Foster 1,2,3 , Yong Zhao 4 1Department of Science, University of Chicago, Chicago IL, USA 2Computation Institute, University of Chicago, Chicago IL, USA 3Mathematics and Computer Science Division, Argonne National Laboratory, Argonne IL, USA 4Microsoft Corporation, Redmond, WA, USA [email protected], [email protected], [email protected]

Abstract model of passing data via files between dependent tasks. This potentially larger class of task-parallel Many-task computing aims to bridge the gap between applications is precluded from leveraging the increasing two computing paradigms, high throughput computing power of modern parallel systems such as and high performance computing. Many task computing supercomputers (e.g. IBM Blue Gene/L [1] and Blue differs from high throughput computing in the emphasis Gene/P [2]) because the lack of efficient support in of using large number of computing resources over those systems for the “scripting” programming model short periods of time to accomplish many [3]. With advances in e-Science and the growing computational tasks (i.e. including both dependent and complexity of scientific analyses, more scientists and independent tasks), where primary metrics are researchers rely on various forms of scripting to measured in seconds (e.g. FLOPS, tasks/sec, MB/s I/O automate end-to-end application processes involving rates), as opposed to operations (e.g. jobs) per month. task coordination, provenance tracking, and Many task computing denotes high-performance bookkeeping. Their approaches are typically based on a computations comprising multiple distinct activities, model of loosely coupled computation, in which data is coupled via file system operations. Tasks may be small exchanged among tasks via files, databases or XML or large, uniprocessor or multiprocessor, compute- documents, or a combination of these. Vast increases in intensive or data-intensive. The set of tasks may be data volume combined with the growing complexity of static or dynamic, homogeneous or heterogeneous, data analysis procedures and algorithms have rendered loosely coupled or tightly coupled. The aggregate traditional manual processing and exploration number of tasks, quantity of computing, and volumes of unfavorable as compared with modern high data may be extremely large. Many task computing performance computing processes automated by includes loosely coupled applications that are generally scientific workflow systems. [4] communication-intensive but not naturally expressed The problem space can be partitioned into four main using standard message passing interface commonly categories (Figure 1 and Figure 2). 1) At the low end of found in high performance computing, drawing the spectrum (low number of tasks and small input attention to the many computations that are size), we have tightly coupled Message Passing heterogeneous but not “happily” parallel. Interface (MPI) applications (white). 2) As the data size increases, we move into the analytics category, Keywords: many-task computing, MTC, high- such as data mining and analysis (blue); MapReduce [5] throughput computing, HTC, high performance is an example for this category. 3) Keeping data size computing, HPC modest, but increasing the number of tasks moves us into the loosely coupled applications involving many 1. Defining Many Task Computing tasks (yellow); Swift/Falkon [6, 7] and Pegasus/DAGMan [8] are examples of this category. 4) We want to enable the use of large-scale distributed Finally, the combination of both many tasks and large systems for task-parallel applications, which are linked datasets moves us into the data-intensive many-task into useful workflows through the looser task-coupling computing category (green); examples of this category

978-1-4244-2872-4/08/$25.00 ©2008 IEEE are Swift/Falkon and data diffusion [9], [ 10], and applications, such as large number of tasks (i.e. millions Sawzall [11]. or more), rel atively short per task execution times (i.e. seconds to minutes long), and data intensive tasks (i.e. tens of MB of I/O per CPU second of compute) have lead to the definition of a new class of applicatioapplicati ns called Many-Task Computing. MTC emphasizes on usin g much large numbers of computing resources over short periods of time to accomplish many computational tasks, where the primary metrics are in seconds (e.g., FLOPS, tasks/sec, MB/sec I/O rates),rates) while HTC requires large amounts of computing for long perio ds of time with the primary metrics being operations per month [12 ]. MTC applications are composed of many tasks (both independent and dependent tasks) that can be individually scheduledschedule on many different computing resources across multiple administrative boundaries to achieve some larger application goal. MTC denotes high-performance computations comprising multiple distinct activities, coupled viv a file Figure 1: Problem types with respect to data s ize and system operations or message passing. Tasks may be number of tasks small or la rge, uniprocessor or multiprocessor, High performance computing can be considered to be compute-intensive or data-intensive. The set of tasks part of the first category (denoted by the white aarea). may be static or dynamic, homogeneous or High throughput computing [12 ] can be considered to heterogeneous, loosely coupled or tightly coupled. The be a subset of the third categor y (denoted by the yellow aggregate number of tasks, quantity of computing, and area). Many-Task Computing can be considered as part volumes of data may be extremely large. Is MTC really of categories three and four (denoted by the yelloyellow and different enough to justify coining a new term? TheTh re green areas). This paper focuses on defining many -task are certainly other choices we could have used instins ead, computing, and the challenges that arise as datasedatasets and such as multiple program multiple data (MPMD), highhig computing systems are growing exponentially. throughput computing, workflows, capacity computingcomputin , or . MPMD is a variant of Flynn’s original taxonomy [ 13], used to denote computations in which several prograprogr ms each operate on different data at the same time. MPM MD can be contrasted with Single Program Multiple Data (SPMD), in which multiple instances of the same program each execute on different processors, operaoper ting on different data. MPMD lacks the emphasis that a set of tasks can vary dynamically. High throughput computing [12 ], a term coined by Miron Livny within the Condor project [14 ], to contrast workloads for which the key metric is not floating -point operations per second (as in high performance computi ng) but “per month or year.” MTC applications are often just as concerned with performance as is the most demandingdemandin Figure 2: An incomplete and simplistic view of HPC application; they just don't happen to be SPMD programming models and tools programs. The term “workflow” was first used to Clusters and Grids have been the preferred platforplatform for denote sequences of tasks in business pro cesses, but the loosely coupled applications that have been term is sometimes used to denote any computation ini traditionally par t of the high throughput computing which control and data passes from one “task” to class of applications, which are managed and execuexecuted another. We find it often used to describe many -task through workflow systems or parallel programming computations (or MPMD, HTC, MTC, etc.), making its systems. Various properties of a new emerging use too general. “Embarrassingly pa rallel computing” is used to denote parallel computations in which each support of MTC applications on petascale HPC individual (often identical) task can execute without systems: any significant communication with other tasks or with 1) The I/O subsystems of petascale systems offer unique a file system. Some MTC applications will be simple capabilities needed by MTC applications. For example, and embarrassingly parallel, but others will be collective I/O operations [16] could be implemented to extremely complex and communication-intensive, use the specialized high-bandwidth and low-latency interacting with other tasks and shared file-systems. interconnects. MTC applications could be composed of Is “many task computing” a useful distinction? Perhaps individual tasks that are themselves parallel programs, we could simply have said “applications that are many tasks operating on the same input data, and tasks communication-intensive but are not naturally that need considerable communication among them. expressed in MPI”, but there are also loosely coupled, Furthermore, the aggregate shared file system and independent many tasks. Through the new term performance of a can be potentially MTC, we are drawing attention to the many larger than that found in a distributed infrastructure computations that are heterogeneous but not “happily” (i.e., Grid), with data rates in the 10GB+/s range, rather parallel. than the more typical 0.1GB/s to 1GB/s range at most Grid sites. 2. MTC for Clusters, Grids, and 2) The cost to manage and run on petascale systems Supercomputers like the Blue Gene/P is less than that of conventional clusters or Grids [15]. For example, a single 13.9 TF We claim that MTC applies to not only traditional HTC Blue Gene/P rack draws 40 kilowatts, for 0.35 GF/watt. environments such as clusters and Grids, assuming Two other systems that get good compute power per appropriate support in the , but also watt consumed are the SiCortex with 0.32 GF/watt and supercomputers. Emerging petascale computing the Blue Gene/L with 0.23 GF/watt. In contrast, the systems, such as IBM’s Blue Gene/P [2], incorporate average power consumption of the Top500 systems is high-speed, low-latency interconnects and other 0.12 GF/watt [17]. Furthermore, we also argue that it is features designed to support tightly coupled parallel more cost effective to manage one large system in one computations. Most of the applications run on these physical location, rather than many smaller systems in have a SMPD structure, and are commonly geographically distributed locations. implemented by using MPI to achieve the needed inter- 3) Large-scale systems inevitably have utilization communication. We believe MTC to be a issues. Hence it is desirable to have a community of viable paradigm for supercomputers. As the computing users who can leverage traditional back-filling and storage scale increases, the set of problems that strategies to run loosely coupled applications on idle must be overcome to make MTC practical (ensuring portions of petascale systems. good efficiency and utilization at large-scale) exacerbate. The challenges include local resource 4) Perhaps most importantly, some applications are so manager and granularity, efficient utilization demanding that only petascale systems have enough of the raw hardware, shared file system contention and compute power to get results in a reasonable scalability, reliability at scale, application scalability, timeframe, or to exploit new opportunities in such and understanding the limitations of the HPC systems applications. With petascale processing capabilities on in order to identify promising and scientifically ordinary applications, it becomes possible to perform valuable MTC applications. vast computations with quick turn-around, thus answering questions in a timeframe that can make a One could ask, why use petascale systems for problems quantitative difference in addressing significant that might work well on terascale systems? We point scientific challenges or responding to emergencies. out that petascale systems are more than just many processors with large peak petaflop ratings. They normally come well balanced, with proprietary, high- 3. The Data Deluge Challenge and the speed, and low-latency network interconnects to give Growing Storage/Compute Gap tightly-coupled applications good opportunities to scale well at full system scales. Even IBM has proposed in Within the science domain, the data that needs to be their internal project Kittyhawk [15] that Blue Gene/P processed generally grows faster than computational can be used to run non-traditional workloads (e.g. resources and their speed. The scientific community is HTC). We identify four factors that motivate the facing an imminent flood of data expected from the next generation of experiments, simulations, sensors and satellites. Scientists are now attempting sites (assuming that each site has calculations requiring orders of magnitude more a local area network shared storage infrastructure), and computing and communication than was possible only a among compute nodes (assuming that data can be few years ago. Moreover, in many currently planned stored on compute nodes’ local disk and/or memory). and future experiments, they are also planning to generate several orders of magnitude more data than 4. Middleware Support for MTC has been collected in the entire human history [18]. For instance, in the astronomy domain the Sloan Digital As high throughput computing (HTC) is a subset of Sky Survey [19] has datasets that exceed 10 terabytes in MTC, it is worth mentioning the various efforts in size. They can reach up to 100 terabytes or even enabling HTC on large scale systems. Some of these petabytes if we consider multiple surveys and the time systems are Condor [14], MapReduce [5], Hadoop [28], dimension. In physics, the CMS detector being built to and BOINC [29]. MapReduce (including Hadoop) is run at CERN’s [20] is expected typically applied to a data model consisting of to generate over a petabyte of data per year. In the name/value pairs, processed at the programming bioinformatics domain, the rate of growth of DNA language level. Its strengths are in its ability to spread databases such as GenBank [21] and European the processing of a large dataset to thousands of Molecular Biology Laboratory (EMBL) [22] has been processors with minimal expertise in distributed following an exponential trend, with a doubling time systems; however it often involves the development of estimated to be 9-12 months. A large class of custom filtering scripts and does not support “black applications in Many-Task Computing will be box” application execution as is commonly found in applications that analyze large quantities of data, which MTC or HTC applications. BOINC is known to scale in turn would require that data and computations be well to large number of compute resources, but lacks distributed over many hundreds and thousands of nodes support for data intensive applications due to the nature in order to achieve rapid turnaround times. of the wide area network deployment BOINC typically has, as well as support for “black box” applications. Many applications in the scientific computing generally use a shared infrastructure such as TeraGrid [23] and On the IBM Blue Gene supercomputer, various works Open Science Grid [24], where data movement relies [30, 31] have leveraged the HTC-mode [32] support in on shared or parallel file systems. The rate of increase Cobalt [33] system. These works have in the number of processors per system is outgrowing aimed at integrating their solutions as much as possible the rate of performance increase of parallel file systems, in Cobalt; however, it is not clear that the current which requires rethinking existing data management implementations will be able to support the largest and techniques. For example, a cluster that was placed in most demanding MTC applications at full system service in 2002 with 316 processors has a parallel file scales. Furthermore, these works focus on compute system (i.e. GPFS [25]) rated at 1GB/s, yielding resource management, and ignore data management 3.2MB/s per processor of bandwidth. The second altogether. largest open science supercomputer, the IBM Blue Condor and glide-ins [34] are the original tools to Gene/P from Argonne National Laboratory, has 160K enable HTC, but their emphasis on robustness and processors, and a parallel file system (i.e. also GPFS) recoverability limits their efficiency for MTC rated at 8GB/s, yielding a mere 0.05MB/s per applications in large-scale systems. We found that processor. That is a 65X reduction in bandwidth relaxing some constraints from the middleware and between a system from 2002 and one from 2008. encouraging the end applications to implement these Unfortunately, this trend is not bound to stop, as constraints have enabled significant improvements in advances multi-core and many-core processors will middleware performance and efficiency at large scale. increase the number of processor cores one to two [7] orders of magnitude over the next decade. [4] For example, some of our work with the Falkon light- We believe that data locality is critical to the successful weight task execution framework has shown that task and efficient use of large distributed systems for data- granularity can be efficiently handled at scales of intensive applications [26, 27] in the face of a growing seconds on modest size clusters of thousands of gap between compute power and storage performance. processing cores, while prior resource management Large scale data management needs to be a primary techniques would have required task granularity to be in objective for any middleware targeting to support MTC minutes, or even hours, to allow good utilization of the workloads, to ensure data movement is minimized by raw resources at the same scale. The task granularity intelligent data-aware scheduling both among issues are exaggerated as system scales are increased to levels of hundreds of thousands of processor cores. will be poor due to high queuing and dispatching Falkon is able to achieve these improvements by overheads, which ultimately results in low streamlining the dispatching process, making the throughput. A common technique to amortize the costs scheduler run in constant time in relation to the number of the local resource management is to “cluster” of tasks queued and number of processors managed, multiple jobs into a single larger job. Although this multi-threading the resource manager to leverage multi- lowers the per job overhead, it is best suited when the core nodes, moving recoverability from the resource set of jobs to be executed are homogenous in execution manager to the application, and leaving out many non- times, or accurate execution time information is essential features (i.e. accountability, support for multi- available prior to job execution; with heterogeneous job node tasks such as MPI, etc) from production managers execution times, it is hard to maintain good load that can be offloaded to other systems or the balancing of the underlying resource, causing low application. [7] resource utilization. We claim that “clustering” jobs is not enough, and that the middleware that manages jobs Swift [6, 35] and Falkon [7] have been used to execute must be streamlined and made as light-weight as MTC applications on clusters, multi-site Grids (e.g., possible to allow applications with heterogonous Open Science Grid [24], TeraGrid [36]), specialized execution times to execute without “clustering” with large machines (SiCortex [37]), and supercomputers high efficiency. (e.g., Blue Gene/P [2]). Swift enables scientific workflows through a data-flow-based functional In addition to streamlined task dispatching, scalable parallel programming model. It is a parallel scripting data management techniques are also required in order tool for rapid and reliable specification, execution, and to support MTC applications. MTC applications are management of large-scale science and engineering often data and/or meta-data intensive, as each job workflows. The runtime system in Swift relies on the requires at least one input file and one output file, and CoG Karajan [38] workflow engine for efficient can sometimes involve many files per job. These data scheduling and load balancing, and it integrates with management techniques need to make good utilization the Falkon light-weight task execution dispatcher for of the full network bandwidth of large scale systems, optimized task throughput and efficiency, as well as which is a function of the number of nodes and improved data management capabilities to ensure good networking technology employed, as opposed to the scalability with increasing compute resources. Large- relatively small number of storage servers that are scale applications from many domains (e.g., astronomy behind a parallel file system or GridFTP server. [7, 39], medicine [7, 40, 41], chemistry [42], molecular We have identified various applications (as detailed dynamics [43], and economics [44, 45]) have been run below) from many disciplines that demonstrate at scales of up to millions of tasks on up to hundreds of characteristics of MTC applications. These thousands of processors. While this paper strives to applications cover a wide range of domains, from define MTC, as well the theory and practice to enable astronomy, physics, astrophysics, pharmaceuticals, MTC on a wide range of systems from the average bioinformatics, biometrics, neuroscience, medical cluster to the largest supercomputers, the Falkon imaging, chemistry, climate modeling, economics, and middleware represents the practical aspects of enabling data analytics. They often involve many tasks, ranging MTC workloads on these systems. from tens of thousands to billions of tasks, and have a large variance of task execution times ranging from 5. MTC Applications hundreds of milliseconds to hours. Furthermore, each task is involved in multiple reads and writes to and We have found many applications that are a better fit from files, which can range in size from kilobytes to for MTC than HTC or HPC. Their characteristics gigabytes. These characteristics made traditional include having a large number of small parallel jobs, a resource management techniques found in HTC common pattern observed in many scientific inefficient; also, although some of these applications applications [6]. They also use files (instead of could be coded as HPC applications, due to the wide messages, as in MPI) for intra-processor variance of the arrival rate of tasks from many users, an communication, which tends to make these applications HPC implementation would also poor utilization. data intensive. Furthermore, the data intensive nature of these While we can push hundreds or even thousands of such applications can quickly saturate parallel file systems at small jobs via GRAM to a traditional local resource even modest computing scales. manager (e.g. PBS [46], Condor [34], SGE [47]), the Astronomy: One of the first applications that achieved utilization of a modest to large resource set motivated much of this work was called the “AstroPortal” [48], which offered a stacking service of impacts of the consumption of natural gas, the astronomy images from the Sloan Digital Sky Survey production and use of hydrogen, and coal-to-liquids co- (SDSS) dataset using grid resources. Astronomical production, and seeks to provide insights into how image collections usually cover an area of sky several refineries can become more efficient through the times (in different wavebands, different times, etc). On capture of waste energy. Other economic modeling the other hand, there are large differences in the applications perform numerical optimization to sensitivities of different observations: objects detected determine optimal resource assignment in energy in one band are often too faint to be seen in another problems. This application is challenging as the survey. In such cases we still would like to see whether parameter space is extremely large, which can produce these objects can be detected, even in a statistical millions, even billions of individual tasks, each with a fashion. There has been a growing interest to re-project relatively short execution time of only seconds long. each image to a common set of pixel planes, then Pharmaceutical Domain: In the pharmaceutical stacking images. The stacking improves the signal to domain, there are applications that screen KEGG [53] noise, and after coadding a large number of images, compounds and drugs against important metabolic there will be a detectable signal to measure the average protein targets using DOCK6 [43] to simulate the brightness/shape etc of these objects. This application “docking” of small molecules, or ligands, to the “active involved the SDSS dataset [19] (currently at 10TB with sites” of large macromolecules of known structure over 300 million objects, but these datasets could be called “receptors”. A compound that interacts strongly petabytes in size if we consider multiple surveys in both with a receptor (such as a protein molecule) associated time and space) [4], many tasks ranging from 10K to with a disease may inhibit its function and thus act as a millions of tasks, each requiring 100ms to seconds of beneficial drug. The economic and health benefits of compute and 100KB to MB of input and output data. speeding drug development by rapidly screening for Another related application in astronomy is promising compounds and eliminating costly dead-ends MONTAGE [49, 50], a national virtual observatory is significant in terms of both resources and human life. project [51] that stitches tiles of images of the sky from The parameter space is quite large, totaling to more various sky surveys (e.g. SDSS [19], etc) into a than one billion computations that have a large variance photorealistic single image. Execution times per task of execution times from seconds to hours, with an range in the 100ms to 10s of seconds, and each task average of 10 minutes. The entire parameter space involves multiple input images and at least one image would require over 22,600 CPU years, or over 50 days output. This application is both compute intensive and on a 160K processor Blue Gene/P supercomputer [2]. data intensive, and has been run as both a HTC (using This application is challenging as there many tasks, Pegasus/DAGMan [8], and Condor [14]) and a HPC each task has a wide range of execution times with little (using MPI) application, but we found its scalability to to no prior knowledge about its execution time, and be limited when run under HTC or HPC. involves significant I/O for each computation as the compounds are typically stored in a database (i.e. 10s to Astrophysics: Another application is from 100s of MB large) and must be read completely per astrophysics, which analyzes the Flash turbulence computation. dataset (simulation data) [52] from various perspectives, using volume rendering and vector Chemistry: Another application in the same domain is visualization. The dataset is composed of 32 million OOPS [54], which aims to predict protein structure and files (1000 time steps times 32K files) taking up about recognize docking partners. In chemistry, specifically in 15TB of storage resource, and contains both temporal molecular dynamics, we have an application MolDyn and spatial locality. In the physics domain, the CMS whose goal is to calculate the solvation free energy of detector being built to run at CERN’s Large Hadron ligands and protein-ligand binding energy, with Collider [20] is expected to generate over a petabyte of structures obtained from the NIST Chemistry WebBook data per year. Supporting applications that can perform database [55]. Solvation free energy is an important a wide range of analysis of the LHC data will require quantity in Computational Chemistry with a variety of novel support for data intensive applications. applications, especially in drug discovery and design. These applications have similar characteristics as the Economic Modeling: An application from the DOCK application previously discussed. economic modeling domain that we have investigated as a good MTC candidate is Macro Analysis of Bioinformatics: In bioinformatics, Basic Local Refinery Systems (MARS) [44], which studies Alignment Search Tool (BLAST), is a family of tools economic model sensitivity to various parameters. for comparing primary biological sequence information MARS models the economic and environmental (e.g. amino-acid sequences of proteins, nucleotides of DNA sequences). A BLAST search enables one to involve many tasks, many input files (or many disjoint compare a query sequence with a library or database of sub-sections of few files), and are data intensive. sequences, and identify library sequences that resemble Data Mining: Another set of applications that perform the query sequence above a certain threshold. [56]. data analysis can be classified in the “All-Pairs” class Although the BLAST codes have been implemented in of applications [60]. These applications aim to both HTC and HPC, they are often both data and understand the behavior of a function on two sets, or to compute intensive, requiring multi-GB databases to be learn the covariance of these sets on a standard inner read for each comparison (or kept in memory if product. Two common applications in All-Pairs are data possible), and each comparison can be done within mining and biometrics. Data mining is the study of minutes on a modern processor-core. MTC and its extracting meaning from large data sets; one phase of support for data intensive applications are critical in knowledge discovery is reacting to bias or other noise scaling BLAST on large scale resources with thousands within a set. Different classifiers work better or worse to hundreds of thousands of processors. for varying data, and hence it is important to explore Neuroscience Domain: In the neuroscience domain, many different classifiers in order to be able to we have the Computational Neuroscience Applications determine which classifier is best for that type of noise Research Infrastructure (CNARI), which aims to on a particular distribution of the validation set. manage neuroscience tools and the heterogeneous Biometrics: Biometrics aims to identifying humans compute resources on which they can enable large-scale from measurements of the body (e.g. photos of the face, computational projects in the neuroscience community. recordings of the voice, and measurements of body The analysis includes the aphasia study, structural structure). A recognition algorithm may be thought of equation modeling, and general use of R for various as a function that accepts two images (e.g. face) as data analysis. [57] The application workloads involve input and outputs a number between zero and one many tasks, relatively short on the order of seconds, indicating the similarity between the two input images. and each task containing many small input and output The application would then compare all images of a files making the application meta-data intensive at large database and create a scoring matrix which can later be scale. easily searched to retrieve the most similar images. Cognitive Neuroscience: The fMRI application is a These All-Pairs applications are extremely challenging workflow from the cognitive neuroscience domain with as the number of tasks can rapidly grow in the millions a four-step , which includes Automated Image and billions, with each task being hundreds of Registration (AIR), Preprocessing and stats from NIH milliseconds to tens of seconds, with multi-megabyte and FMRIB (AFNI and FSL), and Statistical Parametric input data per task. Mapping (SPM2) [58]. An fMRI Run is a series of MPI Ensembles: Finally, another class of applications brain scans called volumes, with a Volume containing a is managing an ensemble of MPI applications. One 3D image of a volumetric slice of a brain image, which example is from the climate modeling domain, which is represented by an Image and a Header. Each volume has been studying climate trends and predicting global can contain hundreds to thousands of images, and with warming [61], is already implemented as an HPC MPI multiple patients, the number of individual analysis application. However, the current climate models could tasks can quickly grow. Task execution times were only be run as ensemble runs (many concurrent MPI seconds long, and the input and output files ranged applications) to quantify climate model uncertainty. from kilobytes to megabytes in size. This application This is challenging in large scale systems such as could run as an HTC one at small scales, but needs supercomputers (a typical resource such models would MTC support to scale up. execute on), as the local resource managers (e.g. Data Analytics: Data analytics and data mining is a Cobalt) favor large jobs and have policy against large field that covers many different applications. running many jobs at the same time (i.e. where many is Here, we outline several applications that fit MTC well. more than single digit number of jobs per user). One example is the analysis of log data from millions All these applications pose significant challenges to computations. Another set of applications are ones traditional resource management found in HPC and commonly found in the MapReduce [59] paradigm, HTC, from both job management and storage namely “sort” and “word count”. Both of these management perspective, and are in critical need of applications are essential to World Wide Web search MTC support as the scale of these resources grows. engines, and are challenging at medium to large scale due to their data intensive nature. All three applications 6. Conclusions individually scheduled, and using files for inter-process communication. We have defined a new paradigm – MTC – which aims We have already shown good support for MTC on a to bridge the gap between two computing paradigms, variety of resources from clusters, grids, and HTC and HPC. MTC applications are typically loosely supercomputers through our work on Swift [6, 35, 42] coupled that are communication-intensive but not and Falkon [7, 63]. Furthermore, we have taken the first naturally expressed using standard message passing steps to address data-intensive MTC by offloading interface commonly found in high performance much of the I/O away from parallel file systems and computing, drawing attention to the many computations into the network, making full utilization of caches (both that are heterogeneous but not “happily” parallel. We on disk and in memory) and the full network bandwidth believe that today’s existing HPC systems are a viable of commodity networks (e.g. gigabit ) as well platform to host MTC applications. We also believe as proprietary and more exotic networks (Torus, Tree, MTC is a broader definition than HTC, allowing for and Infiniband). [9, 16] finer grained tasks, independent tasks as well as ones with dependencies, and allowing tightly coupled We believe that there is more to HPC than tightly applications and loosely coupled applications to co- coupled MPI, and more to HTC than embarrassingly exist on the same system. parallel long running jobs. Like HPC applications, and science itself, applications are becoming increasingly Furthermore, having native support for data intensive complex opening new doors for many opportunities to applications is central to MTC, as there is a growing apply HPC in new ways if we broaden our perspective. gap between storage performance of parallel file We hope the definition of Many-Task Computing leads systems and the amount of processing power. As the to a stronger appreciation of the fact that applications size of scientific data sets and the resources required for that are not tightly coupled via MPI are not necessarily analysis increase, data locality becomes crucial to the embarrassingly parallel: some have just so many simple efficient use of large scale distributed systems for tasks that managing them is hard, some operate on or scientific and data-intensive applications [62]. We produce large amounts of data that need sophisticated believe it is feasible to allocate large-scale data management in order to scale. There also exist computational resources and caching storage resources applications that involve MPI ensembles, essentially that are relatively remote from the original data many jobs where each job is composed of tightly location, co-scheduled together to optimize the coupled MPI tasks, and there are loosely coupled performance of entire data analysis workloads which applications that have dependencies among tasks, but are composed of many loosely coupled tasks. typically use files for inter-process communication. We identified challenges in running these novel Efficient support for these sorts of applications on workloads on large-scale systems, which can hamper existing large scale systems, including future ones will both efficiency and utilization. These challenges involve substantial technical challenges and will have include local resource manager scalability and big impact on science. granularity, efficient utilization of the raw hardware, To extend the discussion of this paper with the broader shared file system contention and scalability, reliability community on many-task computing, we held a at scale, application scalability, and understanding the workshop at IEEE/ACM Supercomputing 2008 titled limitations of HPC systems in order to identify “IEEE Workshop on Many-Task Computing on Grids promising and scientifically valuable MTC and Supercomputers (MTAGS08)” [64] as well as a applications. bird-of-feather (BOF) session titled “Megajobs08: How Clusters with 62K processor cores (e.g., TACC Sun to Run a Million Jobs” [65]. The half-day workshop Constellation System, Ranger), Grids (e.g., TeraGrid) was a success; it attracted over 100 participants who with over a dozen sites and 161K processors, and attended the presentations of the six accepted papers supercomputers with 160K processors (e.g., IBM Blue [16, 66, 67, 68, 69, 70] and a keynote talk by the Gene/P) are now available to the scientific community. renowned Dr. Alan Gara (IBM Blue Gene Chief These large HPC systems are considered efficient at Architect). The BOF was also a success, having seven executing tightly coupled parallel jobs within a short presentations and attracting over 60 participants. particular machine using MPI to achieve inter-process We plan to have follow-up workshops and special issue communication. We proposed using HPC systems for publications on many-task computing in the near future. loosely-coupled applications, which involve the execution of independent, sequential jobs that can be 7. Acknowledgements [9] I. Raicu, Y. Zhao, I. Foster, A. Szalay. “Accelerating Large-Scale Data Exploration This work was supported in part by the NASA Ames through Data Diffusion”, ACM International Research Center GSRP grant number NNA06CB89H, Workshop on Data-Aware Distributed the Mathematical, Information, and Computational Computing 2008 Sciences Division subprogram of the Office of [10] M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Advanced Scientific Computing Research, Office of Fetterly. “Dryad: Distributed Data-Parallel Science, U.S. Dept. of Energy, under Contract DE- Programs from Sequential Building Blocks”, AC02-06CH11357, and the National Science European Conference on Computer Systems Foundation under grant OCI-0721939. We also thank (EuroSys), 2007 the Argonne Leadership Computing Facility for allowing us to test MTC in practice on the IBM Blue [11] R. Pike, S. Dorward, R. Griesemer, S. Quinlan. Gene/P. We also thank our many colleagues for their “Interpreting the Data: Parallel Analysis with contributions to the implementations and applications Sawzall”, Scientific Programming Journal, of making many task computing a reality; in no Special Issue on Grids and Worldwide particular order, we would like to thank Michael Wilde, Computing Programming Models and Zhao Zhang, Allan Espinosa, Ben Clifford, Kamil Infrastructure 13(4), pp. 227-298, 2005 Iskra, Pete Beckman, Mike Kubal, Don Hanson, [12] M. Livny, J. Basney, R. Raman, T. Tannenbaum. Matthew Cohoon, Fangfang Xia, Rick Stevens, “Mechanisms for High Throughput Computing”, Douglas Thain, Matei Ripeanu, and Samer Al-Kiswany. Journal 1(1), 1997 8. References [13] M. Flynn. “Some Computer Organizations and Their Effectiveness”, IEEE Trans. Comput. C- 21, pp. 948, 1972 [1] A. Gara, et al. “Overview of the Blue Gene/L system architecture”, IBM Journal of Research [14] D. Thain, T. Tannenbaum, M. Livny, and Development 49(2/3), 2005 “Distributed Computing in Practice: The Condor Experience”, Concurrency and Computation: [2] IBM BlueGene/P (BG/P), Practice and Experience 17( 2-4), pp. 323-356, http://www.research.ibm.com/bluegene/ , 2008 2005 [3] J. Ousterhout, “Scripting: Higher Level [15] J. Appavoo, V. Uhlig, A. Waterland. “Project Programming for the 21st Century”, IEEE Kittyhawk: Building a Global-Scale Computer”, Computer, March 1998 ACM Sigops Review, 2008 [4] Y. Zhao, I. Raicu, I. Foster. “Scientific [16] Z. Zhang, A. Espinosa, K. Iskra, I. Raicu, I. Workflow Systems for 21st Century e-Science, Foster, M. Wilde. “Design and Evaluation of a New Bottle or New Wine?”, IEEE Workshop on Collective I/O Model for Loosely-coupled Scientific Workflows 2008 Petascale Programming”, IEEE Workshop on [5] J. Dean, S. Ghemawat. “MapReduce: Simplified Many-Task Computing on Grids and data processing on large clusters.” In OSDI, Supercomputers (MTAGS08) 2008 2004 [17] Top500, June 2008, [6] Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. http://www.top500.org/lists/2008/06 , 2008 von Laszewski, I. Raicu, T. Stef-Praun, M. [18] T. Hey, A. Trefethen. “The data deluge: an e- Wilde. “Swift: Fast, Reliable, Loosely Coupled sicence perspective”, Gid Computing: Making Parallel Computation”, IEEE Workshop on the Global Infrastructure a Reality, Wiley, 2003 Scientific Workflows 2007 [19] SDSS: Sloan Digital Sky Survey, [7] I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, M. http://www.sdss.org/ , 2008 Wilde. “Falkon: A Fast and Lightweight Task Execution Framework”, IEEE/ACM SC, 2007 [20] CERN’s Large Hadron Collider, http://lhc.web.cern.ch/lhc , 2008 [8] E. Deelman et al. “Pegasus: A Framework for Mapping Complex Scientific Workflows onto [21] GenBank, Distributed Systems”, Scientific Programming http://www.psc.edu/general/software/packages/g Journal 13(3), 219-237, 2005 enbank 2008 [22] European Molecular Biology Laboratory, [36] C. Catlett et al., “TeraGrid: Analysis of http://www.embl.org , 2008 Organization, System Architecture, and Middleware Enabling New Types of [23] C. Catlett, et al. “TeraGrid: Analysis of Applications”, HPC and Grids in Action, ed. Organization, System Architecture, and Lucio Grandinetti, IOS Press Advances in Middleware Enabling New Types of series, Amsterdam, 2007 Applications”, HPC 2006 [37] SiCortex, http://www.sicortex.com/ , 2008 [24] Open Science Grid (OSG), http://www.opensciencegrid.org/, 2008 [38] G.v. Laszewski, M. Hategan, D. Kodeboyina. “Java CoG Kit Workflow”, in I.J. Taylor, E. [25] F. Schmuck and R. Haskin, “GPFS: A Shared- Deelman, D.B. Gannon, and M. Shields, eds., Disk File System for Large Computing Workflows for eScience, pp. 340-356, 2007 Clusters”, FAST 2002 [39] J.C. Jacob, et al. “The Montage Architecture for [26] A. Szalay, A. Bunn, J. Gray, I. Foster, I. Raicu. Grid-Enabled Science Processing of Large, “The Importance of Data Locality in Distributed Distributed Datasets”, Earth Science Technology Computing Applications”, NSF Workflow Conference 2004 Workshop 2006 [40] The Functional Magnetic Resonance Imaging [27] J. Gray. “Distributed Computing Economics”, , http://www.fmridc.org/ , 2007 Technical Report MSR-TR-2003-24, , Microsoft Corporation, 2003 [41] T. Stef-Praun, B. Clifford, I. Foster, U. Hasson, M. Hategan, S. Small, M. Wilde and Y. Zhao. [28] A. Bialecki, M. Cafarella, D. Cutting, O. “Accelerating Medical Research using the Swift O’Malley. “Hadoop: A Framework for Running Workflow System”, Health Grid , 2007 Applications on Large Clusters Built of Commodity Hardware”, [42] Y. Zhao, I. Raicu, I. Foster, M. Hategan, V. http://lucene.apache.org/hadoop/ , 2005 Nefedova, M. Wilde. “Realizing Fast, Scalable and Reliable Scientific Computations in Grid [29] D.P. Anderson, “BOINC: A System for Public- Environments”, Grid Computing Research Resource Computing and Storage”, IEEE/ACM Progress, Nova Pub. 2008 Workshop on Grid Computing, 2004 [43] D.T. Moustakas et al. “Development and [30] J. Cope, M. Oberg, H.M. Tufo, T. Voran, M. Validation of a Modular, Extensible Docking Woitaszek. “High Throughput Grid Computing Program: DOCK 5”, J. Comput. Aided Mol. Des. with an IBM Blue Gene/L”, Cluster 2007 20, pp. 601-619, 2006 [31] A. Peters, A. King, T. Budnik, P. McCarthy, P. [44] D. Hanson. “Enhancing Technology Michaud, M. Mundy, J. Sexton, G. Stewart. Representations within the Stanford Energy “Asynchronous Task Dispatch for High Modeling Forum (EMF) Climate Economic Throughput Computing for the eServer IBM Models”, Energy and Economic Policy Models: Blue Gene® Supercomputer”, Parallel and A Reexamination of Fundamentals, 2006 Distributed Processing (IPDPS), 2008 [45] T. Stef-Praun, G. Madeira, I. Foster, R. [32] IBM Corporation. “High-Throughput Computing Townsend. “Accelerating solution of a moral (HTC) Paradigm”, IBM System Blue Gene hazard problem with Swift”, e-Social Science, Solution: Blue Gene/P Application 2007 Development, IBM RedBooks, 2008 [46] B. Bode, D.M. Halstead, R. Kendall, Z. Lei, W. [33] N. Desai. “Cobalt: An Open Source Platform for Hall, D. Jackson. “The Portable Batch Scheduler HPC System Software Research”, Edinburgh and the Maui Scheduler on Linux Clusters”, BG/L System Software Workshop, 2005 Usenix, 4th Annual Linux Showcase & [34] J. Frey, T. Tannenbaum, I. Foster, M. Frey, S. Conference, 2000 Tuecke. “Condor-G: A Computation [47] W. Gentzsch, “Sun Grid Engine: Towards Management Agent for Multi-Institutional Creating a Compute Power Grid”, 1st Grids”, Cluster Computing, 2002 International Symposium on Cluster Computing [35] Swift Workflow System, and the Grid, 2001 www.ci.uchicago.edu/swift , 2008 [48] I. Raicu, I. Foster, A. Szalay, G. Turcu. Grid: Supporting the Next Generation of Climate “AstroPortal: A Science Gateway for Large-scale Modeling Research”, Proceedings of the IEEE, Astronomy Data Analysis”, TeraGrid 93 (3), p 485-495, 2005 Conference 2006 [62] A. Szalay, A. Bunn, J. Gray, I. Foster, I. Raicu. [49] G.B. Berriman, et al., “Montage: a Grid Enabled “The Importance of Data Locality in Distributed Engine for Delivering Custom Science-Grade Computing Applications”, NSF Workflow Image Mosaics on Demand”, SPIE Conference Workshop 2006 on Astronomical Telescopes and [63] I. Raicu, Z. Zhang, M. Wilde, I. Foster, P. Instrumentation. 2004 Beckman, K. Iskra, B. Clifford. “Towards [50] J.C. Jacob, et al. “The Montage Architecture for Loosely-Coupled Programming on Petascale Grid-Enabled Science Processing of Large, Systems”, IEEE/ACM International Conference Distributed Datasets”, Earth Science Technology for High Performance Computing, Networking, Conference 2004 Storage and Analysis (SuperComputing/SC08), 2008 [51] US National Virtual Observatory (NVO), http://www.us-vo.org/index.cfm , 2008 [64] I. Raicu, Y. Zhao, I. Foster. “IEEE Workshop on Many-Task Computing on Grids and [52] ASC / Alliances Center for Astrophysical Supercomputers (MTAGS08) 2008”, co-located Thermonuclear Flashes, with IEEE/ACM Supercomputing 2008, http://www.flash.uchicago.edu/website/home/ , http://dsl.cs.uchicago.edu/MTAGS08/ , 2008 2008 [65] M. Pierce, I. Raicu, R. Pordes, J. McGee, D. [53] KEGG’s Ligand Database: Repasky. “How to Run One Million Jobs http://www.genome.ad.jp/kegg/ligand.html , 2008 (Megajobs08)”, Bird-of-Feather Session at [54] PL protein library, http://protlib.uchicago.edu/ , IEEE/ACM Supercomputing 2008, 2008 http://gridfarm007.ucs.indiana.edu/megajobBOF /index.php/Main_Page , 2008 [55] NIST Chemistry WebBook database, http://webbook.nist.gov/chemistry/ , 2008 [66] Y. Gu, R. Grossman. “Exploring and Locality in Wide Area [56] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, Networks”, IEEE Workshop on Many-Task D.J. Lipman. “Basic Local Alignment Search Computing on Grids and Supercomputers Tool”, J Mol Biol 215 (3): 403–410, 1990 (MTAGS08) 2008 [57] Computational Neuroscience Applications [67] E.V. Hensbergen, Ron Minnich. “System Research Infrastructure, Support for Many Task Computing”, IEEE http://www.ci.uchicago.edu/wiki/bin/view/CNA Workshop on Many-Task Computing on Grids RI/WebHome , 2008 and Supercomputers (MTAGS08) 2008 [58] The Functional Magnetic Resonance Imaging [68] L. Hui, Y. Huashan, L. Xiaoming. “A Data Center, http://www.fmridc.org/ , 2007 Lightweight Execution Framework for Massive [59] J. Dean and S. Ghemawat. “MapReduce: Independent Tasks”, IEEE Workshop on Many- Simplified Data Processing on Large Clusters”, Task Computing on Grids and Supercomputers Symposium on Operating System Design and (MTAGS08) 2008 Implementation (OSDI'04), 2004 [69] A.T. Thor, G.V. Zaruba, D. Levine, K. De, T.J. [60] C. Moretti, J. Bulosan, D. Thain, and P. Flynn. Wenaus. “ViGs: A Grid Simulation and “All-Pairs: An Abstraction for Data-Intensive Monitoring Tool for Grid Workflows”, IEEE ”, IPDPS 2008 Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS08) 2008 [61] D. Bernholdt, S. Bharathi, D. Brown, K. Chanchio, M. Chen, A. Chervenak, L. Cinquini, [70] E. Afgan, P. Bangalore. “Embarrassingly Parallel B. Drach, I. Foster, P. Fox, J. Garcia, C. Jobs Are Not Embarrassingly Easy to Schedule Kesselman, R. Markel, D. Middleton, V. on the Grid”, IEEE Workshop on Many-Task Nefedova, L. Pouchard, A. Shoshani, A. Sim, G. Computing on Grids and Supercomputers Strand, and D. Williams, “The Earth System (MTAGS08) 2008