Making a Case for Distributed File Systems at Exascale
Total Page:16
File Type:pdf, Size:1020Kb
Making a Case for Distributed File Systems at Exascale 1,2Ioan Raicu, 2,3,4Ian T. Foster, 2,3,5Pete Beckman 1 Department of Computer Science, Illinois Institute of Technology, Chicago, IL, USA 2 Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA 3 Computation Institute, University of Chicago, Chicago, IL, USA 4 Department of Computer Science, University of Chicago, Chicago, IL, USA 5 Exascale Technology and Computing Institute, Argonne National Laboratory, Argonne, IL, USA [email protected], [email protected], [email protected] ABSTRACT 1. INTRODUCTION Exascale computers will enable the unraveling of significant Today’s science is generating datasets that are increasing scientific mysteries. Predictions are that 2019 will be the year of exponentially in both complexity and volume, making their exascale, with millions of compute nodes and billions of threads analysis, archival, and sharing one of the grand challenges of the of execution. The current architecture of high-end computing 21st century. Seymour Cray once said – “a supercomputer is a systems is decades-old and has persisted as we scaled from device for turning compute-bound problems into I/O-bound gigascales to petascales. In this architecture, storage is problems” – which addresses the fundamental shift in bottlenecks completely segregated from the compute resources and are as supercomputers gain more parallelism at exponential rates [1], connected via a network interconnect. This approach will not the storage infrastructure performance is increasing at a scale several orders of magnitude in terms of concurrency and significantly lower rate. This implies that the data management throughput, and will thus prevent the move from petascale to and data flow between the storage and compute resources is exascale. At exascale, basic functionality at high concurrency becoming the new bottleneck for large-scale applications. The levels will suffer poor performance, and combined with system support for data intensive computing [2] is critical to advancing mean-time-to-failure in hours, will lead to a performance collapse modern science as storage systems have experienced a gap for large-scale heroic applications. Storage has the potential to be between capacity and bandwidth that increased more than 10-fold the Achilles heel of exascale systems. We propose that future over the last decade. There is an emerging need for advanced high-end computing systems be designed with non-volatile techniques to manipulate, visualize and interpret large datasets. memory on every compute node, allowing every compute node to Many domains (e.g. astronomy, medicine, bioinformatics) share actively participate in the metadata and data management and these data management challenges, strengthening the potential leveraging many-core processors high bisection bandwidth in impact from generic solutions. torus networks. This position paper discusses this revolutionary Exascale computers (e.g. capable of 1018 ops/sec) [3], with a new distributed storage architecture that will make exascale processing capability similar to that of the human brain, will computing more tractable, touching virtually all disciplines in enable the unraveling of significant scientific mysteries and high-end computing and fueling scientific discovery. present new challenges and opportunities. The US President made the building of exascale systems a top national priority, stating [4] Categories and Subject Descriptors that it will "dramatically increasing our ability to understand the C.2.4 [Distributed Systems]; C.5.1 [Large and Medium world around us through simulation and slashing the time needed (``Mainframe'') Computers]; D.4.2 [Storage Management]; to design complex products such as therapeutics, advanced D.4.3 [File Systems Management]; D.4.5 [Reliability]; D.4.7 materials, and highly-efficient autos and aircraft". Major [Organization and Design]; D.4.8 [Performance]; H.3.4 scientific opportunities arise in many fields (such as weather [Systems and Software] modeling, understanding global warming, national security, drug General Terms discovery, and economics) and may rely on revolutionary advances that will enable exascale computing. [1] Many experts Management, Performance, Reliability predict [3] that exascale computing will be common by 2019: Keywords millions of nodes, billions of threads of execution, hundreds of Exascale computing, distributed file systems, storage architecture, petabytes of memory, and exabytes of persistent storage. many-task computing The current architecture of high-end computing (HEC) systems is decades-old and has persisted as we scaled from gigascales to petascales. In this architecture, storage is completely segregated from the compute resources and are connected via a network interconnect (e.g. parallel file systems running on network Permission to make digital or hard copies of all or part of this work for attached storage, such as GPFS [5], PVFS [6], and Lustre [7]). personal or classroom use is granted without fee provided that copies are This approach will not scale several orders of magnitude in terms not made or distributed for profit or commercial advantage and that of concurrency and throughput, and will thus prevent the move copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, from petascales to exascale. Unless significant research is requires prior specific permission and/or a fee. invested to revolutionize the storage hardware architecture and LSAP’11, June 8, 2011, San Jose, California, USA. parallel/distributed file system implementations, we will not be Copyright 2011 ACM 978-1-4503-0703-1/11/06...$10.00. able to build capability exascale systems. Supercomputers are generally designed for either capability or capacity computing, as Directory Create (single dir) 10000 Irving Wladawsky-Berger wrote in 2010. File Create (single dir) “Capability supercomputers dedicate the whole 1000 machine to solve a very large problem in the shortest amount of time. Capacity supercomputers, on the other hand, support large numbers of users solving different 100 kinds of problems simultaneously. While both kinds of supercomputing are very important, initiatives designed 10 to push the envelope, like DOE’s exascale project, tend to focus on the development of capability machines to address Grand Challenge problems that could not be 1 solved in any other way. Capability computing has been (sec)per Operation Time primarily applied to what is sometimes referred to as 0.1 heroic computations, where just about the whole 4 256 4096 8192 16384 machine is applied to a single task.” [8] Number of Processors 2. ACHILLES HEEL Figure 2: File/dir create time on GPFS; 16K-cores [10] We believe storage systems in future exascale systems will be its Read/Write: Reading performance of common datasets (e.g. Achilles heel [9]), unless storage systems are re-architected to application binaries, read-only databases) is challenging. The ensure scalability to millions of nodes and potentially billions of experiment in Figure 3 shows the distribution of data from the concurrent I/O requests. We believe that the system mean-time-to- GPFS file system to the compute nodes with two approaches: 1) failure (MTTF) at exascale will be of the order of a few hours. pushing out the data over a spanning tree (CIO), and 2) pulling the However, based on current trends, even basic functionality (e.g. data from each compute node independently (GPFS). At the booting - Figure 1, metadata operations - Figure 2, read/write relatively modest scale of 4k-cores, the CIO approach operations - Figure 3 and Figure 4, loading applications - Figure outperforms GPFS by a factor of five, due to better utilization of 3, and check-pointing - Figure 5) at significant concurrency levels the bi-section bandwidth of the torus network. Writing from many are expected to take more time than the expected exascale compute nodes directly to parallel file systems is also challenging; machine's MTTF, which will result in complete performance Figure 4 shows the poor efficiency achieved (15%-70%) with 16 collapse for grand challenge applications. second tasks producing 1KB to 1MB output. [10, 11] Booting: For example, booting the Blue Gene/P (see Figure 1) is 14000 CIO an expensive operation at 0.5 petaflops, as measured on the real GPFS machine. On 256 processors, it takes 85 seconds to boot the 12000 allocation; on the full 160K processors, it takes 1090 seconds. Unfortunately, it appears that the machine boot time grows 10000 linearly with the number of nodes, translating to potentially over 8000 25K seconds (7+ hours) boot-time at 1M node scales. [10] 6000 Booting Partitions 1320 1200 (MB/s) Throughput 4000 1080 960 2000 840 0 720 4 256 512 1024 2048 4096 600 Number of Processors 480 Time (sec) Time 360 Figure 3: Data read; GPFS vs. spanning tree (CIO) [11] 240 100% 120 90% 0 80% 70% Number of Processors 60% Figure 1: Booting a BlueGene/P; 160K-cores [10] 50% Metadata operations on parallel file systems can be inefficient at 40% large scale. Early experiments on the BlueGene/P system at 16K- Efficiency 30% 16sec+GPFS(1KB) core scales (see Figure 2) shows the various costs (wall-clock time 20% 16sec+GPFS(16KB) measured at remote processor) for file/directory create on GPFS. 16sec+GPFS(128KB) Ideal performance would be to have a flat line. If care is not taken 10% 16sec+GPFS(1MB) to avoid lock contention, performance degrades rapidly,