Developing Integrated Data Services for Cray Systems with a Gemini Interconnect

Developing Integrated Data Services for Cray Systems with a Gemini Interconnect Ron A. Oldeld Todd Kordenbrock Jay Lofstead May 1, 2012 Abstract Over the past several years, there has been increasing interest in injecting a layer of compute resources between a high-performance computing application and the end storage devices. For some projects, the objective is to present the parallel le system with a reduced set of clients, making it easier for le-system vendors to support extreme-scale systems. In other cases, the objective is to use these resources as staging areas to aggregate data or cache bursts of I/O operations. Still others use these staging areas for in-situ analysis on data in-transit between the application and the storage system. To simplify our discussion, we adopt the general term Integrated Data Services to represent these use-cases. This paper describes how we provide user-level, integrated data services for Cray systems that use the Gemini Interconnect. In particular, we describe our implementation and performance results on the Cray XE6, Cielo, at Los Alamos National Laboratory. 1 Introduction den on the le system and potentially improve overall ecency of the application workow. Current In our quest toward exascale systems and applica- work to enable these coupling and workow scenar- tions, one topic that is frequently discussed is the ios are focused on the data issues to resolve resolu- need for more exible execution models. Current tion and mesh mismatches, time scale mismatches, models for capability class high-performance comput- and make data available through data staging tech- ing (HPC) systems are essentially static, requiring niques [20, 34, 21, 11, 2, 40, 10]. applications and analysis to execute independently storing intermediate results on a persistent, globally accessible parallel le system. For example, in fusion science, simulation of the edge of the plasma [17] and the interior of the plasma [39] are currently separate simulations. To have a more complete, accurate model for a fusion reactor, these components need to In this paper, we describe R&D eorts and chal- be tightly coupled to share the eects between the lenges of coupling simulation codes and analysis on two models. The CESM climate model [24] is sim- Cray systems with Gemini Networks. In particular, ilar in that it incorporates atmosphere, ocean, land we describe how to create and use Integrated Data surface, sea ice, and land ice through a coupling en- Services on the Cray XE6 platform using Sandia's gine to manage the interactions between each of these Network Scalable Service Interface (Nessie) [20]. An dierent systems yielding a more accurate model of integrated data service, illustrated in Figure 1, is a global climate. In most cases, these and other scien- separate (possibly parallel) application that performs tic applications are part of larger oine workows operations on behalf of an actively running scientic that process the output written to storage in phases application. Nessie is a framework for developing that ultimately yield insights into the phenomena be- data services on HPC platforms. It provides portable ing studied. interfaces for inter-application communication across For exascale systems, there is a general belief that RDMA-based networks, an RPC-like abstraction for systems will need more exible execution models rapidly developing client and server stubs, and a that allow the coupling of simulations and/or anal- portable model for dening serializable data struc- ysis. This coupling promises to reduce the I/O bur- tures for data transfer. 1 of these issues to be resolved as visualization experts Client Application! become more accustomed to large-scale parallel pro- (compute nodes)! Data Service! gramming. In some cases, however, the communica- (compute/service nodes)! File System! tion requirements of an analysis algorithm are not Processed Raw appropriate for extreme-scale, and thus in-situ for Data Data! capability-class applications. In transit analysis (also known as staged analysis) is similar to in-situ analysis in that the analysis code Visualization! runs concurrently with the simulation code. The dif- Cache/ ference is that the analysis takes place on dierent aggregate/ compute resources than the simulation. Figure 2 il- process! Database! lustrates this by looking at in-situ and in-transit analysis for Sandia's CTH shock physics code. A physi- Figure 1: Data services uses additional compute re- cal partitioning of the simulation and analysis codes sources to perform operations on behalf of an HPC allows the analysis to execute in a pipeline-parallel application. fashion with minimal interference on the parallel simulation. In some cases, the staging nodes simply provide data caching in the network to provide a buer 2 Background and Related for bursty I/O operations [29, 34, 2, 25]. In this case, the staaging area captures data from the application, Work then writes it to storage while the application contin- ues to compute, eectively trading the cost of writing There are a number of ongoing eorts to integrate or to a storage system with the cost of writing memory- couple simulation, analysis, and visualization. The to-memory through the high-speed interconnect [29]. approaches are generally categorized into two areas: There are also a number of examples of using staging in-situ and in-transit [23]. The term in-situ ap- areas for statistical analysis, indexing, feature extrac- plies to codes that perform the analysis or visualization [23], FFTs [32], and data permutations [26]. tion of data with the simulation that generates the Some operations can be performed either in-situ data. In-situ libraries link to the main code and ex- or in-transit based on both the resource availability ecute through library calls. This concept was rst (memory and computation) and communication re- mentioned in the 1987 National Science Foundation quirements. ADIOS [19] introduced the idea of Data Visualization in Scientic Computing workshop re- Characteristics [21] as a way to represent a local por- port [22]; however, interest in in-situ has grown sig- tion of statistics intended to either be used to gain nicantly in recent years and is rapidly becoming one knowledge about that portion of the data or to used of the most important topics in large-scale visualiza- in aggregate to learn about the data set as a whole. tion [15, 4]. The placement of these operators was examined in One reason in-situ analysis is attractive for petas- PreDatA [40]. Other eorts to accelerate the use of cale systems is that the cost of dedicated interactive data include FastBit [14] to generate a bitmap-based visualization computers for petascale computing is index for data values. This approach yielded a rel- prohibitive [9]. Developing algorithms and techniqes atively compact index with fast access to elements. that work directly with the simulation code reduces, The ADIOS data characteristics have also been ex- but does not eliminate, the need for specialized visu- tended from just the minimum and maximum value alization hardware. Other studies show that the I/O to include count, sum, sum of squares, histogram, and cost of writing and reading data from parallel le sys- is it an innity or not [12]. Another use of in-transit tems is beginning to dominate the time spent in both processing is for data compression, such as is done in the simulation and visualization [36, 33]. In-situ visu- ISABELA [18]. alization eliminates this I/O cost by performing the Another primary dierence between in-situ and in- visualization in the memory of the scientic code. transit analysis is that in-transit analysis requires the One downside of in-situ analysis is that the algo- ability to transfer data from the scientic code to the rithms to perform analysis may not scale as well as staging area for analysis. In techniques such as I/O the scientic code, creating a signicant bottleneck Delegation[25] the applications uses MPI to commu- for the overall runtime. This is partly because the nicate this data. For I/O Delegation, the user al- algorithms and codes for visualization were not de- locates an additional set of staging processors when signed for large-scale HPC systems we expect some it launches the application. A separate MPI com- 2 Client Application Client Application Fragment-Detection Service ... ... ... Raw Fragment Fragment Data Data Data PVSPY analysis analysis PVSPY Server code CTH CTH code Client (a) In-situ analysis (b) In-transit analysis Figure 2: Comparison of in-situ (a) and in-transit (b) fragment detection for the CTH shock physics code. municator allows the staging processors to perform XE6. analysis without interfering with the primary application. This approach was rst demonstrated for high- performance computing in a seismic imaging applica- Data services, as dened in this paper, are a general tion called Salvo[32]. In Salvo, the user allocated an form of in-transit computing. They leverage addi- I/O Partition for staging outgoing data and also tional resources for analysis, management, and stag- performing proprocessing (i.e., FFTs) on incoming ing of data. For example, the Nessie service devel- data. I/O delegation is perhaps the most portable oped for the Lightweight le system (LWFS) provided approach for in transit computation, but it requires authentication, authorization, and storage [30]. The a tight coupling of analysis with application and it Dataspaces work is also more than just in-transit is dicult to share the service with multiple applica- analysis [10]. They leverage memory in unused com- tions. pute nodes to manage a distributed shared cache for A second approach for in transit analysis is to cre- coupled applications. ate the staging area as a separate application that communicates with the client application This ap- The two projects most similar to Nessie are proach is extremely exible because it allows for the DataStager [2] and GLEAN [38].

Developing Integrated Data Services for Cray Systems with a Gemini Interconnect

Red Storm Infrastructure at Sandia National Laboratories

Merrimac – High-Performance and Highly-Efficient Scientific Computing with Streams

Jaguar and Kraken -The World's Most Powerful Computer Systems

Red Storm IO Performance Analysis James H

Taking the Lead in HPC

Performance Impact of the Red Storm Upgrade

Tour De Hpcycles

Introduction to the Oak Ridge Leadership Computing Facility for CSGF Fellows Bronson Messer

Towards Efficient Supercomputing: a Quest for the Right Metric

Leadership Computing Directions at Oak Ridge National Laboratory: Navigating the Transition to Heterogeneous Architectures

Leadership Computing at the National Center for Computational Science: Transitioning to Heterogeneous Architectures

Cray Supercomputers Past, Present, and Future