Hadoop on HPC: Integrating Hadoop and Pilot-Based Dynamic Resource Management
Total Page:16
File Type:pdf, Size:1020Kb
Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow1;2∗, Ioannis Paraskevakos1∗, George Chantzialexiou1, Shantenu Jha1∗∗ 1 Rutgers University, Piscataway, NJ 08854, USA 2 School of Computing, Clemson University, Clemson, SC 29634, USA (∗)Contributed equally to this work ∗∗Contact Author: [email protected] Abstract—High-performance computing platforms such as “su- riety of characteristics and properties, as summarized by Fox percomputers” have traditionally been designed to meet the com- et al. [6], [7]. Their complexity and characteristics are fairly pute demands of scientific applications. Consequently, they have distinct from HPC applications. For example, they often com- been architected as net producers and not consumers of data. The Apache Hadoop ecosystem has evolved to meet the require- prise of multiple stages such as, data ingest, pre-processing, ments of data processing applications and has addressed many of feature-extraction and advanced analytics. While some of these the traditional limitations of HPC platforms. There exist a class stages are I/O bound, often with different patterns (random/se- of scientific applications however, that need the collective capa- quential access), other stages are compute-/memory-bound. bilities of traditional high-performance computing environments Not surprisingly, a diverse set of tools for data processing (e. g. and the Apache Hadoop ecosystem. For example, the scientific domains of bio-molecular dynamics, genomics and network sci- MapReduce, Spark RDDs), access to data sources (streaming, ence need to couple traditional computing with Hadoop/Spark filesystems) and data formats (scientific data formats (HDF5), based analysis. We investigate the critical question of how to columnar formats (ORC and Parquet)) have emerged and they present the capabilities of both computing environments to such often need to be combined in order to support the end-to-end scientific applications. Whereas this questions needs answers at needs of applications. multiple levels, we focus on the design of resource management middleware that might support the needs of both. We propose Some applications however, defy easy classification as data- extensions to the Pilot-Abstraction so as to provide a unifying intensive or HPC. In fact, there is specific interest in a class resource management layer. This is an important step towards of of scientific applications, such as bio-molecular dynam- interoperable use of HPC and Hadoop/Spark. It also allows ap- ics [8], that have strong characteristics of both data-intensive plications to integrate HPC stages (e. g. simulations) to data an- and HPC. Bio-molecular simulations are now high-performant, alytics. Many supercomputing centers have started to officially support Hadoop environments, either in a dedicated environ- reach increasing time scales and problem sizes, and thus gen- ment or in hybrid deployments using tools such as myHadoop. erating immense amounts of data. The bulk of the data in such This typically involves many intrinsic, environment-specific de- simulations is typically trajectory data that is time-ordered set tails that need to be mastered, and often swamp conceptual issues of coordinates and velocity. Secondary data includes other like: How best to couple HPC and Hadoop application stages? physical parameters including different energy components. How to explore runtime trade-offs (data localities vs. data move- ment)? This paper provides both conceptual understanding and Often times the data generated needs to be analyzed so as to practical solutions to the integrated use of HPC and Hadoop en- determine the next set of simulation configurations. The type vironments. Our experiments are performed on state-of-the-art of analysis varies from computing the higher order moments, production HPC environments and provide middleware for mul- to principal components, to time-dependent variations. tiple domain sciences. MDAnalysis [9] and CPPTraj [10] are two tools that evolved to meet the increasing analytics demands of molecular dy- I. INTRODUCTION namics applications; Ref [11] represents an attempt to provide arXiv:1602.00345v1 [cs.DC] 31 Jan 2016 The MapReduce [1] abstraction popularized by Apache MapReduce based solutions in HPC environments. These tools Hadoop [2] has been successfully used for many data-intensive provide powerful domain-specific analytics; a challenge is the applications in different domains [3]. One important differen- need to scale them to high data volumes produced by molecu- tiator of Hadoop compared to HPC is the availability of many lar simulations as well as the coupling between the simulation higher-level abstractions and tools for data storage, transfor- and analytics parts. This points to the need for environments mations and advanced analytics. These abstraction typically that support scalable data processing while preserving the abil- allow high-level reasoning about data parallelism without the ity to run simulations at the scale so as to generate the data. need to manually partition data, manage tasks processing this To the best of our knowledge, there does not exist a solution data and collecting the results, which is required in other envi- that provides the integrated capabilities of Hadoop and HPC. ronments. Within the Hadoop ecosystem, tools like Spark [4] For example, Cray’s analytics platform Urika 1 has Hadoop have gained popularity by supporting specific data processing and Spark running on HPC architecture as opposed to regu- and analytics needs and are increasingly used in sciences, e. g. lar clusters, but without the HPC software environment and for DNA sequencing [5]. Data-intensive applications are associated with a wide va- 1http://www.cray.com/products/analytics capabilities. However, several applications ranging from bio- HPC and Hadoop originated from the need to support dif- molecular simulations to epidemiology models [12] require ferent kinds of applications: compute-intensive applications in significant simulations interwoven with analysis capabilities the case of HPC, and data-intensive in the case of Hadoop. such as clustering and graph analytics; in other words some Not surprisingly, they follow different design paradigms: In stages (or parts of the same stage) of an application would ide- HPC environments, storage and compute are connected by a ally utilize Hadoop/Spark environments and other stages (or high-end network (e. g. Infiniband) with capabilities such as parts thereof) utilize HPC environments. RDMA; Hadoop co-locates both. HPC infrastructures intro- Over the past decades, the High Performance Distributed duced parallel filesystems, such as Lustre, PVFS or GPFS, Computing (HPDC) community has made significant advances to meet the increased I/O demands of data-intensive applica- in addressing resource and workload management on hetero- tions and archival storage and to address the need for retaining geneous resources. For example, the concept of multi-level large volumes of primary simulation output data. The paral- scheduling [13] as manifested in the decoupling of work- lel filesystem model of using large, optimized storage clus- load assignment from resource management using the con- ters exposing a POSIX compliant rich interface and connect- cept of intermediate container jobs (also referred to as Pilot- ing it to compute nodes via fast interconnects works well for Jobs [14]) has been adopted for both HPC and Hadoop. Multi- compute-bound task. It has however, some limitations for data- level scheduling is a critical capability for data-intensive appli- intensive, I/O-bound workloads that require a high sequen- cations as often only application-level schedulers can be aware tial read/write performance. Various approaches for integrating of the localities of the data sources used by a specific applica- parallel filesystems, such as Lustre and PVFS, with Hadoop tion. This motivated the extension of the Pilot-Abstraction to emerged [18], [19], which yielded good results in particular Pilot-Data [15] to form the central component of a resource for medium-sized workloads. management middleware. While Hadoop simplified the processing of vast volumes of In this paper, we explore the integration between Hadoop data, it has limitations in its expressiveness as pointed out by and HPC resources utilizing the Pilot-Abstraction allowing ap- various authors [20], [21]. The complexity of creating sophis- plication to manage HPC (e. g. simulations) and data-intensive ticated applications such as iterative machine learning algo- application stages in a uniform way. We propose two ex- rithms required multiple MapReduce jobs and persistence to tensions to RADICAL-Pilot: the ability to spawn and man- HDFS after each iteration. This is lead to several higher-level age Hadoop/Spark clusters on HPC infrastructures on demand abstractions for implementing sophisticated data pipelines. Ex- (Mode I), and to connect and utilize Hadoop and Spark clusters amples of such higher-level execution management frame- for HPC applications (Mode II). Both extensions facilitate the works for Hadoop are: Spark [4], Apache Flink [22], Apache complex application and resource management requirements Crunch [23] and Cascading [24]. of data-intensive applications that are best met by a best-of- The most well-known emerging processing framework in bread mix of Hadoop and HPC. By supporting these two usage the Hadoop ecosystem is Spark [4]. In contrast to MapRe- modes, RADICAL-Pilot dramatically