Towards Interactive, Reproducible Analytics at Scale on HPC Systems

Towards Interactive, Reproducible Analytics at Scale on HPC Systems Shreyas Cholia Lindsey Heagy Matthew Henderson Lawrence Berkeley National Laboratory University Of California, Berkeley Lawrence Berkeley National Laboratory Berkeley, USA Berkeley, USA Berkeley, USA [email protected] [email protected] [email protected] Drew Paine Jon Hays Ludovico Bianchi Lawrence Berkeley National Laboratory University Of California, Berkeley Lawrence Berkeley National Laboratory Berkeley, USA Berkeley, USA Berkeley, USA [email protected] [email protected] [email protected] Devarshi Ghoshal Fernando Perez´ Lavanya Ramakrishnan Lawrence Berkeley National Laboratory University Of California, Berkeley Lawrence Berkeley National Laboratory Berkeley, USA Berkeley, USA Berkeley, USA [email protected] [email protected] [email protected] Abstract—The growth in scientific data volumes has resulted analytics at scale. Our work addresses these gaps. There are in a need to scale up processing and analysis pipelines using High three key components driving our work: Performance Computing (HPC) systems. These workflows need interactive, reproducible analytics at scale. The Jupyter platform • Reproducible Analytics: Reproducibility is a key com- provides core capabilities for interactivity but was not designed ponent of the scientific workflow. We wish to to capture for HPC systems. In this paper, we outline our efforts that the entire software stack that goes into a scientific analy- bring together core technologies based on the Jupyter Platform to create interactive, reproducible analytics at scale on HPC sis, along with the code for the analysis itself, so that this systems. Our work is grounded in a real world science use case can then be re-run anywhere. In particular it is important - applying geophysical simulations and inversions for imaging to be able reproduce workflows on HPC systems, against the subsurface. Our core platform addresses three key areas of very large data sets and computational resources. Further, the scientific analysis workflow - reproducibility, scalability, and we also need to capture any updates made to the analysis interactivity. We describe our implemention of a system, using Binder, Science Capsule, and Dask software. We demonstrate the on subsequent runs. use of this software to run our use case and interactively visualize • Scalable Analytics: In order to run large scale interac- real-time streams of HDF5 data. tive workflows on HPC systems, we need to engineer Index Terms—Interactive HPC, Jupyter, Containers, Repro- these interactive analyses to utilize multiple nodes, in a ducible Science distributed fashion. We need to be able to interface with HPC batch queues and launch a set of workers that can I. INTRODUCTION enable interactive analytics at scale. • Interactive Analytics: Data analytics tends to be a highly Scientific data is increasingly processed and analyzed on interactive process. While traditionally, these have been HPC systems. These workflows are different from traditional run on desktops and stand-alone computers, the size of batch-queue workflows and require more support for a unified, the data is requiring us to consider the modalities of interactive, real-time platform that can operate on High Per- these workflows at scale. There is an increased need formance Computing (HPC) systems and very large datasets. to support interactive analytics with human-in-the-loop, In recent years, scientists across a wide variety of disciplines large data visualizations and exploration at batch-oriented have adopted the Jupyter Notebook as an essential element of HPC facilities, with the goal of pulling subsets of the data their discovery workflows. Jupyter combines documentation, into interactive visualization tools on-demand as needed. visualization, data analytics, and code into a single unit that scientists can share, modify, and even publish. The Jupyter Towards this end, we have developed a system that enables platform and ecosystem is increasingly being deployed on these components for interactive workflows. The Jupyter Note- clusters and HPC systems to support scientific workflows. book provides the core analysis platform, while the JupyterLab However, there are gaps in the current Jupyter ecosystem that extension framework allows us to add custom functionality for need to be addressed to enable interactive and reproducible interactive visualization. We use Binder [1], [2] and Science Capsule [3] to capture a reproducible software environment. pipelines (e.g. playing the role of effective models or accel- Dask [4] (an open source framework and library for distributed erated solvers) and as tools capable of integrating multimodal computing in Python) provides us with a backend to scale the data for prediction, beyond what traditional physical models analyses on the National Energy Research Scientific Comput- could do. In the geosciences we have access to both of these ing Center (NERSC) compute nodes. Finally, we developed a use cases, and the current project focuses on using ML tools JupyterLab extension to allow users to interactively work with to develop accelerated models that can accurately represent large volumes of data. Our system allows users to capture a fine-scale physics that is computationally expensive to solve project specific set of Notebooks and software environment as for directly. a Git repo, launches this as containers on HPC resources at This requires generating a substantial amount of training scale, and enables on-demand visualization to sub-select and data from first-principles solvers, and then training the ML view data in real-time. pipelines on the data to capture the relevant physical features at the required accuracy. This new ML-based effective model II. BACKGROUND then needs to be used in the complete pipeline, in this case, In this section, we describe Jupyter at NERSC, our science opening the possibility of solving more physically realistic use case, and our motivating user study. problems in finite time thanks to this accelerated component. This process is both data- and computationally-intensive A. Jupyter at NERSC (and thus well suited for an HPC environment) but also NERSC is the primary scientific computing facility for the requires expert insight and evaluation throughout. It is critical Office of Science in the U.S. Department of Energy. More that the scientist with domain knowledge inspect the structure than 7,000 scientists use NERSC to perform basic scientific of the solutions learned and produced by the ML system research across a wide range of disciplines, including climate to ensure that, in addition to the constraints encoded in the modeling, research into new materials, simulations of the underlying optimization problem, they have the right overall early universe, analysis of data from high energy physics physical structure. In the training process, visual and interac- experiments, investigations of protein structure, and a host of tive exploration of the complex data is necessary to understand other scientific endeavors. where issues may arise, as summary metrics such as total loss NERSC runs an instance of JupyterHub - a managed multi- are often insufficient to inform the scientist. user Jupyter deployment. The hub allows users to authenticate Since the simulation codes tend to be optimized in their to the service, and provides the ability to spawn Jupyter output formats and structure for computational efficiency, Notebooks on different types of backend resources. they do not always automatically produce the variables most convenient for interactive exploration. Furthermore, it would B. Use case be cumbersome to require such capabilities to be hard-coded in the simulations themselves, as the requirements of interactive We motivate our work through a real-world science use case exploration can vary from problem to problem based on involving geophysical simulations and inversions for imaging multiple conditions (questions being asked, bugs/issues found the subsurface. Geophysical imaging is used for resource ex- in a particular run, parameter values, etc.). It is thus optimal to ploration in hydrocarbons, minerals, and groundwater, as well separate the low-level capabilities of the code that generates as geotechnical and environmental applications. Subsurface all the proper generic data, from the exploratory environment variations in physical properties such as density or electrical where a scientist can probe the specific combinations of conductivity can be diagnostic of geologic units or sought- quantities they need at a given time. after natural resources. Geophysical data, such as gravity or In this work, we design the right level of interactive capa- electromagnetic measurements, which are sensitive to these bilities in the JupyterLab environment to allow the working physical properties, can be collected in the field and interpreted scientist to flexibly explore and query their data at multiple using computational methods. levels, with a minimal amount of customization required of The computational building blocks that we require for the underlying simulation. Additionally, we tackle two issues working with geophysical data include: that are part of the entire lifecycle of research and that be- • The ability to simulate the governing partial differential come particularly acute in HPC contexts: how to improve the equations (PDEs). These are typically large-scale, require experience of interfacing with the HPC system’s scheduling parallel computation, and can generate large

Towards Interactive, Reproducible Analytics at Scale on HPC Systems

Parallel Data Analysis Directly on Scientific File Formats

A Metadata Based Approach for Supporting Subsetting Queries Over Parallel HDF5 Datasets

HUDDL for Description and Archive of Hydrographic Binary Data

Parsing Hierarchical Data Format (HDF) Files Karl Nyberg Grebyn Corporation P

A New Data Model, Programming Interface, and Format Using HDF5

A Very Useful Enhancement for MSC Nastran and Patran

Importing and Exporting Data

Achieving High Performance I/O with HDF5

Efficient Lidar Point Cloud Data Encoding for Scalable Data Management Within the Hadoop Eco-System

ECE 3574: Applied Software Design

HDF5 Portable Parallel - Juelich.De I/O and I/O Data Formats Member of the Helmholtz-Association March March 14Th, 2017 Outline

HDF5 File Format Reference Manual