A Case for Integrating Experimental Containers with Notebooks
Total Page:16
File Type:pdf, Size:1020Kb
A Case for Integrating Experimental Containers with Notebooks Jason Anderson∗y and Kate Keaheyy∗ ∗Consortium for Advanced Science and Engineering University of Chicago, Chicago, IL 60637 Email: [email protected] yArgonne National Laboratory Lemont, IL 60439 Email: [email protected] Abstract—Computational notebooks have gained much pop- greatly facilitated by the advent of computational notebooks, ularity as a way of documenting research processes; they which have already proven themselves instrumental in com- allow users to express research narrative by integrating ideas municating scientific discoveries [2]. Their success lies in expressed as text, process expressed as code, and results in one the ability to provide a link between ideas (explained in executable document. However, the environments in which the text), process (captured as code), and results in the form code can run are currently limited, often containing only a of graphs, pictures, or other artifacts. They have however a fraction of the resources of one node, posing a barrier to many significant limitation in that the execution environment of a computations. In this paper, we make the case that integrating notebook is constrained to a small set of possibilities, such complex experimental environments, such as virtual clusters as preconfigured virtual machines or software containers, or complex networking environments that can be provisioned and often only granted a small fraction of a system’s full via infrastructure clouds, into computational notebooks will resources. While sufficient for many generic computations, significantly broaden their reach and at the same time help it falls short of being able to support large-scale scientific realize the potential of clouds as a platform for repeatable experiments that increasingly require complex experimen- research. To support our argument, we describe the integration tal containers composed of diverse and powerful resources of Jupyter notebooks into the Chameleon cloud testbed, which distributed over complex topologies. allows the user to define complex experimental environments The challenge of supporting such environments has been addressed by infrastructure clouds, from commercial and then assign processes to elements of this environment resources such as Amazon Web Services (AWS) [3] to similarly to the way a laptop user may switch between different open testbeds such as Chameleon [4]: they allow users to desktops. We evaluate our approach on an actual experiment create experimental containers implemented using isolation from both the development and replication perspective. vehicles from virtual machines to bare metal and from nodes to networks. Furthermore, the digital artifacts that 1. Introduction represent such containers—such as images and orchestra- tion templates [5] that marshal those images into complex Scientific computations, whether large-scale simulations, environments such as virtual clusters—can be shared and just-in-time analytics, or performance studies, each represent used by others to reestablish the experimental environment a computational experiment, which may have to be repeated within a given platform. This makes the task of replicating or replicated [1] for purposes of validation, variation, or computational experiments significantly easier; but the pro- other forms of evaluating and leveraging scientific con- cess that creates those digital artifacts is still complex and tribution. These computational experiments take place in thus poses a barrier to publishing them in the first place. experimental containers that define the environment of the In this paper, we propose a way to integrate notebooks– computation in terms of its configuration and makeup (i.e., a highly effective tool for the description of experimental compute nodes, networks, or storage resources as well as process–and experimental containers, capable of providing software configuration required by the computation). Such complex experimental environments, to mutual benefit. On experimental containers may be complex, and thus difficult one hand, dynamically and accurately deployed experimen- to create. Changes in their configuration may introduce tal containers based on shared artifacts developed for clouds inconsistencies, lead to wrong results, or even make re- or testbeds can address a deficiency of notebooks in terms of peating a computation impossible. Repeating or replicating documenting the experimental process. On the other hand, computational experiments thus relies on ensuring that its notebooks can provide a convenient and intuitive tool for experimental container is configured correctly every time. utilizing such containers in a way that more closely resem- The task of documenting an experimental process in bles the creative process and provides a way to introduce such a way that it can be shared and repeated has been variation in both the container makeup and the steps of experimentation. To support our argument, we describe the notebooks on hardware they control. This appraoch requires integration of Jupyter [6] notebooks with the Chameleon additional operational knowledge and reduces the likelihood testbed and a design pattern that leverages this integration of replicating the experiment results by requiring that future to allow the user to define arbitrarily complex experimental researchers possess similar hardware. containers and then assign processes to elements of this A different approach is implemented by systems such as container similarly to the way a laptop user may switch Popper [26] and Sumatra [27] which express an experimen- between different desktops. We evaluate our approach on an tal workflow that can be implemented across platforms in actual experiment from both the development and replication a declarative form. However, the “process” of experimenta- perspective to illustrate how it can help more investigators tion, when defined declaratively, requires understanding the structure more complex experiments interactively, and allow syntax and rules of different platforms and also creates chal- for their replication and controlled variation. lenges for an iterative workflow: declarative systems usually This paper is organized as follows. In Section 3 we do not have a mechanism for storing intermediate state, define experimental containers, describe their properties, and so are biased towards re-running the entire workflow and discuss their implementation across a range of testbeds. from start to finish. Besides this limitation, researchers find In Sections 4 and 5 we explain how we integrated this complex scenarios hard to represent unless care and time concept of a container into notebooks and discuss our im- is taken to follow the declarative model from the beginning plementation of integration of Jupyter notebooks with the [28], [29]. If a researcher makes the investment into such a Chameleon testbed. In Section 6, we validate our proposal workflow, their exact process does indeed become reliable, in the context of a demanding distributed experiment and streamlined, and easy to repeat. However, it is still hard to explain its benefits and limitations. interact with and produce variation easily which is necessary for exploring different inputs, configurations, and hypothe- 2. Related Work ses on the fly; in this context an imperative approach usually works better. The value of interweaving ideas (text) and process (code) in one document was first explored in an approach called 3. Experimental Containers literate programming [7]. Interactive notebooks follow the same pattern, but additionally allow the user to pause, Most computation takes place in the context of rewind, replay and resume a computation at any point, a complex configuration of resources including not supporting a more organic workflow. Additionally, the user only compute nodes, but potentially also accelerators, can work with rich, potentially interactive visualizations. networks, and storage. Recording and then replicating this The two predominant implementations of these interactive configuration is often not only complex and thus hard to do, notebooks are Wolfram Mathematica [8] and IPython (later but also needs to be done accurately to ensure consistency named Project Jupyter) [9]. Despite Mathematica’s longer of results. The latter is particularly true of Computer existence, Jupyter has better penetration within the aca- Science experiments in fields such power management or demic community, due to leveraging a large open-source performance variability, but can also affect other types development community and providing a web-based in- of computations. While virtualization techniques, such terface requiring no specialized client tools. The web in- as virtual machines or networks, addressed this problem terface spawned several managed “Notebook-as-a-Service” partially, we often need a generalized “container” that platforms to aid in reproducibility, such as CodeOcean [10], extends the abstraction over a combination of multiple WholeTale [11], Nextjournal [12], Binder [13], Wolfram resources working in concert. We describe below the Cloud [14], Azure Notebooks [15], Amazon EMR [16], properties of such experimental containers and discuss and Google Colab [17]. Pangeo [18] is recent development several implementations that satisfy them to various that applies this pattern to science problems across various degrees. domains. Many researchers have experimented with using Jupyter notebooks specifically to improve reproducibility Isolation: The ability