Beg 2021 CSE Using Jupyter.Pdf (3.511Mb)

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/MCSE.2021.3052101, Computing in Science & Engineering Using Jupyter for reproducible scientific workflows Marijan Beg Faculty of Engineering and Physical Sciences, University of Southampton, University Road, SO17 1BJ Southampton, United Kingdom Juliette Belin Logilab, 104 Boulevard Auguste Blanqui, 75013 Paris, France Thomas Kluyver European XFEL GmbH, Holzkoppel 4, 22869 Schenefeld, Germany Alexander Konovalov School of Computer Science, University of St Andrews, Jack Cole Building, North Haugh, KY16 9SX St Andrews, United Kingdom Min Ragan-Kelley Simula Research Laboratory, Martin Linges vei 25, 1364 Fornebu, Norway Nicolas M. Thiery´ Laboratoire de Recherche en Informatique, Universite´ Paris-Saclay, CNRS, 91405 Orsay, France Hans Fangohr Max Planck Institute for the Structure and Dynamics of Matter, Luruper Chaussee 149, 22761 Hamburg, Germany European XFEL GmbH, Holzkoppel 4, 22869 Schenefeld, Germany Faculty of Engineering and Physical Sciences, University of Southampton, University Road, SO17 1BJ Southampton, United Kingdom Abstract—Literate computing has emerged as an important tool for computational studies and open science, with growing folklore of best practices. In this work, we report two case studies – one in computational magnetism and another in computational mathematics – where a dedicated software was exposed into the Jupyter environment. This enabled interactive and batch computational exploration of data, simulations, data analysis, and workflow documentation and outcome in Jupyter notebooks. In the first study, Ubermag drives existing computational micromagnetics software through a domain-specific language embedded in Python. In the second study, a dedicated Jupyter kernel interfaces with the GAP system for computational discrete algebra and its dedicated programming language. In light of these case studies, we discuss the benefits of this approach, including progress towards more reproducible and re-usable research results and outputs, notably through the use of infrastructure such as JupyterHub and Binder. INTRODUCTION conclusions. For a publication to be scientifically Research usually results in a publication that valid, it must present the methodology rigorously, presents and shares the obtained findings and so that readers can follow the “recipe” and Published by the IEEE Computer Society © IEEE 1 1521-9615 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: ST ANDREWS UNIVERSITY. Downloaded on February 04,2021 at 14:51:24 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/MCSE.2021.3052101, Computing in Science & Engineering reproduce the results. If this criterion is met, In the last decade, literate computing has the publication is considered reproducible. Re- emerged as an important tool for computational producible publications are more easily re-usable studies and open science, with an ever-growing and thus provide a significant opportunity to make set of best practices. In this paper, we review and (often tax-payer funded) research more impactful. expand some of these best practices in the context However, the reproducibility of computational of two case studies: computational magnetism work is usually hindered not only by a lack of and mathematics. This is based on the experience data or meta-data but also by a lack of details on of enabling and applying Jupyter environments the procedure and tools used: in these fields as a part of the OpenDreamKit (https://opendreamkit.org/) project. 1) The source code of the software used is not To be able to run computational studies from available. the Jupyter environment, it is necessary to either 2) Information on the computing environment, have the simulation and/or analysis code exposed such as the hardware, operating system, to a general-purpose programming language sup- supporting libraries, and (if required) code ported by Jupyter, or have a dedicated Jupyter compilation details is not revealed. kernel for the computational libraries. Although 3) The exact procedure which led to the results the main topic of this work is the overview of reported in the publication is not shared. features and capabilities of the Jupyter environ- This should include the set of parameters ment for reproducible workflows, we begin by used, the simulation and data analysis pro- discussing how a computational library can be cedure, and any additional data cleaning, exposed to Jupyter as a necessary prerequisite. processing, and visualization. Ideally, these are shared as open-source code and analysis scripts used to perform the simulation Prerequisite: Exposing computational libraries to the Jupyter environment and to read, analyze, and visualize the re- Computational studies often use existing com- sulting data. This way, the entire process putational (legacy) tools. These could be executa- can be repeated by re-running simulation bles called from the command line or libraries and/or analysis scripts. A human-readable that are used within a programming language. For document detailing the computational steps the approach suggested here, these computational taken, despite being “better-than-nothing”, tools need to be accessible to scientists from is still insufficient to ensure reproducibility, a general-purpose programming language sup- and keeping a detailed log of all steps taken ported by Jupyter (such as Python). For some do- during a computational study is often impos- mains, such as pure mathematics research, there sible. are domain-specific languages with enough power Reproducibility is a challenging question and to be used directly as the programming language spans a range of different topics. In this work, we in notebooks (e.g., Singular and GAP). In other focus on one of them. We describe the features areas, exposing computational tools to a general- and capabilities of the Jupyter environment that, purpose programming language is the key to inte- in our view, make it a highly productive environ- grating them into researchers’ custom code. A key ment for computational science and mathematics, benefit of making computational tools available while facilitating reproducibility. in a general-purpose programming language is The topic of bitwise reproducibility is outside that the computation can be driven flexibly using the scope of this work: even with the same the control structures provided by that language. hardware and same software, it may be diffi- For example, a simulation can conveniently be cult to reproduce computational results to be repeated with a range of parameters through a for- bitwise identical. This can originate from the non- loop, rather than having to change a configuration associativity of floating-point operations com- file for each value and trigger execution of the bined with parallel execution or from compiler simulation manually. optimizations. Bitwise reproducibility is not al- Making the computational capability acces- ways required to be achieved. sible from a general-purpose programming lan- 2 1521-9615 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: ST ANDREWS UNIVERSITY. Downloaded on February 04,2021 at 14:51:24 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/MCSE.2021.3052101, Computing in Science & Engineering guage supported through a Jupyter kernel such and many more) are supported through other as Python may be trivial – for example, if the Jupyter Notebook kernels. In this work, we sug- required code is already a Python library. When gest using a Jupyter research environment from the computational functionality is locked into an which computational studies can be driven and executable, one can create an interface layer so conducted efficiently. In this section, we discuss that functionality can be accessed via a Python the benefits of using the Jupyter environment for function or class [1]: input parameters will then reproducible scientific workflows. be translated into configuration files, the executable called, outputs retrieved, and finally, the 1. One study – one document results returned. The notebook allows us to carry out an entire If the computational tool uses a programming study within a single notebook and provides a language that Jupyter does not support, another complete and executable record of the process. It possibility is to implement a Jupyter kernel for is possible to put the interpretation of the results that language so that the computational library into the same document, immediately below the can be exposed to the Jupyter environment (as graphical, tabular or text-based output that needs done for GAP and SageMath for example). to be described. The “one study – one document” Over time, scientific

Load more