Disjotter: an Interactive Containerization Tool for Enabling
Total Page:16
File Type:pdf, Size:1020Kb
Bachelor Informatica DisJotter: an interactive con- tainerization tool for enabling FAIRness in scientific code Wilco Kruijer June 15, 2020 Informatica | Universiteit van Amsterdam Supervisor(s): Spiros Koulouzis & Zhiming Zhao 2 Abstract Researchers nowadays often rapidly prototype algorithms and workflows of their exper- iments using notebook environments, such as Jupyter. After experimenting locally, cloud infrastructure is commonly used to scale experiments to larger data sets. We identify a gap in combining these workflows and address them by relating the existing problems to the FAIR principles. We propose and develop DisJotter, a tool that can be integrated into the development life-cycle of scientific applications and help scientists to improve the FAIR- ness of their code. The tool has been demonstrated in the Jupyter Notebook environment. By using DisJotter, a scientist can interactively create a containerized service from their notebook. In this way, the container can be scaled out to larger workflows across bigger infrastructures. 3 4 Contents 1 Introduction 7 1.1 Research question . .8 1.2 Outline . .8 2 Background 9 2.1 FAIR principles . .9 2.2 Environments and tools for scientific research . 10 2.2.1 Jupyter computational notebooks . 10 2.2.2 JupyterHub . 11 2.2.3 Scientific workflow management . 11 2.3 Software encapsulation . 11 2.3.1 Docker . 12 2.3.2 Repo2docker . 12 2.4 Gap analyses . 12 3 DisJotter 15 3.1 Requirements . 15 3.2 Architecture . 16 3.3 Implementation . 17 3.3.1 Front-end and server extensions . 17 3.3.2 Service helper . 18 3.3.3 Introspection . 18 4 Results 21 4.1 Software prototype: current status & installation . 21 4.2 Demonstration . 21 4.2.1 Containerizing code fragments in a notebook environment . 22 4.2.2 Using the generated Docker image . 23 4.2.3 Publishing the generated Docker image . 25 4.3 Process analyses . 26 4.4 Evaluation of requirements . 26 5 Discussion 29 5.1 Alternative implementations . 29 5.2 Installation of dependencies within the generated image . 29 5.3 Missing metadata . 30 5.4 Ethical considerations . 30 5.5 Alternative containerization platform . 30 5.6 Implementing code introspection for other programming languages . 30 6 Conclusion 33 6.1 Future work . 34 5 6 CHAPTER 1 Introduction Activities in the life-cycle of scientific experiments are often modelled as a sequence of dependent steps. A hypothesis is formulated from observations, which is later tested and evaluated using experiments. Computational tasks and steps in a scientific experiment are commonly realized by scientific software or services, which can be automated by workflow management systems. An effective workflow programming and automation environment, in combination with reusable workflow components, enable quick experimental iterations and thus improve efficiency in scien- tific research. In recent years a literate approach to programming has gained popularity in the form of note- books. Literate programming is a paradigm in which natural language is combined with computer code and results to represent a program that is easy to understand for both the computer and humans. Advocates of the literate programming approach argue that this style of programming produces higher-quality computer programs since it enables the programmer to be explicit about the decisions they make during the creation of software [5]. Literate programming has found its use in scientific research following computation becoming important in the scientific field. As computation in the scientific process became more commonplace it naturally became increasingly important to document the computational steps taken. Notebook environments can be used in most steps of the process; data collection, processing, analyses, and in creating visualizations. They allow for rapid iteration and testing across all steps. Reproduction of experimental results is important for multiple reasons. It allows the scientist to validate their experiments, but it also enables other scientists to reproduce the experiment for further studies. Nowadays, scientific computation is often done in distributed cloud envi- ronments. These environments have storage and computation capacities that far exceed the capacity of single computers. Using this type of infrastructure research can be done on a much larger scale. The possibility of scaling experiments makes reproduction of experiments especially important. 7 1.1 Research question An important aspect of notebooks is its highly interactive interface, via which users can rapidly develop algorithms, and visualize the results. In most cases notebooks do not make use of local modules in published notebooks [7], this means that notebooks are made available as a whole. This lowers the findability and accessibility of individual components within a notebook, e.g. a specific algorithm. This makes it hard to execute components of a notebook as part of a workflow in environments such as on cloud platforms, for instance when scaling to large data sets or parallelizing the execution. In this thesis, we are motivated to answer the question: \how to reuse and share components in scientific code via notebook environments?" To answer this question we will first answer the sub-question: \how to encapsulate a component of a notebook?", then we will look at how to make encapsulated components findable, accessible and interoperable. 1.2 Outline In the next chapter we will first explore the background information related to our research. Here we look into the principles we want our product to adhere to, next we discuss software that is used by researchers and which solutions there are to software encapsulation. Then, we identify the gaps that exist within these concepts. Based on these gaps we analyze requirements for a tool in the first part of the third chapter. We propose an architecture based on the requirements and discuss the technical considerations. In chapter four we discuss the implementation details and demonstrate the system functionality. We discuss the limitations and considerations of our software in the second-to-last chapter. Finally, we conclude this thesis. 8 CHAPTER 2 Background 2.1 FAIR principles While the reliance on computation in the scientific process grew, the dependency on data grew with it. To improve \scientific data management and stewardship" [9] the data science community formulated a set of principles. The four principles are findability, accessibility, interoperability, and reusability. In this context, they refer to data specifically, but the same principles can be applied to implementations or code within computer science research [3]. Jim´enezet al. describe four recommended best-practices to keep in mind when working on re- search software. It argues that although not all of the FAIR principles directly apply to other digital objects besides data, most of the principles can be applied to software. All recommenda- tions given in this work draw parallels to the FAIR principles [3]. 1. Findable: Researchers are encouraged to make software easy to discover in various man- ners. 2. Accessible: It pleads to make source code (publicly) available from day one. 3. Interoperable: Although the article does not give a recommendation about the interop- erability of software it does recommend software to be published on community registries, this indirectly makes the software more integratable in different workflows. 4. Reusable: Jim´enezet al. defend the fact that all best practices in software improve reusability. Reusing and reproducing are important aspects of the scientific method. In scientific studies reproducibility can give guarantees about the research that has been done, it also makes a study more transparent. Without reproducibility, science would not be able to progress in any meaningful way. To accumulate knowledge, research must be built upon older research and data. Findability is similarly important in the context of research software. Software packages are often made discoverable by being disposed in repositories along with metadata. Metadata usually includes version numbers, licenses, and contact details of the author. A summarized explanation of the capabilities of software is generally also available with the package. This enables third parties to search for the software they desire. An important tool in findability are persistent identifiers. These identifiers can be used by software consumers as a way to uniquely refer back 9 to the used software. In many registries this is simply a URL to a web page. Another example is a namespace in Java. FAIRness is considered important when applied to data sets. Typically, the principles are not used directly in research software. 2.2 Environments and tools for scientific research 2.2.1 Jupyter computational notebooks Jupyter Notebook is a computational environment that is most widely used in the literate pro- gramming approach [6]. The Jupyter Notebook application can be used to create and share documents that help the user in data analysis, experimenting, and representing results. It sup- ports many different programming languages but it is most commonly used in combination with the Python and R programming languages. It allows the user to create and display plots inline. Documents contain cells, which can be of a variety of types. Among these types are code, text, and images. It is possible to formulate mathematical equations in text cells using LaTeX syn- tax. Notebooks are meant to improve the shareability and reproducibility of the code that they contain [4]. Saved Jupyter documents contain static output from when the code was executed last. This allows third parties to read the output of a notebook without executing it. Figure 2.1: A typical iterative workflow in a literate programming notebook environment [8]. The Jupyter system contains a set of components that ultimately deliver the notebook user experience. All projects in the Jupyter ecosystem are built on top of the Jupyter Client, this is a Python library that implements the Jupyter protocol. This protocol is a specification of messages used by so-called kernels. In the Jupyter context, a kernel is a program that runs and examines the user's code.