Jupyter Tutorial Release 0.8.0
Total Page:16
File Type:pdf, Size:1020Kb
Jupyter Tutorial Release 0.8.0 Veit Schiele Oct 01, 2021 CONTENTS 1 Introduction 3 1.1 Status...................................................3 1.2 Target group...............................................3 1.3 Structure of the Jupyter tutorial.....................................3 1.4 Why Jupyter?...............................................4 1.5 Jupyter infrastructure...........................................4 2 First steps 5 2.1 Install Jupyter Notebook.........................................5 2.2 Create notebook.............................................7 2.3 Example................................................. 10 2.4 Installation................................................ 13 2.5 Follow us................................................. 15 2.6 Pull-Requests............................................... 15 3 Workspace 17 3.1 IPython.................................................. 17 3.2 Jupyter.................................................. 50 4 Read, persist and provide data 143 4.1 Open data................................................. 143 4.2 Serialisation formats........................................... 144 4.3 Requests................................................. 154 4.4 BeautifulSoup.............................................. 159 4.5 Intake................................................... 160 4.6 PostgreSQL................................................ 174 4.7 NoSQL databases............................................ 199 4.8 Application Programming Interface (API)............................... 208 4.9 Glossary................................................. 221 5 Clean up and validate data 225 5.1 Deduplicate data............................................. 225 5.2 String matching............................................. 230 5.3 Managing missing data with pandas................................... 232 5.4 Scikit Learn preprocessing........................................ 236 5.5 Dask pipeline............................................... 238 5.6 Data validation with voluptuous (schema definitions).......................... 242 5.7 Pandas DataFrame validation with Engarde............................... 248 5.8 Pandas DataFrame validation with Bulwark............................... 250 5.9 TDDA: Test-Driven Data Analysis.................................... 252 i 5.10 Hypothesis: property based testing................................... 256 6 Visualise data 259 7 Refactoring 261 7.1 Check and improve code quality and complexity............................ 261 7.2 Parameterisation and scheduling..................................... 277 7.3 Performance measurement and optimisation............................... 279 8 Create a product 325 8.1 Manage code with Git.......................................... 325 8.2 Manage data with DVC .......................................... 350 8.3 Create packages............................................. 359 8.4 Document................................................ 380 8.5 Licensing................................................. 391 8.6 Citing................................................... 394 8.7 Reproduce environments......................................... 405 8.8 Testing.................................................. 440 8.9 Logging.................................................. 455 9 Create web applications 459 9.1 Dashboards................................................ 459 10 Index 515 Index 517 ii Jupyter Tutorial, Release 0.8.0 Jupyter notebooks are growing in popularity with data scientists and have become the de facto standard for rapid prototyping and exploratory analysis. They inspire experiments and innovations enormously and as well they make the entire research process faster and more reliable. In addition, many additional components are created that expand the original limits of their use and enable new uses. CONTENTS 1 Jupyter Tutorial, Release 0.8.0 2 CONTENTS CHAPTER ONE INTRODUCTION 1.1 Status 1.2 Target group The users of Jupyter notebooks are diverse, from data scientists to data engineers and analysts to system engineers. Their skills and workflows are very different. However, one of the great strengths of Jupyter notebooks is thatthey allow these different experts to work closely together in cross-functional teams. • Data scientists explore data with different parameters and summarise the results. • Data engineers check the quality of the code and make it more robust, efficient and scalable. • Data analysts use the code provided by data engineers to systematically analyse the data. • System engineers provide the research platform based on the JupyterHub on which the other roles can perform their work. In this tutorial, we primarily address system engineers who want to build and operate a platform based on Jupyter notebooks. Then, we explain how this platform can be used effectively by data scientists, data engineers, and analysts. 1.3 Structure of the Jupyter tutorial From Chapter 3, the Jupyter tutorial follows the prototype of a research project: 3. Set up the workspace with the installation and configuration of IPython, Jupyter with nbextensions and ipywid- gets. 4. Collect data, either through a REST API or directly from an HTML page. 5. Cleaning up data is a recurring task that includes Remove or modify redundant, inconsistent, or incorrectly formatted data. 6. Analyse data through exploratory analysis and visualising data. 3 Jupyter Tutorial, Release 0.8.0 7. Refactoring includes parameterisation, validation and performance optimisation, including through concur- rency. 8. Creating a product includes Testing, Logging and Document the methods and functions as well as creating packages. 9. Web applications can either generate dashboards from Jupyter notebooks or require more comprehensive appli- cation logic, such as demonstrated in Bokeh-Plots in Flask einbinden, or provide data via a RESTful API. 1.4 Why Jupyter? How can these diverse tasks be simplified? You will hardly find a tool that covers all of these tasks, and several toolsare often required even for individual tasks. Therefore, on a more abstract level, we are looking for more general patterns for tools and languages with which data can be analysed and visualised and a project can be documented and presented. This is exactly what we are aiming for with Project Jupyter. The Jupyter project started in 2014 with the aim of creating a consistent set of open source tools for scientific research, reproducible workflows, computational narratives and data analysis. In 2017, Jupyter received the ACM Software Systems Award – a prestigious award which, among other things, shares with Unix and the web. To understand why Jupyter notebooks are so successful, let’s take a closer look at the core functions: Jupyter Notebook Format Jupyter Notebooks are an open, JSON-based document format with full records of the user’s sessions and the code they contain. Interactive Computing Protocol The notebook communicates with the computing kernel via the Interactive Com- puting Protocol, an open network protocol based on JSON data via ZMQ and WebSockets. Kernels Kernels are processes that execute interactive code in a specific programming language and return the output to the user. See also: • Jupyter celebrates 20 years 1.5 Jupyter infrastructure A platform for the above-mentioned use cases requires an extensive infrastructure that not only allows the provision of the kernel and the parameterisation, time control and parallelisation of notebooks, but also the uniform provision of resources. This tutorial provides a platform that enables fast, flexible and comprehensive data analysis beyond Jupyter notebooks. At the moment, however, we are not yet going into how it can be expanded to include streaming pipelines and domain- driven data stores. However, you can also create and run the examples in the Jupyter tutorial locally. 4 Chapter 1. Introduction CHAPTER TWO FIRST STEPS 2.1 Install Jupyter Notebook 2.1.1 Install Pipenv pipenv is a dependency manager for Python projects. It uses Pip to install Python packages, but it simplifies dependency management. Pip can be used to install Pipenv, but the --user flag should be used so that it is only available tothat user. This is to prevent system-wide packets from being accidentally overwritten: $ python3 -m pip install --user pipenv Downloading pipenv-2018.7.1-py3-none-any.whl (5.0MB): 5.0MB downloaded Requirement already satisfied (use --upgrade to upgrade): virtualenv in /usr/lib/python3/ ,!dist-packages (from pipenv) Installing collected packages: pipenv, certifi, pip, setuptools, virtualenv-clone ... Successfully installed pipenv certifi pip setuptools virtualenv-clone Cleaning up... Note: If Pipenv is not available in the shell after installation, the USER_BASE/bin directory may have to be specified in PATH. Under Linux and MacOS, USER_BASE can be determined with: $ python3 -m site --user-base /home/veit/.local Then the bin directory has to be appended and added to the PATH. Alternatively, PATH can be set permanently by changing ~/.profile or ~/.bash_profile, in my case: export PATH=/home/veit/.local/bin:$PATH • Under Windows, the directory can be determined with py -m site --user-site and then site-packages can be replaced by Scripts. This then results in, for example: C:\Users\veit\AppData\Roaming\Python36\Scripts In order to be permanently available, this path can be entered under ``PATH`` in the control panel. 5 Jupyter Tutorial, Release 0.8.0 Further information on user-specific installation can