Release 0.8.0 Veit Schiele
Total Page:16
File Type:pdf, Size:1020Kb
Jupyter Tutorial Release 0.8.0 Veit Schiele Oct 27, 2020 CONTENTS 1 Introduction 3 1.1 Status...................................................3 1.2 Target group...............................................3 1.3 Structure of the Jupyter tutorial.....................................3 1.4 Why Jupyter?...............................................4 1.5 Jupyter infrastructure...........................................4 2 First steps 5 2.1 Install Jupyter Notebook.........................................5 2.2 Create notebook.............................................7 2.3 Example.................................................9 2.4 Installation................................................ 12 2.5 Follow us................................................. 14 2.6 Pull-Requests............................................... 14 3 Workspace 15 3.1 IPython.................................................. 15 3.2 Jupyter.................................................. 38 4 Read and write data 123 4.1 Requests................................................. 123 4.2 BeautifulSoup.............................................. 128 4.3 Intake................................................... 129 4.4 PostgreSQL................................................ 144 4.5 NoSQL databases............................................ 162 4.6 Glossary................................................. 170 5 Clean up and validate data 175 5.1 Deduplicate data............................................. 175 5.2 String matching............................................. 179 5.3 Managing missing data with pandas................................... 182 5.4 Scikit Learn preprocessing........................................ 185 5.5 Dask pipeline............................................... 187 5.6 Data validation with voluptuous (schema definitions).......................... 191 5.7 Pandas DataFrame validation with Engarde............................... 196 5.8 TDDA: Test-Driven Data Analysis.................................... 198 5.9 Hypothesis: property based testing................................... 202 6 Visualise data 205 7 Refactoring 207 i 7.1 Automating code quality......................................... 207 7.2 Parallelise execution........................................... 212 8 Create a product 245 8.1 Manage code with Git.......................................... 245 8.2 Manage data with DVC .......................................... 265 8.3 Create packages............................................. 274 8.4 Document................................................ 291 8.5 Licensing................................................. 302 8.6 Citing................................................... 305 8.7 Reproduce environments......................................... 310 8.8 Testing.................................................. 345 8.9 Logging.................................................. 355 9 Create web applications 359 9.1 Dashboards................................................ 359 10 Index 413 Index 415 ii Jupyter Tutorial, Release 0.8.0 Jupyter notebooks are growing in popularity with data scientists and have become the de facto standard for rapid prototyping and exploratory analysis. They inspire experiments and innovations enormously and as well they make the entire research process faster and more reliable. In addition, many additional components are created that expand the original limits of their use and enable new uses. CONTENTS 1 Jupyter Tutorial, Release 0.8.0 2 CONTENTS CHAPTER ONE INTRODUCTION 1.1 Status - 1.2 Target group The users of Jupyter notebooks are diverse, from data scientists to data engineers and analysts to system engineers. Their skills and workflows are very different. However, one of the great strengths of Jupyter notebooks is that they allow these different experts to work closely together in cross-functional teams. • Data scientists conduct experiments with different coefficients and summarise the results. • Data engineers check the quality of the code and make it more robust, efficient and scalable. • Data analysts perform systematic studies of the data using code provided by data engineers. • System engineers create the hub, the kernel, extensions, etc. and ensure that this infrastructure runs as smoothly as possible. In this tutorial, we primarily address system engineers who want to build and operate a platform based on Jupyter notebooks. Then, we explain how this platform can be used effectively by data scientists, data engineers, and analysts. 1.3 Structure of the Jupyter tutorial From Chapter 3, the Jupyter tutorial follows the prototype of a research project: 3. Set up the workspace with the installation and configuration of IPython, Jupyter with nbextensions and ipy- widgets. 4. Collect data, either through a REST API or directly from an HTML page. 5. Cleaning up data is a recurring task that includes Remove or modify redundant, inconsistent, or incorrectly formatted data. 6. Analyse data through exploratory analysis and visualising data <viz/index. 7. Refactoring includes parameterisation, validation and performance optimisation, including through paralleli- sation. 8. Creating a product includes Testing, Logging and Document the methods and functions as well as creating packages. 3 Jupyter Tutorial, Release 0.8.0 9. Web applications can either generate dashboards from Jupyter notebooks or require more comprehensive ap- plication logic, such as demonstrated in Bokeh-Plots in Flask einbinden, or provide data via a RESTful API. 1.4 Why Jupyter? How can these diverse tasks be simplified? You will hardly find a tool that covers all of these tasks, and several tools are often required even for individual tasks. Therefore, on a more abstract level, we are looking for more general patterns for tools and languages with which data can be analysed and visualised and a project can be documented and presented. This is exactly what we are aiming for with Project Jupyter. The Jupyter project started in 2014 with the aim of creating a consistent set of open source tools for scientific research, reproducible workflows, computational narratives and data analysis. In 2017, Jupyter received the ACM Software Systems Award – a prestigious award which, among other things, shares with Unix and the web. To understand why Jupyter notebooks are so successful, let’s take a closer look at the core functions: Jupyter Notebook Format Jupyter Notebooks are an open, JSON-based document format with full records of the user’s sessions and the code they contain. Interactive Computing Protocol The notebook communicates with the computing kernel via the Interactive Com- puting Protocol, an open network protocol based on JSON data via ZMQ and WebSockets. Kernels Kernels are processes that execute interactive code in a specific programming language and return the output to the user. 1.5 Jupyter infrastructure A platform for the above-mentioned use cases requires an extensive infrastructure that not only allows the provision of the kernel and the parameterization, time control and parallelisation of notebooks, but also the uniform provision of resources. This tutorial provides a platform that enables fast, flexible and comprehensive data analysis beyond Jupyter notebooks. At the moment, however, we are not yet going into how it can be expanded to include streaming pipelines and domain- driven data stores. However, you can also create and run the examples in the Jupyter tutorial locally. 4 Chapter 1. Introduction CHAPTER TWO FIRST STEPS 2.1 Install Jupyter Notebook 2.1.1 Install Pipenv pipenv is a dependency manager for Python projects. It uses Pip to install Python packages, but it simplifies depen- dency management. Pip can be used to install Pipenv, but the --user flag should be used so that it is only available to that user. This is to prevent system-wide packets from being accidentally overwritten: $ python3 -m pip install --user pipenv Downloading pipenv-2018.7.1-py3-none-any.whl (5.0MB): 5.0MB downloaded Requirement already satisfied (use --upgrade to upgrade): virtualenv in /usr/lib/ ,!python3/dist-packages (from pipenv) Installing collected packages: pipenv, certifi, pip, setuptools, virtualenv-clone ... Successfully installed pipenv certifi pip setuptools virtualenv-clone Cleaning up... Note: If Pipenv is not available in the shell after installation, the USER_BASE/bin directory may have to be specified in PATH. Under Linux and MacOS, USER_BASE can be determined with: $ python3 -m site --user-base /home/veit/.local Then the bin directory has to be appended and added to the PATH. Alternatively, PATH can be set permanently by changing ~/.profile or ~/.bash_profile, in my case: export PATH=/home/veit/.local/bin:$PATH • Under Windows, the directory can be determined with py -m site --user-site and then site-packages can be replaced by `` Scripts``. This then results in, for example: C:\Users\veit\AppData\Roaming\Python36\Scripts In order to be permanently available, this path can be entered under ``PATH`` in the control panel. Further information on user-specific installation can be found in User Installs. 5 Jupyter Tutorial, Release 0.8.0 2.1.2 Create a virtual environment with jupyter Python virtual environments <virtual environment> allow Python packages to be installed in an isolated location for a specific application, rather than installing them globally. So you have your own installation directories and do not share libraries with other virtual environments: $ mkdir myproject $ cd !$ cd myproject $ pipenv install