UC Berkeley JupyterHubs Documentation

Division of Data Sciences Technical Staff

Sep 30, 2021

CONTENTS

1 Using DataHub 3 1.1 Using DataHub...... 3

2 Modifying DataHub to fit your needs 13 2.1 Contributing to DataHub...... 13

i ii UC Berkeley JupyterHubs Documentation

This repository contains configuration and documentation for the many JupyterHubs used by various organizations in UC Berkeley.

CONTENTS 1 UC Berkeley JupyterHubs Documentation

2 CONTENTS CHAPTER ONE

USING DATAHUB

1.1 Using DataHub

1.1.1 Services Offered

This page lists the various services we offer as part of DataHub. Not all these will be available on all hubs, butwecan easily enable them as you wish.

User Interfaces

Our diverse user population has diverse needs, so we offer many different user interfaces for instructors to choose from.

Jupyter Notebook (Classic)

What many people mean when they say ‘Jupyter’, this familiar interface is used by default for most of our introductory classes. Document oriented, no-frills, and well known by a lot of people.

3 UC Berkeley JupyterHubs Documentation

RStudio

We want to provide first class support for teaching with R, which means providing strong support for RStudio. This includes Shiny support. Try without berkeley.edu account: Try with berkeley.edu account: R DataHub

4 Chapter 1. Using DataHub UC Berkeley JupyterHubs Documentation

JupyterLab

JupyterLab is a more modern version of the classic Jupyter notebook from the Jupyter project. It is more customizable and better supports some advanced use cases. Many of our more advanced classes use this, and we might help all classes move to this once there is a simpler document oriented mode available

1.1. Using DataHub 5 UC Berkeley JupyterHubs Documentation

Linux Desktop (Experimental)

Sometimes, you just need to use something that requires a full desktop environment to run. Instead of trying to get students to install things locally, we offer a full fledged Linux Desktop environment they can access from inside their browser! This is just a different ‘UI’ on the same infrastructure as the notebook environment, so they all usethesame libraries and home directories. Try without Berkeley.edu account: Try with Berkeley.edu account: EECS DataHub

6 Chapter 1. Using DataHub UC Berkeley JupyterHubs Documentation

Visual Studio Code (Experimental)

Sometimes you just want an IDE, not a notebook environment. We are experimenting with a hosted, web version of the popular Visual Studio Code editor, to see if it would be useful for teaching more traditional CS classes. Try without Berkeley.edu account: Try with Berkeley.edu account: EECS DataHub

SSH & SFTP (Experimental)

You can access the same environments and home directories via the terminal, using traditional ssh and sftp programs. See here for more documentation.

1.1. Using DataHub 7 UC Berkeley JupyterHubs Documentation

More?

If you have a web based environment, we can almost certainly make it run under a hub. Contact us and we’ll see what we can do :)

Services

Sometimes you need something custom to get your class going. Very very interesting things can happen here, so we’re always looking for new services to add.

Postgresql

Some of our classes require using real databases to teach. We now experimentally offer a postgresql server for each user on the data100 hub. The data does not persist right now, but we can turn that on whenever needed.

Programming languages

We support the usual suspects - Python, R & Julia. However, there are no limits to what languages we can actually support, so if you are planning on using a different (open source) programming language, contact us and we’ll setyou up.

More?

We want to find solution to your interesting problems, so please bring us your interesting problems

1.1.2 Accessing private GitHub repos

GitHub is used to store class materials (lab notebooks, lecture notebooks, etc), and nbgitpuller is used to distribute it to students. By default, nbgitpuller only supports public GitHub repositories. However, Berkeley’s JupyterHubs are set up to allow pulling from private repositories as well. Public repositories are still preferred, but if you want to distribute a private repository to your students, you can do so. 1. Go to the GitHub app for the hub you are interested in. 1. R Hub 2. DataHub 3. PublicHealth Hub 4. Open an issue if you want more hubs supported. 2. Click the ‘Install’ button. 3. Select the organization / user containing the private repository you want to distribute on the JupyterHub. If you are not the owner or administrator of this organization, you might need extra permissions to do this action. 4. Select ‘Only select repositories’, and below that select the private repositories you want to distribute to this JupyterHub.

8 Chapter 1. Using DataHub UC Berkeley JupyterHubs Documentation

5. Click the ‘Install’ button. The JupyterHub you picked now has access to this private repository. You can revoke this anytime by coming back to this page, and removing the repo from the list of allowed repos. You can also totally uninstall the GitHub app. 6. You can now make a link for your repo at nbgitpuller.link. If you had just created your repo, you might have to specify main instead of master for the branch name, since GitHub changed the name of the default branch recently. That’s it! You’re all set. You can distribute these links to your students, and they’ll be able to access your materials! You can also use more traditional methods (like the git commandline tool, or RStudio’s git interface) to access this repo as well. Note: Everyone on the selected JupyterHub can clone your private repo if you do this. They won’t be able to see that this repo exists, but if they get their hands on your nbgitpuller link they can fetch that too. More fine-grained permissions coming soon.

1.1.3 JupyterHubs in this repository

DataHub

datahub.berkeley.edu is the ‘main’ JupyterHub for use on UC Berkeley campus. It’s the largest and most active hub. It has many Python & R packages installed. It runs on Google Cloud Platform in the ucb-datahub-2018 project. You can see all config for it under deployments/ datahub.

Classes

• The big data8 class. • Active connector courses • Data Science Modules • Astro 128/256 This hub is also the ‘default’ when folks wanna use a hub for a short period of time for any reason without super specific requirements.

Prob140 Hub

A hub specifically for prob140. Some of the admin users on DataHub are students in prob140 - this would allow them to see the work of other prob140 students. Hence, this hub is separate until JupyterHub gains features around restricting admin use. It runs on Google Cloud Platform in the ucb-datahub-2018 project. You can see all config for it under deployments/ prob140.

1.1. Using DataHub 9 UC Berkeley JupyterHubs Documentation

Data 100

This hub is for Data 100 which has a unique user and grading environment. It runs on Google Cloud Platform in the ucb-datahub-2018 account. You can see all config for it under deployments/data100. Data100 also has shared folders between staff (professors and GSIs) and students. Staff, assuming they have been added as admins in config/common.yaml, can see a shared and a shared-readwrite folder. Students can only see the shared folder, which is read-only. Anything that gets put in shared-readwrite is automatically viewable in shared, but as read-only files. The purpose of this is to be able to share large data files instead of having oneperstudent.

Data 102

Data 102 runs on Google Cloud Platform in the ucb-datahub-2018 project. You can see all config for it under deployments/data102.

Data8X Hub

A hub for the data8x course on EdX. This hub is open to use by anyone in the world, using LTI Authentication to provide login capability from inside EdX. It runs on Google Cloud Platform in the data8x-scratch project. You can see all config for it under deployments/ data8x.

1.1.4 User Authentication

UC Berkeley uses a Canvas instance, called bcourses.berkeley.edu. Almost all our hubs use this for authentication, although not all yet (issue)).

Who has access?

Anyone who can log in to bcourses can log into our JupyterHubs. This includes all berkeley affiliates. If you have a working berkeley.edu email account, you can most likely log in to bcourses, and hence to our JupyterHubs. Students have access for 9 months after they graduate. If they have an incomplete, they have 13 months of access instead.

Non-berkeley affiliates

If someone who doesn’t have a berkeley.edu account wants to use the JupyterHubs, they need to get a CalNet Sponsored Guest account This gives people access to bcourses, and hence to all the JupyterHubs.

Troubleshooting

If you can log in to bcourses but not to any of the JupyterHubs, please contact us. If you can not log in to bcourses, please contact bcourses support

10 Chapter 1. Using DataHub UC Berkeley JupyterHubs Documentation

1.1.5 Storage Retention Policy

Policy

Criteria

No non-hidden files in the user’s home directory have been modified in the last 12months.

Archival

1. Zip the whole home directory 2. Upload it to Google drive of a SPA created for this purpose 3. Share the ZIP file in the Google Drive with the user.

Rationale

Today (6 Feb 2020), we have 18,623 home directories in datahub. Most of these users used datahub in previous semesters, have not logged in for a long time, and will probably never log in again. This costs us a lot of money in disk space - we will have to forever expand disk space. By cleaning it up after 12 months of non-usage, we will not affect any current users - just folks who haven’t logged in for a long time. Archiving the contents would make sure people still have access to their old work, without leaving the burden of maintaining it forever on us.

Why Google Drive?

For UC Berkeley users, Google Drive offers Unlimited Free Space. We can also perform access control easily with Google Drive.

Alternatives

1. Email it to our users. This will most likely be rejected by most mail servers as the home directory will be too big an attachment 2. Put it in Google Cloud Nearline storage, build a token based access control mechanism on top, and email this link to the users. We will need to probably clean this up every 18 months or so for cost reasons. This is the viable alternative, if we decide to not use Google Drive

1.1.6 SSH & SFTP service

All the hubs offer experimental SSH service for interactive use, and SFTP for file transfer, via a deployment ofthe jupyterhub-ssh. SSH access provides the exact same environment - packages, home directory, etc - as web based access to JupyterHub. This allows for pedagogical uses where web based access & terminal based access are both used, and the same infras- tructure (authentication, clusters, cloud resources, etc) used for both. SFTP also lets you copy files in / out of the same home directories, allowing for fast large amounts of file transfer for use in web based or ssh based uses.

1.1. Using DataHub 11 UC Berkeley JupyterHubs Documentation

Accessing SSH

1. Create a JupyterHub authentication token, which you can use as the password. The URL to go to this depends on the hub you are trying to access, and is https:///hub/token. URLs for common hubs are provided here: • datahub.berkeley.edu • r.datahub.berkeley.edu • eecs.datahub.berkeley.edu • biology.datahub.berkeley.edu • prob140.datahub.berkeley.edu • workshop.datahub.berkeley.edu • julia.datahub.berkeley.edu • highschool.datahub.berkeley.edu NOTE: This token is your password, and everyone with it can access all your files, including any assignments you might have. If you are an admin on any of these hubs, then they can use it to access the space of anyone else on that hub, so please treat these with extreme care. 2. Open your terminal, and run the following: ssh @

The is the same as your Calnet username in most places - the part before the @ in your berkeley.edu email. For a small minority of users, this is different - you can confirm this in the JupyterHub control panel, on the top right. refers to the hub you are trying to log on to - so datahub.berkeley.edu or r.datahub. berkeley.edu, etc. 3. When asked for the password, provide the token generated in step 1. 4. This should give you an interactive terminal! You can do anything you would generally interactively do via ssh - run editors, fully interactive programs, use the commandline, etc. Some features - non-interactive command running, tunneling, etc are currently unavailable.

Accessing SFTP

SFTP lets you transfer files to and from your home directory on the hubs. The steps are almost exactly thesameas accessing SSH. The one difference is that the port used is 2222, rather than the default port of 22. If you are using a GUI program for SFTP, you will need to specify the port explicitly there. If you are using the commandline sftp program, the invocation is something like sftp -oPort=2222 @

12 Chapter 1. Using DataHub CHAPTER TWO

MODIFYING DATAHUB TO FIT YOUR NEEDS

Our infrastructure can serve the diverse needs of our students only if it is built by a diverse array of people.

2.1 Contributing to DataHub

2.1.1 Pre-requisites

Smoothly working with the JupyterHubs maintained in this repository has a number of pre-requisite skills you must possess. The rest of the documentation assumes you have at least a basic level of these skills, and know how to get help related to these technologies when necessary.

Basic

These skills let you interact with the repository in a basic manner. This lets you do most ‘self-service’ tasks - such as adding admin users, libraries, making changes to resource allocation, etc. This doesn’t give you any skills to debug things when they break, however. 1. Basic git& GitHub skills. The Git Book& GitHub Help are good resources for this. 2. Familiarity with YAML syntax. 3. Understanding of how packages are installed in the languages we support. 4. Rights to merge changes into this repository on GitHub.

Full

In addition to the basic skills, you’ll need the following skills to ‘fully’ work with this repository. Primarily, you need this to debug issues when things break - since we strive to never have things break in the same way more than twice. 1. Knowledge of our tech stack: 1. Kubernetes 2. Google Cloud 3. Helm 4. Docker 5. repo2docker 6. Jupyter

13 UC Berkeley JupyterHubs Documentation

7. Languages we support: Python&R 2. Understanding of our JupyterHub distribution, Zero to JupyterHub. 3. Full access to the various cloud providers we use.

2.1.2 Repository Structure

Hub Configuration

Each hub has a directory under deployments/ where all configuration for that particular hub is stored in a standard for- mat. For example, all the configuration for the primary hub used on campusdatahub ( ) is stored under deployments/ datahub/.

User Image (image/)

The contents of the image/ directory determine the environment provided to the user. For example, it controls: 1. Versions of Python / R / Julia available 2. Libraries installed, and which versions of those are installed 3. Specific config for Jupyter Notebook or IPython repo2docker is used to build the actual user image, so you can use any of the supported config files to customize the image as you wish.

Hub Config (config/ and secrets/)

All our JupyterHubs are based on Zero to JupyterHub (z2jh). z2jh uses configuration files in YAML format to specify exactly how the hub is configured. For example, it controls: 1. RAM available per user 2. Admin user lists 3. User storage information 4. Per-class & Per-user RAM overrides (when classes or individuals need more RAM) 5. Authentication secret keys These files are split between files that are visible to everyoneconfig/ ( ) and files that are visible only to a select few illuminati (secrets/). To get access to the secret files, please consult the illuminati. Files are further split into: 1. common.yaml - Configuration common to staging and production instances of this hub. Most config shouldbe here. 2. staging.yaml - Configuration specific to the staging instance of thehub. 3. prod.yaml - Configuration specific to the production instance of thehub.

14 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation hubploy.yaml

We use hubploy to deploy our hubs in a repeatable fashion. hubploy.yaml contains information required for hubploy to work - such as cluster name, region, provider, etc. Various secret keys used to authenticate to cloud providers are kept under secrets/ and referred to from hubploy. yaml.

Documentation

Documentation is under the docs/ folder, and is generated with the sphinx project. It is written with the reStructured- Text (rst) format. Documentation is automatically published to https://ucb-jupyterhubs.readthedocs.io/.

2.1.3 User home directory storage

All users on all the hubs get a home directory with persistent storage.

Why NFS?

NFS isn’t a particularly cloud-native technology. It isn’t highly available nor fault tolerant by default, and is a single point of failure. However, it is currently the best of the alternatives available for user home directories, and so we use it. 1. Home directories need to be fully POSIX compliant file systems that work with minimal edge cases, since this is what most instructional code assumes. This rules out object-store backed filesystems such as s3fs. 2. Users don’t usually need guaranteed space or IOPS, so providing them each a persistent cloud disk gets unnec- essarily expensive - since we are paying for it wether it is used or not. When we did use one persistent disk per user, the storage cost dwarfed everything else by an order of magnitude for no apparent benefit. Attaching cloud disks to user pods also takes on average about 30s on Google Cloud, and much longer on Azure. NFS mounts pretty quickly, getting this down to a second or less. We’ll probably be on some form of NFS for the foreseeable future.

NFS Server

We currently have two approaches to running NFS Servers. 1. Run a hand-maintained NFS Server with ZFS SSD disks. This gives us control over performance, size and most importantly, server options. We use anonuid=1000, so all reads / writes from the cluster are treated as if they have uid 1000, which is the uid all user processes run as. This prevents us from having to muck about permissions & chowns - particularly since Kubernetes creates new directories on volumes as root with strict permissions (see issue). 2. Use a hosted NFS service like Google Cloud Filestore. We do not have to perform any maintenance if we use this - but we have no control over the host machine either. This necessitates some extra work to deal with the permission issues - see jupyterhub.singleuser. initContainers in the common.yaml of a hub that uses this method.

2.1. Contributing to DataHub 15 UC Berkeley JupyterHubs Documentation

Right now, every hub except data8x is using the first approach - primarily because Google Cloud Filestore wasnot available when they were first set up. data8x is using the second approach, and if proven reliable we will switch everything to it the next semester.

Home directory paths

Each user on each hub gets their own directory on the server that gets treated as their home directory. The staging & prod servers share home directory paths, so users get the same home directories on both. For most hubs, the user’s home directory path relative to the exported NFS directory is /home/ . Prefixing the path with the name of the hub allows us to use the same NFS share for manynumber of hubs.

NFS Client

We currently have two approaches for mounting the user’s home directory into each user’s pod. 1. Mount the NFS Share once per node to a well known location, and use hostpath volumes with a subpath on the user pod to mount the correct directory on the user pod. This lets us get away with one NFS mount per node, rather than one per pod. See hub/templates/ nfs-mounter.yaml to see how we mount this on the nodes. It’s a bit of a hack, and if we want to keep using this method should be turned into a CSI Driver instead. 2. Use the Kubernetes NFS Volume provider. This doesn’t require hacks, but leads to at least 2 NFS mounts per user per node, often leading to hundreds of NFS mounts per node. This might or might not be a problem. Most hubs use the first method, while data8x is trialing the second method. If it goes well, we might switch to using the second method for everything. We also try to mount everything as soft, since we would rather have a write fail than have processes go into uninter- ruptible sleep mode (D) where they can not usually be killed when NFS server runs into issues.

2.1.4 Kubernetes Cluster Configuration

We use kubernetes to run our JupyterHubs. It has a healthy open source community, managed offerings from multiple vendors & a fast pace of development. We can run easily on many different cloud providers with similar config by running on top of Kubernetes, so it is also our cloud agnostic abstraction layer. We prefer using a managed Kubernetes service (such as Google Kubernetes Engine). This document lays out our preferred cluster configuration on various cloud providers.

Google Kubernetes Engine

In our experience, Google Kubernetes Engine (GKE) has been the most stable, performant, and reliable managed kubernetes service. We prefer running on this when possible. A gcloud container clusters create command can succintly express the configuration of our kubernetes clus- ter. The following command represents the currently favored configuration.

16 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

gcloud container clusters create \ --enable-ip-alias \ --enable-autoscaling \ --max-nodes=20 --min-nodes=1 \ --region=us-central1 --node-locations=us-central1-b \ --image-type=cos \ --disk-size=100 --disk-type=pd-balanced \ --machine-type=n1-highmem-8 \ --cluster-version latest \ --no-enable-autoupgrade \ --enable-network-policy \ --create-subnetwork="" \ --tags=hub-cluster \ gcloud container node-pools create \ --machine-type n1-highmem-8 \ --num-nodes2 \ --enable-autoscaling \ --min-nodes1 --max-nodes 20 \ --node-labels hub.jupyter.org/pool-name=beta-pool \ --node-taints hub.jupyter.org_dedicated=user:NoSchedule \ --region=us-central1 \ --image-type=cos \ --disk-size=200 --disk-type=pd-balanced \ --no-enable-autoupgrade \ --tags=hub-cluster \ --cluster=fall-2019 \ user-pool----

IP Aliasing

--enable-ip-alias creates VPC Native Clusters. This becomes the default soon, and can be removed once it is the default.

Autoscaling

We use the kubernetes cluster autoscaler to scale our node count up and down based on demand. It waits until the cluster is completely full before triggering creation of a new node - but that’s ok, since new node creation time on GKE is pretty quick. --enable-autoscaling turns the cluster autoscaler on. --min-nodes sets the minimum number of nodes that will be maintained regardless of demand. This should ideally be 2, to give us some headroom for quick starts without requiring scale ups when the cluster is completely empty. --max-nodes sets the maximum number of nodes that the cluster autoscaler will use - this sets the maximum number of concurrent users we can support. This should be set to a reasonably high number, but not too high - to protect against runaway creation of hundreds of VMs that might drain all our credits due to accident or security breach.

2.1. Contributing to DataHub 17 UC Berkeley JupyterHubs Documentation

Highly available master

The kubernetes cluster’s master nodes are managed by Google Cloud automatically. By default, it is deployed in a non-highly-available configuration - only one node. This means that upgrades and master configuration changes cause a few minutes of downtime for the kubernetes API, causing new user server starts / stops to fail. We request our cluster masters to have highly available masters with --region parameter. This specifies the region where our 3 master nodes will be spread across in different zones. It costs us extra, but it is totally worth it. By default, asking for highly available masters also asks for 3x the node count, spread across multiple zones. We don’t want that, since all our user pods have in-memory state & can’t be relocated. Specifying --node-locations explicitly lets us control how many and which zones the nodes are located in.

Region / Zone selection

We generally use the us-central1 region and a zone in it for our clusters - simply because that is where we have asked for quota. There are regions closer to us, but latency hasn’t really mattered so we are currently still in us-central1. There are also unsubstantiated rumors that us-central1 is their biggest data center and hence less likely to run out of quota.

Disk Size

--disk-size sets the size of the root disk on all the kubernetes nodes. This isn’t used for any persistent storage such as user home directories. It is only used ephemerally for the operations of the cluster - primarily storing docker images and other temporary storage. We can make this larger if we use a large number of big images, or if we want our image pulls to be faster (since disk performance increases with disk size). --disk-type=pd-standard gives us standard spinning disks, which are cheaper. We can also request SSDs instead with --disk-type=pd-ssd - it is much faster, but also much more expensive. We compromise with --disk-type=pd-balanced, faster than spinning disks but not as fast as ssds all the time.

Node size

--machine-type lets us select how much RAM and CPU each of our nodes have. For non-trivial hubs, we generally pick n1-highmem-8, with 52G of RAM and 8 cores. This is based on the following heuristics: 1. Students generally are memory limited than CPU limited. In fact, while we have a hard limit on memory use per-user pod, we do not have a CPU limit - it hasn’t proven necessary. 2. We try overprovision clusters by about 2x - so we try to fit about 100G of total RAM use in a node with about 50G of RAM. This is accomplished by setting the memory request to be about half of the memory limit on user pods. This leads to massive cost savings, and works out ok. 3. There is a kubernetes limit on 100 pods per node. Based on these heuristics, n1-highmem-8 seems to be most bang for the buck currently. We should revisit this for every cluster creation.

18 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

Cluster version

GKE automatically upgrades cluster masters, so there is generally no harm in being on the latest version available.

Node autoupgrades

When node autoupgrades are enabled, GKE will automatically try to upgrade our nodes whenever needed (our GKE version falling off the support window, security issues, etc). However, since we run stateful workloads, we disable this right now so we can do the upgrades manually.

Network Policy

Kubernetes Network Policy lets you firewall internal access inside a kubernetes cluster, whitelisting only the flowsyou want. The JupyterHub chart we use supports setting up appropriate NetworkPolicy objects it needs, so we should turn it on for additional security depth. Note that any extra in-cluster services we run must have a NetworkPolicy set up for them to work reliabliy.

Subnetwork

We put each cluster in its own subnetwork, since seems to be a limit on how many clusters you can create in the same network with IP aliasing on - you just run out of addresses. This also gives us some isolation - subnetworks are isolated by default and can’t reach other resources. You must add firewall rules to provide access, including access to any manually run NFS servers. We add tags for this.

Tags

To help with firewalling, we add network tags to all our cluster nodes. This lets us add firewall rules to control traffic between subnetworks.

Cluster name

We try use a descriptive name as much as possible.

2.1.5 Cloud Credentials

Google Cloud

Service Accounts

Service accounts are identified by a service key, and help us grant specific access to an automated process. Our CI process needs two service accounts to operate: 1.A gcr-readwrite key. This is used to build and push the user images. Based on the docs, this is assigned the role roles/storage.admin. 2.A gke key. This is used to interact with the Google Kubernetes cluster. Roles roles/container.clusterViewer and roles/container.developer are granted to it.

2.1. Contributing to DataHub 19 UC Berkeley JupyterHubs Documentation

These are currently copied into the secrets/ dir of every deployment, and explicitly referenced from hubploy.yaml in each deployment. They should be rotated every few months. You can create service accounts through the web console or the commandline. Remember to not leave around copies of the private key elsewhere on your local computer!

2.1.6 Incident reports

Blameless incident reports are very important for long term sustainability of resilient infrastructure. We publish them here for transparency, and so we may learn from them for future incidents.

2017-02-09 - JupyterHub db manual overwrite

Summary

Datahub was reportedly down at 1am. Users attempting to log in to datahub were greeted with a proxy error. The hub pod was up but the log was full of sqlite errors. After the hub pod was deleted and a new one came up, students logging in to datahub found their notebooks were missing and their home directories were empty. Once this was fixed, some students still were being logged in as a different particular user. Finally, students with a ‘.’ in their username were still having issues after everyone else was fine. This was all fixed and an all-clear signalled at about 2017-02-09 11:35AM.

Timeline

2017-02-09 00:25 - 00:29 AM

Attempting to debug some earlier 400 errors, Trying to set base_url and ip to something incorrect to see if it will cause a problem. kubectl exec hub-deployment-something --namespace=datahub -it apt-get install sqlite3 sqlite3

ATTACH 'jupyterhub.sqlite AS my_db; SELECT name FROM my_db.sqlite_master WHERE type='table'; SELECT * FROM servers; SELECT * FROM servers WHERE base_url LIKE '%%'; UPDATE servers SET ip='' WHERE base_url LIKE '%%'; UPDATE servers SET base_url='/ WHERE base_url LIKE '%%';

Ctrl+D (exit back into bash shell) checked datahub.berkeley.edu, and nothing happened to the account saw that the sql db was not updated, attempt to run .save

```bash sqlite3

.save jupyterhub.sqlite

This replaced the db with an empty one, since ATTACH was not run beforehand.

20 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

0:25:59 AM

Following exception shows up in hub logs: sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: proxies [SQL:

˓→'SELECT proxies.id AS proxies_id, proxies._public_server_id AS proxies__public_server_

˓→id, proxies._api_server_id AS proxies__api_server_id \nFROM proxies \nWHERE proxies.id␣

˓→= ?'] [parameters: (1,)]

This continues for hub table as well, since those two seem to be most frequently used.

1:12 AM

Sam’s roommate notices that he can log in to datahub but all his notebooks are gone. We notice that there are only ~50 users on the JHub admin panel when there used to be ~1000, so we believe that this is because the JHub sqlite user database got wiped/corrupted, then created an account for his roommate when he logged in, then created a new persistent disk since it lost track of his old one. This is confirmed soon after: $ kubectl --namespace=datahub get pvc | grep claim--257 Bound pvc-3b405e13-ddb4-11e6-98ef-42010af000c3 10Gi␣

˓→RWO 21d claim--51 Bound pvc-643dd900-eea7-11e6-a291-42010af000c3 10Gi␣

˓→RWO 5m

1:28 AM

We shut down the hub pod by scaling the replicas to 0. We then begin recreating the JHub sqlite database by taking the Kubernetes PVCs and matching them back with the user ids. We could do this because the name of the PVC contains a sanitized form of the username and the userid. Here’s the notebook that was used to recreate the db from PVCs: [pvc-sqlite.ipynb 2017-02-09-datahub-db-outage- pvc-recreate-script.ipynb](pvc-sqlite.ipynb 2017-02-09-datahub-db-outage-pvc-recreate-script.ipynb)

2:34 AM

We recreate the sqlite3 database. Initially each user’s cookie_id was set to a dummy cookie value.

2:42 AM

User cookie_id values are changed to null rather than dummy value. The sqlite file is then attached back to datahub. The number of users shown on admin page is back to ~1000. The hub was up, and a spot check of starting other user’s servers seem to work. Some users get redirected to one particular user, but deleting and recreating the affected user seems to fix this.

2.1. Contributing to DataHub 21 UC Berkeley JupyterHubs Documentation

10:11 AM

Attempt to log everyone out by changing cookie secret in hub pod at /srv/jupyterhub/jupyterhub_cookie_secret. Just one character near the end was changed, and pod restarted. No effect. One character at the beginning of secret was changed next, and restarted - this caused actual change, and logged all users out. People are still being redirected to one particular user’s account when they log in. More looking around required.

10:17 AM

John Denero advises students to use ds8.berkeley.edu right now. ds8.berkeley.edu promptly starts crashing because it does not have resources for a data8 level class.

10:29 AM

All user pods are deleted, which finally properly logs everyone out. However, people logging in are still all getting the same user’s pods.

10:36 AM

Notice that cookie_id column in the user database table is empty for many users, and the user that everyone is being logged in as has an empty cookie_id too and is the ‘first’ on the table when sorted in ascending by id. Lookingat the JupyterHub code, cookie_id is always supposed to be set to a uuid, and never supposed to be empty. Setting cookie_id for users fixes their issues, and seems to spawn them into their own notebook.

10:45 AM

A script is run that populates cookie_id for all users, and restarts the hub to make sure there’s no stale cache in RAM. All user pods are deleted again. Most users are back online now! More users start testing and confirming things are working for them.

10:53 AM

User with a ‘.’ in their name reports that they’re getting an empty home directory. More investigation shows two users - one with a ‘.’ in their name that is newer, and one with a ‘-‘ in their name instead of ‘.’ that is older. Hypothesis is that one of them is the ‘original’, but they’re all attaching to a new one that is empty. Looking at pvcs confirms this - there are two PVCs for users with a . in their name who have tried to log in, and they differ only by ids. There is some confusion about users ending up on prob140, because the data8.org homework link is changed to use that temporarily.

22 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

11:05 AM

Directly modifying the user table to rename the user with the ‘-‘ in the name to have a ‘.’ seems to work for people.

11:15 AM

A script is run that modifies the database user table for all users with a ‘-‘ in their name, and the ‘-‘ isreplacedwitha ‘.’. The new users created with the ‘.’ in their name are dropped before this.

11:17 AM

All clear given for datahub.berkeley.edu

11:19 AM

Locally verified that running .save on sqlite3 will overwrite the db file without any confirmation, and ismostlikely cause of the issue. Conclusion Accidental overwriting of the sqlite file during routine debugging operation led all tables being deleted. Users were getting new user ids when they were logging in now, causing them to get new disks provisioned - and these disks were empty. During reconstruction of the db, cookie_id was missing for several users, causing them all to log in to one particular user’s notebook. Users with ‘.’ in their name were also set up slightly incorrectly - their pods have ‘-‘ in them but the user name should have a ‘.’.

Action items

Upstream bug reports for JupyterHub

1. JupyterHub only uses a certain length of the cookie secret, and discards the rest. This causes confusion when trying to change it to log people out. Issue 2. The cookie_id column in the users table should have UNIQUE and NOT NULL constraints. Issue

Upstream bug reports for KubeSpawner

1. Support using username hashes in PVC and Pod Names rather than user ids, so that pod and PVC names remain constant even when DB is deleted. Issue

Upstream bug reports for OAuthenticator

1. Support setting id of user in user table to be same as ‘id’ provided by Google authenticator, thus providing a stable userid regardless of when the user first logged in. Issue

2.1. Contributing to DataHub 23 UC Berkeley JupyterHubs Documentation

DataHub deployment changes

1. Switch to using Google Cloud SQL, which provides hosted and managed MySQL database 2. Perform regular and tested backups of the database 3. Start writing an operational FAQ for things to do and not do 4. Setup better monitoring and paging systems 5. Document escalation procedures explicitly

2017-02-24 - Custom Autoscaler gonee haywire

Summary

On the evening of February 24, 2017, a premature version of the Autoscaler script for the Datahub deployment was mistakenly run on the prod cluster, resulting in a large amount of nodes (roughly 30-40) being set as unschedulable for about 20 minutes. Though no information was lost nor service critically disturbed, it was necessary to manually re-enable these nodes to be scheduled.

Timeline

As of this commit in the Autoscaler branch history, there exists a scale.py file that would based on the utilization of the cluster, mark a certain number of nodes unschedulable before attempting to shut down nodes with no pods in them. Unfortunately, this script was executed prematurely, and without configuration, looked to execute in whatever context currently specified in .kube/config, which ended up being the production cluster rather than the dev cluster.

2017-02-24 11:14 PM

Script is mistakenly executed. A bug in the calculations for the utilization of the cluster leads to about 40 nodes being marked as unschedulable. The mistake is noted immediately.

2017-02-24 11:26 PM

The unschedulability of these nodes is reverted. All nodes in the cluster were first all set to be schedulable to ensure that no students current and future would be disturbed. Immediately after, 10 of the most idle nodes on the cluster were manually set to be unschedulable (to facilitate them later being manually descaled - to deal with https://github.com/data- 8/infrastructure/issues/6) using kubectl cordon .

Conclusion

A cluster autoscaler script was accidentally run against the production cluster instead of the dev cluster, reducing capacity for new user logins for about 12 minutes. There was still enough capacity so we had no adverse effects.

24 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

Action Items

Datahub Deployment Changes

1. The Autoscaler should not be run unless the context is explicitly set via environment variables or command line arguments. This is noted in the comments of the pull request for the Autoscaler. 2. The idea of the ‘current context’ should be abolished in all the tools we build / read.

Future organizational change

1. Use a separate billing account for production vs development clusters. This makes it harder to accidentally run things on the wrong cluster

2017-02-24 - Proxy eviction strands user

Summary

On the evening of Feb 23, several students started experiencing 500 errors in trying to access datahub. The proxy had died because of a known issue, and it took a while for the hub to re-add all the user routes to the proxy. Some students’ needed their servers to be manually restarted, due to a JupyterHub spawner bug that is showing up at scale. Everything was fixed in about 40 minutes.

Timeline

All times in PST

21:10:57

The proxy pod is evicted, due to a known issue that is currently being worked on. Users start running into issue now, with connection failures.

21:11:04

New proxy pod is started by kubernetes, and starts accepting connections. However, the JupyterHub model currently has the proxy starting with no state about user routes, and so the users’ requests aren’t being routed to their notebook pods. This manifests as errors for users. The hub process is supposed to poll the proxy every 300s, and repopulate the route table when it notices it is empty. The hub does this at some point in the next 300s (we do not know when), and starts repopulating the route table. As routes get added for currently users, their notebook starts working again.

2.1. Contributing to DataHub 25 UC Berkeley JupyterHubs Documentation

21:11:52

The repopulate process starts running into issues - it is making far too many http requests (to the kuber- netes and proxy APIs) that it starts running into client side limits on http client (which is what we use to make these requests). This causes them to time out on the request queue. We were running into https://github.com/tornadoweb/tornado/issues/1400. Not all requests fail - for those that succeed, the students are able to access their notebooks. The repopulate process takes a while to process, and errors for a lot of students who are left with notebook in inconsistent state - JupyterHub thinks their notebook is running but it isn’t, or vice versa. Lots of 500s for users.

21:14

Reports of errors start reaching the Slack channel + Piazza. The repopulate process keeps being retried, and notebooks for users slowly come back. Some users are ‘stuck’ in a bad state, however - their notebook isn’t running, but JupyterHub thinks it is (or vice versa).

21:34

Most users are fine by now. For those still with problems, a forced delete from the admin interface + astartworks, since this forces JupyterHub to really check if they’re there or not.

22:03

Last reported user with 500 error is fixed, and datahub is fully operational again.

Conclusion

This is almost a ‘perfect storm’ event. Three things colluded to make this outage happen: 1. The inodes issue, which causes containers to fail randomly 2. The fact that the proxy is a single point of failure with a longish recovery time in current JupyterHub architecture. 3. KubeSpawner’s current design is inefficient at very high user volumes, and its request timeouts & other perfor- mance characteristics had not been tuned (because we have not needed to before). We have both long term (~1-2 months) architectural fixes as well as short term tuning in place for all three ofthese issues.

Action items

Upstream JupyterHub

1. Work on abstracting the proxy interface, so the proxy is no longer a single point of failure. Issue

26 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

Upstream KubeSpawner

1. Re-architect the spawner to make a much smaller number of HTTP requests. DataHub has become big enough that this is a problem. Issue 2. Tune the HTTP client kubespawner uses. This would be an interim solution until (1) gets fixed. Issue

DataHub configuration

1. Set resource requests explicitly for hub and proxy, so they have less chance of getting evicted. Issue 2. Reduce the interval at which the hub checks to see if the proxy is running.PR 3. Speed up the fix for the inodes issue which is what triggered this whole issue.

2017-03-06 - Non-matching hub image tags cause downtime

Summary

On the evening of Mar 6, the hub on prod would not come up after an upgrade. The upgrade was to accommodate a new disk for cogneuro that had been tested on dev. After some investigation it was determined that the helm’s config did not match the hub’s image. After the hub image was rebuilt and pushed out, then tested on dev, it was pushed out to prod. The problem was fixed in about 40 minutes. A few days later (March 12), similar almost outage is avoided when -dev breaks and deployment is put on hold. More debugging shows the underlying cause is that git submodules are hard to use. More documentation is provided, and downtime is averted!

Timeline

All times in PST

March 6 2017 22:59

dev changes are deployed but hub does not start correctly. The describe output for the hub shows repeated instances of: Error syncing pod, skipping: failed to “StartContainer” for “hub-container” with CrashLoopBackOff: “Back-off 10s restarting failed container=hub-container pod=hub-deployment-3498421336-91gp3_datahub-dev(bfe7d8bd-0303- 11e7-ade6-42010a80001a) helm chart for -dev is deleted and reinstalled.

2.1. Contributing to DataHub 27 UC Berkeley JupyterHubs Documentation

23:11 dev changes are deployed successfully and tested. cogneuro’s latest data is available.

23:21

Changes are deployed to prod. The hub does not start properly. get pod -o=yaml on the hub pod shows that the hub container has terminated. The hub log shows that it failed due to a bad configuration parameter.

21:31

While the helm chart had been updated from git recently, the latest tag for the hub did not correspond with the one in either prod.yaml or dev.yaml.

21:41

The hub image is rebuilt and pushed out.

21:45

The hub is deployed on -dev.

21:46

The hub is tested on -dev then deployed on -prod.

21:50

The hub is tested on -prod. Students are reporting that the hub had been down.

March 12 19:57

A new deploy is attempted on -dev, but runs into same error. Deployments are halted for more debugging this time, and more people are called on.

23:21

More debugging reveals that the commit update looked like this: diff--git a/chart b/chart index e38aba2..c590340 160000 ---a/chart +++b/chart @@-1+1@@ -Subproject commit e38aba2c5601de30c01c6f3c5cad61a4bf0a1778 (continues on next page)

28 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

(continued from previous page) +Subproject commit c59034032f8870d16daba7599407db7e6eb53e04 diff--git a/data8/dev.yaml b/data8/dev.yaml index2bda156..ee5987b 100644 ---a/data8/dev.yaml +++b/data8/dev.yaml @@-13,7+13,7@@ publicIP:"104.197.166.226"

singleuser: image: - tag:"e4af695" + tag:"1a6c6d8" mounts: shared: cogneuro88:"cogneuro88-20170307-063643"

Only the tag should’ve been the only thing updated. The chart submodule is updated to c59034032f8870d16daba7599407db7e6eb53e04, which is from February 25 (almost two weeks old). This is the cause of the hub failing, since it is using a really old chart commit with a new hub image.

23:27

It is determined that incomplete documentation about deployment processes caused git submodule update to be not run after a git pull, and so the chart was being accidentally moved back to older commits. Looking at the commit that caused the outage on March 6 showed the exact same root cause.

Conclusion

Git submodules are hard to use, and break most people’s mental model of how git works. Since our deployment requires that the submodule by in sync with the images used, this caused an outage.

Action items

Process

1. Make sure we treat any errors in -dev exactly like we would in prod. Any deployment error in prod should immediately halt future deployments & require a rollback or resolution before proceeding. 2. Write down actual deployment documentation & a checklist. 3. Move away from git submodules to a separate versioned chart repository.

2.1. Contributing to DataHub 29 UC Berkeley JupyterHubs Documentation

2017-03-20 - Too many volumes per disk leave students stuck

Summary

From sometime early March 20 2017 till about 1300, some new student servers were stuck in Pending forever, giving them 500 errors. This was an unintended side-effect of reducing student memory limit to 1G while keeping the size of our nodes constant, causing us to hit a Google Cloud limit on number of disks per node. This was fixed by spawning more nodes that were smaller.

Timeline

March 18, 16:30

RAM per student is reduced from 2G to 1G, as a resource optimization measure. The size of our nodes remains the same (26G RAM), and many are cordonned off and slowly decomissioned over the coming few days. Life seems fine, given the circumstances.

March 20, 12:44

New student servers report a 500 error preventing them from logging on. This is deemed widespread & not an isolated incident.

12:53

A kubectl describe pod on an affected student’s pod shows it’s stuck in Pending state, with an error message: pod failed to fit in any node fit failure on node (XX): MaxVolumeCount

This seems to be common problem for all the new student servers, which are all stuck in Pending state. Googling leads to https://github.com/kubernetes/kubernetes/issues/24317 - even though Google Compute Engine can handle more than 16 disks per node (we had checked this before deploying), Kubernetes itself still can not. This wasn’t foreseen, and seemed to be the direct cause of the incident.

13:03

A copy of the instance template that is used by Google Container Engine is made and then modified to spawn smaller nodes (n1-highmem-2 rather than n1-highmem-4). The managed instance group used by Google Container Engine is then modified to use the new template. This was the easiest way to not distrupt students for whom things are working, while also allowing new students to be able to log in. This new instance group was then set to expand for 30 new nodes, which will provide capacity for about 12 students each. populate.bash was also run to make sure that students pods start up on time in the newnodes.

30 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

13:04

The simple autoscaler is stopped, on fear that it’ll be confused by the unusal mixed state of the nodes and do something wonky.

13:11

All the new nodes are online, and populate.bash has completed. Pods start leaving the Pending state. However, since it’s been more than the specified timeout that JupyterHub will wait before giving up on Pod (5 minutes), JupyterHub doesn’t know the pods exist. This causes state of cluster + state in JupyterHub to go out of sync, causing the dreaded ‘redirected too many times’ error. Admins need to manually stop and start user pods in the control panel as users report this to fix this issue.

14:23

The hub and proxy pods are restarted since there were plenty of ‘redirected too many times’ errors. This seems to catch most users state, although some requests still failed with a 599 timeout (similar to an earlier incident, but much less frequent). A long tail of manual user restarts are performed by admins over the next few days.

Action Items

Upstream: Kubernetes

1. Keep an eye on the status of the bug we ran into

Upstream: JupyterHub

1. Track down and fix the ‘too many redirects’ issue at source. Issue

Cleanup

1. Delete all the older larger nodes that are no longer in use. (Done!)

Monitoring

1. Have alerting for when there are any number of pods in Pending state for a non-negligible amount of time. There is always something wrong when this happens.

2.1. Contributing to DataHub 31 UC Berkeley JupyterHubs Documentation

2017-03-23 - Weird upstream ipython bug kills kernels

Summary

A seemingly unrelated change caused user kernels to die on start (making notebook execution impossible) for newly started user servers from about Mar 22 19:30 to Mar 23 09:45. Most users didn’t see any errors until start of class at about 9AM, since they were running servers that were previously started.

Timeline

March 22, around 19:30

A deployment is performed, finally deploying https://github.com/data-8/jupyterhub-k8s/pull/146 to production. It seemed to work fine on -dev, and on prod as well. However, the testing regimen was only to see if a notebook server would show up - not if a kernel would spawn.

Mar 23, 09:08

Students report that their kernels keep dying. This is confirmed to be a problem for all newly launched notebooks, in both prod and dev.

09:16

The last change to the repo (an update of the single-user image) is reverted, to check if that was causing the problem. This does not improve the situation. Debugging continues, but with no obvious angles of attack.

09:41

After debugging produces no obvious culprits, the state of the entire infrastructure for prod is reverted to a known good state from a few days ago. This was done with: ./deploy.py prod data8 25abea764121953538713134e8a08e0291813834

25abea764121953538713134e8a08e0291813834 is the commit hash of a known good commit from March 19. Our disciplined adherence to immutable & reproducible deployment paid off, and we were able to restore new servers to working order with this! Students are now able to resume working after a server restart. A mass restart is also performed to aid this. Dev is left in a broken state in an attempt to debug.

32 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

09:48

A core Jupyter Notebook dev at BIDS attempts to debug the problem, since it seems to be with the notebook itself and not with JupyterHub.

11:08

Core Jupyter Notebook dev confirms that this makes no sense.

14:55

Attempts to isolate the bug start again, mostly by using git bisect to deploy different versions of our infrastructure to dev until we find what broke.

15:30 https://github.com/data-8/jupyterhub-k8s/pull/146 is identified as the culprit. It continues to not make sense.

17:25

A very involved and laborious revert of the offending part of the patch is done in https://github.com/jupyterhub/kubespawner/pull/37. Core Jupyter Notebook dev continues to confirm this makes no sense. https://github.com/data-8/jupyterhub-k8s/pull/152 is also merged, and deployed shortly after verifiying that everything (including starting kernels & executing code) works fine on dev. Deployed to prod and everything is fine.

Conclusion

Insufficient testing procedures caused a new kind of outage (kernel dying) that we had not seen before. However,since our infrastructure was immutable & reproducible, our outage really only lasted about 40 minutes (from start of lab when students were starting containers until the revert). Deeper debugging produced a fix, but attempts to understand why the fix works are ongoing. Update: We have found and fixed the underlying issue

Action items

Process

1. Document and formalize the testing process for post-deployment checks. 2. Set a short timeout (maybe ten minutes?) after which investigation temporarily stops and we revert our deploy- ment to a known good state.

2.1. Contributing to DataHub 33 UC Berkeley JupyterHubs Documentation

Upstream KubeSpawner

1. Continue investigating https://github.com/jupyterhub/kubespawner/issues/31, which was the core issue that prompted the changes that eventually led to the outage.

2017-04-03 - Custom autoscaler does not scale up when it should

Summary

On April 3, 2017, as students were returning from spring break, the cluster wasn’t big enough in time and several students had errors spawning. This was because the simple-autoscaler was ‘stuck’ on a populate call. More capacity was manually added, the pending pods were deleted & this seemed to fix the outage.

Timeline

Over spring break week

The cluster is scaled down to a much smaller size (7 machines), and the simple scaler is left running.

2017-04-03 11:32

Students report datahub isn’t working on Piazza, and lots of Pods in PENDING state. Doing a kubectl --namespace=datahub describe pod said the pod was unschedulable because there wasn’t enough RAM in the cluster. This clearly implied the cluster wasn’t big enough. Looking at the simple scaler shows it was ‘stuck’ at a populate.bash call, and wasn’t scaling up fast enough.

11:35

The cluster is manually scaled up to 30 nodes: gcloud compute instance-groups managed resize gke-prod-highmem-pool-0df1a536-grp --

˓→size=30

At the same time, pods stuck in Pending state are deleted so they don’t become ghost pods, with: kubectl --namespace=datahub get pod | grep -v Running | grep -P '$' | awk '{print $1;}'␣

˓→| xargs -L1 kubectl --namespace=datahub delete pod

11:40

The nodes have come up, so a populate.bash call is performed to pre-populate all user container images on the new nodes. Users in Pending state are deleted again.

34 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

11:46

The populate.bash call is complete, and everything is back online!

Conclusion

Our simple scaler didn’t scale up fast enough when a large number of students came back online quickly after a time of quiet (spring break). Took a while for this to get noticed, and manual scaling fixed everything.

Action items

Process

1. When coming back from breaks, pre-scale the cluster back up. 2. Consider cancelling spring break.

Monitoring

1. Have monitoring for pods stuck in non-Running states

2017-05-09 - Oops we forgot to pay the bill

Summary

On May 9, 2017, the compute resources associated with the data-8 project at GCE were suspended. All hubs including datahub, stat28, and prob140 were not reachable. This happened because the grant that backed the project’s billing account ran out of funds. The project was moved to a different funding source and the resources gradually came back online.

Timeline

2017-05-09 16:51

A report in the Data 8 Spring 2017 Staff slack, #jupyter channel, says that datahub is down. This is confirmed. At- tempting to access the provisioner via gcloud compute ssh provisioner-01 fails with: ERROR: (gcloud.compute.ssh) Instance [provisioner-01] in zone [us-central1-a] has not been allocated an external IP address yet. Try rerunning this command later.

2.1. Contributing to DataHub 35 UC Berkeley JupyterHubs Documentation

17:01

The Google Cloud console shows that the billing account has run out of the grant that supported the data-8 project. The project account is moved to another billing account which has resources left. The billing state is confirmed by gcloud messages: Google Compute Engine: Project data-8 cannot accept requests to setMetadata while in an␣

˓→inactive billing state. Billing state may take several minutes to update.

17:09 provisioner-01 is manually started. All pods in the datahub namespace are deleted.

17:15 datahub is back online. stat28 and prob140 hub pods are manually killed. After a few moments the hubs are back online. The autoscaler is started.

17:19

The slack duplicator is started.

2017-05-10 10:48

A report in uc-jupyter #jupyterhub says that try.datahub is down. This is confirmed and the hub in the tmp namespace is killed. The hub comes online a couple of minutes later.

Conclusion

There was insufficient monitoring of the billing status.

Action items

Process

1. Identify channels for billing alerts. 2. Identify billing threshold functions that predict when funds will run out. 3. Establish off-cloud backups. The plan is to do this via nbgdrive. 4. Start autoscaler automatically. It is manually started at the moment.

36 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

Monitoring

1. Setup scheduled billing reports and threshold alarms. 2. Setup hub monitoring! 3. The slack duplicator runs in one of the GCP clusters. When the clusters go down, slack messages aren’t forwarded from the data8-sp17-staff slack to uc-jupyter.

2017-10-10 - Docker dies on a few Azure nodes

Summary

On Oct 10, 2017, some user pods were not starting or terminating correctly. After checking node status, it was found that all affected pods were running on two specific nodes. The docker daemon wasn’t responsive on these nodessothey were cordoned off. User pods were then able to start correctly.

Timeline

2017-05-09 10:45a

A report in the course Piazza said that two students couldn’t start their servers. The /hub/admin interface was not able to start them either. It was reported that the students may have run out of memory.

12:29p

The user pods were stuck in Terminating state and would not respond to explicit delete. The pods were forcefully deleted with kubectl --namespace=prod delete pod jupyter- --grace-period=0 --force. The user pods started correctly via /hub/admin.

13:27

It was reported in the course slack that another student’s server wasn’t starting correctly. After checking one of the pod logs, it was observed that the node hosting the pods, k8s-pool1-19522833-13, was also hosting many more pods stuck in a Terminating state. docker ps was hanging on that node. The node was cordoned.

13:42

It was reported in slack that the student’s server was able to start. By this time, the cluster was checked for all pods to see if any other nodes were hosting an unusual number of pods in Terminating. It was found that k8s-pool2-19522833-9 was in a similar state. All stuck pods on that node were forcefully deleted and the node was also cordoned. docker ps was hung on that node too. pool2-...-9 had a load of 530 while pool1-...-13 had a load of 476. On the latter, hypercube was at 766% cpu utilization while it was nominal on the former. Node pool1-...-13 was rebooted from the shell however it did not come back online. The node was manually restarted from the Azure portal but it still didn’t come back. A node previously cordoned on another day, pool1-...-14, was rebooted. It came back online and was uncordoned.

2.1. Contributing to DataHub 37 UC Berkeley JupyterHubs Documentation

13:51

Some relevant systemctl status docker logs were captured from pool2-...-9: Oct 10 20:55:30 k8s-pool2-19522833-9 docker[1237]: time="2017-10-10T20:55:30.790401257Z"␣

˓→level=error msg="containerd: start container" error="containerd: container did not␣

˓→start before the specified timeout"␣

˓→id=abd267ef08b4a4184e19307be784d62470f9a713b59e406249c6cdf0bb333260 Oct 10 20:55:30 k8s-pool2-19522833-9 docker[1237]: time="2017-10-10T20:55:30.790923460Z"␣

˓→level=error msg="Create container failed with error: containerd: container did not␣

˓→start before the specified timeout" Oct 10 20:55:30 k8s-pool2-19522833-9 docker[1237]: time="2017-10-10T20:55:30.810309575Z"␣

˓→level=error msg="Handler for POST /v1.24/containers/

˓→abd267ef08b4a4184e19307be784d62470f9a713b59e406249c6cdf0bb333260/start returned error:␣

˓→containerd: container did not start before the specified timeout" Oct 10 20:55:36 k8s-pool2-19522833-9 docker[1237]: time="2017-10-10T20:55:36.146453953Z"␣

˓→level=error msg="containerd: start container" error="containerd: container did not␣

˓→start before the specified timeout"␣

˓→id=2ba6787503ab6123b509811fa44c7e42986de0b800cc4226e2ab9484f54e8741 Oct 10 20:55:36 k8s-pool2-19522833-9 docker[1237]: time="2017-10-10T20:55:36.147565759Z"␣

˓→level=error msg="Create container failed with error: containerd: container did not␣

˓→start before the specified timeout" Oct 10 20:55:36 k8s-pool2-19522833-9 docker[1237]: time="2017-10-10T20:55:36.166295370Z"␣

˓→level=error msg="Handler for POST /v1.24/containers/

˓→2ba6787503ab6123b509811fa44c7e42986de0b800cc4226e2ab9484f54e8741/start returned error:␣

˓→containerd: container did not start before the specified timeout" Oct 10 20:55:36 k8s-pool2-19522833-9 docker[1237]: time="2017-10-10T20:55:36.169360588Z"␣

˓→level=error msg="Handler for GET /v1.24/containers/json returned error: write unix /

˓→var/run/docker.sock->@: write: broken pipe" Oct 10 20:55:36 k8s-pool2-19522833-9 dockerd[1237]: http: multiple response.WriteHeader␣

˓→calls Oct 10 20:55:36 k8s-pool2-19522833-9 docker[1237]: time="2017-10-10T20:55:36.280209444Z"␣

˓→level=error msg="Handler for GET /v1.24/containers/

˓→610451d9d86a58117830ea7c0189f6157ba9a9602739ee23723e923de8c7e23e/json returned error:␣

˓→No such container: 610451d9d86a58117830ea7c0189f6157ba9a9602739ee23723e923de8c7e23e" Oct 10 20:55:39 k8s-pool2-19522833-9 docker[1237]: time="2017-10-10T20:55:39.095888009Z"␣

˓→level=error msg="Handler for GET /v1.24/containers/

˓→54b64ca3c1e7ef4a04192ccdaf1cb9309d73acebd7a08e13301f3263de3d376a/json returned error:␣

˓→No such container: 54b64ca3c1e7ef4a04192ccdaf1cb9309d73acebd7a08e13301f3263de3d376a"

14:00 datahub@k8s-pool2-19522833-9:~$ ps aux | grep exe | wc -l 520 datahub@k8s-pool2-19522833-9:~$ ps aux | grep exe | head -5 root 329 0.0 0.0 126772 9812 ? Dsl 00:36 0:00 /proc/self/exe init root 405 0.0 0.0 61492 8036 ? Dsl 00:36 0:00 /proc/self/exe init root 530 0.0 0.0 127028 8120 ? Dsl 00:36 0:00 /proc/self/exe init root 647 0.0 0.0 127028 8124 ? Dsl 13:07 0:00 /proc/self/exe init root 973 0.0 0.0 77884 8036 ? Dsl 13:10 0:00 /proc/self/exe init

38 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

14:30 pool1-...-13 was manually stopped in the Azure portal, then manually started. It came back online afterwards and docker was responsive. It was uncordoned. pool2-...-9 was manually stopped in the Azure portal.

14:45 pool2-...-9 completed stopping and was manually started in the Azure portal.

17:25

It was observed that /var/lib/docker on pool1-19522833-13/10.240.0.7 was on / (sda) and not on /mnt (sdb).

Conclusion

Docker was hung on two nodes, preventing pods from starting or stopping correctly.

Action items

Process

1. When there are multiple reports of student servers not starting or stopping correctly, check to see if the user pods were run on the same node(s). 2. Determine how many nodes are not mounting /var/lib/docker on sdb1.

Monitoring

1. Look for elevated counts of pods stuck in Terminating state. For example, kubectl --namespace=prod get pod -o wide| grep Terminating

2017-10-19 - Billing confusion with Azure portal causes summer hub to be lost

Summary

On October 10, 2017, the cloud vendor notified ds-instr that the data8r-17s subscription was canceled due to its enddate, and we had 90 days to reactivate it using the educator portal. A support ticket was created to reverse the cancellation since the educator portal did not permit reactivation. On October 18 we were notified that the subscription’s resources were deleted. Coincidentally, a script was written on Oct. 9 to backup data to ds-instr’s Google Drive and it was performed for the instructor as a test. Unfortunately it wasn’t run for all users before the resources were taken offline.

2.1. Contributing to DataHub 39 UC Berkeley JupyterHubs Documentation

Timeline

2017-10-10 9:06a ds-instr received an email from the cloud vendor: The following subscriptions under your [cloud vendor] sponsorships for [email protected] have re- cently become canceled. Because these subscription(s) are canceled, all services have been suspended but no data has been lost. You have 90 days from the date of cancellation before [the cloud vendor] will delete the subscription and all attached data. Please use the Educator Portal to reactivate the subscription(s).

Subscription Name Subscription Id Canceled Reason data8r-17s omitted here Subscription End Date

9:30

The instructor was notified. The educator portal did not provide a way to view or alter the subscription end dateofa canceled subscription so a support request was filed at the cloud vendor.

11:14

The cloud vendor asks that a payment instrument be added to the ds-instr account. We respond that the account is funded by a sponsorship.

17:22

The cloud vendor contacts their sponsorship team.

2017-10-11 15:00

The cloud vendor calls to discuss the situation. Screenshots of the educator and cloud portal were sent to the cloud vendor.

2017-10-12 16:19

The cloud vendor offers to enable the subscription for a 60 minute period from the backend so that the EndDatemay be extended from the portal. Though the subscription is re-enabled for an hour, the portal still does not permit the subscription parameters to be changed.

40 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

2017-10-18 15:29

The cloud vendor says that the subscription was actually disabled because it had exhausted the allocated funds, and the data was deleted within 24 hours despite the stated 90 day grace period. Later the following was provided by the cloud vendor: I worked with our backend engineering team last night and I am afraid to say that we could not retrieve the storage account after all our sincere efforts. I understand how frustrating it would be for you and I do not have the words to express the same, I just wish if I could be of some help to you. Having said that we did dig into the reasons behind this situation, the subscription was initially suspended by an internal engineering job occurred that auto-suspended all Academic Account Sponsorship subscrip- tions with an end date that was part of the previous fiscal year. Usually this suspension does not delete the subscription. There are a few [cloud vendor] accounts which are on legacy commerce platform which are affected and these accounts are in the process of modern platform. Your account was in thetransi- tion mode when the subscription got suspended and your account was partially converted to the modern platform. The billing & subscription part was converted to the modern platform but the not at the service level. Hence you got the message that your data would be retained for 90 days, at the same stated at the service level it was not converted to the modern hence the data got deleted. I had a detailed discussion with our product group team on this and how we can avoid this in future. First of all, your account is now completely migrated/transitioned completely to the modern platform. Also, to ensure that our other Academic Account Sponsorships customers do not face the same issue they have agreed to complete the migration manually on those accounts.

2017-10-19 10:53

The cloud vendor compensates ds-instr with an additional $10k for the experience.

Conclusion

There were insufficient funds on the subscription to persist its resources. The resources were deleted bythecloud vendor before the grace period ran out.

Action items

Process

1. Until there is a per-user backup implemented hub-side, set a schedule for backing up user data for every course. 2. Always set a billing alert at some conservative amount less than the subscription alotment 3. If a subscription is ever canceled, backup user data within 24 hours, regardless of the stated grace period.

2.1. Contributing to DataHub 41 UC Berkeley JupyterHubs Documentation

2018-01-25 - Accidental merge to prod brings things down

Summary

On January 25, 2018, a new version of the helm chart was installed on the staging hub. It was not immediately merged to production because there were active labs throughout the day. While preparing another course’s hub via Travis CI, the Data 8 change was accidentally merged from staging to production. This production hub went down because the new helm chart’s jupyterhub image was broken.

Timeline

2018-01-25 14:30

The helm chart for datahub was upgraded to a beta of v0.6 to make use of a new image puller. This was merged into the staging branch. After some initial debugging, the helm chart was installed successfully and the image puller worked correctly. However, the staging hub was not tested. Since labs were scheduled throughout the day until 7p, it was decided to delay the upgrade of the production hub until after 7p.

15:30

While a different hub was being managed in Travis CI, the production hub for Data 8 was accidentally upgraded. This upgrade brought with it the faulty hub image from staging which wasn’t working.

16:11

GSIs report in slack that the hub is down for lab users. It is confirmed that the hub process has crashed due to ashared C library included from a python library. It is decided that the quickest way to bring the hub back up is to downgrade the helm-chart back to v0.5.0.

16:35

The chart is installed into the staging repo, merged to staging, and checked on the staging hub. It is then merged into production and brought online there.

Conclusion

A relatively large change was made to the hub configuration with insufficient testing on the staging server. Thiswas compounded when the change was accidentally merged to production.

42 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

Action items

Process

1. Admins should refamiliarize themselves with the deployment policy to check the staging hub before changes are merged to production. 2. Determine if there is a way to block merges to production if the staging hub is not online. 3. Determine if there is a way to contextualize the Travis CI interface so that it is obvious which deployment is being managed.

2018-01-26 - Hub starts up very slow, causing outage for users

Summary

On January 26, 2018, a new version of the helm chart was being installed on the production hub. Though the pod prepuller worked fine on the staging cluster, the prepuller never successfully finished on prod. This causedtheCIto error because helm ran for too long. Additionally, the hub was taking a very long time to check user routes. After users were deleted in the hub’s orm and the hub was restarted, it came back up fairly quickly.

Timeline

2018-01-26 15:00

The helm chart for datahub was upgraded to a beta of v0.6 to make use of a new image puller. This was merged into the staging branch, successfully tested on the staging, and passed CI checks on prod. It was then merged to prod.

15:15 helm times out because the prepuller never completes. It is determined that the master node on staging is cordoned while the master node on prod is not and has: taints: - effect: NoSchedule key: node-role.kubernetes.io/master timeAdded: null value:"true"

The master is cordoned on prod and a new build is started in CI.

2.1. Contributing to DataHub 43 UC Berkeley JupyterHubs Documentation

15:33

After CI times out again due to the prepuller, it is discovered that the master node has been uncordoned. Though the hub and proxy pods restart, the hub is taking a very long time to check user routes. It is slower than the most recent hub restart which was itself slow enough to warrant a new issue on jupyterhub, https://github.com/jupyterhub/jupyterhub/issues/1633.

13:40

It is decided that the most expedient way to get the hub up is to delete users from the orm.

13:50

The following command is run after the database is backed up: delete from users where users.id in (select users.id from users join spawners on spawners.user_id = users.id where server_id is null); deleting 4902 records. The hub pod is deleted and the hub comes up shortly after.

Conclusion

At 5000, the hub takes long enough to restart to inconvenience the number of active users at any one time.

Action items

Process

1. The prepuller should be fixed so that helm does not time out. 2. The hub route checking should be parallelized so that startup is not slow. 3. The staging hub should be seeded with users so that scaling issues can be exposed prior to reaching production.

2018-02-06 - Azure PD refuses to detach, causing downtime for data100

Summary

On February 5, 2018, a PR was merged into the production cluster for Data 100. The CI got as far as running helm upgrade but the hub’s persistent volume would not detach from the old hub. The new hub pod had to wait on the hub-db-dir volume and so would not start. The persistent volume claim was ultimately deleted. The subsequent helm upgrade created a new volume and a new hub pod was able to start.

44 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

Timeline

2018-02-05 20:57

A PR for data100 is merged.

21:20

Towards the end of the build, the upgrade fails because the new hub pod does not start up. The hub-db-dir volume remains bound to the old hub pod which is stuck in a Terminating state. The hub pod only completes termination when delete is passed a grace period of 0. The hub volume remains bound however.

21:30

CI is restarted but by the time helm is run, the hub-db-dir volume remains bound and cannot be attached to the new hub pod. Additionally, helm errors because the jupyterhub-internal ingress object cannot be found even though it does exist.

21:45

Since it cannot be determined what node the volume is bound to, the volume is deleted. The jupyterhub-internal ingress object is also deleted prior to restarting the CI build.

22:05

The hub comes up with a new hub-db-dir volume. CI fails due to the same jupyterhub-internal object error.

Conclusion

Azure was not able to detach the hub-db-dir azure disk from the hub pod. The pvc is deleted and the hub comes up on the next CI run.

Action items

Process

1. Store the hub db in a cloud database to eliminate reliance on hub volume. 2. Downgrade helm to 2.6.x to see if this fixes the helm upgrades.

2.1. Contributing to DataHub 45 UC Berkeley JupyterHubs Documentation

2018-02-28 - A node hangs, causing a subset of users to report issues

Summary

On February 28, 2018, a handful of users reported on piazza that there servers wouldn’t start. It was determined that all problematic servers were running on the same node. After the node was cordoned and rebooted, the student servers were able to start properly.

Timeline

2018-02-28 21:21

Three students report problems starting their server on piazza and a GSI links to the reports on slack. More reports come in by 21:27.

21:30

The infrastructure team is alerted to the problem. The command kubectl --namespace=prod get pod -o wide | egrep -v -e prepull -e Running shows that all non-running pods were scheduled on the same node. Most of the pods have an “Unknown” status while the rest are in “Terminating”. The oldest problematic pod is 29m.

21:34

The node k8s-pool1-19522833-9 is cordoned. It has a load of about 90 with no processes consuming much CPU. The node is rebooted via sysrq trigger. The hung pods remain stuck.

21:39

When the node comes back online, kubectl reports no more hung pods. Students are able to start their servers.

Conclusion

A problematic VM prevented nodes from launching pods. Once the VM was cordoned and rebooted, pods launch without trouble.

Action items

Process

1. Monitor the cluster for non-running pods and send an alert if the count exceeds a threshold or if the non-running pods are clustered on the same node(s).

46 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

2018-06-11 - Azure billing issue causes downtime

Summary

On June 11, 2018, the cloud vendor notified ds-instr that the data8-17f-prod subscription was canceled due to itsusage cap. The educator portal confirmed that the spend had surpassed the budget. After additional funds were allocated to the subscription, a portion of the VMs were manually started. The hub came back online after pods were forcibly deleted and nodes were cordoned.

Timeline

2018-06-11 9:02a ds-instr received an email from the cloud vendor: The following subscriptions under your Microsoft Azure sponsorships for [email protected] have re- cently become canceled. Because these subscription(s) are canceled, all services have been suspended but no data has been lost. You have 90 days from the date of cancellation before Microsoft will delete the subscription and all attached data. Please use the Educator Portal to reactivate the subscription(s).

Subscription Name Subscription Id Canceled Reason data8-17f-prod omitted here Subscription Cap

9:29

The subscription status was confirmed at https://www.microsoftazuresponsorships.com/Manage. In order to allocate additional budget to data8-17f-prod, budget for other subscriptions had to be reduced.

9:40

VMs were turned on at https://portal.azure.com: 3 nodes in each node pool, the nfs server, the kubernetes master, and the database server.

9:45

The hub was unreachable even though the VMs were online. The hub and proxy pods were shown as Running and all nodes were shown as online even though some nodes had not been started. The offline cluster nodes were manually cordoned. All pods had to be forcibly deleted before they would start.

2.1. Contributing to DataHub 47 UC Berkeley JupyterHubs Documentation

10:14

The Billing Alert Service was checked at https://account.azure.com/Subscriptions/alert?subscriptionId=06f94ac5- b029-411f-8896-411f3c6778b4 and it was discovered that alerts were no longer registered.

Conclusion

There were insufficient funds on the subscription to persist its resources. The subscription budget was increased andthe hub was brought back online. The billing alert service that was configured to prevent such incidents did not function properly.

Action items

Process

1. Do not use subscription portal billing alerts. 2. Manually check subscription usage via an unattended process.

2019-02-25 - Azure Kubernetes API Server outage causes downtime

Summary

On February 25, 2019, the kubernetes API server for data100 became unreachable, causing new resource creation requests to fail. When the hub pod was stopped, a new one did not get created leading users to see a proxy error message. The hub came back online after a new cluster was created, storage was migrated to the new cluster, and then DNS was updated.

Timeline

2019-02-25 11:21a

The kubernetes API server became unavailable. The time of this event was determine post mortem via the cloud provider’s monitoring metrics.

11:34

Infrastructure staff is notified in slack. It is determined that the hub proxy is up, but kubectl fails for all operations.The API server is unreachable.

48 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

11:57

A C ticket is created via the cloud provider’s portal. There are no other reports on the cloud provider’s status page. Infrastructure staff consider creating a new cluster and attaching storage toit.

12:28p

An email is sent to contacts with the cloud provider asking for the ability to escalate the issue. C tickets have 8 hour response times.

12:40

It is decided that rather than moving the nfs server from one cluster to another, the ZFS pool should be migrated to a new nfs server in the new cluster. The new cluster is requested.

12:43 - 12:49

Cloud provider responds and calls infrastructure staff.

13:00

The cluster is created and a new nfs server is requested in the cluster’s resource group.

13:10

Data volumes are detached from the old server and moved from the old cluster’s resource group to the new one.

13:20

The ZFS pool is imported into the new nfs server. helm is run to create the staging hub.

13:34 helm completes and the staging hub is up. DNS is updated. helm is run to create the prod hub.

13:41 prod hub is up and DNS is updated.

2.1. Contributing to DataHub 49 UC Berkeley JupyterHubs Documentation

13:46

Cloud provider asks their upstream why the API server went down.

14:48 letsencrypt on prod can successfully retrieve an SSL certificate enabling students to connect.

Conclusion

The managed kubernetes service went down for as yet unknown reasons. A new cluster was created and existing storage was attached to it.

Action items

Monitoring

1. Remotely monitor the API server endpoint and send an alert when it is down.

Update

Cloud provider’s response on 3/15/2019: After reviewing all the logs we have, our backend advised below. We’ve identified that there were problems with the infrastructure hosting your cluster which caused the kubelet on the master stopped responding. There were alerts regarding this issue which were addressed by our teams. We’re working to reduce the impact of these events as much as possible. Please be advised this is not related with region stability Feel free to let me know if any further questions and thanks for your patience.

2019-05-01 - Service Account key leak incident

Summary

Service account keys that granted restricted access to some of our cloud services were inadvertently leaked on GitHub. Google immediately notified us in seconds, and the credentials were revoked within the next few minutes.

50 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

Impact

Deployments are paused until this was fixed.

Timeline

May 1 2019, 3:18 PM

A template + documentation for creating new hubs easily is pushed to GitHub as a pull request. This inadvertantly contained live credentials for pushing & pulling our (already public) docker images, and for access to our kubernetes clusters. Google immediately notified us via email within seconds that this might be abreach.

3:19 PM

Discussion and notification starts in slack about dealing with the issue.

3:27 PM

Both keys are revoked so they are no longer valid credentials.

3:36 PM

All in-use resources are checked, and verified to not be compromised by automated bots looking for leaked accounts.

3:40 PM

An email is sent out to all owners of the compromised project (ucb-datahub-2018) giving an all-clear.

Action items

1. Don’t duplicate service key credentials across multiple hubs. Issue 2. Switch to a different secret management strategy than what we have now. Issue

2.1.7 Common Administrator Tasks

Add an admin user

What can admin users do?

JupyterHub has admin users who have the following capabilities: 1. Access & modify all other users’ home directories (where all their work is kept) 2. Mark other users as admin users

2.1. Contributing to DataHub 51 UC Berkeley JupyterHubs Documentation

3. Start / Stop other users’ servers These are all powerful & disruptive capabilities, so be careful who gets admin access!

Adding / removing an admin user

1. Pick the hub you want to make a user an admin of. 2. Find the config directory for the hub, and open common.yaml in there. 3. Add / remove the admin user name from the list jupyterhub.auth.admin.users. Make sure there is an explanatory comment nearby that lists why this user is an admin. This helps us remove admins when they no longer need admin access. 4. Follow the steps to make a deployment

Update DNS

Some staff have access to make and update DNS entries in the .datahub.berkeley.edu and .data8x.berkeley.edu subdo- mains.

Authorization

Request access to make changes by creating an issue in this repository. Authorization is granted via membership in the edu:berkeley:org:nos:DDI:datahub CalGroup. @yuvipanda and @ryanlovett are group admins and can update membership.

Making Changes

1. Log into Infoblox from a campus network or through the campus VPN. Use your CalNet credentials. 2. Navigate to Data Management > DNS > Zones and click berkeley.edu. 3. Navigate to Subzones and choose either data8x or datahub, then click Records.

Tip: For quicker access, click the star next to the zone name to make a bookmark in the Finder pane on the left side.

Create a new record

1. Click the down arrow next to + Add in the right-side Toolbar. Then choose Record > A Record. 2. Enter the name and IP of the A record, and uncheck Create associated PTR record. 3. Consider adding a comment with a timestamp, your ID, and the nature of the change. 4. Click Save & Close.

52 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

Edit an existing record

1. Click the gear icon to the left of the record’s name and choose Edit. 2. Make a change. 3. Consider adding a comment with a timestamp, your ID, and the nature of the change. 4. Click Save & Close.

Delete a record

1. Click the gear icon to the left of the record’s name and choose Delete.

Create a new Hub

Why create a new hub?

The major reasons for making a new hub are: 1. You wanna use a different kind of authenticator. 2. Some of your students are admins on another hub, so they can see other students’ work there. 3. You are running in a different cloud, or using a different billing account. 4. Your environment is different enough and specialized enough that a different hub is a good idea. Bydefault, everyone uses the same image as datahub.berkeley.edu. 5. You want a different URL (X.datahub.berkeley.edu vs just datahub.berkeley.edu) If your reason is something else, it probably needs some justification :)

Setting up a new hub structure

There’s a simple cookiecutter we provide that sets up a blank hub that can be customized. 1. Make sure you have the following python packages installed: cookiecutter 2. In the deployments directory, run cookiecutter: cookiecutter template/

3. Answer the questions it asks. Should be fairly basic. It should generate a directory with the name of the hub you provided with a skeleton configuration. It’ll also generate all the secrets necessary. 4. You need to log into the NFS server, and create a directory owned by 1000:1000 under /export/ homedirs-other-2020-07-29/. The path might differ if your hub has special home directory storage needs. Consult admins if that’s the case. 5. Set up authentication via bcourses. We have two canvas OAuth2 clients setup in bcourses for us - one for all production hubs and one for all staging hubs. The secret keys for these are already in the generated secrets config. However, you need to add the new hubs to the authorized callback list maintained inbcourses. 1. -staging.datahub.berkeley.edu/hub/oauth_callback added to the staging hub client (id 10720000000000471)

2.1. Contributing to DataHub 53 UC Berkeley JupyterHubs Documentation

2. staging.datahub.berkeley.edu/hub/oauth_callback added to the production hub client (id 10720000000000472) Please reach out to Jonathan Felder (or [email protected] if he is not available) to set this up. 6. Add an entry in .circleci/config.yml to deploy the hub via CI. It should be under the deploy job, and look something like this: - run: name: Deploy command: | hubploy deploy hub ${CIRCLE_BRANCH}

There will be a bunch of other stanzas very similar to this one, helping you find it. 7. Commit the hub directory, and make a PR to the the staging branch in the GitHub repo. Once tests pass, merge the PR to get a working staging hub! It might take a few minutes for HTTPS to work, but after that you can log into it at https://-staging.datahub.berkeley.edu. Test it out and make sure things work as you think they should. 8. Make a PR from the staging branch to the prod branch. When this PR is merged, it’ll deploy the produc- tion hub. It might take a few minutes for HTTPS to work, but after that you can log into it at https://.datahub.berkeley.edu. Test it out and make sure things work as you think they should. 9. All done!

Rebuild the custom hub image

We use a customized JupyterHub image so we can use versions of hub packages (such as authenticators) and install additional required by custom config we might have. The image is located in images/hub. It must inherit from the JupyterHub image used in the Zero to JupyterHub. chartpress is used to build the image and update hub/values.yaml with the new image version. 1. Modify the image in images/hub and make a git commit. 2. Run chartpress --push. This will build and push the hub image, and modify hub/values.yaml appropri- ately. 3. Make a commit with the hub/values.yaml file, so the new hub image name and tag are comitted. 4. Proceed to deployment as normal.

Rebuild the custom postgres image

For data100, we provide a postgresql server per user. We want the python extension installed. So we inherit from the upstream postgresql docker image, and add the appropriate package. This image is in images/postgres. If you update it, you need to rebuild and push it. 1. Modify the image in images/postgres and make a git commit. 2. Run chartpress --push. This will build and push the image, but not put anything in YAML. There is no place we can put thi in values.yaml, since this is only used for data100. 3. Notice the image name + tag from the chartpress --push command, and put it in the appropriate place (under extraContainers) in data100/config/common.yaml. 4. Make a commit with the new tag in data100/config/common.yaml.

54 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

5. Proceed to deploy as normal.

Testing and Upgrading New Packages

It is helpful to test package additions and upgrades for yourself before they are installed for all users. You can make sure the change behaves as you think it should, and does not break anything else. Once tested, request that the change by installed for all users by by creating a new issue in github,contacting cirriculum support staff, or creating a new pull request. Ultimately, thouroughly testing changes locally and submitting a pull request will result in the software being rolled out to everyone much faster.

Install a package in your notebook

When testing a notebook with new version of the package, add the following line to a cell at the beginning of your notebook. !pip install --upgrade packagename==version

You can then execute this cell every time you run the notebook. This will ensure you have the version you think you have when running your code. To avoid complicated errors, make sure you always specify a version. You can find the latest version by searching on pypi.org.

Find current version of package

To find the current version of a particular installed package, you can run the following in anotebook. !pip list | grep

This should show you the particular package you are interested in and its current version.

Submitting a pull request

Familiarize yourself with pull requests and repo2docker , and create a fork of the datahub staging branch. 1. Find the correct environment.yml file for your class. This should be under \deployments\class\image\ 2. In environment.yml, packages listed under dependencies are installed using conda, while packages under pip are installed using pip. Any packages that need to be installed via apt must be added to \deployments\ class\image\Dockerfile. 3. Add any packages necessary. Pip will almost always have the latest version of a package, but conda may only contain older versions.

• Note that package versions for conda are specified using =, while in pip they are specified using == 4. Test the changes locally using repo2docker, then submit a PR to staging. • To use repo2docker, you have to point it at the right Dockerfile for your class. For example, to test the data100 datahub, you would run repo2docker deployments/data100/image from the base datahub directory. 5. Once the PR is pulled, test it out on class-staging.datahub.berkeley.edu. 6. Finally, submit a pull request to merge from staging into master.

2.1. Contributing to DataHub 55 UC Berkeley JupyterHubs Documentation

• Double check what commits are pulled in. Creating this pull request will pull in all new commits to master. • If other commits are pulled into your pull request, ask the authors of those commits if they are okay with this. • The pull request title should be “Merge [List of commits] to prod”. For example, the pr title might be “Merge #1136, #1278, #1277, #1280 to prod”. 7. Changes are only deployed to datahub once the relevant Travis CI job is completed. See https://circleci.com/gh/ berkeley-dsep-infra/datahub to view Travis CI job statuses.

Tips for Upgrading Package

• Conda can take an extremely long time to resolve version dependency conflicts, if they are resolvable at all. When upgrading Python versions or a core package that is used by many other packages, such as requests, clean out or upgrade old packages to minimize the number of dependency conflicts.

Configuring course profiles

We fetch per-course enrollment from the Student Information System when we need to configure user servers based on course affiliations. We periodically use this to set per-user resource limits, attach extra volumes to user servers,and automatically add or remove admin roles. This is implemented with a hub sidecar container which fetches enrollment data and shares it with the hub. The sidecar container’s image is located in images/fetch-course-emails and the hub reads these rosters in our custom KubeSpawner in hub/values.yaml. The rosters are saved into the files: /srv/jupyterhub/profiles.d/{year}-{term}-{class_section_id}-students.txt /srv/jupyterhub/profiles.d/{year}-{term}-{class_section_id}-instructors.txt

Defining course profiles

We indicate which courses we’re interested in by defining them as profiles in a given deployment’s hub configuration at jupyterhub.hub.extraConfigMap.profiles. Courses are specified as keys of the form {year}-{term}-{class_section_id} in the helm config. For example: profiles: 2019-summer-15798: {} 2019-spring-25622: mem_limit: 4096M mem_guarantee: 2048M 2019-fall-23970: extraVolumeMounts: - mountPath: /home/rstudio/.ssh name: home subPath: _stat131a/_ssh readOnly: true

See https://classes.berkeley.edu for class section IDs. Specifying empty profiles is sufficient to ensure that any student enrolled in a course cannot also be anadmin.Thisis important if an enrolled student is a member of the course staff of another course on the same hub, and they’ve been given admin access.

56 Chapter 2. Modifying DataHub to fit your needs UC Berkeley JupyterHubs Documentation

Memory limits and extra volume mounts are specified as in the example above.

Cleaning Up

Remember to remove a course profile after the course is over. This prevents the sidecar container from fetching unnec- essary enrollment data. Two semesters is probably a sufficient amount of time to retain the profiles in the event students want to revisit assignments or instructors want to re-evaluate them. For example if a profile was specified for 2019 Fall, consider removing it by 2020 Summer. If a student is in a course with a specified profile, and they become a member of course staff the next semester, theold course profile will need to be removed to ensure the new GSI/UGSI has sufficient adminaccess.

2.1. Contributing to DataHub 57