UC Berkeley Jupyterhubs Documentation
Total Page:16
File Type:pdf, Size:1020Kb
UC Berkeley JupyterHubs Documentation Division of Data Sciences Technical Staff Sep 30, 2021 CONTENTS 1 Using DataHub 3 1.1 Using DataHub..............................................3 2 Modifying DataHub to fit your needs 13 2.1 Contributing to DataHub......................................... 13 i ii UC Berkeley JupyterHubs Documentation This repository contains configuration and documentation for the many JupyterHubs used by various organizations in UC Berkeley. CONTENTS 1 UC Berkeley JupyterHubs Documentation 2 CONTENTS CHAPTER ONE USING DATAHUB 1.1 Using DataHub 1.1.1 Services Offered This page lists the various services we offer as part of DataHub. Not all these will be available on all hubs, butwecan easily enable them as you wish. User Interfaces Our diverse user population has diverse needs, so we offer many different user interfaces for instructors to choose from. Jupyter Notebook (Classic) What many people mean when they say ‘Jupyter’, this familiar interface is used by default for most of our introductory classes. Document oriented, no-frills, and well known by a lot of people. 3 UC Berkeley JupyterHubs Documentation RStudio We want to provide first class support for teaching with R, which means providing strong support for RStudio. This includes Shiny support. Try without berkeley.edu account: Try with berkeley.edu account: R DataHub 4 Chapter 1. Using DataHub UC Berkeley JupyterHubs Documentation JupyterLab JupyterLab is a more modern version of the classic Jupyter notebook from the Jupyter project. It is more customizable and better supports some advanced use cases. Many of our more advanced classes use this, and we might help all classes move to this once there is a simpler document oriented mode available 1.1. Using DataHub 5 UC Berkeley JupyterHubs Documentation Linux Desktop (Experimental) Sometimes, you just need to use something that requires a full desktop environment to run. Instead of trying to get students to install things locally, we offer a full fledged Linux Desktop environment they can access from inside their browser! This is just a different ‘UI’ on the same infrastructure as the notebook environment, so they all usethesame libraries and home directories. Try without Berkeley.edu account: Try with Berkeley.edu account: EECS DataHub 6 Chapter 1. Using DataHub UC Berkeley JupyterHubs Documentation Visual Studio Code (Experimental) Sometimes you just want an IDE, not a notebook environment. We are experimenting with a hosted, web version of the popular Visual Studio Code editor, to see if it would be useful for teaching more traditional CS classes. Try without Berkeley.edu account: Try with Berkeley.edu account: EECS DataHub SSH & SFTP (Experimental) You can access the same environments and home directories via the terminal, using traditional ssh and sftp programs. See here for more documentation. 1.1. Using DataHub 7 UC Berkeley JupyterHubs Documentation More? If you have a web based environment, we can almost certainly make it run under a hub. Contact us and we’ll see what we can do :) Services Sometimes you need something custom to get your class going. Very very interesting things can happen here, so we’re always looking for new services to add. Postgresql Some of our classes require using real databases to teach. We now experimentally offer a postgresql server for each user on the data100 hub. The data does not persist right now, but we can turn that on whenever needed. Programming languages We support the usual suspects - Python, R & Julia. However, there are no limits to what languages we can actually support, so if you are planning on using a different (open source) programming language, contact us and we’ll setyou up. More? We want to find solution to your interesting problems, so please bring us your interesting problems 1.1.2 Accessing private GitHub repos GitHub is used to store class materials (lab notebooks, lecture notebooks, etc), and nbgitpuller is used to distribute it to students. By default, nbgitpuller only supports public GitHub repositories. However, Berkeley’s JupyterHubs are set up to allow pulling from private repositories as well. Public repositories are still preferred, but if you want to distribute a private repository to your students, you can do so. 1. Go to the GitHub app for the hub you are interested in. 1. R Hub 2. DataHub 3. PublicHealth Hub 4. Open an issue if you want more hubs supported. 2. Click the ‘Install’ button. 3. Select the organization / user containing the private repository you want to distribute on the JupyterHub. If you are not the owner or administrator of this organization, you might need extra permissions to do this action. 4. Select ‘Only select repositories’, and below that select the private repositories you want to distribute to this JupyterHub. 8 Chapter 1. Using DataHub UC Berkeley JupyterHubs Documentation 5. Click the ‘Install’ button. The JupyterHub you picked now has access to this private repository. You can revoke this anytime by coming back to this page, and removing the repo from the list of allowed repos. You can also totally uninstall the GitHub app. 6. You can now make a link for your repo at nbgitpuller.link. If you had just created your repo, you might have to specify main instead of master for the branch name, since GitHub changed the name of the default branch recently. That’s it! You’re all set. You can distribute these links to your students, and they’ll be able to access your materials! You can also use more traditional methods (like the git commandline tool, or RStudio’s git interface) to access this repo as well. Note: Everyone on the selected JupyterHub can clone your private repo if you do this. They won’t be able to see that this repo exists, but if they get their hands on your nbgitpuller link they can fetch that too. More fine-grained permissions coming soon. 1.1.3 JupyterHubs in this repository DataHub datahub.berkeley.edu is the ‘main’ JupyterHub for use on UC Berkeley campus. It’s the largest and most active hub. It has many Python & R packages installed. It runs on Google Cloud Platform in the ucb-datahub-2018 project. You can see all config for it under deployments/ datahub. Classes • The big data8 class. • Active connector courses • Data Science Modules • Astro 128/256 This hub is also the ‘default’ when folks wanna use a hub for a short period of time for any reason without super specific requirements. Prob140 Hub A hub specifically for prob140. Some of the admin users on DataHub are students in prob140 - this would allow them to see the work of other prob140 students. Hence, this hub is separate until JupyterHub gains features around restricting admin use. It runs on Google Cloud Platform in the ucb-datahub-2018 project. You can see all config for it under deployments/ prob140. 1.1. Using DataHub 9 UC Berkeley JupyterHubs Documentation Data 100 This hub is for Data 100 which has a unique user and grading environment. It runs on Google Cloud Platform in the ucb-datahub-2018 account. You can see all config for it under deployments/data100. Data100 also has shared folders between staff (professors and GSIs) and students. Staff, assuming they have been added as admins in config/common.yaml, can see a shared and a shared-readwrite folder. Students can only see the shared folder, which is read-only. Anything that gets put in shared-readwrite is automatically viewable in shared, but as read-only files. The purpose of this is to be able to share large data files instead of having oneperstudent. Data 102 Data 102 runs on Google Cloud Platform in the ucb-datahub-2018 project. You can see all config for it under deployments/data102. Data8X Hub A hub for the data8x course on EdX. This hub is open to use by anyone in the world, using LTI Authentication to provide login capability from inside EdX. It runs on Google Cloud Platform in the data8x-scratch project. You can see all config for it under deployments/ data8x. 1.1.4 User Authentication UC Berkeley uses a Canvas instance, called bcourses.berkeley.edu. Almost all our hubs use this for authentication, although not all yet (issue)). Who has access? Anyone who can log in to bcourses can log into our JupyterHubs. This includes all berkeley affiliates. If you have a working berkeley.edu email account, you can most likely log in to bcourses, and hence to our JupyterHubs. Students have access for 9 months after they graduate. If they have an incomplete, they have 13 months of access instead. Non-berkeley affiliates If someone who doesn’t have a berkeley.edu account wants to use the JupyterHubs, they need to get a CalNet Sponsored Guest account This gives people access to bcourses, and hence to all the JupyterHubs. Troubleshooting If you can log in to bcourses but not to any of the JupyterHubs, please contact us. If you can not log in to bcourses, please contact bcourses support 10 Chapter 1. Using DataHub UC Berkeley JupyterHubs Documentation 1.1.5 Storage Retention Policy Policy Criteria No non-hidden files in the user’s home directory have been modified in the last 12months. Archival 1. Zip the whole home directory 2. Upload it to Google drive of a SPA created for this purpose 3. Share the ZIP file in the Google Drive with the user. Rationale Today (6 Feb 2020), we have 18,623 home directories in datahub. Most of these users used datahub in previous semesters, have not logged in for a long time, and will probably never log in again. This costs us a lot of money in disk space - we will have to forever expand disk space.