Introduction to High-Performance Computing Using Clusters to Speed up Your Research

Overview Basics Languages/tools Scheduling Best practices and sharing Summary Introduction to High-Performance Computing Using clusters to speed up your research ALEX RAZOUMOV [email protected] 4 slides and data files at http://bit.ly/introhpc I the link will download a file introHPC.zip (∼3 MB) I unpack it to find codes/ and slides.pdf I a copy on our training cluster in ~/projects/def-sponsor00/shared slides at http://bit.ly/introhpc WestGrid 2021 1 / 89 Overview Basics Languages/tools Scheduling Best practices and sharing Summary Workshop outline To work on the remote cluster today, you will need 1. an SSH (secure shell) client - often pre-installed on Linux/Mac - http://mobaxterm.mobatek.net Home Edition on Windows 2. one of: - guest account on our training cluster - Compute Canada account + be added to the Slurm reservation (if provisioned) on a production cluster Cluster hardware overview – CC national systems Basic tools for cluster computing I logging in, transferring files Working with Slurm scheduler I software environment, modules I serial jobs I Linux command line, editing remote files I shared-memory jobs Programming languages and tools I distributed-memory jobs I GPU and hybrid jobs I overview of languages from HPC standpoint I interactive jobs I parallel programming environments I available compilers Best practices (common mistakes) I quick look at OpenMP, MPI, Chapel, Dask, make slides at http://bit.ly/introhpc WestGrid 2021 2 / 89 Overview Basics Languages/tools Scheduling Best practices and sharing Summary Hardware overview slides at http://bit.ly/introhpc WestGrid 2021 3 / 89 Overview Basics Languages/tools Scheduling Best practices and sharing Summary Added to CC’s national infrastructure in the past several years Arbutus @UVic is an extension to West Cloud (in production since September 2016) General-purpose clusters: Cedar @SFU and Graham @UofWaterloo (both in production since June 2017), Béluga @McGill (in production since April 2019) Large parallel cluster: Niagara @UofToronto (in production since April 2018) slides at http://bit.ly/introhpc WestGrid 2021 4 / 89 Overview Basics Languages/tools Scheduling Best practices and sharing Summary HPC cluster overview Mostly off-the-shelf components for individual nodes, everything rack-mounted Typically hundreds of nodes, wired by fast interconnect Shared vs. distributed memory Login vs. compute nodes Compute nodes: CPU-only, GPU nodes (accelerators) Job scheduler Development/visualization nodes slides at http://bit.ly/introhpc WestGrid 2021 5 / 89 Overview Basics Languages/tools Scheduling Best practices and sharing Summary cedar.computecanada.ca graham.computecanada.ca beluga.computecanada.ca purpose general-purpose clusters for a variety of workloads specs https://docs.computecanada.ca/wiki/{Cedar,Graham,Beluga} processor count 101,424 CPUs and 1,352 GPUs 41,548 CPUs and 520 GPUs 34,880 CPUs and 688 GPUs interconnect 100Gb/s Intel OmniPath, non-blocking Mellanox FDR 56Gb/s and EDR Mellanox EDR 100 Gb/s Infiniband to 1024 cores 100Gb/s InfiniBand, non-blocking to 1024 cores base nodes 576 nodes: 125 GB, 32 cores 903 nodes: 125 GB, 32 cores 172 SL nodes: 92 GB, 40 cores 640 SL nodes: 187 GB, 48 cores 72 CL nodes: 192 GB, 44 cores 516 SL nodes: 186 GB, 40 cores 768 CL nodes: 187 GB, 48 cores large-memory nodes 128 nodes: 250 GB, 32 cores 56 nodes: 250 GB, 32 cores 12 SL nodes: 752 GB, 40 cores 24 nodes: 502 GB, 32 cores 24 nodes: 502 GB, 32 cores 24 nodes: 1510 GB, 32 cores 3 nodes: 3022 GB, 64 cores 4 nodes: 3022 GB, 32 cores GPU base 114 nodes: 125 GB, 24 cores, 160 nodes: 124 GB, 32 cores, 172 nodes: 186 GB, 40 cores, 4 NVIDIA 4 NVIDIA P100 Pascal 12 GB GPUs 2 NVIDIA P100 Pascal 12 GB GPUs Volta V100 16 GB GPUs 32 nodes: 250 GB, 24 cores, 7 SL nodes: 178 GB, 28 cores, 4 NVIDIA P100 Pascal 16 GB GPUs 8 NVIDIA V100 Volta 16 GB GPUs 192 CL nodes: 187 GB, 32 cores, 6 SL nodes: 192 GB, 16 cores, 4 NVIDIA V100 Volta 32 GB GPUs, 4 NVIDIA T4 Turing 16 GB GPUs NVLink connection within each node 30 CL nodes: 192 GB, 44 cores, 4 NVIDIA T4 Turing 16 GB GPUs ¯ All nodes have on-node SSD storage ¯ BW = Broadwell ¯ SL = Skylake ¯ CL = Cascade Lake ¯ CPU type can be specified with --constraint=broadwell or --constraint=skylake or --constraint=cascade, depending on the cluster ¯ GPUs can be requested with --gres=gpu:{p100,p100l,v100l}:count (on Cedar) and --gres=gpu:{v100,t4}:count (on Graham) slides at http://bit.ly/introhpc WestGrid 2021 6 / 89 Overview Basics Languages/tools Scheduling Best practices and sharing Summary niagara.computecanada.ca purpose for large parallel jobs, ideally ≥1,000 cores with an allocation specs https://docs.computecanada.ca/wiki/Niagara and https://docs.scinet.utoronto.ca/index.php/Niagara_Quickstart processor count 80,640 CPUs and no GPUs interconnect EDR Infiniband (Dragonfly+, completely connected topology, dynamic routing), 1:1 to 432 nodes, effectively 2:1 beyond that base nodes 2,016 SL or CL nodes: 202 GB, 40 cores ¯ No local disk, nodes booting off the network, small RAM filesystem ¯ SL = Skylake ¯ CL = Cascade Lake ¯ Authentication via CC accounts, need to request access ¯ Scheduling is by node (in multiples of 40 cores) ¯ Users with an allocation: job sizes up to 1000 nodes and 24h max runtime ¯ Users without an allocation: job sizes up to 20 nodes and 12h max runtime ¯ Maximum number of jobs per user: running 50, queued 150 slides at http://bit.ly/introhpc WestGrid 2021 7 / 89 Overview Basics Languages/tools Scheduling Best practices and sharing Summary Accessing resources: RAS vs. RAC ∼20% of compute cycles available via the Rapid Access Service (RAS) I available to all CC users via default queues I you can start using it as soon as you have a CC account I shared pool with resources allocated via “fair share” mechanism I will be sufficient to meet computing needs of many research groups ∼80% of compute cycles allocated via annual Resource Allocation Competitions (RAC) I apply if you need >50 CPU-years or >10 GPU-years I only PIs can apply, allocation per research group I announcement in the fall of each year via email to all users = awarded % I 2020 RAC: 590 applications, success rate requested is 40 of CPUs, 26% of GPUs, 86% of storage, 99% of vCPUs slides at http://bit.ly/introhpc WestGrid 2021 8 / 89 Overview Basics Languages/tools Scheduling Best practices and sharing Summary File systems: wide array of options for different uses Details at https://docs.computecanada.ca/wiki/Storage_and_file_management filesystem quotas backed purged? performance mounted on intended use up? compute nodes? $HOME ∼50 GB, 500k files per user nightly, 8 medium 4 source code, 100 GB, 250k Niagara latest parameter files, snapshot scripts $SCRATCH 20 TB1, 1000k files per user 8 4 high for 4 intensive I/O on 25 TB1, 6M files Niagara large files large (100+ MB) files /project 1 TB, 500k files per project, nightly 8 medium 4 static data (long-term disk could be increased via RAC, storage) Niagara only via RAC /nearline only via RAC 8 8 medium to 8 archival storage (tape archive) low /localscratch none 8 4 very high local temporarily for many small files On Cedar no longer allowed to submit jobs from $HOME For frequent I/O use on-node SSD ($SLURM_TMPDIR ! /localscratch/${USER}.${SLURM_JOBID}.0) or RAM disk ($TMPDIR ! /tmp) - don’t forget to move files out before your job terminates!!! - script examples at https://bit.ly/3feXwfC To check disk usage: quota command (aliased to diskusage_report) To request more storage: [email protected] for small increases, RAC for large requests 1 except when full slides at http://bit.ly/introhpc WestGrid 2021 9 / 89 Overview Basics Languages/tools Scheduling Best practices and sharing Summary Basic tools for working with a cluster slides at http://bit.ly/introhpc WestGrid 2021 10 / 89 Overview Basics Languages/tools Scheduling Best practices and sharing Summary Logging into the systems On Mac or Linux in terminal: $ ssh [email protected] # training cluster guest account (where XXX=001..118) $ ssh [email protected] # CC account on Cedar/Graham/Beluga/Niagara On Windows many options: I MobaXTerm https://docs.computecanada.ca/wiki/Connecting_with_MobaXTerm I PuTTY https://docs.computecanada.ca/wiki/Connecting_with_PuTTY I Secure Shell Extension https://bit.ly/2PxEQww in Chrome browser I bash from the Windows Subsystem for Linux (WSL) in Win10 – need to enable developer mode and then WSL SSH key pairs are very handy, save you from typing passwords I implies secure handling of private keys, non-empty passphrases + see https://docs.computecanada.ca/wiki/Securing_your_account I https://docs.computecanada.ca/wiki/SSH_Keys (scroll to 2-min videos at the bottom) I https://docs.computecanada.ca/wiki/Using_SSH_keys_in_Linux I https://docs.computecanada.ca/wiki/Generating_SSH_keys_in_Windows GUI connection: X11 forwarding (through ssh), VNC, x2go Client-server workflow in selected applications, both on login and compute nodes slides at http://bit.ly/introhpc WestGrid 2021 11 / 89 Overview Basics Languages/tools Scheduling Best practices and sharing Summary Linux command line All our systems run Linux (CentOS 7) ) you need to know basic command line I separate “Bash command line” session in our schools I attend a Software Carpentry bash session https://software-carpentry.org/workshops I lots of tutorials online, e.g. tutorials 1 – 4 at http://bit.ly/2vH3j8v Much typing can be avoided by using bash aliases, functions, ∼/.bashrc, hitting TAB FILE COMMANDS FILE COMMANDS SEARCHING ls directory listing command > file redirect command output to file grep pattern files search for pattern in files ls -alF pass command arguments command file append command output to file command | grep pattern example of a pipe cd dir change directory to dir more file page through contents of file find .

Introduction to High-Performance Computing Using Clusters to Speed up Your Research

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support