Introductory Supercomputing Course Objectives

Introductory Supercomputing Course Objectives • Understand basic parallel computing concepts and workflows • Understand the high-level architecture of a supercomputer • Use a basic workflow to submit a job and monitor it • Understand a resource request and know what to expect • Check basic usage and accounting information Iron Man 2 WORKFLOWS eResearch Workflows Research enabled by computers is called eResearch. eResearch examples: – Simulations (e.g. prediction, testing theories) – Data analysis (e.g. pattern matching, statistical analysis) – Visualisation (e.g. interpretation, quality assurance) – Combining datasets (e.g. data mashups, data mining) A supercomputer is just one part of an eResearch workflow. eResearch Workflow Collect • Kilobytes of parameters, or petabytes of data from a telescope or atomic collider Stage • Data transfer from manual transfer of small files, to custom data management frameworks for large projects Process • Single large parallel job with check pointing, or many smaller parallel or serial jobs Visualise • Examining values, generating images or movies, or interactive 3D projections Archive • Storage of published results, or publicly accessible research datasets The Pawsey Centre • Provides multiple components of the workflow in one location • Capacity to support significant computational and data requirements Vis cluster Display Archive, scratch compute scratch Instrument databases Supercomputer(s) Archive, Pawsey Centre databases Parallelism in Workflows Exploiting parallelism in a workflow allows us to • get results faster, and • break up big problems into manageable sizes. A modern supercomputer is not a fast processor. It is many processors working together in parallel. Workflow Example – Cake Baking It has a sequence of tasks that follow a recipe. Just like a computer program! Some tasks are independent • Can be done in any order. • Can be done in parallel. Some tasks have prerequisites. • Must be done sequentially. Baking a Cake - Staging • “Staging” the ingredients improves access time. • Ingredients are “prerequisites”. • Each ingredient does not depend on others, so can be moved in parallel, or at different times. Supermarket Kitchen pantry Kitchen bench Baking a Cake - Process Stage ingredients Measure ingredients Mix ingredients Preheat oven Serial Bake and cool cake Parallel Mix icing Ice cake Baking a Cake - Finished Either: • Sample it for quality. • Put it in a shop for others to browse and buy. • Store it for later. • Eat it! Clean up! Then make next cake. Levels of Parallelism Coarse-grained parallelism (high level) • Different people baking cakes in their own kitchens. • Preheating oven while mixing ingredients. Greater autonomy, can scale to large problems and many helpers. Fine-grained parallelism (low level) • Spooning mixture into cupcake tray. • Measuring ingredients. Higher coordination requirement. Difficult to get many people to help on a single cupcake tray. How many helpers? What is your goal? – high throughput, to do one job fast, or solve a grand-challenge problem? High Throughput: • For many cakes, get many people to bake independently in their own kitchens – minimal coordination. • Turn it into a production line. Use specialists and teams in some parts. Doing one job fast: • Experience as well as trial and error will find the optimal number of helpers. When to use supercomputing Workflows need supercomputing when the resources of a single laptop or workstation are not sufficient: • The program takes too long to process • There is not sufficient memory for the program • The dataset is too large to fit on the computer If you are unsure whether moving to a World’s largest wedding cake 6.8 tonnes supercomputer will help, ask for help 2004. Connecticut, USA from Pawsey staff. Photo: Guinness World Records Cray 1, Cray Research Inc. SUPERCOMPUTERS Abstract Supercomputer Login Nodes Data Movers Scheduler Compute Nodes High Performance Storage Login Nodes • Remote access to the supercomputer • Where users should manage workflows • Many people (~100) share a login node at the same time. Do not run your programs on the login nodes! • Use the login nodes to submit jobs to the queue to be executed on the compute nodes • Login nodes can have different hardware to compute nodes. • Some build tests may fail if you try to compile on login nodes. Compute Nodes • Programs should run on the compute nodes. • Access is provided via the scheduler • Compute nodes have a fast interconnect that allows them to communicate with each other • Jobs can span multiple compute nodes • Individual nodes are not that different in performance from a workstation • Parallelism across compute nodes is how significant performance improvements are achieved. Inside a Compute Node Each compute node has one or more CPUs: • Each CPU has multiple cores • Each CPU has memory attached to it Each node has an external network connection Some systems have accelerators (e.g. GPUs) High Performance Storage Fast storage “inside” the supercomputer • Temporary working area • Might have local node storage • Not shared with other users • Usually have global storage • All nodes can access the filesystems • Either directly connected to the interconnect, or via router nodes • The storage is shared. Multiple simultaneous users on different nodes will reduce performance Data Mover Nodes Externally connected servers that are dedicated to moving data to and from the high performance filesystems. • Data movers are shared, but most users will not notice. • Performance depends on the other end, distance, encryption algorithms, and other concurrent transfers. • Data movers see all the global filesystems hpc-data.pawsey.org.au Scheduler The scheduler feeds jobs into the compute nodes. It has a queue of jobs and constantly optimizes how to get them all done. As a user you interact with the queues. The scheduler runs jobs on the compute nodes on your behalf. Pawsey Supercomputing Centre PAWSEY SYSTEMS Pawsey Supercomputers Name System Nodes Cores/node RAM/node Notes Magnus Cray XC40 1488 24 64 GB Aries interconnect Galaxy Cray XC30 472 20 64 GB Aries interconnect, +64 K20X GPU nodes Zeus HPE Cluster 90 28 98-128 GB +20 vis nodes, +15 K20/K40/K20X GPU nodes, +11 P100 GPU nodes, +80 KNL nodes, +1 6TB node • Magnus should be used for large, parallel workflows. • Galaxy is reserved for the operation of the ASKAP and MWA radio telescopes. • Zeus should be used for small parallel jobs, serial or single node workflows, GPU-accelerated jobs, and remote visualisation. Filesystems Filesystems are areas of storage with very different purposes. • Some are large, some are small. • Some are semi-permanent, others temporary. • Some are fast, some are slow. High performance storage should not be used for long- term storage. Scratch Filesystem High performance Lustre filesystem for large data transfers • Large 3 PB capacity, or 1.98 billion files • Shared by all users, with no quotas • Temporary space for queued, running, or recently finished jobs Your directory is /scratch/projectname/username • Path is available in $MYSCRATCH environment variable • Files on /scratch may be purged after 30 days • Touching files to avoid purging violates Pawsey’s conditions of use and will result in suspension of accounts • Please delete your files before they are purged Group Filesystem High performance Lustre filesystem for shared project files, such as data sets and executables. • Not as fast as /scratch, copy data before accessing it from the compute nodes. Your directory is /group/projectname • Path is available in $MYGROUP environment variable • Default 1TB quota: `lfs quota -g projectname /group` • Make sure files are set to your project group, not your username group. • Use `fix.group.permission.sh projectname` if needed Home Filesystem You can use your home directory to store environment configurations, scripts and small input files • Small quota (about 10GB): `quota -s -f /home` • Uses NFS, which is very slow compared to Lustre • Where possible, use /group instead Your directory is /home/username • The path is available in the $HOME environment variable John Kogut using Cray X-MP at NCSA, University of Illinois LOGGING IN Command line SSH Within a terminal window, type: ssh [email protected] • Common terminal programs: • Windows, use MobaXterm (download) • Linux, use xterm (preinstalled) • OS X, use Terminal (preinstalled) or xterm (download) Account Security • SSH uses fingerprints to identify computers … so you don't give your password to someone pretending to be the remote host • We recommend that you set up SSH key authentication to increase the security of your account: https://support.pawsey.org.au/documentation/display/US/Logging+in+with+SSH+keys • Do not share your account with others, this violates the conditions of use (have the project leader add them to the project) • Please do not provide your password in help desk tickets, we never ask for your password via email as it is not secure. Graphical Interfaces • Remote graphical interface for some tasks available. • Graphical debuggers, text editors, remote visualisation. • Low performance over long distances. • Very easy to set up for Linux, Mac and MobaXterm clients. Add -X flag to ssh. ssh -X [email protected] • For higher performance remote GUI, use VNC: https://support.pawsey.org.au/documentation/display/US/Visualisation+documentation Firewalls • Firewalls protect computers on the local network by only allowing particular traffic in or out to the internet. • Most computers

Introductory Supercomputing Course Objectives

4: Gigabit Research

NVIDIA Tesla Personal Supercomputer, Please Visit

Resource Guide: Workstation Productivity

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Parallel Programming

Real-Time Performance During CUDA™ a Demonstration and Analysis of Redhawk™ CUDA RT Optimizations

Unit: 4 Processes and Threads in Distributed Systems

Gpu Concurrency

Parallel Computing

Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2

Hyper-Threading Technology Architecture and Microarchitecture

Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks