Supercomputers: Queue and Job Management HORT 59000 Lecture 6 Instructor: Kranthi Varala Client/Server Architecture

Supercomputers: Queue and Job management HORT 59000 Lecture 6 Instructor: Kranthi Varala Client/Server architecture User1 User2 Server (UNIX/ Web/ Database etc..) User3 User4 When to use supercomputers • Need to run hundreds to thousands of similar jobs. • Need to run a few large jobs quickly. • Tasks can be divided into smaller portions and run in parallel. Parallelization • Refers to the ability of dividing a large task into smaller parts that can all be run in parallel. • E.g., Correlation matrix of 10,000 genes. • Can be divided into 10,000 jobs where each job works on one gene. • Need to balance increase in computation power with increase in need to communicate. Parallelization: Ideal vs Real By Raul654 (Own work) [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons Supercomputer architecture Storage Compute Node1 Home Compute Node2 Compute Node3 Storage Head Node Compute Node4 Scratch Compute Node5 Compute Node6 Storage Compute Node7 Archive Compute Node8 Supercomputer architecture Compute Node1 Compute Node2 Compute Node3 Storage Head Node Compute Node4 Archive Compute Node5 Compute Node6 Compute Node7 Storage Storage Compute Node8 Home Scratch Terminology • Cluster: Complex of multiple “nodes” + connecting hardware + storage • Node: Individual computer in the cluster with its own processors and memory. • Core: Individual processor on a node. Head Node • Specialized node handles user accounts, logins, communication with the internet and network storage. • Built to handle lots of user interactions and to run the job management system. • NOT optimized to run compute intensive jobs. • Typically interacts with archival storage. Compute Node • Multiple identical nodes heavy computational load and large memory jobs. • NOT optimized to support user interaction. • Typically has access to home and scratch space only. Homogeneous vs. Heteregeneous nodes • Homogeneous : All compute nodes share identical software and hardware. • Heterogeneous: Different kinds of compute nodes with same software but large differences in hardware. • Heterogenous clusters support multiple job types such as compute intensive vs. memory intensive vs. I/O intensive jobs Job Management • User submits “jobs” to the cluster with specific requests for cores, memory, walltime etc. • Jobs once submitted will be allocated to the nodes by the job management software. • Each job runs creates its own login/shell and runs within that. • Eg., PBS, SGE, SLURM etc. PBS Job Management • Portable Batch System (PBS) is a very common management system. Used on all Purdue RCAC clusters. • Each “job” is essentially a shell script that has a series oF commands. • Jobs are allocated based on resources available to the user. PBS resources • Nodes: No. of nodes requested for the job. • Cores: No. of cores requested for the job. Total number cannot exceed the sum of cores available on all nodes requested. • Memory: Total memory required for the job (estimate). • Wall time: Expected run time for the job. • NOTE: job will be terminated if it exceeds time requested and/or runs out of memory. PBS queues • Queues are how PBS manages the job submission. • Each queue has a set of properties: No. and/or types of nodes available to it, max. run time, user access permissions, priorities etc. • Each job is submitted by user to specific queue. • When a node that fits user requirement becomes available the job in the queue is run. PBS queues • Multiple queues may be available to you. • Get a list of queues by using PBS command qlist • qstat –a will list the status of the queue Scholar queues • Three queues available. • Our jobs go to scholar queue by default. PBS environment variables • Job environment variables are specific to the job and only exist while the job is running. • PBS_JOBID : ID assigned to the current job. • PBS_O_WORKDIR : Directory from which the job was submitted. Example PBS job #!/bin/sh –l #PBS -l nodes=1:ppn=24 #PBS -l walltime=24:00:00 #PBS -l naccesspolicy=singleuser #PBS -q kvarala cd /scratch/brown/kvarala/EvoNet mpirun -np 24 /home/kvarala/bin/examl-AVX -t RAxML_parsimonyTree.StartingTree -m PSR -s EvoNet.m10.binary -n T1 >ExaML.log Example SGE job #!/bin/bash #$ -N run_blastp #$ -cwd #$ -pe smp 8 #$ -l h_vmem=4G blastp –a 8 –I infile.fasta –d database.db –o outfile.blast Example SLURM job #!/bin/bash #SBATCH --nodes 1 #SBATCH --tasks-per-node 12 #SBATCH -t 4:00:00 #SBATCH --mem 8GB cd /scratch/kv15/subCluster makeblastdb -dbtype 'prot' -in Gymno.Family1 -title GYMFAM1 -out GYMFAM1 -parse_seqids blastp -db GYMFAM1 -query Gymno.Family1 -outfmt 6 -evalue 0.01 - num_threads 12 -out GYMFAM1.blastout Modules • Pre-installed software that is loaded when needed. • E.g., module load blastall • module list avail shows the list of all modules available. • Only load the modules you need to run the current job..

Supercomputers: Queue and Job Management HORT 59000 Lecture 6 Instructor: Kranthi Varala Client/Server Architecture

Analysis of Server-Smartphone Application Communication Patterns

Implementation of Embedded Web Server Based on ARM11 and Linux Using Raspberry PI

Data Management for Portable Media Players

Blockchain Database for a Cyber Security Learning System

An Introduction to Cloud Databases a Guide for Administrators

Middleware-Based Database Replication: the Gaps Between Theory and Practice

Database Technology for Bioinformatics from Information Retrieval to Knowledge Systems

Microcomputers: NQS PUBLICATIONS Introduction to Features and Uses

Usb Laptop Console 2

TIBCO Spotfire® Analyst Portable Software Release 10.4 2

What Is a Database? Differences Between the Internet and Library

Embedded ATMEL HTTP Server