GNU Parallel

Parallel Processing and Load Balancing Robert Bukowski Institute of Biotechnology Bioinformatics Facility (aka Computational Biology Service Unit - CBSU) Workshop website: https://biohpc.cornell.edu/ww/1/Default.aspx?wid=136 Contact: [email protected] Motivation Can you solve a ‘big task’ on your laptop? Not really… . too small: not enough memory, not enough disk to store big data . too slow: analysis would take forever You need a more powerful resource . bigger: more memory, more disk . FASTER!!! What does FASTER mean? . faster processor (and other hardware) – yes, but first of all…. MORE processors !!! . knowledge how to use it all BioHPC renatal resources Server type #servers #cores RAM [GB] interactive 4 4 24 general 32 8 16 your workshop machines medium gen1 1 16 64 16 24 128 medium gen2 12 40 256 large gen1 8 64 512 large gen2 2 96 512 4 112 512 2 80 512 3 88 512 extra-large 1 64 1,024 1 112 1,024 1 88 1,024 GPU gen2 2 32 256 Total 89 3,056 19,104 Big picture Given: ‘big task’ at hand multiple CPUs, RAM, and disk storage, possibly scattered across multiple networked computers network Objective: Parallelize: solve the ‘big task’ in time shorter than it would take using a single CPU on a single computer Balance load: keep resources busy, but not overloaded Synopsis Some basic hardware facts Some basic software facts Parallelization: problems and tools Monitoring and timing Linux processes Multiple independent tasks . Load balancing Next week: Advanced load balancing using job scheduler (SLURM = Simple Linux Utility for Resource Management) Hands-on exercises: will introduce some tools and techniques (although quantitative conclusions doubtful in shared environment… ) Resources on a single machine (here: cbsumm12) Motherboard Local disk storage /workdir fast core core core core RAM Cache core core core core Cache RAM Network disk storage 64GB 15MB 15MB 64GB core core core core /home slow CPU 1 (in ‘slot’ or ‘socket’) CPU 2 (in ‘slot’ or ‘socket’) CPU: an integrated circuit (a “chip”) containing computational hardware. May be more than one per server, typically 2-4. Core: a subunit of CPU capable of executing an independent sequence of instructions (a thread). Shares communication infrastructure and internal memory with other cores on the CPU. # threads possible to run at the same time = # cores Hyperthreading (HT): technology to simultaneously run several (typically – two) independent sequences of instructions (threads) on each core, sharing the core’s hardware; may be disabled or enabled. If HT enabled, core is understood as hyperthreaded core In this example, with HT enabled, # cores =24 RAM: memory. All accessible to all cores (but easier to access CPU’s ‘own’ portion); Cache: fast-access (but small) memory ‘close’ to CPU Check on CPU configuration with lscpu [root@cbsumm12 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 24 Hyper-threaded cores On-line CPU(s) list: 0-23 Thread(s) per core: 2 Hyperthreading ON Core(s) per socket: 6 Socket(s): 2 CPUs NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 45 Model name: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz Stepping: 7 CPU MHz: 2500.122 CPU max MHz: 2500.0000 CPU min MHz: 1200.0000 BogoMIPS: 4000.35 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-5,12-17 NUMA node1 CPU(s): 6-11,18-23 Flags: fpu vme de pse tsc … Check memory using free show in Mbytes (other options: -k -g) [bukowski@cbsumm12 ~]$ free -m total used free shared buff/cache available Mem: 128738 1772 123220 1550 3746 124575 Swap: 4095 836 3259 Not always correct, unfortunately… Check disk storage using df [root@cbsumm12 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/rhel_cbsumm12-root 300G 32G 269G 11% / devtmpfs 63G 0 63G 0% /dev tmpfs 63G 0 63G 0% /dev/shm tmpfs 63G 2.1G 61G 4% /run tmpfs 63G 0 63G 0% /sys/fs/cgroup /dev/md126 871G 72M 827G 1% /SSD /dev/mapper/rhel_cbsumm12-home 3.4T 130G 3.3T 4% /local /dev/sda2 494M 144M 351M 30% /boot tmpfs 13G 16K 13G 1% /run/user/42 tmpfs 13G 0 13G 0% /run/user/0 128.84.180.177@tcp1:128.84.180.176@tcp1:/lustre1 1.3P 1003T 300T 78% /home cbsugfs1:/home 233T 135T 99T 58% /glusterfs/home tmpfs 13G 0 13G 0% /run/user/4857 Network-mounted (slow, permanent). Local scratch space (fast, temporary) NO I/O-intensive computations there! Check other hardware using lspci PCI = Peripheral Component Interconnect Most devices are attached this way lspci produces long output, better paginate or filter, e.g., [root@cbsumm12 ~]# lspci | grep -i raid 00:1f.2 RAID bus controller: Intel Corporation C600/X79 series chipset SATA RAID Controller (rev 06) [cryosparc_user@cbsugpu03 ~]$ lspci | grep -i nvidia 02:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1) 83:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1) What’s running on a machine Everything ‘running on a machine’ (apps run by users, OS tasks) does this by means of processes thread 1 Process: instruction sequence loaded into memory, to be thread 2 executed by CPU cores, using some memory to store code (text) text, data and data, communicating with peripherals (disk storage, network, …) Process 2 Templates for processes stored on disk as executable files thread text, data Process may contain one or more threads (multithreading), all Process 1 with access to the same data (but not to data of other processes) Each process has a unique process ID (and so do individual threads) thread 1 thread 2 Each process is created by another process - its parent process thread 3 (thus, there is a process tree) thread 4 text, data Each process (with all its threads) runs on a single machine Process 3 Cores and processes: mixing it all together At any given time, a core can be . executing one thread . idle At any given time, a thread can be . running on one of the cores . waiting off-core (for input or data from memory or disk, or for an available core) . stopped on purpose Load: number of threads running + waiting for a core to run on (should not exceed number of cores!) Context switches . a core executes one thread for a while, then switches to another (state of the previous one is saved to be resumed later) . some threads have higher priority (like quick house-keeping tasks by OS) . threads are only allowed to run for some time without being switched out . frequent context switches not good for performance (occur at high load) Scheduler (part of Linux kernel) takes care of distributing threads over cores . (not to be confused with SLURM job scheduler discussed in Part 2) Software structure User applications (bash, python, bwa, FireFox, ssh, VNC, blast+…) Low-level system components (init, services, logind, networkd, X11,…) User mode Processes C standard library (processes communicating with kernel) Linux kernel system call interface (SCI) – used by processes process scheduling Kernel mode inter-process communication tools (IPC) memory management interface to hardware (drivers) Hardware: CPUs, memory, disk storage, other peripherals Cores and processes: mixing it all together Typically, there are many more threads than cores: Example: empty (i.e., no users) machine cbsumm12 (24 cores), some time last Saturday: ps -ef | wc -l : 596 (all processes) ps –efL | wc –l : 919 (all threads) these are processes that keep the OS running mostly waiting for stuff to help with, clean up, running only when needed consume very few CPU cycles and little RAM Despite large number of threads, the load on the machine was very low, and most memory was available: uptime 10:37:32 up 265 days, 14:57, 3 users, load average: 0.08, 0.21, 1.11 free -m total used free shared buff/cache available Mem: 128738 1772 123220 1550 3746 124575 Almost all CPU and memory resources up for grabs by users’ programs Big picture Given a ‘big task’ at hand, make multiple CPU cores work in parallel to achieve the solution in time shorter than what would be needed if only a single core were used Constraints: CPUs and memory possibly scattered over multiple network networked machines Core number and memory limits on individual machines A process (with all its threads and memory) can only run on one machine No direct data sharing between processes Parallelize the problem! Parallelizing a problem: a silly (but complex) example Sum up a bunch of numbers (here: from 1 to 8) and calculate the Exp od the sum using 4 threads 1 2 3 4 5 6 7 8 Thread 1 Thread 2 Thread 3 Thread 4 3 7 11 15 Synchronize/communicate (idle) (idle) 10 26 Synchronize/communicate (idle) (idle) (idle) 36 Sequential (i.e., non-parallelizable) (idle) (idle) (idle) part Exp(36) Programmer’s perspective: planning complex parallelization Algorithm design Constraints Identify parallelizable portions of the problem CPUs and memory possibly scattered over multiple networked machines Minimize the sequential (non-parallelizable) part #threads <= #cores (on each machine) Consider/minimize synchronization and inter- thread communication Combined memory taken up by all processes not to exceed total machine’s memory Avoid race conditions Storage capacity and access Avoid simultaneous I/O by multiple threads A process (with all its threads and memory) can Threads organization only run on one machine . Single process with multiple threads . Multiple single-threaded processes No direct data sharing between processes . Multiple multi-threaded processes Programmer’s perspective: tools For complicated algorithms with varying levels of parallelism and communication, programs are typically written using appropriate parallelization tools (libraries of functions).

Load more