Best Practices HPC and Introduction to the New System FRAM
Total Page:16
File Type:pdf, Size:1020Kb
Best Practices HPC and introduction to the new system FRAM Ole W. Saastad, Dr.Scient USIT / UAV / ITF / FI Sept 6th 2016 Sigma2 - HPC seminar Introduction • FRAM overview • Best practice for new system, requirement for application • Develop code for the new system – Vectorization Universitetets senter for informasjonsteknologi Strategy for FRAM and B1 • FRAM with Broadwell processors – Take over for Hexagon and Vilje – Run parallel workload in 18-24 mnts. – Switch role to take over for Abel & Stallo – Run throughput production when B1 comes on line Universitetets senter for informasjonsteknologi Strategy for FRAM and B1 • B1 massive parallel system – Processor undecided • AMD ZEN, OpenPOWER, ARM, KNL, Skylake • Strong focus on vectors for floating point • Performance is essential – Will run the few large core count codes • Typical job will used 10k+ cores • Interconnect InfiniBand or Omnipath Universitetets senter for informasjonsteknologi FRAM overall specs • Broadwell processors – AVX2 just like Haswell, including FMA • Island topology – 4 islands – approx 8k cores per island • Close to full bisection bandwidth within the island • Run up to 8k core jobs nicely • Struggle with jobs requiring > 8k cores Universitetets senter for informasjonsteknologi FRAM overall specs • 1000 compute nodes, 32000 cores • 8 large memory nodes, 256 cores • 2 very large memory nodes, 112 Cores • 8 accelerated nodes, 256 Cores + CUDA cores • A total of 32624 cores for computation ≈ 1 Pflops/s • 18 watercooled racks Universitetets senter for informasjonsteknologi FRAM overall specs • 2.4 PiB of local scratch storage, DDN EXAScaler • • LUSTRE file system /work (and local software) • /home and /project mounted from Norstore • Expected performance is ∼ 40 GiB/s read / write Universitetets senter for informasjonsteknologi Lenovo nx360m5 direct water cooled Universitetets senter for informasjonsteknologi Compute nodes – processor • Dual socket motherboard • Intel Broadwell E5-2683v4 processors • 2.1 GHz, 16 cores, 32 threads • Cache: L1 64kiB , L2 256 kiB , L3/last level 40 MiB • AVX2 (256 bits vectors), FMA3, RDSEED, ADX Universitetets senter for informasjonsteknologi Compute nodes - memory • Single memory controller per socket • 4 memory channels per socket • 8 x 8G dual ranked 2400 MHz DDR4 memory • Total of 64 GiB Cache coherent memory in 2 NUMA nodes • Approx 120 GiB/s bw (7.5 GiB/s per core) Universitetets senter for informasjonsteknologi Compute nodes - Interconnect • Mellanox Connect X-4 EDR (100 Gbits/s) InfiniBand • PCIe3 x 16 , 15.75 GiB/s per direction • Close to 10 GiB/s bw (0.625 GiB/s per core) • Island topology Universitetets senter for informasjonsteknologi Large memory nodes • Dual Intel E5-2683v4 processors • 8 x 32 G dual ranked 2400 MHz DDR4 memory • Total of 512 GiB cache coherent memory in 2 NUMA nodes • 1920 GiB Solid State Disk as local scratch storage Universitetets senter for informasjonsteknologi Very large memory nodes • Quad Intel E7-4850v3 processors, 14 cores, 2.2 GHz • 56 cores in total • 96 x 64 G LRDIMM Memory • Total of 6 TiB cache coherent memory in 4 NUMA nodes • 14 TiB Disk as local scratch storage Universitetets senter for informasjonsteknologi Accelerated nodes • Dual Intel E5-2683v4 processors • 8 x 16 G dual ranked 2400 MHz DDR4 memory • Total of 128 GiB Cache coherent memory in 2 NUMA nodes • 2 x NVIDIA K80 cards (Most probably PASCAL based cards will be installed, TBD), full CUDA support. Universitetets senter for informasjonsteknologi Local IO • 2 x DDN EXAScalers • LUSTRE file system with native InfiniBand support • InfiniBand RDMA • 2.45 PiB of storage • 49.9 GiB/s sequential write • 52.9 GiB/s sequential read • Provide /work and cluster wide software • NO backup Universitetets senter for informasjonsteknologi Remote IO • Provided by Norstore • Expected to be LUSTRE • Performance estimate is about 40 GiB/s read/write • Provide – /home – /projects • Full backup, snapshots, redundancy • A range of file access services – POSIX, object storage, etc Universitetets senter for informasjonsteknologi Software • Intel parallel studio – Compilers and libraries – Tuning and analysis tools – Intel MPI • GNU tools – Compilers, libraries and debuggers – Common GNU utilities Universitetets senter for informasjonsteknologi Software • Message Passing Interface, MPI – Intel – OpenMPI • Mellanox InfiniBand stack – Mellanox services for collectives – OpenMPI ok, intelMPI ?? Universitetets senter for informasjonsteknologi Software • SLURM queue system – Minor changes from today’s user experience • Modules – Expect changes, work in progress • User environment – Could be a mode for scientists and a mode for developers, undecided. Universitetets senter for informasjonsteknologi Support • RT tickets as before • Tickets handled by Sigma2 and metecenter members • Advanced user support as before • Training will be arranged by Sigma2 Universitetets senter for informasjonsteknologi Best Practice for HPC • How to utilize the system in a best possible way • How to schedule your job • How to allocate resources • How to do Input and Output • How to understand your application Universitetets senter for informasjonsteknologi Understand your application • The following is valid for any application • These tools can be used on any application • Optimal use of the system is important when resources are limited • Sigma2 might require performance review Universitetets senter for informasjonsteknologi Understand your application • Memory footprint • Memory access • Scaling – threads, shared memory • Scaling – MPI, interconnect, collectives • Vectorization • Usage of storage during run • Efficiency as a fraction of theoretical performance Universitetets senter for informasjonsteknologi Step by step application insight • Starting with simple tools – time – top – iostat – strace – MPI snapshop – Intel SDE • Progressing with tools like performance reports • Finishing with Intel vector advisor, amplifier and tracer Universitetets senter for informasjonsteknologi Time spent and memory • Timing is a key in performance tuning • /usr/bin/time • Man time • Very simple syntax : /usr/bin/time ./prog.x Universitetets senter for informasjonsteknologi Time spent and memory • Default format gives: %Uuser %Ssystem %Eelapsed %PCPU (%Xtext+%Ddata %Mmax)k %Iinputs+%Ooutputs (%Fmajor+%Rminor)pagefaults %Wswaps Universitetets senter for informasjonsteknologi Time spent and memory • %Uuser – Total number of CPU-seconds that the process spent in user mode. %Ssystem – Total number of CPU-seconds that the process spent in kernel mode. %Eelapsed – Elapsed real time (in [hours:]minutes:seconds). Universitetets senter for informasjonsteknologi Time spent and memory • %PCPU (%Xtext+%Ddata %Mmax)k – Percentage of the CPU that this job got, computed as (%U + %S) / %E. %Iinputs+%Ooutputs – Number of file system inputs by the process. (%Fmajor+%Rminor)pagefaults %Wswaps Universitetets senter for informasjonsteknologi Time spent and memory • %Fmajor pagefaults – Number of major page faults that occurred while the process was running. These are faults where the page has to be read in from disk. Universitetets senter for informasjonsteknologi Time spent and memory • %Rminor pagefaults – Number of minor, or recoverable, page faults. These are faults for pages that are not valid but which have not yet been claimed by other virtual pages. Thus the data in the page is still valid but the system tables must be updated. Universitetets– senter for informasjonsteknologi Time spent and memory • %Wswaps – Number of times the process was swapped out of main memory. Universitetets senter for informasjonsteknologi Time spent and memory • A Gaussian 09 job : 2597.79user 42.46system 44:06.57elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+2957369minor)pagefaults 0swaps – 2598 s user time – 43 s system time – 2647 s wall clock time – 2957369 minor page faults – Bug in older kernels, 0 for memory Universitetets senter for informasjonsteknologi Time spent and memory • An evolution run : 13.89user 1.63system 1:27:26elapsed 0%CPU (0avgtext+0avgdata 220352maxresident)k 1768848inputs+17336outputs (462major+87335minor)pagefaults 0swaps – Maximum resident memory : 220 MB – Kernel : 3.10.0-327.28.3.el7.x86_64 Universitetets senter for informasjonsteknologi Memory, cache and data • While you can address and manipulate a single byte, • The smallest amount of data transferred is 32 or 64 bytes • Do not waste data in cachelines • Do all the work on data while they are in the cache • Running from memory is equal to a clock of 10 MHz Universitetets senter for informasjonsteknologi Memory and page faults • Major and minor page faults – Remember output from /usr/bin/time ? 2597.79user 42.46system 44:06.57elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+2957369minor)pagefaults 0swaps Universitetets senter for informasjonsteknologi Minor page fault • Minor page faults are unavoidable • Page is in memory, just a memory admin by OS • Little cost • A huge number of these is often a sign of ineffective programming Universitetets senter for informasjonsteknologi Memory and page faults • Major page faults means to little memory for your working set! • Showstopper ! • Avoid at all cost, slows down your system to halt • Submit job on a node/system with more memory! Universitetets senter for informasjonsteknologi Memory footprint • “top” is your friend • Man top will give help • top is the mostly used tool for memory monitoring Universitetets senter for informasjonsteknologi