SUSE Linux Enterprise High Performance Computing 15 SP3 Administration Guide Administration Guide SUSE Linux Enterprise High Performance Computing 15 SP3

SUSE Linux Enterprise High Performance Computing 15 SP3 Administration Guide Administration Guide SUSE Linux Enterprise High Performance Computing 15 SP3 SUSE Linux Enterprise High Performance Computing Publication Date: September 24, 2021 SUSE LLC 1800 South Novell Place Provo, UT 84606 USA https://documentation.suse.com Copyright © 2020–2021 SUSE LLC and contributors. All rights reserved. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or (at your option) version 1.3; with the Invariant Section being this copyright notice and license. A copy of the license version 1.2 is included in the section entitled “GNU Free Documentation License”. For SUSE trademarks, see http://www.suse.com/company/legal/ . All third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its aliates. Asterisks (*) denote third-party trademarks. All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its aliates, the authors nor the translators shall be held liable for possible errors or the consequences thereof. Contents Preface vii 1 Available documentation vii 2 Giving feedback vii 3 Documentation conventions viii 4 Support x Support statement for SUSE Linux Enterprise High Performance Computing x • Technology previews xi 1 Introduction 1 1.1 Components provided 1 1.2 Hardware platform support 2 1.3 Support and life cycle 2 1.4 Documentation and other information 2 2 Installation and upgrade 3 2.1 System roles for SUSE Linux Enterprise High Performance Computing 15 SP3 3 2.2 Upgrading to SUSE Linux Enterprise High Performance Computing 15 SP3 4 3 Remote administration 5 3.1 Genders — static cluster configuration database 5 Genders database format 5 • Nodeattr usage 6 3.2 pdsh — parallel remote shell program 6 3.3 PowerMan — centralized power control for clusters 7 iii Administration Guide 3.4 MUNGE authentication 8 Setting up MUNGE authentication 8 • Enabling and starting MUNGE 9 3.5 mrsh/mrlogin — remote login using MUNGE authentication 10 4 Hardware 11 4.1 cpuid 11 4.2 hwloc — portable abstraction of hierarchical architectures for high- performance computing 11 5 Slurm — utility for HPC workload management 12 5.1 Installing Slurm 12 Minimal installation 12 • Installing the Slurm database 15 5.2 Slurm administration commands 17 scontrol 17 5.3 Upgrading Slurm 21 Upgrade workflow 22 5.4 Additional resources 29 Frequently asked questions 29 • External documentation 30 6 Monitoring and logging 31 6.1 ConMan — the console manager 31 6.2 Monitoring High Performance Computing clusters 32 Installing Prometheus and Grafana 33 • Monitoring cluster workloads 34 • Monitoring compute node performance 36 6.3 Ganglia — system monitoring 38 Using Ganglia 38 • Ganglia on Btrfs 38 • Using the Ganglia Web interface 39 6.4 rasdaemon — utility to log RAS error tracings 39 iv Administration Guide 7 HPC user libraries 41 7.1 Lmod — Lua-based environment modules 41 Installation and basic usage 41 • Listing available modules 42 • Listing loaded modules 42 • Gathering information about a module 42 • Loading modules 42 • Environment variables 42 • For more information 43 7.2 GNU Compiler Toolchain Collection for HPC 43 Environment module 43 • Building High Performance Computing software with GNU Compiler Suite 43 • Later versions 44 7.3 High Performance Computing libraries 44 boost — Boost C++ Libraries 45 • FFTW HPC library — discrete Fourier transforms 46 • NumPy Python library 46 • SciPy Python Library 47 • HYPRE — scalable linear solvers and multigrid methods 47 • METIS — serial graph partitioning and fill-reducing matrix ordering library 48 • GSL — GNU Scientific Library 49 • OCR — Open Community Runtime (OCR) for shared memory 49 • memkind — heap manager for heterogeneous memory platforms and mixed memory policies 50 • MUMPS — MUltifrontal Massively Parallel sparse direct Solver 51 • Support for PMIx in Slurm and MPI libraries 51 • OpenBLAS library — optimized BLAS library 52 • PETSc HPC library — solver for partial differential equations 53 • ScaLAPACK HPC library — LAPACK routines 54 • SCOTCH — static mapping and sparse matrix reordering algorithms 54 • SuperLU — supernodal LU decomposition of sparse matrices 55 • Trilinos — object-oriented software framework 55 7.4 File format libraries 56 Adaptable IO System (ADIOS) 56 • HDF5 HPC library — model, library, and file format for storing and managing data 56 • NetCDF HPC library — implementation of self-describing data formats 58 • HPC flavor of pnetcdf 60 7.5 MPI libraries 60 v Administration Guide 7.6 Profiling and benchmarking libraries and tools 62 IMB — Intel* MPI benchmarks 62 • PAPI HPC library — consistent interface for hardware performance counters 63 • mpiP — lightweight MPI profiling library 63 A GNU licenses 65 vi Administration Guide Preface 1 Available documentation Online documentation The online documentation for this product is available at https://documentation.suse.com/ #sle-hpc . Browse or download the documentation in various formats. Find the online documentation for other products at https://documentation.suse.com/ . Note: Latest updates The latest documentation updates are usually available in the English version of the documentation. Release notes For release notes, see https://www.suse.com/releasenotes/ . In your system For oine use, nd documentation in your installed system under /usr/share/doc . Many commands are also described in detail in their manual pages. To view them, run man , followed by a specic command name. If the man command is not installed on your system, install it with sudo zypper install man . 2 Giving feedback We welcome feedback on, and contributions to, this documentation. There are several channels for this: Service requests and support For services and support options available for your product, see http://www.suse.com/ support/ . To open a service request, you need a SUSE subscription registered at SUSE Customer Center. Go to https://scc.suse.com/support/requests , log in, and click Create New. Bug reports vii Available documentation SUSE Linux Enterp… 15 SP3 Report issues with the documentation at https://bugzilla.suse.com/ . Reporting issues requires a Bugzilla account. To simplify this process, you can use the Report Documentation Bug links next to headlines in the HTML version of this document. These preselect the right product and category in Bugzilla and add a link to the current section. You can start typing your bug report right away. Contributions To contribute to this documentation, use the Edit Source links next to headlines in the HTML version of this document. They take you to the source code on GitHub, where you can open a pull request. Contributing requires a GitHub account. For more information about the documentation environment used for this documentation, see the repository's README at https://github.com/SUSE/doc-hpc . Mail You can also report errors and send feedback concerning the documentation to doc- [email protected] . Include the document title, the product version, and the publication date of the document. Additionally, include the relevant section number and title (or provide the URL) and provide a concise description of the problem. 3 Documentation conventions The following notices and typographic conventions are used in this document: /etc/passwd : Directory names and le names PLACEHOLDER : Replace PLACEHOLDER with the actual value PATH : An environment variable ls , --help : Commands, options, and parameters user : The name of user or group package_name : The name of a software package Alt , Alt – F1 : A key to press or a key combination. Keys are shown in uppercase as on a keyboard. File, File Save As: menu items, buttons viii Documentation conventions SUSE Linux Enterp… 15 SP3 AMD/Intel This paragraph is only relevant for the Intel 64/AMD64 architectures. The arrows mark the beginning and the end of the text block. IBM Z, POWER This paragraph is only relevant for the architectures IBM Z and POWER . The arrows mark the beginning and the end of the text block. Chapter 1, “Example chapter”: A cross-reference to another chapter in this guide. Commands that must be run with root privileges. Often you can also prex these commands with the sudo command to run them as non-privileged user. root # command tux > sudo command Commands that can be run by non-privileged users. tux > command Notices Warning: Warning notice Vital information you must be aware of before proceeding. Warns you about security issues, potential loss of data, damage to hardware, or physical hazards. Important: Important notice Important information you should be aware of before proceeding. Note: Note notice Additional information, for example about dierences in software versions. Tip: Tip notice Helpful information, like a guideline or a piece of practical advice. Compact Notices Additional information, for example about dierences in software versions. ix Documentation conventions SUSE Linux Enterp… 15 SP3 Helpful information, like a guideline or a piece of practical advice. 4 Support Find the support statement for SUSE Linux Enterprise High Performance Computing and general information about technology previews below. For details about the product lifecycle, see https://www.suse.com/lifecycle . If you are entitled to support, nd details on how to collect information for a support ticket at https://documentation.suse.com/sles-15/html/SLES-all/cha-adm-support.html

SUSE Linux Enterprise High Performance Computing 15 SP3 Administration Guide Administration Guide SUSE Linux Enterprise High Performance Computing 15 SP3

Slurm Overview and Elasticsearch Plugin

Intel® SSF Reference Design: Intel® Xeon Phi™ Processor, Intel® OPA

Version Control Graphical Interface for Open Ondemand

Extending SLURM for Dynamic Resource-Aware Adaptive Batch

Distributed HPC Applications with Unprivileged Containers Felix Abecassis, Jonathan Calmels GPUNVIDIA Computing Beyond Video Games

Cluster Deployment

Sansa - the Supercomputer and Node State Architecture

Intelligent Colocation of HPC Workloads?,??

Winter 2016 Vol

Virtual Machine Checkpoint Storage and Distribution for Simuboost

The Crossroads of Cloud And

Lic-Sec: an Enhanced Apparmor Docker Security Profile Generator