IBM High-Performance Computing Insights with IBM Power System AC922 Clustered Solution

Front cover IBM High-Performance Computing Insights with IBM Power System AC922 Clustered Solution Dino Quintero Miguel Gomez Gonzalez Ahmad Y Hussein Jan-Frode Myklebust Redbooks International Technical Support Organization IBM High-Performance Computing Insights with IBM Power System AC922 Clustered Solution May 2019 SG24-8422-00 Note: Before using this information and the product it supports, read the information in “Notices” on page ix. First Edition (May 2019) This edition applies to: IBM Power System AC922 Firmware OP9-V2.0-2.14 Extreme Cloud Administration Toolkit (xCAT) V2.14.3, Red Hat Enterprise Linux server 7.5 Alt (RHEL-ALT-7.5-20180315.0-Server-ppc64le-dvd1.iso) NVIDIA Compute Unified Device Architecture (CUDA) V9.2 (cuda-repo-rhel7-9-2-local-9.2.148-1.ppc64le) Mellanox MLNX_OFED_LINUX-4.3-4.0.5.1-rhel7.5alternate-ppc64le.iso IBM XLC xlc-16.1.0-1-ppc64le.NEED_PRODUCT_PKGS.tar.bz2 IBM XLF xlf-16.1.0-1-ppc64le.NEED_PRODUCT_PKGS.tar.bz2 AT at12.0 IBM SMPI ibm_smpi-10.02.00.09rtm5-rh7_20181010.ppc64le.rpm IBM Parallel Performance Toolkit ppt-2.4.0-2.tar.bz2 IBM IBM Engineering and Scientific Subroutine Library (IBM ESSL) essl-6.1.0-1-ppc64le IBM Parallel Engineering and Scientific Subroutine Library (IBM Parallel ESSL) pessl-5.4.0-0-ppc64le IBM Spectrum Scale 5.0.1.-2 The Portland Group, Inc. (PGI) pgilinux-2018-187-ppc64le.tar.gz IBM Spectrum Load Sharing Facility (IBM Spectrum LSF) lsf-10.1-build_number.ppc64le_csm.bin IBM Cluster Systems Management V1.3.0 IBM Burst Buffer V1.3.0 © Copyright International Business Machines Corporation 2019. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Notices . ix Trademarks . .x Preface . xi Authors. xi Now you can become a published author, too! . xii Comments welcome. xiii Stay connected to IBM Redbooks . xiii Part 1. Planning. 1 Chapter 1. Introduction to IBM high-performance computing . 1 1.1 Overview of HPC. 2 1.2 Reasons for implementing HPC on IBM POWER9 processor-based systems. 2 1.3 Overview of the POWER9 processor-based CORAL Project . 3 Chapter 2. IBM Power System AC922 server for HPC overview. 7 2.1 Power AC922 server for HPC . 8 2.2 Functional component description . 8 2.2.1 Power AC922 models . 9 2.2.2 POWER9 processor . 10 2.2.3 Memory subsystem. 11 2.2.4 PCI adapters . 12 2.2.5 IBM CAPI2 . 15 2.2.6 NVLink 2.0 . 17 2.2.7 Baseboard management controller. 18 Chapter 3. Software stack . 19 3.1 Red Hat Enterprise Linux . 20 3.2 Device drivers . 20 3.2.1 Mellanox OpenFabrics Enterprise Distribution . 20 3.2.2 NVIDIA CUDA. 21 3.3 Development environment software . 23 3.3.1 IBM XL compilers . 23 3.3.2 LLVM. 23 3.3.3 IBM Engineering and Scientific Subroutine Library. 24 3.3.4 IBM Parallel Engineering and Scientific Subroutine Library . 25 3.3.5 IBM Spectrum MPI . 26 3.3.6 IBM Parallel Performance Toolkit . 27 3.4 Workload management . 28 3.4.1 IBM Spectrum Load Sharing Facility. 28 3.5 Cluster management software . 28 3.5.1 Extreme Cluster and Cloud Administration Toolkit . 28 3.5.2 Cluster Administration and Storage Tools. 29 3.5.3 Mellanox Unified Fabric Manager . 30 3.6 IBM Storage and file systems . 30 3.6.1 IBM Spectrum Scale . 30 3.6.2 IBM Elastic Storage Server. 31 © Copyright IBM Corp. 2019. All rights reserved. iii Chapter 4. Reference architecture. 33 4.1 Large HPC cluster architecture . 34 4.2 Medium HPC cluster architecture . 40 4.3 Software stack mapping . 41 4.4 Generic sizing of nodes. 42 4.5 Ethernet network layout . 43 4.6 InfiniBand network. 44 4.6.1 InfiniBand network topologies . 44 4.6.2 Fat tree topology . 44 4.7 Job launch overview . 46 Part 2. Deployment . 49 Chapter 5. Nodes and software deployment . 51 5.1 Deployment overview . 52 5.2 System management . 52 5.2.1 The Intelligent Platform Management Interface tool . 52 5.2.2 OpenBMC . 54 5.2.3 Firmware update . 55 5.2.4 Boot order configuration . 57 5.3 xCAT deployment overview . 59 5.3.1 xCAT database: Objects and tables . 60 5.3.2 xCAT node booting . 60 5.3.3 xCAT node discovery . 61 5.3.4 xCAT baseboard management controller discovery . 62 5.3.5 xCAT installation types: Disks and state. 62 5.3.6 xCAT network interfaces: Primary and additional . 62 5.3.7 xCAT software kits . 63 5.3.8 xCAT synchronizing files. ..

Load more