
Front cover IBM High Performance Computing Cluster Health Check Learn how to verify cluster functionality See sample scenarios Understand best practices Dino Quintero Jie Gong Ross Aiken Markus Hilger Shivendra Ashish Herbert Mehlhose Manmohan Brahma Justin I. Morosi Murali Dhandapani Thorsten Nitsch Rico Franke Fernando Pizzano ibm.com/redbooks International Technical Support Organization IBM High Performance Computing Cluster Health Check February 2014 SG24-8168-00 Note: Before using this information and the product it supports, read the information in “Notices” on page vii. First Edition (February 2014) This edition applies to Red Hat Enterprise Linux 6.3, IBM Parallel Environment (PE) (Versions 1.3.0.5, 1.3.0.6, and 1.3.0.7), GPFS 3.5.0-9, xCAT 2.8.1 and 2.8.2, Mellanox OFED 2.0-2.0.5 and 1.5.3-4.0.22.3, Mellanox UFM 4.0.0 build 19, iperf 2.0.5, IOR version 2.10.3 and 3.0.1, mdtest version 1.9.1, stream version 5.10. © Copyright International Business Machines Corporation 2014. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Notices . vii Trademarks . viii Preface . ix Authors. ix Now you can become a published author, too! . xi Comments welcome. xii Stay connected to IBM Redbooks . xii Chapter 1. Introduction. 1 1.1 Overview of the IBM HPC solution . 2 1.2 Why we need a methodical approach for cluster consistency checking . 2 1.3 Tools and interpreting their results for HW and SW states . 2 1.3.1 General Parallel File System. 3 1.3.2 Extreme Cloud Administration Toolkit. 3 1.3.3 The OpenFabrics Enterprise Distribution . 4 1.3.4 Red Hat Package Manager. 4 1.4 Tools and interpreting their results for identifying performance inconsistencies. 4 1.5 Template of diagnostics steps that can be used (checklists) . 5 Chapter 2. Key concepts and interdependencies . 7 2.1 Introduction to High Performance Computing . 8 2.2 Rationale for clusters . 8 2.3 Definition of an HPC Cluster . 8 2.4 Definition of a “healthy cluster” . 9 2.5 HPC preferred practices . 10 Chapter 3. The health lifecycle methodology . 13 3.1 Why a methodology is necessary . 14 3.2 The health lifecycle methodology . 14 3.3 Practical application of the health lifecycle methodology . 15 3.3.1 Deployment phase . 16 3.3.2 Verification or pre-production readiness phase. 19 3.3.3 Production phase (monitoring) . 21 Chapter 4. Cluster components reference model . 23 4.1 Overview of installed cluster systems . 24 4.2 ClusterA nodes hardware description . 25 4.3 ClusterA software description . 25 4.4 ClusterB nodes hardware description . 26 4.5 ClusterB software description . 27 4.6 ClusterC nodes hardware description . 27 4.7 ClusterC software description . 28 4.8 Interconnect infrastructure . 28 4.8.1 InfiniBand . 28 4.8.2 Ethernet Infrastructure . 30 4.8.3 IP Infrastructure . 30 4.9 GPFS cluster. 33 © Copyright IBM Corp. 2014. All rights reserved. iii Chapter 5. Toolkits for verifying health (individual diagnostics) . 37 5.1 Introduction to CHC. 38 5.1.1 Requirements . 38 5.1.2 Installation. 38 5.1.3 Configuration. 38 5.1.4 Usage . 43 5.2 Tool output processing methods . 44 5.2.1 The plain mode . 44 5.2.2 The xcoll mode . 44 5.2.3 The compare (config_check) mode. 45 5.3 Compute node. 48 5.3.1 The leds check . 48 5.3.2 The cpu check. 49 5.3.3 The memory check . 49 5.3.4 The os check. 50 5.3.5 The firmware check. 50 5.3.6 The temp check . 50 5.3.7 The run_daxpy check . 51 5.3.8 The run_dgemm check . 52 5.4 Ethernet network: Port status, speed, bandwidth, and port errors . 53 5.4.1 Ethernet firmware and drivers. 53 5.4.2 Ethernet port state . 54 5.4.3 Network settings . 55 5.4.4 Bonding. 56 5.5 InfiniBand: Port status, speed, bandwidth, port errors, and subnet manager . 57 5.5.1 The hca_basic check . 57 5.5.2 The ipoib check. 58 5.5.3 The switch_module check. ..
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages124 Page
-
File Size-