IBM High Performance Computing Cluster Health Check

IBM High Performance Computing Cluster Health Check

Front cover IBM High Performance Computing Cluster Health Check Learn how to verify cluster functionality See sample scenarios Understand best practices Dino Quintero Jie Gong Ross Aiken Markus Hilger Shivendra Ashish Herbert Mehlhose Manmohan Brahma Justin I. Morosi Murali Dhandapani Thorsten Nitsch Rico Franke Fernando Pizzano ibm.com/redbooks International Technical Support Organization IBM High Performance Computing Cluster Health Check February 2014 SG24-8168-00 Note: Before using this information and the product it supports, read the information in “Notices” on page vii. First Edition (February 2014) This edition applies to Red Hat Enterprise Linux 6.3, IBM Parallel Environment (PE) (Versions 1.3.0.5, 1.3.0.6, and 1.3.0.7), GPFS 3.5.0-9, xCAT 2.8.1 and 2.8.2, Mellanox OFED 2.0-2.0.5 and 1.5.3-4.0.22.3, Mellanox UFM 4.0.0 build 19, iperf 2.0.5, IOR version 2.10.3 and 3.0.1, mdtest version 1.9.1, stream version 5.10. © Copyright International Business Machines Corporation 2014. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Notices . vii Trademarks . viii Preface . ix Authors. ix Now you can become a published author, too! . xi Comments welcome. xii Stay connected to IBM Redbooks . xii Chapter 1. Introduction. 1 1.1 Overview of the IBM HPC solution . 2 1.2 Why we need a methodical approach for cluster consistency checking . 2 1.3 Tools and interpreting their results for HW and SW states . 2 1.3.1 General Parallel File System. 3 1.3.2 Extreme Cloud Administration Toolkit. 3 1.3.3 The OpenFabrics Enterprise Distribution . 4 1.3.4 Red Hat Package Manager. 4 1.4 Tools and interpreting their results for identifying performance inconsistencies. 4 1.5 Template of diagnostics steps that can be used (checklists) . 5 Chapter 2. Key concepts and interdependencies . 7 2.1 Introduction to High Performance Computing . 8 2.2 Rationale for clusters . 8 2.3 Definition of an HPC Cluster . 8 2.4 Definition of a “healthy cluster” . 9 2.5 HPC preferred practices . 10 Chapter 3. The health lifecycle methodology . 13 3.1 Why a methodology is necessary . 14 3.2 The health lifecycle methodology . 14 3.3 Practical application of the health lifecycle methodology . 15 3.3.1 Deployment phase . 16 3.3.2 Verification or pre-production readiness phase. 19 3.3.3 Production phase (monitoring) . 21 Chapter 4. Cluster components reference model . 23 4.1 Overview of installed cluster systems . 24 4.2 ClusterA nodes hardware description . 25 4.3 ClusterA software description . 25 4.4 ClusterB nodes hardware description . 26 4.5 ClusterB software description . 27 4.6 ClusterC nodes hardware description . 27 4.7 ClusterC software description . 28 4.8 Interconnect infrastructure . 28 4.8.1 InfiniBand . 28 4.8.2 Ethernet Infrastructure . 30 4.8.3 IP Infrastructure . 30 4.9 GPFS cluster. 33 © Copyright IBM Corp. 2014. All rights reserved. iii Chapter 5. Toolkits for verifying health (individual diagnostics) . 37 5.1 Introduction to CHC. 38 5.1.1 Requirements . 38 5.1.2 Installation. 38 5.1.3 Configuration. 38 5.1.4 Usage . 43 5.2 Tool output processing methods . 44 5.2.1 The plain mode . 44 5.2.2 The xcoll mode . 44 5.2.3 The compare (config_check) mode. 45 5.3 Compute node. 48 5.3.1 The leds check . 48 5.3.2 The cpu check. 49 5.3.3 The memory check . 49 5.3.4 The os check. 50 5.3.5 The firmware check. 50 5.3.6 The temp check . 50 5.3.7 The run_daxpy check . 51 5.3.8 The run_dgemm check . 52 5.4 Ethernet network: Port status, speed, bandwidth, and port errors . 53 5.4.1 Ethernet firmware and drivers. 53 5.4.2 Ethernet port state . 54 5.4.3 Network settings . 55 5.4.4 Bonding. 56 5.5 InfiniBand: Port status, speed, bandwidth, port errors, and subnet manager . 57 5.5.1 The hca_basic check . 57 5.5.2 The ipoib check. 58 5.5.3 The switch_module check. ..

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    124 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us