Guide for Running AI Workloads on Red Hat Openshift and NVIDIA DGX Systems with IBM Spectrum Scale

Front cover Draft Document for Review September 24, 2020 11:16 am REDP-5610-00 Deployment and Usage Guide for Running AI Workloads on Red Hat OpenShift and NVIDIA DGX Systems with IBM Spectrum Scale Simon Lorenz Gero Schmidt Thomas Schoenemeyer Redpaper Draft Document for Review September 17, 2020 11:20 am 5610edno.fm IBM Redbooks Deployment and Usage Guide for Running AI Workloads on Red Hat OpenShift and NVIDIA DGX Systems with IBM Spectrum Scale September 2020 REDP-5610-00 5610edno.fm Draft Document for Review September 17, 2020 11:20 am Note: Before using this information and the product it supports, read the information in “Notices” on page v. First Edition (September 2020) This edition applies to Version 4, Release 4, Modification 3 of Red Hat OpenShift and Version 5 and Release 0, Modification 4.3 of IBM Spectrum Scale. This document was created or updated on September 17, 2020. © Copyright International Business Machines Corporation 2020. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Draft Document for Review September 24, 2020 8:14 am 5610TOC.fm Contents Notices . .v Trademarks . vi Preface . vii Authors. vii Now you can become a published author, too! . viii Comments welcome. viii Stay connected to IBM Redbooks . ix Chapter 1. Overview . 1 1.1 Abstract . 2 1.2 Proof of Concept Background . 2 Chapter 2. PoC Environment . 5 2.1 Prerequisites . 6 Chapter 3. Installation. 9 3.1 Configuring the NVIDIA Mellanox EDR InfiniBand network . 10 3.2 Integrating DGX-1 Systems as Worker Nodes into the Red Hat OpenShift 4.4.3 Cluster . 11 3.2.1 Installing Red Hat Enterprise Linux 7.6 and DGX Software . 11 3.2.2 Installing NVIDIA Mellanox InfiniBand Drivers (MLNX_OFED). 12 3.2.3 Installing GPUDirect RDMA Kernel Module . 13 3.2.4 Installing NVIDIA Mellanox SELinux Module . 13 3.2.5 Adding DGX-1 systems as Worker Nodes to the Red Hat OpenShift Cluster . 14 3.3 Adding DGX-1 Systems as Client Nodes to the IBM Spectrum Scale Cluster . 14 3.4 Red Hat OpenShift Stack . 15 3.4.1 Special Resource Operator (SRO) . 15 3.4.2 NVIDIA Mellanox RDMA Shared Device Plugin . 17 3.4.3 Enabling IPC_LOCK in User Namespace for RDMA Shared Device Plugin . 19 3.4.4 MPI Operator . 22 3.4.5 IBM Spectrum Scale CSI . 22 Chapter 4. Preparation and Functional Testing . 25 4.1 Test RDMA via the InfiniBand Network. 25 4.2 Prepare Persistent Volumes with IBM Spectrum Scale CSI . 29 4.3 MPI Job Definition . 31 4.4 Connectivity Tests with NVIDIA Collective Communications Library . 33 4.5 Multi-GPU, Multi-Node GPU Scaling with TensorFlow ResNet-50 Benchmark . 44 Chapter 5. Deep neural network (DNN) Training on A2D2 Semantic Segmentation Dataset. 51 5.1 Description A2D2 Dataset. 51 5.2 Multi-GPU, Multi-Node GPU Scaling Results for DNN Training Jobs . 52 5.3 Application. 55 5.4 Integrating IBM Spectrum Discover and IBM Spectrum LSF to find the right data based on labels . 58 Related publications . 63 Online resources . 63 © Copyright IBM Corp. 2020. iii 5610TOC.fm Draft Document for Review September 24, 2020 8:14 am Help from IBM . 64 iv Guide for Running AI Workloads on Red Hat OpenShift and NVIDIA DGX Systems with IBM Spectrum Scale Draft Document for Review September 17, 2020 11:36 am 5610spec.fm Notices This information was developed for products and services offered in the US. This material might be available from IBM in other languages. However, you may be required to own a copy of the product or product version in that language in order to access it. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user’s responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, MD-NC119, Armonk, NY 10504-1785, US INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk. IBM may use or distribute any of the information you provide in any way it believes appropriate without incurring any obligation to you. The performance data and client examples cited are presented for illustrative purposes only. Actual performance results may vary depending on specific configurations and operating conditions. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. Statements regarding IBM’s future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to actual people or business enterprises is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are provided “AS IS”, without warranty of any kind. IBM shall not be liable for any damages arising out of your use of the sample programs. © Copyright IBM Corp. 2020. v 5610spec.fm Draft Document for Review September 17, 2020 11:36 am Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright and trademark information” at http://www.ibm.com/legal/copytrade.shtml The following terms are trademarks or registered trademarks of International Business Machines Corporation, and might also be trademarks or registered trademarks in other countries. DS6000™ IBM Research® Redbooks® DS8000® IBM Spectrum® Redbooks (logo) ® IBM® LSF® System Storage™ IBM Elastic Storage® POWER® The following terms are trademarks of other companies: The registered trademark Linux® is used pursuant to a sublicense from the Linux Foundation, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. OpenShift, Red Hat, are trademarks or registered trademarks of Red Hat, Inc. or its subsidiaries in the United States and other countries. Other company, product, or service names may be trademarks or service marks of others. vi Guide for Running AI Workloads on Red Hat OpenShift and NVIDIA DGX Systems with IBM Spectrum Scale Draft Document for Review September 24, 2020 7:49 am 5610pref.fm Preface This IBM® Redpaper publication describes the architecture as well as the installation procedure and the results for running a typical training application working on an automotive dataset, in an orchestrated and secured environment, providing horizontal scalability of GPU resources across physical node boundaries for deep neural networks (DNN) workloads. This Redpaper is mostly relevant for system engineers, system administrators or system architects responsible for data center infrastructure management and typical day-to-day operations such as system monitoring, operational control, asset management, and security audits. In a later stage of the Redpaper we added IBM Spectrum® LSF® as a workload manager and IBM Spectrum Discover as a metadata search engine to find the right data for our inference job and to automate the data science workflow. With the help of this solution, the data location (could be even different storage systems) and time of availability for the AI job can be fully abstracted, which introduces valuable information for data scientists.

Guide for Running AI Workloads on Red Hat Openshift and NVIDIA DGX Systems with IBM Spectrum Scale

20201130 Gcdv V1.0.Pdf

NVIDIA DGX Station the First Personal AI Supercomputer 1.0 Introduction

The NVIDIA DGX-1 Deep Learning System

Computing for the Most Demanding Users

Nvidia Autonomous Driving Platform

Investigations of Various HPC Benchmarks to Determine Supercomputer Performance Efficiency and Balance

Fabric Manager for NVIDIA Nvswitch Systems

NVIDIA DGX-1 System Architecture White Paper

NVIDIA A100 Tensor Core GPU Architecture UNPRECEDENTED ACCELERATION at EVERY SCALE

NVIDIA GPU COMPUTING: a JOURNEY from PC GAMING to DEEP LEARNING Stuart Oberman | October 2017 GAMING PROENTERPRISE VISUALIZATION DATA CENTER AUTO

Nvidia Dgx Os 5.0

NVIDIA Topology-Aware GPU Selection 0.1.0 (Early Access)