Powering Artificial Intelligence with High-Performance Computing

White Paper Mission Critical Computing Powering Artificial Intelligence with High-Performance Computing September 2020 SUSE.com Powering Artificial Intelligence with 2020 High-Performance Computing Your business might be starting to recognize that an Today’s HPC systems are based on Linux clusters HPC infrastructure is vital to supporting the AI and running on industry-standard x86 hardware—the analytics applications you need to keep up with the same kind of Linux clusters enterprises are likely al- huge volumes of data and processing to compete in ready using for their big data and cloud architectures. today’s world. Updating your current IT infrastructure to enable HPC to run AI workloads is vital to The HPC community is making it easier than ever to staying competitive in the current fast paced and implement a software-defined HPC infrastructure on changing environment. This paper will discuss the the hardware you already have. The OpenHPC group seven steps necessary to enable an HPC environ- and its members work to simplify the task of adopt- ment in your data center. ing HPC techniques and technologies and applying them to enterprise tasks. For this discussion, we’re broadly defining AI as any system that is trying to solve the kind of problem— How to Bring HPC to Your Organization such as recognizing patterns—that human thinking Here are seven things to keep in mind when con- does. AI encompasses machine learning (ML) and sidering how your organization is going to do AI with Deep Learning (DL) techniques. ML is a process in HPC. which curated examples are used to train a com- puter to recognize patterns. Deep learning is a PLAN YOUR WORKLOAD branch of ML that uses digital neural networks to Your obvious first step is to decide what you are going create systems that learn on their own. to run on your HPC cluster. Consider not only what you will start with, but also how you expect to scale AI applications are leveraged to accelerate competi- the application or applications. tive advantage in many industries: MPI vs Throughput Workloads - What type of work- • Management by exception load will you run? HPC workloads can be broadly • Marketing divided into two categories: parallel distributed (MPI) • Banking workloads and throughput workloads. • Finance • Agriculture Message Passing Interface (MPI) applications are par- • Healthcare allel distributed applications that consist of multiple • Gaming concurrently running processes that communicate • Space Exploration with each other with intensity. • Autonomous Vehicles • Personal assistants (Chat bots) Throughput workloads often require many tasks to • Social Media be run to complete a specific job, with each task running on its own without communication between Optimizing Your Environment to Enable HPC for tasks. One example of a throughput workload would AI Workloads be the rendering the frames of digital content. Each AI systems require scalable computing power. As frame can be computed on its own as well as in the volumes of data grow, so do training times and parallel. computational requirements. There is a lot of information out there on what type HPC environments today no longer imply bespoke of workload fits each environment and application. supercomputers built for government and academic research institutions. In fact, enterprises can build Virtualized vs Bare Metal – Historically HPC work- high-performance infrastructures while dealing with loads have been run on bare metal systems, but this far fewer vendors and far less customization than is starting to change as virtualized environments are most realize. becoming more advanced and offer more value to the enterprise and HPC environments. 2 SUSE.com Powering Artificial Intelligence with 2020 High-Performance Computing There are a number of benefits to running in a vir- and predictably with a wide range of configuration tual environment these days around heterogeneity, and connectivity options. For AI applications, Dell of- security, and flexibility. Many workloads have been fers several purpose-built platforms that feature high successfully virtualized to match the performance numbers of GPUs and dense compute environments: of bare-metal environments. • The PowerEdge C4140 Server is a two-proces- These are just some of the points that need con- sor server with up to four NVIDIA V100 GPUs in sideration when starting to think about your HPC just 1U! This server has a patented interleaved environment. Some others include resiliency, re- GPU design to optimize both space and air- dundancy, and performance. flow for maximum compute performance. The PowerEdge C4140 is available with NVIDIA UPDATING YOUR HARDWARE NVLINK™ direct GPU-to-GPU interconnect de- Your next step is to determine your hardware needs. signed to speed communication between GPUs As mentioned above, HPC can be implemented to- an order of magnitude faster than PCIe. day as a software-defined infrastructure on top of • The PowerEdge C6420 Server has up to four a standard x86 hardware cluster. independent twoprocessor servers in 2U with up to 16 DIMMs, up to 24x 2.5" hard drives and Some of the base components you will need when M.2 boot storage. It’s also available with direct assembling your HPC cluster are the following contact liquid cooling (DCLC) to support higher- wattage processors for increased performance, • Management nodes energy efficiency and racklevel density. • Compute nodes • The DSS 8440 machine learning server is a two- • Network Interconnects processor 4U server with up to 16 GPU accel- • Storage eratros and up to 10 local NVMe and SATA drives for optimized access to data. The DSS 8440 has Networking and Storage will be covered a little later, an open architecture, based on industry stan- but now we want to focus on the Management and dard PCIe fabric, allowing for customization of compute nodes and what options you have. Your accelerators, storage options and network cards. basic HPC cluster consists of at least one management or login node connected to a network of SEGMENTING YOUR NETWORK multiple compute nodes. There may be multiple If you already have a Linux cluster, your existing management nodes used to run cluster-wide ser- top-of rack switches, and Ethernet network may vices, such as monitoring, workloads and storage suffice for your HPC infrastructure. More advanced usage. This depends on the size of the cluster. The HPC interconnects can be utilized to solve specific login nodes are a key component to allow users throughput issues. The diagram below shows how to access the system. Users submit their jobs us- you might segment your network traffic based on ing these login nodes to the worker nodes via a the network capacity of each type of interconnect. workload scheduler. As technology has improved over the years, the components to building a clus- User Login Data transfer Head Compute ter have changed. These days multicore Intel x86- system nodes nodes nodes nodes architecture systems with varying amounts of cores and memory are commonly used for worker nodes. 1 Gb Ethernet Hand ware mangement Dell provides several great options for both man- 1 Gb Ethernet LAN Access 1 Gb Ethernet Internodal traffic agement and compute nodes. Dell Technologies Non-compute storage 10 Gb Ethernet Application storage Database storage PowerEdge servers deliver high performance com- Infiniband Compute storage puting with the latest processors, accelerators, 40 Gb Ethernet Science DMZ storage memory and NVMe storage. You can scale efficiently 3 SUSE.com Powering Artificial Intelligence with 2020 High-Performance Computing Dell EMC switches an exascale future. It packs up to 90 hot service- Dell EMC PowerSwitch S5200 ON Series Switches able 3.5 inch drives in 4U. Available with either one provide state of the art, high density open net- or two server nodes, the DSS 7000 can deliver up working 25GbE top-of-rack and 100GbE spine/leaf to 1.26PB of storage to tackle demanding storage switches to meet the growing demands of today’s environments. HPC/AI compute and storage traffic. These combined with SUSE Enterprise Storage Dell EMC Networking Z9100 ON Series Switches are product will provide the most flexible and robust 10/25/40/50/100GbE fixed switches for high perfor- storage solution for your HPC environment running mance computing environments. With 32 ports of AI workloads. SUSE Enterprise Storage is a soft- 100GbE, 64 ports of 50GbE, 32 ports of 40GbE, 128 ware-defined storage solution powered by Ceph ports of 25GbE or 128 ports 10GbE and two SFP+ designed to help enterprises manage ever-growing ports of 10GbE/1GbE/100MbE, you can conserve data sets. rack space and simplifying migration to 100Gbps. SELECT THE RIGHT MIDDLEWARE Mellanox InfiniBand Much of the magic in HPC happens at the middle- Mellanox® SB7800 series Switch IB 2™ InfiniBand® ware layer, where you enable parallel computing EDR 100Gb/s Switches deliver high bandwidth with through tools such as workload schedulers and low latency for the highest server efficiency and ap- Message Passing Interfaces (MPIs). Luckily, many of plication productivity — ideal for HPC applications. the tools that once came from various vendors are You can get 36 ports at 100Gb/s per port, and can now available as open source. scale out to hundreds of nodes. All the tools you need are packaged together with Gen Z Consortium SUSE® Linux Enterprise Server for High Performance Dell is a founding member of the Gen Z Consortium, Computing - a highly scalable, high performance dedicated to creating a next generation intercon- open-source operating system designed to utilize nect that will bridge existing solutions while en- the power of parallel computing. abling unbounded innovation. SUSE Linux Enterprise for High-Performance Utilize Innovative Storage Solutions Computing comes with several preconfigured sys- Unprecedented growth in the amount of data col- tem roles for HPC. These roles provide a set of lected for analytics, artificial intelligence and other preselected packages as well as an installation high-performance computing makes fast, scalable workflow that will configure the systems to make and resilient storage an imperative.

Powering Artificial Intelligence with High-Performance Computing

Slurm Overview and Elasticsearch Plugin

Intel® SSF Reference Design: Intel® Xeon Phi™ Processor, Intel® OPA

SUSE Linux Enterprise High Performance Computing 15 SP3 Administration Guide Administration Guide SUSE Linux Enterprise High Performance Computing 15 SP3

Version Control Graphical Interface for Open Ondemand

Extending SLURM for Dynamic Resource-Aware Adaptive Batch

Distributed HPC Applications with Unprivileged Containers Felix Abecassis, Jonathan Calmels GPUNVIDIA Computing Beyond Video Games

Cluster Deployment

Sansa - the Supercomputer and Node State Architecture

Intelligent Colocation of HPC Workloads?,??

Winter 2016 Vol

Virtual Machine Checkpoint Storage and Distribution for Simuboost

The Crossroads of Cloud And