Optimizing Performance of HPC Storage Systems

Optimizing Performance of HPC Storage Systems Optimzing performance for reads and writes. Torben Kling Petersen, PhD John Fragalla HPC Storage Solutions HPC Storage Solutions Xyratex Ltd Xyratex International Inc. Langstone Road, Havant, Hampshire PO9 1SA, UK 46831 Lakeview Blvd, Fremont, California 94538, USA [email protected] [email protected] Abstract — The performance of HPC storage written on older, less capable solutions. For compute systems systems depend upon a variety of factors. The results of using on the Top500 list, the benchmark of choice is Linpack [1] any of the standard benchmark suites for storage depends not that will test a large system floating point performance as a only on the storage architecture, but also on the type of disk result of the CPU architecture and system interconnect. For drives, type and design of the interconnect, and the type and storage, however, there is no similar standard and number of clients. In addition, each of the benchmark suites unfortunately no equivalent list. This had led some vendors have a number of different parameters and test methodologies to state performance of their products in different ways such that require careful analysis to determine the optimal settings as aggregated theoretical disk bandwidth that obviously says for a successful benchmark run. To reliably benchmark a nothing about the systems actual performance. storage solution, every stage of the solution needs to be analyzed including block and file performance of the RAID, In the storage arena, a number of different benchmark network and client throughput to the entire filesystem and suites exist. These can be used to measure a systems I/O meta data servers. For a filesystem to perform at peak performance regardless if the solution is based on direct performance, there needs to be a balance between the actual attached storage, network attached storage (such as NFS performance of the disk drives, the SAS chain supporting the based solutions) or a parallel distributed filesystem (such as RAID sets, the RAID code used (whether hardware RAID GPFS or Lustre)[2, 3]. controllers or software MD-RAID), the interconnect and finally the clients. This paper describes these issues with For general filesystem I/O, benchmark tools such as IOR respect to the Lustre filesystem. The dependence of [4], IOzone [5], Bonnie++ [6] as well as meta data benchmark results with respect to various parameters is performance of said filesystem such as MDtest [7]. For the shown. Using a single storage enclosure consisting of 8 RAID purpose of this paper, we focused on obdfilter-survey and sets (8+2 drives each) it is possible achieve both read and write IOR. performances in excess of 6 GB/s which translates to more than 36 GB/s per rack of measured client based throughput. This ® paper will focus on using Linux performance tool, obdfilter- II. LUSTRE PARALLEL FILESYSTEM survey, and IOR to measure different levels of the filesystem While there’s a number of parallel filesystems used in performance using Lustre. HPC solutions today, Lustre currently dominates with 5 of the top10 systems and more that 70% of the Top100 [8]. Keywords—HPC Storage, Benchmarking, This is also why Xyratex is basing all solutions on this filesystem and hence this paper will focus on Lustre I. INTRODUCTION performance benchmarking. Benchmarking system performance has always been an Lustre is based on the concept of separating meta data important factor in the HPC industry. This is especially true from block I/O using a pair of metadata servers (MDS) and a during procurement phases of a new system. Although number of object store servers (OSSes) all interconnected benchmarks seldom, if ever, provide any indication of real with a network fabric (fig. 1). life performance of an application (unless the application itself is benchmarked at intended scale which is usually hard III. STORAGE SYSTEM ARCHITECTURES before the final system is delivered), a benchmark allows an end user to compare an offered solution to others. In a While the choice of filesystem probably has the greatest number of RFPs, the customers have written their own impact on I/O performance, it is important to remember that benchmarks which is distributed to vendors with specific the specific architecture of the underlying storage solution is instructions of how they should be wrong. While this is a also very important to understand to achieve good prudent approach, these benchmark codes rarely shows the throughput. Factors that affect the storage performance peak performance of a new solution as they often were ranges from the choice of RAID system (hardware based or software such as MD-RAID), disk protocols such as S-ATA Fig. 1 Schematic overview of active components making up the Lustre filesystem. and SAS, disk type (even within a range of say 2TB NL-SAS The embedded server modules (2 per enclosure) are drives from one of the 3 major vendors, has very different based on a single socket Intel Xeon CPU, 32 GB RAM and I/O characteristics), firmware levels of all components, the an onboard Mellanox CX3 FDR capable HCA. The components of the software stack, tuning parameters ClusterStor architecture uses Lustre version 2.1.3 + Xyratex (important even on the client side) and type of interconnect extensions. The ClusterStor solution benchmarked consists to mention a few. of a total of 2 SSUs, 4 OSS servers, and 16 OSTs. It is therefore important to benchmark the underlying The storage subsystem is configured with the following hardware before starting client based tests to set a baseline of Lustre parameters: writethrough cache enabled, read cache the disk backend subsystem. The most commonly used tool enabled and the read cache max filesize = 1M. The MDT, for this (at the filesystem level) is obdfilter-survey[9]. MGS and OSTs leverage software RAID and no data caching. All data is stored directly to disk and committed IV. SYSTEM ARCHITECTURE before sending an ACK back to the client. A. Storage Subsystem B. Clients and interconnect The ClusterStor 6000 Lustre based engineered solution The test clients were made up of 64 x86 compute nodes consist of a pair of active-passive failover servers with with the following configuration: shared storage acting as the systems metadata server (MDS) • 2x Intel Xeon Westmere Processors and metadata management server (MGS). Each object store servers consists of a 5 RU enclosure, known as a Scalable • 24 GB RAM Storage Unit (SSU) hosting 84 3,5’ disk drive slots and two • QDR interface embedded OSS server modules in the back of the All clients as well as the ClusterStor SSUs were enclosure[10]. The following storage components were used: connected to a Mellanox based FDR capable core switch • 82 Hitachi Mars-K near line SAS drives (3,5’ form fabric. factor) with a total capacity of 2 TB each. The clients connected to the storage were configured • 2 Hitachi Ralston Peak SSDs (2.5’ form factor in a with Lustre client version 1.8.8. Each client is also tuned 3.5’ carrier) with a total capacity of 100 GB each. according to industry best practice with LRU and check- sums disabled, and the max RPCs in flight set to 32. The disk drives are configured as 8 RAID 6 (8+2) volumes formatted with ldiskfs rendering them into 8 object V. TEST METHODOLOGY store targets (OST) with the remaining 2 drives acting as roaming spares. The 2 SSDs (formatted with RAID 1) was To define the optimal benchmark parameters, a number exclusively used for write intend bitmaps WIBs) and for of settings were tested. This paper will focus on the Linux journals. following subset of all user modifiable test parameters: # pdsh -g oss "TERM=linux thrlo=256 thrhi=256 nobjlo=1 nobjhi=1 rsz=1024K size=32768 obdfilter-survey" cstor01n04: ost 4 sz 134217728K rsz 1024K obj 4 thr 1024 write 3032.89 [ 713.86, 926.89] rewrite 3064.15 [ 722.83, 848.93] read 3944.49 [ 912.83,1112.82] cstor01n05: ost 4 sz 134217728K rsz 1024K obj 4 thr 1024 write 3022.43 [ 697.83, 819.86] rewrite 3019.55 [ 705.15, 827.87] read 3959.50 [ 945.20,1125.76] Fig. 2 Results of an optimized obdfilter-survey run on a single SSU. The individual OSS results are highlighted in bold. where transfer size have been used as a variable but most A. Obdfilter-survey ranges from sub 1 MB to a max of 4 MB. As our previous A part of the Lustre I/O kit, the obdfilter-survey script performance experiences with the ClusterStor solution have generates sequential I/O from varying numbers of threads indicated that the system architecture requires significant and objects (files) to simulate the I/O patterns of a Lustre load to perform well, we decided to vary transfer size (-t) the client. While it can run on a network attached client, it is between 1 and 64 megabytes. commonly used as a server side tool. E. Number of client threads B. I/O mode As IOR uses MPI for its execution on multiple nodes, it IOR supports a number of I/O modes such as Buffered is possible to control the number of processes each node will I/O and Direct I/O. In addition, the benchmark suite also use. This is set with the mpirun argument of –np. To assess supports simulation of several application characteristics how much load each client needs to produce to achieve such as file per process (FFP) and single shared file (SSF). In maximum throughput on the storage solution, a number of this paper we focused used both buffered and direct I/O runs were done varying the total number of processes using the file per process methodology.

Optimizing Performance of HPC Storage Systems

Bull SAS: Novascale B260 (Intel Xeon Processor 5110,1.60Ghz)

Microbenchmarks in Big Data

Overview of the SPEC Benchmarks

Comparing Filesystem Performance: Red Hat Enterprise Linux 6 Vs

Hypervisors Vs. Lightweight Virtualization: a Performance Comparison

Power Measurement Tutorial for the Green500 List

Towards Better Performance Per Watt in Virtual Environments on Asymmetric Single-ISA Multi-Core Systems

The High Performance Linpack (HPL) Benchmark Evaluation on UTP High Performance Computing Cluster by Wong Chun Shiang 16138 Diss

A Study on the Evaluation of HPC Microservices in Containerized Environment

I.MX 8Quadxplus Power and Performance

Software Performance Engineering Using Virtual Time Program Execution

Connectcore® 8X Performance and Power Benchmarking Report