<<

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Hitoshi Sato*1, Shuichi Ihara*2, Satoshi Matsuoka*1 *1 Tokyo Institute of Technology *2 Data Direct Networks Japan

1 Tokyo Tech’s TSUBAME2.0 Nov. 1, 2010 “The Greenest Production in the World”

TSUBAME 2.0 New Development

>600TB/s Mem BW 220Tbps NW >1.6TB/s Mem BW >12TB/s Mem BW >400GB/s Mem BW Bisecion BW 80Gbps NW BW 35KW Max 32nm 40nm 1.4MW Max ~1KW max 2 TSUBAME2 System Overview 11PB (7PB HDD, 4PB Tape, 200TB SSD) Computing Nodes: 17.1PFlops(SFP), 5.76PFlops(DFP), 224.69TFlops(CPU), ~100TB MEM, ~200TB SSD

Thin nodes 1408nodes (32nodes x44 Racks) Medium nodes Fat nodes

HP Proliant SL390s G7 1408nodes HP Proliant DL580 G7 24nodes HP Proliant DL580 G7 10nodes CPU: Westmere-EP 2.93GHz CPU: Intel Nehalem-EX 2.0GHz CPU: Intel Nehalem-EX 2.0GHz 6cores × 2 = 12cores/node 8cores × 2 = 32cores/node 8cores × 2 = 32cores/node ・・・・・・GPU: NVIDIA Tesla K20X, 3GPUs/node GPU: NVIDIA Tesla S1070, GPU: NVIDIA Tesla S1070 Mem: 54GB (96GB) NextIO vCORE Express 2070 Mem: 256GB (512GB) SSD: 60GB x 2 = 120GB (120GB x 2 = 240GB) Mem:128GB SSD: 120GB x 4 = 480GB SSD: 120GB x 4 = 480GB Local SSDs Interconnets: Full-bisection Optical QDR Infiniband Network

Core Switch Edge Switch Edge Switch (/w 10GbE ports)

179switches 6switches 12switches Voltaire Grid Director 4036 ×179 Voltaire Grid Director 4036E ×6 IB QDR : 36 ports IB QDR:34ports Voltaire Grid Director 4700 ×12 10GbE: 2port IB QDR: 324 ports Infiniband QDR Networks

QDR IB(×4) × 20 QDR IB (×4) × 8 10GbE × 2

GPFS#1 GPFS#2 GPFS#3 GPFS#4

HOME HOME SFA10k #1 SFA10k #2 SFA10k #3 SFA10k #4 SFA10k #5

GPFS+Tape Lustre HomeSystem /data0 /data1 /work0 application iSCSI /work1 /gscr SFA10k #6

“Global Work “Global Work Space” #1 Space” #2 “Global Work Space” #3 “cNFS/Clusterd Samba w/ GPFS” “NFS/CIFS/iSCSI by BlueARC” 3.6 PB 2.4 PB HDD + Home Volumes 1.2PB3 〜4PB Tape Parallel Volumes TSUBAME2 System Overview 11PB (7PB HDD, 4PB Tape, 200TB SSD) Computing Nodes: 17.1PFlops(SFP), 5.76PFlops(DFP), 224.69TFlops(CPU), ~100TB MEM, ~200TB SSD

Thin nodes 1408nodes (32nodes x44 Racks) Medium nodes Fat nodes

HP Proliant SL390s G7 1408nodes HP Proliant DL580 G7 24nodes HP Proliant DL580 G7 10nodes CPU: Intel Westmere-EP 2.93GHz CPU: Intel Nehalem-EX 2.0GHz CPU: Intel Nehalem-EX 2.0GHz 6cores × 2 = 12cores/node 8cores × 2 = 32cores/node 8cores × 2 = 32cores/node ・・・・・・GPU: NVIDIA Tesla K20X, 3GPUs/node GPU: NVIDIA Tesla S1070, GPU: NVIDIA Tesla S1070 Mem: 54GB (96GB) NextIO vCORE Express 2070 Mem: 256GB (512GB) SSD: 60GB x 2 = 120GB (120GB x 2 = 240GB) Mem:128GB SSD: 120GB x 4 = 480GB Fine-grainedSSD: 120GB R/W x 4 = 480GB I/O Local SSDs (check point, temporal files) Interconnets: Full-bisection Optical QDR Infiniband Network

Core Switch Edge Switch Edge Switch (/w 10GbE ports)

179switches 6switches 12switches Voltaire Grid Director 4036 ×179 Voltaire Grid Director 4036E ×6 IB QDR : 36 ports IB QDR:34ports Voltaire Grid Director 4700 ×12 10GbE: 2port IB QDR: 324 ports Infiniband QDR Networks Mostly Read I/O (data-intensive apps, parallelQDR IB( workflow,×4) × 20 QDR IB (×4) × 8 10GbE × 2 parameter survey) GPFS#1 GPFS#2 GPFS#3 GPFS#4

HOME HOME SFA10k #1 SFA10k #2 SFA10k #3 SFA10k #4 SFA10k #5

GPFS+Tape Lustre HomeSystem /data0 /data1 /work0 application iSCSI /work1 /gscr SFA10k #6 Fine-grained R/W I/O “Global Work • Home storage for computing nodes “Global Work Space” #1 (checkSpace” #2point, temporal“Global files) Work Space” #3 “cNFS/Clusterd Samba w/ GPFS” “NFS/CIFS/iSCSI by BlueARC” • Cloud-based campus storage services Backup 3.6 PB 2.4 PB HDD + Home Volumes 1.2PB4 〜4PB Tape Parallel File System Volumes Towards TSUBAME 3.0 Interim Upgrade TSUBAME-KFC TSUBAME2.0 to 2.5 (Sept.10th, 2013) A TSUBAME3.0 prototype system with advanced cooling for next-gen. Upgrade the TSUBAME2.0s GPUs . NVIDIA Fermi M2050 to Kepler K20X 40 compute nodes are oil-submerged. 160 NVIDIA Kepler K20x 80 Ivy Bridge Xeon FDR Infiniband

Total 210.61 Flops TSUBAME2.0 Compute Node System 5.21 TFlops Fermi GPU 3 x 1408 = 4224 GPUs

SFP/DFP peak Storage design is also a matter!! from 4.8PF/2.4PF => 17PF/5.7PF 5 Existing Storage Problems Small I/O I/O contention

• PFS • Shared across compute – Throughput-oriented Design nodes – Master-Worker Configuration – I/O contention from various • Performance Bottlenecks on multiple applications Meta-data references – I/O performance degradation • Low IOPS (1k – 10k IOPS) Lustre I/O Monitoring IOPS PFS operation down

Small I/O ops Recovered from user’s apps

TIME 6 Requirements for Next Generation I/O Sub-System 1 • Basic Requirements – TB/sec sequential bandwidth for traditional HPC applications – Extremely high IOPS for to cope with applications that generate massive small I/O • Example: graph, etc. – Some application workflows have different I/O requirements for each workflow component • Reduction/consolidation of PFS I/O resources is needed while achieving TB/sec performance – We cannot simply use a larger number of IO servers and drives for achieving ~TB/s throughput! – Many constraints • Space, power, budget, etc. • Performance per Watt and performance per RU is crucial 7 Requirements for Next Generation I/O Sub-System 2 • New Approaches • Tsubame 2.0 has pioneered the use of local flash storage as a high-IOPS alternative to an external PFS • Tired and hybrid storage environments, combining (node) local flash with an external PFS • Industry Status • High-performance, high-capacity flash (and other new semiconductor devices) are becoming available at reasonable cost • New approaches/interface to use high-performance devices (e.g. NVMexpress)

8 Key Requirements of I/O system

• Electrical Power and Space are still Challenging – Reduce of #OSS/OST, but keep higher IO Performance and large storage capacity – How maximize Lustre performance • Understand New type of Flush storage device – Any benefits with Lustre? How use it? – MDT on SSD helps?

9 Lustre 2.x Performance Evaluation

• Maximize OSS/OST Performance for large Sequential IO – Single OSS and OST performance – 1MB vs 4MB RPC – Not only Peak performance, but also sustain performance • Small IO to shared file – 4K random read access to the Lustre • Metadata Performance – CPU impacts? – SSD helps?

10 Throughput per OSS Server

Single OSS's Throughput (Write) Single OSS's Throughput (Read)

Nobonding AA-bonding AA-bonding(CPU bind) Nobonding AA-bonding AA-bonding(CPU bind) 14000 14000 Dual IB FDR LNET bandwidth 12000 12.0 (GB/sec) 12000

10000 10000

8000 8000 Single FDR’s LNET bandwidth 6000 6.0 (GB/sec) 6000

4000 Throughput(MB/s) 4000 Throughput(MB/sec)

2000 2000

0 0 1 2 4 8 16 1 2 4 8 16 Number of clients Number of clients

11 Performance comparing (1 MB vs. 4 MB RPC, IOR FPP, asyncIO)

• 4 x OSS, 280 x NL-SAS • 14 clients and up to 224 processes

IOR(Write, FPP, xfersize=4MB) IOR(Read, FPP, xfersize=4MB) 30000 25000

25000 20000

20000 15000

15000 1MB RPC 1MB RPC 4MB RPC 10000 4MB RPC

Throughput (MB/sec) Throughput 10000 (MB/sec) Throughput

5000 5000

0 0 28 56 112 224 28 56 112 224 12 Number of Process Number of Process Lustre Performance Degradation with Large Number of Threads: OSS standpoint

• 1 x OSS, 80 x NL-SAS • 20 clients and up to 960 processes

IOR(FPP, 1MB, Write) IOR(FPP, 1MB, Read)

RPC=4MB RPC=1MB RPC=4MB RPC=1MB

12000 8000

7000 10000 6000 30% 8000 5000 4.5x 6000 4000 55% 3000 4000 THroughput (MB/sec) THroughput THroughput (MB/sec) THroughput 2000 2000 1000

0 0 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 Number of Process Number of Process 13 Lustre Performance Degradation with Large Number of Threads: SSDs vs. Nearline Disks

• IOR (FFP, 1MB) to single OST • 20 clients and up to 640 processes

Write: Lustre Backend Performance Degradation Read : Lustre Backend Performance Degradation (Maximum for each dataset=100%) (Maximum for each dataset=100%)

7.2KSAS(1MB RPC) 7.2KSAS(4MB RPC) SSD(1MB RPC) 7.2KSAS(1MB RPC) 7.2KSAS(4MB RPC) SSD(1MB RPC)

120% 120%

100% 100%

80% 80%

60% 60%

40% 40%

20% 20% Performance per OST as % of Peak Performance Performance Peak % of as OST per Performance Performance Peak % of as OST per Performance 0% 0% 0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700 14 Number of Process Number of Process Lustre 2.4 4k Random Read Performance (With 10 SSDs)

• 2 x OSS, 10 x SSD(2 x RAID5), 16 clients • Create Large random files (FFP, SSF), and run random read access with 4KB Lustre 4K random read(FFP and SSF) 600000

500000

400000

FFP(POSIX)

300000 FFP(MPIIO) IOPS SSF(POSIX) SSF(MPIIO) 200000

100000

0 15 1n32p 2n64p 4n128p 8n256p 12n384p 16n512p Conclusions: Lustre with SSDs

• SSDs pools in Lustre or SSDs – SSDs change the behavior of pools, as “seek” latency is no longer an issue • Application scenarios – Very consistent performance, independent of the IO size or number of concurrent IOs – Very high random access to a single shared file (millions of IOPS in a large file system with sufficient clients) • With SSDs, the bottleneck is no more the device, but the RAID array (RAID stack) and the file system – Very high random read IOPS in Lustre is possible, but only if the metadata workload is limited (i.e. random IO to a single shared file) • We will investigate more benchmarks of Lustre on SSD

16 Metadata Performance Improvements

• Very Significant Improvements (since Lustre 2.3) – Increased performance for both unique and shared directory metadata – Almost linear scaling for most metadata operations with DNE

Performance Comparing: Lustre-1.8 vs 2.4 Performance Comparing (metadata operation to shared dir) (metadata operation to unique dir) Same Storage resource, but just add MDS resource lustre-1.8.8 lustre-2.4.2 Single MDS Dual MDS(DNE) 140000 600000 120000 500000 100000 400000 80000 300000 ops/sec 60000 ops/sec

200000 40000

20000 100000

0 0 Directory Directory Directory File File stat File Directory Directory Directory File File stat File creation stat removal creation removal creation stat removal creation removal17 * 1 x MDS, 16 clients, 768 Lustre exports(multiple mount points) Metadata Performance CPU Impact

File Creation (Unique dir) 90000 80000 70000 60000 50000 40000 3.6GHz ops/sec • Metadata 30000 2.4GHz 20000 Performance is 10000 highly dependent on 0 0 100 200 300 400 500 600 700 CPU performance Number of Process • Limited variation in File Creation (Shared dir) 70000 CPU frequency or 60000 memory speed can 50000 have a significant 40000 impact on metadata 30000 3.6GHz ops/sec 20000 2.4GHz performance 10000 0 0 100 200 300 400 500 600 700 Number of Process 18 Conclusions: Metadata Performance

• Significant metadata improvements for shares and unique metadata performance since Lustre 2.3 – Improvements across all metadata workloads, especially removals – SSDs add to performance, but the impact is limited (and probably mostly due to latency, not the IOPS performance of the device) • Important considerations for metadata benchmarks – Metadata benchmarks are very sensitive to the underlying hardware – Clients limitations are important when considering metadata performance of scaling – Only directory/file stats scale with the number of processes per node, other metadata workloads do appear not scale with the number of processes on a single node

19 Summary

• Lustre Today – Significant performance increase in and metadata performance – Much better spindle efficiency (which is now reaching the physical drive limit) – Client performance is quickly becoming the limitation for small file and random performance, not the server side!! • Consider alternative scenarios – PFS combined with fast local devices (or, even, local instances of the PFS) – PFS combined with a global acceleration/caching layer

20