HPC Storage Current Status and Futures Torben Kling Petersen, PhD Principal Architect, HPC

1 Agenda ?? • Where are we today ?? • File systems • Interconnects • Disk technologies • Solid state devices • Recipes and Solutions Pan Galactic Gargle Blaster • Final thoughts … "Like having your brains smashed out by a slice of lemon wrapped around a large gold brick.”

2 Current Top10 …..

Total Power File Rank Name Computer Site Rmax Rpeak Size Perf Cores (KW) system TH-IVB-FEP Cluster, Xeon National Super / 1 Tianhe-2 E5-2692 12C 2.2GHz, TH Computer Center 3120000 33862700 54902400 17808 12.4 PB ~750 GB/s H2FS Express-2, Intel Xeon Phi in Guangzhou Cray XK7 , Opteron 6274 DOE/SC/Oak 2 Titan 16C 2.2GHz, Cray Gemini Ridge National 560640 17590000 27112550 8209 Lustre 10.5 PB 240 GB/s interconnect, NVIDIA K20x Laboratory BlueGene/Q, Power BQC 3 Sequoia 16C 1.60 GHz, Custom DOE/NNSA/LLNL 1572864 17173224 20132659 7890 Lustre 55 PB 850 GB/s Interconnect Fujitsu, SPARC64 VIIIfx 4 K computer 2.0GHz, RIKEN AICS 705024 10510000 11280384 12659 Lustre 40 PB 965 GB/s Tofu interconnect DOE/SC/Argonne BlueGene/Q, Power BQC 5 Mira National 786432 8586612 10066330 3945 GPFS 7.6 PB 88 GB/s 16C 1.60GHz, Custom Laboratory Cray XC30, Xeon E5-2670 Swiss National 6 Piz Daint 8C 2.600GHz, Aries Supercomputing 115984 6271000 7788853 2325 Lustre 2.5 PB 138 GB/s interconnect , NVIDIA K20x Centre (CSCS) PowerEdge C8220, Xeon TACC/ 7 Stampede E5-2680 8C 2.7GHz, IB 462462 5168110 8520112 4510 Lustre 14 PB 150 GB/s Univ. of Texas FDR, Intel Xeon Phi BlueGene/Q, Power BQC Forschungs 8 JUQUEEN 16C 1.600GHz, Custom zentrum Juelich 458752 5008857 5872025 2301 GPFS 5.6 PB 33 GB/s Interconnect (FZJ) BlueGene/Q, Power BQC 9 Vulcan 16C 1.600GHz, Custom DOE/NNSA/LLNL 393216 4293306 5033165 1972 Lustre 55 PB 850 GB/s Interconnect iDataPlex DX360M4, Xeon Leibniz 10 SuperMUC E5-2680 8C 2.70GHz, 147456 2897000 3185050 3423 GPFS 10 PB 200 GB/s Rechenzentrum Infiniband FDR

n.b. NCSA Bluewaters 24 PB 1100 GB/s (Lustre 2.1.3) 3 Lustre® Roadmap

4 Other parallel file systems • GPFS – Running out of steam ?? – Let me qualify !! (and he then rambles on …..) • Frauenhofer (now BeeGFS) – Excellent metadata perf – Many modern features – No real HA • – New, interesting and with a LOT of good features – Immature and with limited track record • Panasas – Still RAID 5 and running out of steam ….

5 Object based storage

• A traditional includes a hierarchy of files and directories • Accessed via a file system driver in the OS • is “flat”, objects are located by direct reference • Accessed via custom o Swift, S3, librados, etc. • The difference boils down to 2 questions:

o How do you find files?

o Where do you store metadata? • Object store + Metadata + driver is a filesystem

6 Object Storage Backend: Why?

• It’s more flexible. Interfaces can be presented in other ways, without the FS overhead. A generalized storage architecture vs. a file system • It’s more scalable. POSIX was never intended for clusters, concurrent access, multi-level caching, ILM, usage hints, striping control, etc. • It’s simpler. With the file system-”isms” removed, an elegant (= scalable, flexible, reliable) foundation can be laid

7 Elephants all the way down...

Most clustered FS and OS are built on local FS’s – ….and inherit their problems • Native FS

o XFS, , ZFS, • OS on FS

o Ceph on btrfs

o Swift on XFS • FS on OS on FS

o CephFS on Ceph on btrfs

o Lustre on OSS on ext4

o Lustre on OSS on ZFS

8 The way forward .. • ObjectStor based solutions offers a lot of flexibility: – Next-generation design, for exascale-level size, performance, and robustness – Implemented from scratch • "If we could design the perfect exascale storage system..." – Not limited to POSIX – Non-blocking availability of data – Multi-core aware – Non-blocking execution model with thread-per-core – Support for non-uniform hardware – Flash, non-volatile memory, NUMA – Using abstractions, guided interfaces can be implemented • e.g., for burst buffer management (pre-staging and de-staging).

9 Interconnects (Disk and Fabrics)

• S-ATA 6 Gbit • FC-AL 8 Gbit • SAS 6 Gbit • SAS 12 Gbit • PCI-E direct attach

• Ethernet • Infiniband • Next gen interconnect…

10 12 Gbit SAS

• Doubles bandwidth compared to SAS 6 Gbit • Triples the IOPS !! • Same connectors and cables J • 4.8 GB/s in each direction with 9.6 GB/s following – 2 streams moving to 4 streams • 24 Gb/s SAS is on the drawing board ….

11 PCI-E direct attach storage

• M.2 Solid State Storage Modules • Lowest latency/highest bandwidth • Limitations in # of PCI-E channels available • Ivy Bridge has up to 40 lanes per chip

PCIe Arch Raw Bit Rate Data Interconnect BW Lane Total BW for Encoding bandwidth Direction x16 link PCIe 1.x 2.5GT/s 8b/10b 2Gb/s ~250MB/s ~8GB/s

PCIe 2.0 5.0GT/s 8b/10b 4Gb/s ~500MB/s ~16GB/s

PCIe 3.0 8.0GT/s 128b/130b 8Gb/s ~1 GB/s ~32GB/s

PCIe 4.0 ?? ?? ?? ?? ??

12 Ethernet – Still going strong ??

• Ethernet has now been around for 40 years !!! • Currently around 41% of Top500 systems … 28% 1 GbE 13% 10 GbE • 40 GbE shipping in volume • 100 GbE being demonstrated – Volume shipments expected in 2015 • 400 GbE and 1 TbE is on the drawing board – 400 GbE planned for 2017

13 Infiniband …

14 Next gen interconnects …

• Intel acquired Qlogic Infiniband team … • Intel acquired Cray’s interconnect technologies … • Intel has published and shows silicon photonics …

• And this means WHAT ????

15 Disk drive technologies NEW Seagate 1 PB drive !!!

16 Areal density futures

105

Superparamagnetic limit (Single particle) HAMR + BPM 104 HAMR

103 Perpendicular Write and GMR reading

102 GigabitSqinch / Inductive writing MR/GMR reading

101

Inductive Read/Write

100 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 Year

17 Disk write technology ….

18 18 Hard drive futures …

• Sealed Helium Drives (Hitachi ) – Higher density – 6 platters/12 heads – Less power (~ 1.6W idle) – Less heat (~ 4°C lower temp) • SMR drives (Seagate) – Denser packaging on current technology – Aimed at read intensive application areas • Hybrid drives (SSHD) – Enterprise edition – Transparent SSD/HHD combination (aka Fusion drives) • eMLC + SAS

19 SMR drive deep-dive

20 Hard drive futures …

• HAMR drives (Seagate) – Using a laser to heat the magnetic substrate (Iron/Platinum alloy) – Projected capacity – 30-60 TB/ 3.5 inch drive … – 2016 timeframe ….

• BPM (bit patterned media recording) – Stores one bit per cell, as opposed to regular hard-drive technology, where each bit is stored across a few hundred magnetic grains – Projected capacity – 100+ TB / 3.5 inch drive …

21 What about RAID ? • RAID 5 – No longer viable • RAID 6 – Still OK – But re-build times are becoming prohibitive • RAID 10 – OK for SSDs and small arrays • RAID Z/Z2 etc – A choice but limited functionality on • Parity Declustered RAID – Gaining foothold everywhere – But …. PD-RAID ≠ PD-RAID ≠ PD-RAID …. • No RAID ??? – Using multiple distributed copies works but …

22 Flash (NAND)

• Supposed to “Take over the World” [cf. Pinky and the Brain] • But for high performance storage there are issues …. – Price and density not following predicted evolution – Reliability (even on SLC) not as expected • Latency issues – SLC access ~25µs, MLC ~50µs … – Larger chips increase contention • Once a flash die is accessed, other dies on the same bus must wait • Up to 8 flash dies shares a bus • Address translation, garbage collection and especially wear leveling add significant latency

23 Flash (NAND)

• MLC 3-4 bits per cell @ 10K duty cycles • SLC – 1 bit per cell @ 100K duty cycles • eMLC – 2 bits per cell @ 30K duty cycles • Disk drive formats – S-ATA / SAS bandwidth limitations • PCI-E accelerators • PCI-E direct attach

24 NV-RAM

• Flash is essentially NV-RAM but …. • Phase Change Memory (PCM) – Significantly faster and more dense that NAND – Based on chalcogenide glass • Thermal vs electronic process – More resistant to external factors • Currently the expected solution for burst buffers etc … – but there’s always Hybrid Memory Cubes ……

25 3D RAM (aka Hybrid Memory Cube)

• Comes in many flavour: – Stacked DDR3 + Logic – Stacked NAND • Demonstrated performance: 480GB/sec / 1 memory device

26 Solutions … • Size does matter ….. 2014 – 2016 >20 proposals for 40+ PB file systems Running at 1 – 4 TB/s !!!! • Heterogeneity is the new buzzword – Burst buffers, data capacitors, cache off-loaders … • Mixed workloads are now taken seriously …. • Data integrity is paramount – T10-PI/DIX is a decent start but … • Storage system resiliency is equally important – PD-RAID need to evolve and become system wide • Multi tier storage as standard configs … • Geographical distributed solutions commonplace

27 Final thoughts ??? • Object based storage – Not an IF but a WHEN …. – Flavor(s) still TBD – DAOS, Exascale10, XEOS, …. • Data management core to any solution – Self aware data, real time analytics, resource management ranging from job scheduler to disk block …. – HPC storage = Big Data • Live data – – Cache ↔ ︎ Near line ↔ ︎ Tier2 ↔ ︎ Tape ? ↔ ︎ Cloud ↔ ︎ Ice • Compute with storage ➜ Storage with compute

Storage is no longer a 2nd class citizen

28 Thank You [email protected]

29