BeeGFS Data Integrity Improvements without impacting Performance

Presenter: Dr. M. K. Jibbe Technical Director , NetApp ESG Presenter: Joey Parnell

1 SW2019 StorageArchitect, Developer Conference. NetApp © 2019 NetApp ESG Inc. NetApp Proprietary All Rights Reserved. Current Problem • has been a common choice in the HPC market for years but the acquisition of the Lustre intellectual property by storage provider in 2018 has created a challenge for users that want to avoid vendor lock-in.

• Parallel needs are expanding beyond HPC, but most lack enterprise reliability

• Any BeeGFS configuration does Lack Data protection at different locations: • Storage Client and servers do not have cache protection in the event of kernel panics, software crashes or power loss. Potential issues are: • Data cached in the BeeGFS client itself • Data cached at the underlying filesystem layer in the storage server (XFS, ZFS, , etc.) • Data cached by the OS kernel itself underneath the filesystem in the storage server

Goal of the Team (Remove Competitor and Expand Market Reach of BeeGFS)

 Parallel File system Improve solution resiliency to ensure data integrity by  ML BeeGFS Swim lane  identifying potential component faults that could lead to BeeGFS  IB and FC data loss and changing system design to handle the fault  Address known Market without any data integrity challenges. Opportunity  Simple solution and support

2 2 2019 Storage Developer Conference. © 2019 NetApp Inc. NetApp Proprietary All Rights Reserved. Data Integrity & Availability expectations in HPC (Provide Enterprise Reliability with E-series Storage – BeeGFS) However,

. The market we serve goes beyond the HPC scratch file system use cases . The need for parallel file systems is growing beyond HPC . We can solve data integrity problems while meeting the performance requirements of these customers . Extend the reach of BeeGFS to further leverage our R&D investment

3 3 2019 Storage Developer Conference. © 2019 NetApp Inc. NetApp Proprietary All Rights Reserved. Areas of the problem & solutions / enhancements

• Data loss protection • HPC customers using BeeGFS have the expectation of treating data as scratch space • Mitigation: shared storage cluster, STONITH • Risk of data loss • Mitigation: buddy mirroring built in to BeeGFS • Overhead of storage capacity • Write penalty for synchronous mirroring from primary to secondary • Requires enterprise license support contract • Storage servers do not have cache protection in the event of kernel panics, software crashes or power loss • Possible problem solutions: • Eliminate caching at the client server and/or storage server, mindful of performance impacts, • Find ways to protect/mirror and recover the data, or • Find ways such that the BeeGFS or the underlying filesystem can tolerate the data loss and re-write data as needed after a fault (journaling, redo logs, etc.).

4 4 2019 Storage Developer Conference. © 2019 NetApp Inc. NetApp Proprietary All Rights Reserved. Data Loss protection and Integrity

1. Prevent Data Loss a) Mount the file systems on storage servers with sync enabled to cause the storage server to write all the way to the E-Series storage array before returning a response.

. E-Series provides the cache protection with power loss recovery . Any crash of the storage server will not lose any write cache (because there is none) b) Provide the ability to dynamically set the cache on a per-storage-pool basis. . Customer can migrate data from high-performance scratch space to data-protected space or vice versa. . Remount the file system on the fly (adding or removing sync) and update the fstab . We can set cache mode on the target associated with some pools to provide higher resiliency than others 2. Protect Data Integrity a) Configure data file system with instead of . . For HW RAID systems, BeeGFS recommends xfs/ext4 for data/metadata. For SW RAID, BeeGFS recommends zfs . However, this puts data integrity at risk—zfs provides better data consistency guarantees and recovery . Provide data integrity to protect the entire solution from data corruption or bit-rot . Don’t stripe at the storage server layer since E-Series is providing RAID . Measure performance impact of using zfs for this additional protection ( In progress) 5 5 2019 Storage Developer Conference. © 2019 NetApp Inc. NetApp Proprietary All Rights Reserved. E-Series Data Corruption statistics with and without PI Why add data protection to BeeGFS?

• Parallel accesses introduce race conditions; Disruptions for code releases pre Caching then introduces inconsistency 2012 and All Releases 2012 – 2019 • Race conditions need 2+ ‘simultaneous’ events 25.00 Probability proportional to square of frequency 21.00 21.00 21.00 20.00 20.00 20.00 • One event lasting 0.003 seconds every 5 19.00 20.00 18.00 minutes 2 clients involved in causing events 16.00 a couple of race conditions a year 15.00 15.00 13.00 • One event lasting 0.03 seconds every 30 ⇒ seconds 20 clients involved in causing 10.00 events a race condition every few minutes; a 100,000-fold increase 5.00 ⇒ 5.00 3.00 3.00

Disruption per week 2.50 2.50 1.75 1.50 1.50 1.30 1.50 1.20 1.00 1.50 • Reasonable protection against filesystem 0.00 corruption. Still application-level Jan Feb Mar Apri May June July Aug Sept Oct Nov Dec inconsistencies and races

Code Releases pre 2012 • Even then, filesystem corruption does occur. Quite often when one system crashes or hangs

6 6 2019 Storage Developer Conference. © 2019 NetApp Inc. NetApp Proprietary All Rights Reserved. POC BeeGFS Architecture for our Solution

BeeGFS Test Client 1 2.3GHz, 24 cores IOR and vdBench IO tools R730 128GB RAM measure filesystem performance on clients

100Gb/s Infiniband Network

BeeGFS Storage Server #1 BeeGFS Storage Server #2 BeeGFS Metadata Server 3.0GHz, 20 cores 2.9GHz, 16 cores 2.3GHz, 24 cores R720 128GB RAM R720 128GB RAM R730 128GB RAM

E5760 E5760

18 x 8+2 R6 Data HDD 7.2K, 18 Volumes 1x 6+6 R10 SSD 2 Volumes

7 7 2019 Storage Developer Conference. © 2019 NetApp Inc. NetApp Proprietary All Rights Reserved. Base performance for BeeGFS with Data Protection

8 8 2019 Storage Developer Conference. © 2019 NetApp Inc. NetApp Proprietary All Rights Reserved. Typical Parallel File system configuration POC BeeGFS Architecture Tested Lustre Architecture at a vendor

. 8 client servers, IB Network, 2 storage . 1 client server, IB Network, 3 BeeGFS servers, One E5760, 3 servers, Two E5760, 2 enclosures Drive enclosures with 180 HDD and 1 5724 with 12 SSDs with 120 HDD 10TB . 36 Files (64 GB each) . 12 Files ( 1.2 TB each) . W cache: Max WR: 7640 MB/sec, Max RD: 11282 MB/sec) . With Cache: . Wo Cache: Max WR: 7150 MB/sec, Max Read: 11261 MB/sec . Cost W Cache: $196,208 ( Less without cache) . Max Write: 8225.54 MiB/sec . Data protection ( Example: Without Data Protection: Lost 7.5 G (8625.11 MB/sec) out of 175 G when injecting fault at storage servers . Max Read: 11250.45 MiB/sec (11796.95 MB/sec) . Cost with Cache: $234,563 . No Data Protection from the storage server 9 9 2019 Storage Developer Conference. © 2019 NetApp Inc. NetApp Proprietary All Rights Reserved. Cost Comparison between Typical Lustre and This solution BeeGFS configuration Typical BeeGFS Architecture Typical Lustre Architecture BeeGFS Configuration Lustre Configuration Item BeeGFS item Luster 3 storage servers (Dell 720 with 2 storage servers ((Dell 720 120 GB memory) $24,342.00 with 120 GB memory)) $16,228.00 IB Network $12,000.00 IB Network $28,000.00 System: FC HBA, IB HCA, System: FC HBA, IB HCA, Cables, MSW, and E5760 with 2 Cables, MSW, and E5760 with 2 Trafford “60 drives” EBOD and Trafford “120 drives” EBOD, 1 Camden “24 drives ” 4 TB $129,066.00 10TB drives $148,566.00 BeeGFS Support with 3 year Lustre Support with 3 year lease lease $30,800.00 $41,769 Total = $196,208.00 $234,563.00

NetApp ESG BeeGFS CFG is 17% cheaper (without considering the Client servers), simple and easy to manage, has similar throughput with data protection (Without tuning and with headroom available to improve performance via Maxdata)

10 10 2019 Storage Developer Conference. © 2019 NetApp Inc. NetApp Proprietary All Rights Reserved. What is the solution? . Implemented a data integrity caching scheme and control through beegfs-ctl (CLI) . Modified BeeGFS source code and recompiled binaries for internal use . For each storage pool, locates storage targets; remounts file systems with new cache setting & updates fstab . Provide new cache options at the storage pool level so some data can be used as high-performance scratch and some with data integrity protection . Observed minimal performance disruption on target workloads, and can be modified at runtime . Achieving this performance with data integrity would not be possible with a BeeGFS JBOD configuration . Achieving this level of data integrity would not be this affordable with buddy mirroring . Wrote our I/O integrity test tool (bgbench) since we couldn’t find the data loss with vdbench . Installed and configured ZFS on BeeGFS storage servers . Non-tangible benefits . Team learned a significant amount about BeeGFS and how it operates to aid future E-Series projects

11 11 2019 Storage Developer Conference. © 2019 NetApp Inc. NetApp Proprietary All Rights Reserved. Key Takeaways and Closing Thoughts

• Implemented a flexible data integrity caching scheme for the BeeGFS solution (CLI execution from client server)

• Used zfs as the storage server file system to observe greater data protection

• Measured performance with data integrity improvements and observed minimal performance impact on bandwidth and IOPs with XFS

• Wrote an alternative I/O generation tool to check for data loss

• Gained understanding of workload characteristics to vet BeeGFS for various use cases

Future Enhancements: • Further testing with zfs • Further testing with cache setting on metadata server • Further testing on scale out • Evaluate feasibility of BeeGFS for M&E environments

12 12 2019 Storage Developer Conference. © 2019 NetApp Inc. NetApp Proprietary All Rights Reserved. Questions

13 2019 Storage Developer Conference. © 2019 NetApp Inc. NetApp Proprietary All Rights Reserved. Thank You

14 2019 Storage Developer Conference. © 2019 NetApp Inc. NetApp Proprietary All Rights Reserved.