How BeeGFS excels in extreme HPC scale-out environments HPC Knowledge Meeting '19

www.beegfs.io 2019 Alexander Eekhoff, Manager System Engineering About ThinkParQ

• Established in 2014 as a spinoff from the Fraunhofer Center for High-Performance Computing, with a strong focus on R&D • 5 rankings in the top 20 on the IO-500 list. • Awarded the HPCwire 2018 Best Storage Product or Technology Award • Together with Partners, ThinkParQ provides fast, flexible, and solid storage solutions around BeeGFS for the users’ needs

HPC Knowledge Meeting ‘19 Delivering solutions for

HPC AI / Deep Learning Life Sciences Oil and Gas

HPC Knowledge Meeting ‘19 Technology Partners

HPC Knowledge Meeting ‘19 Partners

Platinum Gold Partners Gold Partners Gold Partners Partners APAC EMEA NA

HPC Knowledge Meeting ‘19 Partners

Platinum Gold Partners Partners

HPC Knowledge Meeting ‘19 BeeGFS – The Leading Parallel Cluster

Client Service

Well balanced from Easy to deploy and Performance Ease of Use small to large files Metadata integrate with existing Service infrastructure

Direct Parallel File Access Storage Service

Increase file system High availability design performanceScalability and enablingRobust continuous capacity, seamlessly operations and nondisruptively

HPC Knowledge Meeting ‘19 Quick Facts: BeeGFS

• A hardware-independent parallel file system (aka Software-defined Parallel Storage) • Runs on various platforms: x86, ARM, /mnt/beegfs/dir OpenPower, … 1 • Multiple networks (InfiniBand, OmniPath, Ethernet...)

• Open Source 1 1 1 2 2 3 2 3 3 M MM … • Runs on various distros: RHEL, SLES, Ubuntu… Storage Server #1 Storage Server #2 Storage Server #3 Storage Server #4 Storage Server #5 Metadata Server #1 • NFS, CIFS, Hadoop enabled Simply grow capacity and performance to the level that you need

HPC Knowledge Meeting ‘19 Enterprise Features

BeeGFS Enterprise Features (under support contract): • High Availability • Quota Enforcement • Access Control Lists (ACLs) • Storage Pools

Support Benefits: • Professional Support • Customer Portal (Training videos, additional documentation) • Special repositories with early updates and hotfixes • Guaranteed next business day response

End User License Agreement https://www.beegfs.io/docs/BeeGFS_EULA.txt

HPC Knowledge Meeting ‘19 How BeeGFS Works

beegfs.io What is BeeGFS

/mnt/beegfs/dir 1

1 1 1 2 2 3 2 3 3 M M M …

Storage Server #1 Storage Server #2 Storage Server #3 Storage Server #4 Storage Server #5 Metadata Server #1

Simply grow capacity and performance to the level that you need

HPC Knowledge Meeting ‘19 BeeGFS Architecture

Client Service

• Client Service • Native Linux module to mount the file system Metadata • Management Service Service • Service registry and watch dog • Metadata Service • Maintain striping information for files Direct Parallel File Access • Not involved in data access between file open/close Storage Service • Storage Service • Store the (distributed) file contents • Graphical Administration and Monitoring Service • GUI to perform administrative tasks and monitor system information • Can be used for “Windows-style installation“

HPC Knowledge Meeting ‘19 BeeGFS Architecture Clients

Metadata Servers

• Management Service Direct, • Meeting point for servers and clients parallel file access • Watches registered services and checks their state • Not critical for performance, stores no user data • Typically not running on a dedicated machine Storage Servers

Management Host Graphical Administration & Monitoring system

HPC Knowledge Meeting ‘19 BeeGFS Architecture Clients

• Metadata Service • Stores information about the data Metadata Servers • Directory information Direct, • File and directory ownership parallel file access • Location of user data files on storage targets • Not involved in data access between file open/close • Faster CPU cores improve latency Storage Servers • Manages one metadata target • In general, any directory on an existing local file system Management Host Graphical Administration & Monitoring system • Typically a RAID1 or RAID10 on SSD or NVMe devices • Stores complete metadata including file size

HPC Knowledge Meeting ‘19 BeeGFS Architecture Clients

• Storage Service • Stores striped user file contents (data chunk files) • One or multiple storage services per BeeGFS instance Metadata Servers • Manages one or more storage targets Direct, • In general, any directory on an existing local file system parallel file access • Typically a RAID-6 (8+2 or 10+2) or RAIDz2 volume, either internal or externally attached • It can also be a single HDD, NVMe, or SSD device Storage Servers • Multiple RDMA interfaces per server possible

• Different storage service instances bind to different Management Host Graphical Administration & Monitoring system interfaces • Different IP subnets for the interfaces for the routing to work correctly HPC Knowledge Meeting ‘19 Live per-Client and per-User Statistics

HPC Knowledge Meeting ‘19 BeeGFS - Design Philosophy

• Designed for Performance, Scalability, Robustness and Ease of Use • Distributed Metadata • No Linux patches, on top of EXT, XFS, ZFS, , .. • Scalable multithreaded architecture • Supports RDMA / RoCE & TCP (InfiniBand, Omni-Path, 100/40/10/1GbE, …) • Easy to install and maintain (user space servers) • Robust and flexible (all services can be placed independently) • Hardware agnostic

HPC Knowledge Meeting ‘19 Key Features

beegfs.io High Availability I – Buddy Mirroring

• Built-in Replication for High Storage Storage Storage Storage Availability Server #1 Server #2 Server #3 Server #4 • Flexible setting per directory • Individual for metadata and/or storage Target #101 Target #201 Target #301 Target #401 • Buddies can be in different Buddy Buddy Group #1 Group #2 racks or different fire zones.

HPC Knowledge Meeting ‘19 High Availability II – Shared storage

• Shared storage together with Storage Storage Storage Storage Pacemaker/Corosync Server #1 Server #2 Server #3 Server #4 • No extra storage space needed • Works in active/active layout • BeeGFS ha-utils simplify setup Target #101 Target #301 and administration Target #201 Target #401

HPC Knowledge Meeting ‘19 Storage Pool

Storage Service … • Support for different types of storage • Single namespace across all tiers

Performance Pool Capacity Pool

Current Finished Projects Projects

HPC Knowledge Meeting ‘19 BeeOND – BeeGFS On Demand

• Create a parallel file system instance on-the-fly • Start/stop with one simple command Compute Compute Compute Compute • Use cases: cloud computing, test systems, Node #1 Node #2 Node #3 Node #n cluster compute nodes, ….. … • Can be integrated in cluster batch system • Common use case: per-job parallel file system User-controlled Data Staging • Aggregate the performance and capacity of local SSDs/disks in compute nodes of a job • Take load from global storage • Speed up "nasty" I/O patterns

HPC Knowledge Meeting ‘19 The easiest way to setup a parallel filesystem…

# GENERAL USAGE… $ beeond start –n -d -c

------

# EXAMPLE… $ beeond start –n $NODEFILE –d /local_disk/beeond –c /my_scratch

Starting BeeOND Services… Mounting BeeOND at /my_scratch… Done.

HPC Knowledge Meeting ‘19 BeeGFS Additional Features

• HA support • Quota user/group • ACL • Support for different types of storage • Modification Event Logging • Statistics in time series database • Cluster Manager Integration eg Bright Cluster Manager, Univa • Cloud readiness for AWS / Azure

HPC Knowledge Meeting ‘19 Bright Cluster Manager Integration

HPC Knowledge Meeting ‘19 BeeGFS and BeeOND

beegfs.io Scale from small

Converged Setup

HPC Knowledge Meeting ‘19 Into Enterprise

Storage Service ...

Storage Service

Direct Parallel File Direct Parallel File Access Access ...

HPC Knowledge Meeting ‘19 to BeeOND

NvME

Storage Service

...

HPC Knowledge Meeting ‘19 BeeGFS Use Cases

beegfs.io Alfred Wegener Institute for Polar and Marine Research

• Institute was founded in 1980 and is named after meteorologist, climatologist and geologist Alfred Wegener. • Government funded • Conducts research in the Arctic, in the Antarctic and in the high and mid latitude oceans • Additional research topics are: • North Sea research • Marine biological monitoring • Technical marine developments

• Actual mission: In September 2019 the icebreaker Polarstern will drift through the Arctic Ocean for 1 year with 600 team members from 17 countries & use the data gathered to take climate and ecosystem research to the next level.

HPC Knowledge Meeting ‘19 Day to day HPC operations @AWI

• CS400 • 11,548 Cores • 316 Nodes: • 2x Intel Xeon Broadwell 18-Core CPUs • 64GB RAM (DDR4 2400MHz) • 400GB SSD • 4 fat compute nodes, as above, but 512GB RAM • 1 very fat node, 2x Intel Broadwell 14-Core CPUs, 1.5TB RAM • Intel Omnipath network • 1024TB fast parallel file system (BeeGFS) • 128TB home and software file system

HPC Knowledge Meeting ‘19 Do you remember BeeOND?

• Global BeeGFS storage on spinning disks “Robust and stable, even in a case of unexpected power • 1PB of scratch_fs providing 80GB/s failure.“ • 316 compute nodes Dr. Malte Thoma • Each equipped with 400MB SSD each Alfred Wegener Institute, Helmholtz Centre for Polar and • 316x500MB/s per SSD equals 150GB/s aggregate Marine Research - (Bremerhaven, Germany) BeeOND burst “for free”

HPC Knowledge Meeting ‘19 Tokyo Institute of Technology: Tsubame 3

• Top national university for science and technology in Japan • 130 year history • Over 10,000 students located in the Tokyo Area

Tsubame 3 • Latest Tsubame Supercomputer • #1 on the Green500 in November 2017 • 14.110 GFLOPS2 per watt • BeeOND uses 1PB of available NVMe

HPC Knowledge Meeting ‘19 Tsubame 3 Configuration

• 540 nodes • Four Nvidia Tesla P100 GPUs per node (2,160 total) • Two 14-core Intel Xeon Processor E5-2680 v4 (15,120 cores total) • Two dual-port Intel Omni-Path Architecture HFIs (2,160 ports total) • 2 TB of Intel SSD DC Product Family for NVMe storage devices • Simple integration with Univa Grid Engine

HPC Knowledge Meeting ‘19 AIST (National Institute of Advanced Industrial Science and Technology)

• Japanese Research Institute located in the Greater Tokyo Area • Over 2,000 researchers • Part of the Ministry of Economy, Trade and Industry

ABCI (AI Bridging Cloud Infrastructure) • Japanese supercomputer in production since July 2018 • Theoretical performance is 130pflops – one of the fastest in the world • Will make its resources available through the cloud to various private and public entities in Japan • #7 on the Top 500 list

HPC Knowledge Meeting ‘19 Largest Machine Learning Environment in Japan uses BeeOND

• 1,088 servers • Two Intel Xeon Gold processor CPUs (a total of 2,176 CPUs) • Four NVIDIA Tesla V100 GPU computing cards (a total of 4,352 GPUs) • Intel SSD DC P4600 series based on an NVMe standard, as local storage. 1.6TB per node (a total of about 1.6PB) • InfiniBand EDR • Simple integration with Univa Grid Engine

HPC Knowledge Meeting ‘19 Spookfish

• Aerial survey system based in Western Australia • High resolution images are provided to customers who need up to date information on terrain they plan to utilize • Information can be fed into Geographical Information System and CAD applications.

HPC Knowledge Meeting ‘19 Spookfish System Architecture

• Metadata server x 6 • Supermicro chassis with 4 x Intel Xeon X7560 and 256GB RAM • Only performs MDS Services

• Metadata target x6 with buddy mirroring

• Converged storage server x 40 "The result [of switching to BeeGFS] is • DELL R730 with 2 x Intel Xeon E5-2650v4 CPU’s and 128GB of RAM that we’re now able to process about 3 times faster with BeeGFS than with our • Storage servers also perform processing for applications old NFS server. We’re seeing speeds of • Uses Linux cgroups to avoid out-of-memory events up to 10GB/s read and 5-6GB/s write.” • cgroups not used for CPU usage and so far no issues of CPU shortage –Spookfish

• Storage target x 160 with buddy mirroring • 10GB/s Ethernet

• Performance exceeded expectations with 10GB/s read and 5-6GB/s write after tuning

HPC Knowledge Meeting ‘19 CSIRO

• The Commonwealth Scientific and Industrial Research Organisation (CSIRO) has adopted BeeGFS file system for their 2PB all NVMe storage in Australia, making it one of the largest NVMe storage systems in the world. Metadata Overview: x 4 • 4 x Metadata Server • 32 x Storage Server • 2 PiB usable capacity DELL all NVMe • Look forward to ISC to see what the beast can do! Storage • Further details: http://www.pacificteck.com/?p=437 x 32

3.2 TB NVMe x 24 per server

HPC Knowledge Meeting ‘19 Follow BeeGFS:

HPC Knowledge Meeting ‘19 Sales engagement

beegfs.io What do we need?

• Full Address: • Name: • Email: • Phone #: • Business (university, research institute, life science, HPC etc):

• Quantity Single Target (MDS) Servers: • Quantity Multi Target (OSS) Servers: • RAID Settings • Server Type: • Hardware Platform (e.g. Intel Xeon, AMD, ARM): • Capacity Requirement: • Performance Requirement: • Quantity Clients (rough number): • BeeOND up to 100, up to 500, > 500 nodes • Support duration (3 years, 5 years): • Expected support start date: • System Nickname (to distinguish multiple systems): • Interconnect: (EDR,FDR,QDR, OPA, 40,10 GigE) • Linux Distribution (e.g. Red Hat):

HPC Knowledge Meeting ‘19 BeeGFS terms for sizing

• The term “target” refers to a storage device exported through BeeGFS. Typically, a target is a RAID6 volume (for BeeGFS storage servers) or a RAID10 volume (for BeeGFS metadata servers) consisting of several disks, but it can optionally also be a single HDD or SSD. • A “Single Target Server” exports exactly one target, either for storage or metadata. • A “Multi Target Server” exports up to six targets for storage and optionally one target for metadata. • An “Unlimited Target Server” exports an unlimited number of targets for storage and/or metadata.

HPC Knowledge Meeting ‘19 Pricing structure

HPC Knowledge Meeting ‘19 • 1st Level Support (Partner) • 1st level support is done by the reseller or qualified partner (e.g. sub-contractor / sub- reseller) of reseller. 1st level support staff has knowledge level of general system administrators to perform following tasks: • • Definition of problem, steps to reproduce and expected behavior • Description of customer hardware setup • Description of customer software setup (e.g. , software, firmware and driver versions) • Gathering of other potentially relevant information such as log files • Attempts to solve problems based on previously known similar cases (e.g. recommendation of software update or configuration changes) • • 2nd Level Support (Partner) • • 2nd level support is done by Gold Reseller or qualified partner (e.g. sub-contractor / sub- reseller) of Gold Reseller. 2nd level support staff has knowledge of BeeGFS concepts and tools as well as knowledge of general storage system stack (e.g. storage devices and network tools / testing) to perform following tasks: • • Problem and root cause analysis (e.g. based on log file analysis) • Hardware check (e.g. network, storage devices, cables) and software check including attempts to reproduce issues on different test system to test if problems are caused by hardware malfunction at customer site. • Issue discussion and potential solution or work-around discussion with customer • Definition of minimal setup to reproduce problems before escalation to higher support level • • 3rd Level Support (ThinkParQ) • 3rd level support is provided by ThinkParQ. The BeeGFS support has detailed knowledge of BeeGFS internals. Incoming support tickets are prioritized based on severity. • • Full problem and root cause analysis, optionally including remote login to customer system via ssh • Code inspection for detailed internal analysis • Patch development with early update releases for supported customer • Recommendation of performance tuning methods and HPC consulting • Reaction time is next business day German working hours •

HPC Knowledge Meeting ‘19 Installation / Training

• Installation, Training can be done remotely.

• Remote ssh Installation per day 1,200 USD

• BeeGFS Remote Training (Agenda & Time) 10 hrs session • for partner it is free of cost, for end-customers we charge 1,200 USD) • Introduction BeeGFS basic concepts, architecture and features • How do I ... with BeeGFS typical administrative tasks • Sizing and tuning • Designing and implementing storage solutions with BeeGFS • Projects with BeeGFS • Reference installation and best practices

• Sales & Presales Training: • BeeGFS sales, pre-sales training mapped to your customer focus, with FS comparison, use cases and get your sales force up to speed. Please let me know when you guys are ready to schedule it

HPC Knowledge Meeting ‘19 Pricing Model

BeeGFS sizing for your information:

The term “target” refers to a storage device exported through BeeGFS. Typically, a target is a RAID6 volume (for BeeGFS storage servers) or a RAID10 volume (for BeeGFS metadata servers) consisting of several disks, but it can optionally also be a single HDD or SSD.

∙ A “Single Target Server” exports exactly one target, either for storage or metadata. ∙ A “Multi Target Server” exports up to six targets for storage and optionally one target for metadata. ∙ An “Unlimited Target Server” exports an unlimited number of targets for storage and/or metadata.

HPC Knowledge Meeting ‘19 BeeGFS Storage Engine under the Hood

HPC Knowledge Meeting ‘19