How BeeGFS excels in extreme HPC scale-out environments HPC Knowledge Meeting '19
www.beegfs.io 2019 Alexander Eekhoff, Manager System Engineering About ThinkParQ
• Established in 2014 as a spinoff from the Fraunhofer Center for High-Performance Computing, with a strong focus on R&D • 5 rankings in the top 20 on the IO-500 list. • Awarded the HPCwire 2018 Best Storage Product or Technology Award • Together with Partners, ThinkParQ provides fast, flexible, and solid storage solutions around BeeGFS for the users’ needs
HPC Knowledge Meeting ‘19 Delivering solutions for
HPC AI / Deep Learning Life Sciences Oil and Gas
HPC Knowledge Meeting ‘19 Technology Partners
HPC Knowledge Meeting ‘19 Partners
Platinum Gold Partners Gold Partners Gold Partners Partners APAC EMEA NA
HPC Knowledge Meeting ‘19 Partners
Platinum Gold Partners Partners
HPC Knowledge Meeting ‘19 BeeGFS – The Leading Parallel Cluster File System
Client Service
Well balanced from Easy to deploy and Performance Ease of Use small to large files Metadata integrate with existing Service infrastructure
Direct Parallel File Access Storage Service
Increase file system High availability design performanceScalability and enablingRobust continuous capacity, seamlessly operations and nondisruptively
HPC Knowledge Meeting ‘19 Quick Facts: BeeGFS
• A hardware-independent parallel file system (aka Software-defined Parallel Storage) • Runs on various platforms: x86, ARM, /mnt/beegfs/dir OpenPower, … 1 • Multiple networks (InfiniBand, OmniPath, Ethernet...)
• Open Source 1 1 1 2 2 3 2 3 3 M MM … • Runs on various Linux distros: RHEL, SLES, Ubuntu… Storage Server #1 Storage Server #2 Storage Server #3 Storage Server #4 Storage Server #5 Metadata Server #1 • NFS, CIFS, Hadoop enabled Simply grow capacity and performance to the level that you need
HPC Knowledge Meeting ‘19 Enterprise Features
BeeGFS Enterprise Features (under support contract): • High Availability • Quota Enforcement • Access Control Lists (ACLs) • Storage Pools
Support Benefits: • Professional Support • Customer Portal (Training videos, additional documentation) • Special repositories with early updates and hotfixes • Guaranteed next business day response
End User License Agreement https://www.beegfs.io/docs/BeeGFS_EULA.txt
HPC Knowledge Meeting ‘19 How BeeGFS Works
beegfs.io What is BeeGFS
/mnt/beegfs/dir 1
1 1 1 2 2 3 2 3 3 M M M …
Storage Server #1 Storage Server #2 Storage Server #3 Storage Server #4 Storage Server #5 Metadata Server #1
Simply grow capacity and performance to the level that you need
HPC Knowledge Meeting ‘19 BeeGFS Architecture
Client Service
• Client Service • Native Linux module to mount the file system Metadata • Management Service Service • Service registry and watch dog • Metadata Service • Maintain striping information for files Direct Parallel File Access • Not involved in data access between file open/close Storage Service • Storage Service • Store the (distributed) file contents • Graphical Administration and Monitoring Service • GUI to perform administrative tasks and monitor system information • Can be used for “Windows-style installation“
HPC Knowledge Meeting ‘19 BeeGFS Architecture Clients
Metadata Servers
• Management Service Direct, • Meeting point for servers and clients parallel file access • Watches registered services and checks their state • Not critical for performance, stores no user data • Typically not running on a dedicated machine Storage Servers
Management Host Graphical Administration & Monitoring system
HPC Knowledge Meeting ‘19 BeeGFS Architecture Clients
• Metadata Service • Stores information about the data Metadata Servers • Directory information Direct, • File and directory ownership parallel file access • Location of user data files on storage targets • Not involved in data access between file open/close • Faster CPU cores improve latency Storage Servers • Manages one metadata target • In general, any directory on an existing local file system Management Host Graphical Administration & Monitoring system • Typically a RAID1 or RAID10 on SSD or NVMe devices • Stores complete metadata including file size
HPC Knowledge Meeting ‘19 BeeGFS Architecture Clients
• Storage Service • Stores striped user file contents (data chunk files) • One or multiple storage services per BeeGFS instance Metadata Servers • Manages one or more storage targets Direct, • In general, any directory on an existing local file system parallel file access • Typically a RAID-6 (8+2 or 10+2) or zfs RAIDz2 volume, either internal or externally attached • It can also be a single HDD, NVMe, or SSD device Storage Servers • Multiple RDMA interfaces per server possible
• Different storage service instances bind to different Management Host Graphical Administration & Monitoring system interfaces • Different IP subnets for the interfaces for the routing to work correctly HPC Knowledge Meeting ‘19 Live per-Client and per-User Statistics
HPC Knowledge Meeting ‘19 BeeGFS - Design Philosophy
• Designed for Performance, Scalability, Robustness and Ease of Use • Distributed Metadata • No Linux patches, on top of EXT, XFS, ZFS, BTRFS, .. • Scalable multithreaded architecture • Supports RDMA / RoCE & TCP (InfiniBand, Omni-Path, 100/40/10/1GbE, …) • Easy to install and maintain (user space servers) • Robust and flexible (all services can be placed independently) • Hardware agnostic
HPC Knowledge Meeting ‘19 Key Features
beegfs.io High Availability I – Buddy Mirroring
• Built-in Replication for High Storage Storage Storage Storage Availability Server #1 Server #2 Server #3 Server #4 • Flexible setting per directory • Individual for metadata and/or storage Target #101 Target #201 Target #301 Target #401 • Buddies can be in different Buddy Buddy Group #1 Group #2 racks or different fire zones.
HPC Knowledge Meeting ‘19 High Availability II – Shared storage
• Shared storage together with Storage Storage Storage Storage Pacemaker/Corosync Server #1 Server #2 Server #3 Server #4 • No extra storage space needed • Works in active/active layout • BeeGFS ha-utils simplify setup Target #101 Target #301 and administration Target #201 Target #401
HPC Knowledge Meeting ‘19 Storage Pool
Storage Service … • Support for different types of storage • Single namespace across all tiers
Performance Pool Capacity Pool
Current Finished Projects Projects
HPC Knowledge Meeting ‘19 BeeOND – BeeGFS On Demand
• Create a parallel file system instance on-the-fly • Start/stop with one simple command Compute Compute Compute Compute • Use cases: cloud computing, test systems, Node #1 Node #2 Node #3 Node #n cluster compute nodes, ….. … • Can be integrated in cluster batch system • Common use case: per-job parallel file system User-controlled Data Staging • Aggregate the performance and capacity of local SSDs/disks in compute nodes of a job • Take load from global storage • Speed up "nasty" I/O patterns
HPC Knowledge Meeting ‘19 The easiest way to setup a parallel filesystem…
# GENERAL USAGE… $ beeond start –n
------
# EXAMPLE… $ beeond start –n $NODEFILE –d /local_disk/beeond –c /my_scratch
Starting BeeOND Services… Mounting BeeOND at /my_scratch… Done.
HPC Knowledge Meeting ‘19 BeeGFS Additional Features
• HA support • Quota user/group • ACL • Support for different types of storage • Modification Event Logging • Statistics in time series database • Cluster Manager Integration eg Bright Cluster Manager, Univa • Cloud readiness for AWS / Azure
HPC Knowledge Meeting ‘19 Bright Cluster Manager Integration
HPC Knowledge Meeting ‘19 BeeGFS and BeeOND
beegfs.io Scale from small
Converged Setup
HPC Knowledge Meeting ‘19 Into Enterprise
Storage Service ...
Storage Service
Direct Parallel File Direct Parallel File Access Access ...
HPC Knowledge Meeting ‘19 to BeeOND
NvME
Storage Service
...
HPC Knowledge Meeting ‘19 BeeGFS Use Cases
beegfs.io Alfred Wegener Institute for Polar and Marine Research
• Institute was founded in 1980 and is named after meteorologist, climatologist and geologist Alfred Wegener. • Government funded • Conducts research in the Arctic, in the Antarctic and in the high and mid latitude oceans • Additional research topics are: • North Sea research • Marine biological monitoring • Technical marine developments
• Actual mission: In September 2019 the icebreaker Polarstern will drift through the Arctic Ocean for 1 year with 600 team members from 17 countries & use the data gathered to take climate and ecosystem research to the next level.
HPC Knowledge Meeting ‘19 Day to day HPC operations @AWI
• CS400 • 11,548 Cores • 316 Nodes: • 2x Intel Xeon Broadwell 18-Core CPUs • 64GB RAM (DDR4 2400MHz) • 400GB SSD • 4 fat compute nodes, as above, but 512GB RAM • 1 very fat node, 2x Intel Broadwell 14-Core CPUs, 1.5TB RAM • Intel Omnipath network • 1024TB fast parallel file system (BeeGFS) • 128TB home and software file system
HPC Knowledge Meeting ‘19 Do you remember BeeOND?
• Global BeeGFS storage on spinning disks “Robust and stable, even in a case of unexpected power • 1PB of scratch_fs providing 80GB/s failure.“ • 316 compute nodes Dr. Malte Thoma • Each equipped with 400MB SSD each Alfred Wegener Institute, Helmholtz Centre for Polar and • 316x500MB/s per SSD equals 150GB/s aggregate Marine Research - (Bremerhaven, Germany) BeeOND burst “for free”
HPC Knowledge Meeting ‘19 Tokyo Institute of Technology: Tsubame 3
• Top national university for science and technology in Japan • 130 year history • Over 10,000 students located in the Tokyo Area
Tsubame 3 • Latest Tsubame Supercomputer • #1 on the Green500 in November 2017 • 14.110 GFLOPS2 per watt • BeeOND uses 1PB of available NVMe
HPC Knowledge Meeting ‘19 Tsubame 3 Configuration
• 540 nodes • Four Nvidia Tesla P100 GPUs per node (2,160 total) • Two 14-core Intel Xeon Processor E5-2680 v4 (15,120 cores total) • Two dual-port Intel Omni-Path Architecture HFIs (2,160 ports total) • 2 TB of Intel SSD DC Product Family for NVMe storage devices • Simple integration with Univa Grid Engine
HPC Knowledge Meeting ‘19 AIST (National Institute of Advanced Industrial Science and Technology)
• Japanese Research Institute located in the Greater Tokyo Area • Over 2,000 researchers • Part of the Ministry of Economy, Trade and Industry
ABCI (AI Bridging Cloud Infrastructure) • Japanese supercomputer in production since July 2018 • Theoretical performance is 130pflops – one of the fastest in the world • Will make its resources available through the cloud to various private and public entities in Japan • #7 on the Top 500 list
HPC Knowledge Meeting ‘19 Largest Machine Learning Environment in Japan uses BeeOND
• 1,088 servers • Two Intel Xeon Gold processor CPUs (a total of 2,176 CPUs) • Four NVIDIA Tesla V100 GPU computing cards (a total of 4,352 GPUs) • Intel SSD DC P4600 series based on an NVMe standard, as local storage. 1.6TB per node (a total of about 1.6PB) • InfiniBand EDR • Simple integration with Univa Grid Engine
HPC Knowledge Meeting ‘19 Spookfish
• Aerial survey system based in Western Australia • High resolution images are provided to customers who need up to date information on terrain they plan to utilize • Information can be fed into Geographical Information System and CAD applications.
HPC Knowledge Meeting ‘19 Spookfish System Architecture
• Metadata server x 6 • Supermicro chassis with 4 x Intel Xeon X7560 and 256GB RAM • Only performs MDS Services
• Metadata target x6 with buddy mirroring
• Converged storage server x 40 "The result [of switching to BeeGFS] is • DELL R730 with 2 x Intel Xeon E5-2650v4 CPU’s and 128GB of RAM that we’re now able to process about 3 times faster with BeeGFS than with our • Storage servers also perform processing for applications old NFS server. We’re seeing speeds of • Uses Linux cgroups to avoid out-of-memory events up to 10GB/s read and 5-6GB/s write.” • cgroups not used for CPU usage and so far no issues of CPU shortage –Spookfish
• Storage target x 160 with buddy mirroring • 10GB/s Ethernet
• Performance exceeded expectations with 10GB/s read and 5-6GB/s write after tuning
HPC Knowledge Meeting ‘19 CSIRO
• The Commonwealth Scientific and Industrial Research Organisation (CSIRO) has adopted BeeGFS file system for their 2PB all NVMe storage in Australia, making it one of the largest NVMe storage systems in the world. Metadata Overview: x 4 • 4 x Metadata Server • 32 x Storage Server • 2 PiB usable capacity DELL all NVMe • Look forward to ISC to see what the beast can do! Storage • Further details: http://www.pacificteck.com/?p=437 x 32
3.2 TB NVMe x 24 per server
HPC Knowledge Meeting ‘19 Follow BeeGFS:
HPC Knowledge Meeting ‘19 Sales engagement
beegfs.io What do we need?
• Full Address: • Name: • Email: • Phone #: • Business (university, research institute, life science, HPC etc):
• Quantity Single Target (MDS) Servers: • Quantity Multi Target (OSS) Servers: • RAID Settings • Server Type: • Hardware Platform (e.g. Intel Xeon, AMD, ARM): • Capacity Requirement: • Performance Requirement: • Quantity Clients (rough number): • BeeOND up to 100, up to 500, > 500 nodes • Support duration (3 years, 5 years): • Expected support start date: • System Nickname (to distinguish multiple systems): • Interconnect: (EDR,FDR,QDR, OPA, 40,10 GigE) • Linux Distribution (e.g. Red Hat):
HPC Knowledge Meeting ‘19 BeeGFS terms for sizing
• The term “target” refers to a storage device exported through BeeGFS. Typically, a target is a RAID6 volume (for BeeGFS storage servers) or a RAID10 volume (for BeeGFS metadata servers) consisting of several disks, but it can optionally also be a single HDD or SSD. • A “Single Target Server” exports exactly one target, either for storage or metadata. • A “Multi Target Server” exports up to six targets for storage and optionally one target for metadata. • An “Unlimited Target Server” exports an unlimited number of targets for storage and/or metadata.
HPC Knowledge Meeting ‘19 Pricing structure
HPC Knowledge Meeting ‘19 • 1st Level Support (Partner) • 1st level support is done by the reseller or qualified partner (e.g. sub-contractor / sub- reseller) of reseller. 1st level support staff has knowledge level of general system administrators to perform following tasks: • • Definition of problem, steps to reproduce and expected behavior • Description of customer hardware setup • Description of customer software setup (e.g. operating system, software, firmware and driver versions) • Gathering of other potentially relevant information such as log files • Attempts to solve problems based on previously known similar cases (e.g. recommendation of software update or configuration changes) • • 2nd Level Support (Partner) • • 2nd level support is done by Gold Reseller or qualified partner (e.g. sub-contractor / sub- reseller) of Gold Reseller. 2nd level support staff has knowledge of BeeGFS concepts and tools as well as knowledge of general storage system stack (e.g. storage devices and network tools / testing) to perform following tasks: • • Problem and root cause analysis (e.g. based on log file analysis) • Hardware check (e.g. network, storage devices, cables) and software check including attempts to reproduce issues on different test system to test if problems are caused by hardware malfunction at customer site. • Issue discussion and potential solution or work-around discussion with customer • Definition of minimal setup to reproduce problems before escalation to higher support level • • 3rd Level Support (ThinkParQ) • 3rd level support is provided by ThinkParQ. The BeeGFS support has detailed knowledge of BeeGFS internals. Incoming support tickets are prioritized based on severity. • • Full problem and root cause analysis, optionally including remote login to customer system via ssh • Code inspection for detailed internal analysis • Patch development with early update releases for supported customer • Recommendation of performance tuning methods and HPC consulting • Reaction time is next business day German working hours •
HPC Knowledge Meeting ‘19 Installation / Training
• Installation, Training can be done remotely.
• Remote ssh Installation per day 1,200 USD
• BeeGFS Remote Training (Agenda & Time) 10 hrs session • for partner it is free of cost, for end-customers we charge 1,200 USD) • Introduction BeeGFS basic concepts, architecture and features • How do I ... with BeeGFS typical administrative tasks • Sizing and tuning • Designing and implementing storage solutions with BeeGFS • Projects with BeeGFS • Reference installation and best practices
• Sales & Presales Training: • BeeGFS sales, pre-sales training mapped to your customer focus, with FS comparison, use cases and get your sales force up to speed. Please let me know when you guys are ready to schedule it
HPC Knowledge Meeting ‘19 Pricing Model
BeeGFS sizing for your information:
The term “target” refers to a storage device exported through BeeGFS. Typically, a target is a RAID6 volume (for BeeGFS storage servers) or a RAID10 volume (for BeeGFS metadata servers) consisting of several disks, but it can optionally also be a single HDD or SSD.
∙ A “Single Target Server” exports exactly one target, either for storage or metadata. ∙ A “Multi Target Server” exports up to six targets for storage and optionally one target for metadata. ∙ An “Unlimited Target Server” exports an unlimited number of targets for storage and/or metadata.
HPC Knowledge Meeting ‘19 BeeGFS Storage Engine under the Hood
HPC Knowledge Meeting ‘19