Modernising HPC cluster provisioning

Jordi Blasco (HPCNow!) Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Agenda

1 Motivation

2 Deployment Provisioning

3 Image Provisioning State-full NFSROOT State-less Semi State-less / State-lite

4 Improvements in diskless provisioning BeeGFS-root -root

5 Introducing OverlayFS root SquashFS Cluster Client

6 Conclusions

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Motivation

Main Objectives • Flexibility • Resilience • Scalability • Online changes • Fast provisioning • DevOps & CI friendly

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions

Deployment Based Provisioning

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Deployment Based Provisioning

Standard Pipeline • The system boots from PXE • Unattended installation based on OS dependent template (Kickstart|Autoyast|Preseed) • The OS is installed on local hard disks • Reboot the system in order to boot from local disk • (optional) Configuration manager (Puppet, cfengine, ansible) completes the setup and ensures consistency

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Deployment Based Provisioning

Pros

• Flexible management • DevOps & CI friendly

Cons

• Local disks are not reliable • Long time required to deploy a reasonably large cluster • Not easy to apply changes online • Limited scalability • Risk of inconsistency if configuration manager is not used

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions

Image Based Provisioning

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Local Disk (statefull)

Standard Pipeline • The image is generated on a "golden" node • The image is uploaded to an NFS server. • (optional) generate a torrent file to propagate the image faster. • Boot the system from PXE • Load a rigid and limited OS with the cloning software (i.e. SystemImager) • Clone the local disk with the OS image. • Reboot the system from local disk.

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Local Disk (statefull)

Pros

• Easy to manage • Consistent configuration across the cluster • Still room for configuration managers

Cons

• Local disk required but not reliable or expensive • Not easy to apply changes into the disk image • Not easy to apply changes online • Takes reasonable amount of time to update the image and test it. • Not suitable for CI • Not DevOps friendly

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions NFSROOT based

Standard Pipeline • The image is usually generated from a "golden" node or by bootstrap. • The image is shared via NFS. • The OS boots from PXE • The root filesystem is mounted from a NFS server

File System Management Options

-: Each node needs its own OS file system • Read-Only: Potential high memory footprint

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions NFSROOT based

Pros

• Easy to apply changes online • Easy to manage • Consistent configuration across the cluster • Still room for configuration managers

Cons

• NFS server becomes a SPOF • Limited scalability • Potential large memory footprint and hard to maintain rwtab file (read-only option) • Complex data structure and ongoing communication (read-write option)

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Stateless with ramdisk

Standard Pipeline • The image is usually generated from a "golden" node. • The image is uploaded to a TFTP/HTTP/FTP/NFS server. • The OS boots from PXE • The root file system is loaded in the memory of the system. • Mutable files will use additional memory.

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Stateless with ramdisk

Pros

• Resilient to TFTP/HTTP/FTP/NFS failures • Resilient to local disk failures • Easy to manage • Consistent configuration across the cluster

Cons

• Usually it generates a massive ramdisk (500M to 4GB) • Limited scalability with single server • Long term boot process if the image is distributed through 1GB eth • Large (sometimes huge) memory footprint • ramdisk supports only 16 bits

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Semi State-less / State-lite with ramdisk

Standard Pipeline • The image is usually generated from a "golden" node. • The image is uploaded to a TFTP/HTTP/FTP/NFS server. • The OS boots from PXE. • The root file system is loaded in the ram of the system. • The rootfs is partially located in a NFS server or cluster file system.

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Semi State-less / State-lite with ramdisk

Pros

• Resilient to TFTP/HTTP/FTP/NFS failures • Resilient to local disk failures • Less memory footprint • Allows applying (some) changes online in the portion of the file system located in the shared FS.

Cons

• NFS is a SPOF. Cluster FS without HA is a SPOF • Limited scalability with single TFTP/HTTP/FTP/NFS server • ramdisk supports only 16 bits • Complex maintenance and administration • Not suitable for CI • Not DevOps friendly

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions

Improvements in diskless provisioning

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Improvements in diskless provisioning

NFS-root and state-lite solutions have limited scalability and SPOF. By booting the OS with a native cluster file system client we achieved: Pros

• Great scalability • RDMA support • Fully resilient solution • Consistent configuration across the cluster • Small memory footprint

Cons

• Requires maintaining /etc/rwtab • Changes which require updating the rwtab file, also require a reboot

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Read-only BeeGFS root file system

DEMO

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Read-only Lustre root file system

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Comparison

Provisioning Mechanism Memory (MB) Data Transf. (MB/node) Deployment 285 400 read-only NFS root 300 42 read-only BeeGFS root 320 44 read-only Lustre root 330 44

Table: Statistics based on sNow! CentOS 7.3 minimal template (611 packages, 1638MB of file system).

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions

Introducing OverlayFS root

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions About OverlayFS

• Developed by Miklos Szeredi • implements a for other file systems. • merged into the kernel mainline in 2014, in kernel version 3.18. • OverlayFS supports whiteouts and opaque directories in the upper file system to allow file and deletion. • improved in version 4.0 (fixed issues related with inode utilization and memory footprint)

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Architecture of sNow! OverlayFS root

Overlay mount options:

mount -t overlay overlay \ -o lowerdir=/lower,upperdir=/upper,workdir=/work /merged The lower directory can actually be a list of directories separated by :, all changes in the merged directory are still reflected in upper.

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Architecture of sNow! OverlayFS root

X OS can be loaded from SquashFS image or using native cluster file system client X In order to enable read-only image, mutable content goes to (i.e. /var/run, /etc) X Painless administration X No longer required to maintain /etc/rwtab. X No longer required to reboot the OS to apply certain changes.

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions sNow! stateless based on SquashFS

Standard Pipeline • The image is generated from an already deployed node. • The OS boots from PXE • The image can be fetched into memory using different protocols, including cluster file systems. • The image doesn’t need to be fetched if a cluster file system is used (less memory footprint). • Mutable files will use additional memory.

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions sNow! diskless based on Cluster File System Client

Standard Pipeline • The image is generated from an already deployed node. • The image is uploaded to a resilient cluster file system. • The system boots via PXE which includes cluster file system client. • The system pivots to a read-only rootfs image located in the cluster file system • Mutable files will use additional memory. • Still under development.

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Comparison

Provisioning Mechanism Memory (MB) Data (MB/node) Deployment 285 400 read-only NFS root 300 42 read-only BeeGFS root 320 44 read-only Lustre root 330 44 ramdisk 1940 710 OverlayFS: SquashFS fetch 750 535 OverlayFS: SquashFS no fetch 350 58 OverlayFS: read-only BeeGFS - - OverlayFS: read-only Lustre - -

Table: Statistics based on sNow! CentOS 7.3 minimal template (611 packages, 1638MB of file system).

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Conclusions

Improvements in diskless provisioning X Simplified administration X Enabled resilience with small memory footprint cost (≈ 35MB). X Improved scalability Improvements achieved with OverlayFS X Significant memory footprint reduction compared to ramdisk solution X Enabled scalability by including native cluster file system client X Further reduced the memory footprint when not fetching the image X Opens new opportunities to expose different file systems on demand

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Questions

Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Get ready for the coming challenges, get involved with the HPC community! [email protected] www.hpcnow.com Almogàvers, 165 - 08018 Barcelona (Spain) 34 Fernly Rise, 1019 Auckland (New Zealand)