Modernising HPC cluster provisioning
Jordi Blasco (HPCNow!) Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Agenda
1 Motivation
2 Deployment Provisioning
3 Image Provisioning State-full NFSROOT State-less Semi State-less / State-lite
4 Improvements in diskless provisioning BeeGFS-root Lustre-root
5 Introducing OverlayFS root SquashFS Cluster File System Client
6 Conclusions
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Motivation
Main Objectives • Flexibility • Resilience • Scalability • Online changes • Fast provisioning • DevOps & CI friendly
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions
Deployment Based Provisioning
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Deployment Based Provisioning
Standard Pipeline • The system boots from PXE • Unattended installation based on OS dependent template (Kickstart|Autoyast|Preseed) • The OS is installed on local hard disks • Reboot the system in order to boot from local disk • (optional) Configuration manager (Puppet, cfengine, ansible) completes the setup and ensures consistency
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Deployment Based Provisioning
Pros
• Flexible management • DevOps & CI friendly
Cons
• Local disks are not reliable • Long time required to deploy a reasonably large cluster • Not easy to apply changes online • Limited scalability • Risk of inconsistency if configuration manager is not used
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions
Image Based Provisioning
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Local Disk (statefull)
Standard Pipeline • The image is generated on a "golden" node • The image is uploaded to an NFS server. • (optional) generate a torrent file to propagate the image faster. • Boot the system from PXE • Load a rigid and limited OS with the cloning software (i.e. SystemImager) • Clone the local disk with the OS image. • Reboot the system from local disk.
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Local Disk (statefull)
Pros
• Easy to manage • Consistent configuration across the cluster • Still room for configuration managers
Cons
• Local disk required but not reliable or expensive • Not easy to apply changes into the disk image • Not easy to apply changes online • Takes reasonable amount of time to update the image and test it. • Not suitable for CI • Not DevOps friendly
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions NFSROOT based
Standard Pipeline • The image is usually generated from a "golden" node or by bootstrap. • The image is shared via NFS. • The OS boots from PXE • The root filesystem is mounted from a NFS server
File System Management Options
• Read-Write: Each node needs its own OS file system • Read-Only: Potential high memory footprint
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions NFSROOT based
Pros
• Easy to apply changes online • Easy to manage • Consistent configuration across the cluster • Still room for configuration managers
Cons
• NFS server becomes a SPOF • Limited scalability • Potential large memory footprint and hard to maintain rwtab file (read-only option) • Complex data structure and ongoing communication (read-write option)
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Stateless with ramdisk
Standard Pipeline • The image is usually generated from a "golden" node. • The image is uploaded to a TFTP/HTTP/FTP/NFS server. • The OS boots from PXE • The root file system is loaded in the memory of the system. • Mutable files will use additional memory.
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Stateless with ramdisk
Pros
• Resilient to TFTP/HTTP/FTP/NFS failures • Resilient to local disk failures • Easy to manage • Consistent configuration across the cluster
Cons
• Usually it generates a massive ramdisk (500M to 4GB) • Limited scalability with single server • Long term boot process if the image is distributed through 1GB eth • Large (sometimes huge) memory footprint • ramdisk supports only 16 bits
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Semi State-less / State-lite with ramdisk
Standard Pipeline • The image is usually generated from a "golden" node. • The image is uploaded to a TFTP/HTTP/FTP/NFS server. • The OS boots from PXE. • The root file system is loaded in the ram of the system. • The rootfs is partially located in a NFS server or cluster file system.
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Semi State-less / State-lite with ramdisk
Pros
• Resilient to TFTP/HTTP/FTP/NFS failures • Resilient to local disk failures • Less memory footprint • Allows applying (some) changes online in the portion of the file system located in the shared FS.
Cons
• NFS is a SPOF. Cluster FS without HA is a SPOF • Limited scalability with single TFTP/HTTP/FTP/NFS server • ramdisk supports only 16 bits • Complex maintenance and administration • Not suitable for CI • Not DevOps friendly
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions
Improvements in diskless provisioning
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Improvements in diskless provisioning
NFS-root and state-lite solutions have limited scalability and SPOF. By booting the OS with a native cluster file system client we achieved: Pros
• Great scalability • RDMA support • Fully resilient solution • Consistent configuration across the cluster • Small memory footprint
Cons
• Requires maintaining /etc/rwtab • Changes which require updating the rwtab file, also require a reboot
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Read-only BeeGFS root file system
DEMO
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Read-only Lustre root file system
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Comparison
Provisioning Mechanism Memory (MB) Data Transf. (MB/node) Deployment 285 400 read-only NFS root 300 42 read-only BeeGFS root 320 44 read-only Lustre root 330 44
Table: Statistics based on sNow! CentOS 7.3 minimal template (611 packages, 1638MB of file system).
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions
Introducing OverlayFS root
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions About OverlayFS
• Developed by Miklos Szeredi • implements a union mount for other file systems. • merged into the Linux kernel mainline in 2014, in kernel version 3.18. • OverlayFS supports whiteouts and opaque directories in the upper file system to allow file and directory deletion. • improved in version 4.0 (fixed issues related with inode utilization and memory footprint)
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Architecture of sNow! OverlayFS root
Overlay mount options:
mount -t overlay overlay \ -o lowerdir=/lower,upperdir=/upper,workdir=/work /merged The lower directory can actually be a list of directories separated by :, all changes in the merged directory are still reflected in upper.
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Architecture of sNow! OverlayFS root
X OS can be loaded from SquashFS image or using native cluster file system client X In order to enable read-only image, mutable content goes to TMPFS (i.e. /var/run, /etc) X Painless administration X No longer required to maintain /etc/rwtab. X No longer required to reboot the OS to apply certain changes.
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions sNow! stateless based on SquashFS
Standard Pipeline • The image is generated from an already deployed node. • The OS boots from PXE • The image can be fetched into memory using different protocols, including cluster file systems. • The image doesn’t need to be fetched if a cluster file system is used (less memory footprint). • Mutable files will use additional memory.
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions sNow! diskless based on Cluster File System Client
Standard Pipeline • The image is generated from an already deployed node. • The image is uploaded to a resilient cluster file system. • The system boots via PXE which includes cluster file system client. • The system pivots to a read-only rootfs image located in the cluster file system • Mutable files will use additional memory. • Still under development.
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Comparison
Provisioning Mechanism Memory (MB) Data (MB/node) Deployment 285 400 read-only NFS root 300 42 read-only BeeGFS root 320 44 read-only Lustre root 330 44 ramdisk 1940 710 OverlayFS: SquashFS fetch 750 535 OverlayFS: SquashFS no fetch 350 58 OverlayFS: read-only BeeGFS - - OverlayFS: read-only Lustre - -
Table: Statistics based on sNow! CentOS 7.3 minimal template (611 packages, 1638MB of file system).
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Conclusions
Improvements in diskless provisioning X Simplified administration X Enabled resilience with small memory footprint cost (≈ 35MB). X Improved scalability Improvements achieved with OverlayFS X Significant memory footprint reduction compared to ramdisk solution X Enabled scalability by including native cluster file system client X Further reduced the memory footprint when not fetching the image X Opens new opportunities to expose different file systems on demand
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Motivation Deployment Provisioning Image Provisioning Improvements in diskless provisioning Introducing OverlayFS root Conclusions Questions
Jordi Blasco (HPCNow!) Modernising HPC cluster provisioning Get ready for the coming challenges, get involved with the HPC community! [email protected] www.hpcnow.com Almogàvers, 165 - 08018 Barcelona (Spain) 34 Fernly Rise, 1019 Auckland (New Zealand)