Operating an HPC/HTC Cluster with Fully Containerized Jobs Using Htcondor, Singularity, Cephfs and CVMFS
Total Page:16
File Type:pdf, Size:1020Kb
Computing and Software for Big Science (2021) 5:9 https://doi.org/10.1007/s41781-020-00050-y ORIGINAL ARTICLE Operating an HPC/HTC Cluster with Fully Containerized Jobs Using HTCondor, Singularity, CephFS and CVMFS Oliver Freyermuth1 · Peter Wienemann1 · Philip Bechtle1 · Klaus Desch1 Received: 28 May 2020 / Accepted: 16 December 2020 © The Author(s) 2021 Abstract High performance and high throughput computing (HPC/HTC) is challenged by ever increasing demands on the software stacks and more and more diverging requirements by diferent research communities. This led to a reassessment of the operational concept of HPC/HTC clusters at the Physikalisches Institut at the University of Bonn. As a result, the present HPC/HTC cluster (named BAF2) introduced various conceptual changes compared to conventional clusters. All jobs are now run in containers and a container-aware resource management system is used which allowed us to switch to a model without login/head nodes. Furthermore, a modern, feature-rich storage system with powerful interfaces has been deployed. We describe the design considerations, the implemented functionality and the operational experience gained with this new- generation setup which turned out to be very successful and well-accepted by its users. Keywords Scientifc computing · Containers · Distributed fle systems · Batch processing · Host management Introduction cluster operators has increased signifcantly over time. To cope with the increased demands, an ongoing trend to con- High performance and high throughput computing (HPC/ solidate diferent communities and fulfl their requirements HTC) is an integral part of scientifc progress in many areas with larger, commonly operated systems is observed [2]. of science. Researchers require more and more computing This work describes the commissioning and frst oper- and storage resources allowing them to solve ever more com- ational experience gained with an HPC/HTC cluster at plex problems. But just scaling up existing resources is not the Physikalisches Institut at the University of Bonn. We sufcient to handle the ever increasing amount of data. New call this cluster second generation Bonn Analysis Facility tools and technologies keep showing up and users are asking (BAF2) in the following. Occasionally we compare the setup for them. As a result of these continuously emerging new of this cluster with the one of its predecessor (BAF1). Both tools, software stacks on which jobs rely become increas- BAF1 and BAF2 were purchased to perform fundamental ingly complex, data management is done with more and research work in all kinds of physics felds ranging from more sophisticated tools and the demands of users on HPC/ high energy physics (HEP), hadron physics, theoretical par- HTC clusters evolve with breathtaking speed [1]. Therefore, ticle physics, theoretical condensed matter physics to math- the diversity and complexity of services run by HPC/HTC ematical physics. BAF1 was a rather conventional cluster whose commissioning started in 2009. It used a TORQUE/ * Peter Wienemann Maui-based resource management system [3] whose jobs [email protected] were run directly on its worker nodes, a Lustre distributed Oliver Freyermuth fle system [4] without any redundancy (except for RAID [email protected] 5 disk arrays) for data storage and an OpenAFS fle sys- Philip Bechtle tem [5] to distribute software which was later supplemented [email protected] by CVMFS [6] clients (see “CVMFS” for more informa- Klaus Desch tion on CVMFS). The latter provides software maintained [email protected] by CERN and HEP collaborations. BAF1 maintenance was characterized by a large collection of home-brewed shell 1 Physikalisches Institut, Universität Bonn, Nußallee 12, scripts and other makeshift solutions. 53115 Bonn, Germany Vol.:(0123456789)1 3 9 Page 2 of 20 Computing and Software for Big Science (2021) 5:9 In contrast, BAF2 is adapted to the increasingly varying modern virtualization technology provides an attractive solu- demands of the diferent communities. These requirements tion to this constraint. We set up the cluster in such a way that and which solutions we chose to tackle them will be discussed each user can choose the operating system (OS) in which her/ in “BAF2 Requirements” followed by a short overview of the his jobs should run using modern container technology [13], new concepts of this cluster in “Cluster Concept”. After this and the actual bare metal operating system is never exposed introduction, we will present the key components of the clus- to the user. The container setup is described in “Containeriza- ter in “CVMFS”, “Containerization”, “CephFS”, “XRootD”, tion”. Thanks to the container awareness of the HTCondor “Cluster Management” and “HTCondor” in-depth. The paper resource management system [14–21] the implementation of will conclude with a presentation of experiences and observa- such a setup is straightforward. More details on how we use tions collected in the frst two years of operation in “Opera- HTCondor are given in “HTCondor”. tional Experience”. On purpose, benchmarks of the system Another particular requirement for ATLAS data analysis are not presented. While we performed some benchmarking is providing interfaces for the distributed data management of the components before putting the system into operation, (DDM) tools deployed in ATLAS [22, 23]. Most importantly, we believe that our results cannot be easily transferred one- we are running an XRootD service [24]. This allows automatic to-one to setups using diferent hardware or which operate at transfers and subscription of datasets with high throughput diferent scales. Another aspect which comes into play is that to the BAF2 cluster. More information on XRootD is given even refned synthetic benchmarks are quite diferent from in “XRootD”. The datasets shipped to the BAF2 cluster are the constantly evolving load submitted by users. A realistic fnally analyzed by reading the corresponding data fles from simulation of the load caused by a diverse mix of jobs is very a POSIX fle system. As POSIX fle system, we have chosen difcult, and even after the two years of operations, new use CephFS [25] due to the reasons explained in “CephFS”. cases and workloads appear on a regular basis. All mentioned resources are deployed and orchestrated For this reason, we consider the presentation of experi- using Foreman [26] and Puppet [27]. We describe this setup ences with the operational system in its entirety under realis- in “Cluster Management”. tic production workloads of users to be more useful and will As a general objective, free and open source software present the observed efects in “Operational Experience” in (FOSS) tools are used for cluster management and operation detail before concluding in “Conclusion”. wherever possible. Reasons are not only fnancial ones but also our appreciation of the possibility to easily debug, patch and extend available software. When patching software we BAF2 Requirements always try to feed the modifcations back into the respective developer community to avoid accumulating large patch sets The requirements on BAF2 are as broad as the range of over time which could diverge from the upstream project and research felds it serves. The most challenging constraints are make future upgrades more difcult. imposed by running analysis jobs of the ATLAS high energy physics experiment [7]. This experiment uses a huge software stack which is provided and maintained centrally by a dedi- Cluster Concept cated team on a specifc platform (at the time of writing the migration from Scientifc Linux 6 to CentOS 7 is still not fully While the previously operated BAF1 cluster was convention- completed). Given the rapid development of software of a col- ally set up using login nodes and without redundancy of the laboration of O(3000) members, it would be unfeasible for cluster services, a diferent approach was chosen for BAF2 each collaborating institute to maintain and validate its own due to the increasing diversity of requirements. software installation. Therefore, the software is distributed The new workfow from the users’ perspective is that the to all participating institutes via the CVMFS fle system [6, jobs are submitted directly from their desktop machines. 8–10]. We will describe this fle system, for which we now On that very machines, not only access to kerberized home also operate our own server infrastructure, in more detail in directories maintained and backed up by the central uni- “CVMFS”. The collaboratively maintained software also puts versity computing centre and mounted via NFS v4.2 [28] constraints on the platform of computing resources used for is provided, but the cluster fle system CephFS can also be ATLAS purposes. At present, the software framework only accessed directly (via NFS v4.2). The users’ jobs can then runs on x86_64 Scientifc Linux 6 [11] and CentOS 7 [12] either access CephFS with high bandwidth or use HTCon- machines (with Scientifc Linux 6 slowly dying out). Given the dor’s fle transfer mechanism to store smaller input and out- age of Scientifc Linux 6 and CentOS 7, there is an increasing put fles in the home directories. tension between ATLAS requirements and other applications For each job, users can choose the operating system which which require a more up-to-date software stack. The same is should be used, and the cluster workload manager HTCon- true for support of modern hardware components. Luckily, dor takes care to instantiate the corresponding container 1 3 Computing and Software for Big Science (2021) 5:9 Page 3 of 20 9 to the users, since an interactive cluster job emulates an ssh connection to a worker node which HTCondor realizes via the Connection Broker service as explained in “HTCondor”. Using this approach, full fexibility is exposed to the users. The scheduling is based on a fair-share algorithm, so any user can submit as many jobs as needed and may at times use the full cluster resources.