Operating an HPC/HTC Cluster with Fully Containerized Jobs Using Htcondor, Singularity, Cephfs and CVMFS

Total Page:16

File Type:pdf, Size:1020Kb

Operating an HPC/HTC Cluster with Fully Containerized Jobs Using Htcondor, Singularity, Cephfs and CVMFS Computing and Software for Big Science (2021) 5:9 https://doi.org/10.1007/s41781-020-00050-y ORIGINAL ARTICLE Operating an HPC/HTC Cluster with Fully Containerized Jobs Using HTCondor, Singularity, CephFS and CVMFS Oliver Freyermuth1 · Peter Wienemann1 · Philip Bechtle1 · Klaus Desch1 Received: 28 May 2020 / Accepted: 16 December 2020 © The Author(s) 2021 Abstract High performance and high throughput computing (HPC/HTC) is challenged by ever increasing demands on the software stacks and more and more diverging requirements by diferent research communities. This led to a reassessment of the operational concept of HPC/HTC clusters at the Physikalisches Institut at the University of Bonn. As a result, the present HPC/HTC cluster (named BAF2) introduced various conceptual changes compared to conventional clusters. All jobs are now run in containers and a container-aware resource management system is used which allowed us to switch to a model without login/head nodes. Furthermore, a modern, feature-rich storage system with powerful interfaces has been deployed. We describe the design considerations, the implemented functionality and the operational experience gained with this new- generation setup which turned out to be very successful and well-accepted by its users. Keywords Scientifc computing · Containers · Distributed fle systems · Batch processing · Host management Introduction cluster operators has increased signifcantly over time. To cope with the increased demands, an ongoing trend to con- High performance and high throughput computing (HPC/ solidate diferent communities and fulfl their requirements HTC) is an integral part of scientifc progress in many areas with larger, commonly operated systems is observed [2]. of science. Researchers require more and more computing This work describes the commissioning and frst oper- and storage resources allowing them to solve ever more com- ational experience gained with an HPC/HTC cluster at plex problems. But just scaling up existing resources is not the Physikalisches Institut at the University of Bonn. We sufcient to handle the ever increasing amount of data. New call this cluster second generation Bonn Analysis Facility tools and technologies keep showing up and users are asking (BAF2) in the following. Occasionally we compare the setup for them. As a result of these continuously emerging new of this cluster with the one of its predecessor (BAF1). Both tools, software stacks on which jobs rely become increas- BAF1 and BAF2 were purchased to perform fundamental ingly complex, data management is done with more and research work in all kinds of physics felds ranging from more sophisticated tools and the demands of users on HPC/ high energy physics (HEP), hadron physics, theoretical par- HTC clusters evolve with breathtaking speed [1]. Therefore, ticle physics, theoretical condensed matter physics to math- the diversity and complexity of services run by HPC/HTC ematical physics. BAF1 was a rather conventional cluster whose commissioning started in 2009. It used a TORQUE/ * Peter Wienemann Maui-based resource management system [3] whose jobs [email protected] were run directly on its worker nodes, a Lustre distributed Oliver Freyermuth fle system [4] without any redundancy (except for RAID [email protected] 5 disk arrays) for data storage and an OpenAFS fle sys- Philip Bechtle tem [5] to distribute software which was later supplemented [email protected] by CVMFS [6] clients (see “CVMFS” for more informa- Klaus Desch tion on CVMFS). The latter provides software maintained [email protected] by CERN and HEP collaborations. BAF1 maintenance was characterized by a large collection of home-brewed shell 1 Physikalisches Institut, Universität Bonn, Nußallee 12, scripts and other makeshift solutions. 53115 Bonn, Germany Vol.:(0123456789)1 3 9 Page 2 of 20 Computing and Software for Big Science (2021) 5:9 In contrast, BAF2 is adapted to the increasingly varying modern virtualization technology provides an attractive solu- demands of the diferent communities. These requirements tion to this constraint. We set up the cluster in such a way that and which solutions we chose to tackle them will be discussed each user can choose the operating system (OS) in which her/ in “BAF2 Requirements” followed by a short overview of the his jobs should run using modern container technology [13], new concepts of this cluster in “Cluster Concept”. After this and the actual bare metal operating system is never exposed introduction, we will present the key components of the clus- to the user. The container setup is described in “Containeriza- ter in “CVMFS”, “Containerization”, “CephFS”, “XRootD”, tion”. Thanks to the container awareness of the HTCondor “Cluster Management” and “HTCondor” in-depth. The paper resource management system [14–21] the implementation of will conclude with a presentation of experiences and observa- such a setup is straightforward. More details on how we use tions collected in the frst two years of operation in “Opera- HTCondor are given in “HTCondor”. tional Experience”. On purpose, benchmarks of the system Another particular requirement for ATLAS data analysis are not presented. While we performed some benchmarking is providing interfaces for the distributed data management of the components before putting the system into operation, (DDM) tools deployed in ATLAS [22, 23]. Most importantly, we believe that our results cannot be easily transferred one- we are running an XRootD service [24]. This allows automatic to-one to setups using diferent hardware or which operate at transfers and subscription of datasets with high throughput diferent scales. Another aspect which comes into play is that to the BAF2 cluster. More information on XRootD is given even refned synthetic benchmarks are quite diferent from in “XRootD”. The datasets shipped to the BAF2 cluster are the constantly evolving load submitted by users. A realistic fnally analyzed by reading the corresponding data fles from simulation of the load caused by a diverse mix of jobs is very a POSIX fle system. As POSIX fle system, we have chosen difcult, and even after the two years of operations, new use CephFS [25] due to the reasons explained in “CephFS”. cases and workloads appear on a regular basis. All mentioned resources are deployed and orchestrated For this reason, we consider the presentation of experi- using Foreman [26] and Puppet [27]. We describe this setup ences with the operational system in its entirety under realis- in “Cluster Management”. tic production workloads of users to be more useful and will As a general objective, free and open source software present the observed efects in “Operational Experience” in (FOSS) tools are used for cluster management and operation detail before concluding in “Conclusion”. wherever possible. Reasons are not only fnancial ones but also our appreciation of the possibility to easily debug, patch and extend available software. When patching software we BAF2 Requirements always try to feed the modifcations back into the respective developer community to avoid accumulating large patch sets The requirements on BAF2 are as broad as the range of over time which could diverge from the upstream project and research felds it serves. The most challenging constraints are make future upgrades more difcult. imposed by running analysis jobs of the ATLAS high energy physics experiment [7]. This experiment uses a huge software stack which is provided and maintained centrally by a dedi- Cluster Concept cated team on a specifc platform (at the time of writing the migration from Scientifc Linux 6 to CentOS 7 is still not fully While the previously operated BAF1 cluster was convention- completed). Given the rapid development of software of a col- ally set up using login nodes and without redundancy of the laboration of O(3000) members, it would be unfeasible for cluster services, a diferent approach was chosen for BAF2 each collaborating institute to maintain and validate its own due to the increasing diversity of requirements. software installation. Therefore, the software is distributed The new workfow from the users’ perspective is that the to all participating institutes via the CVMFS fle system [6, jobs are submitted directly from their desktop machines. 8–10]. We will describe this fle system, for which we now On that very machines, not only access to kerberized home also operate our own server infrastructure, in more detail in directories maintained and backed up by the central uni- “CVMFS”. The collaboratively maintained software also puts versity computing centre and mounted via NFS v4.2 [28] constraints on the platform of computing resources used for is provided, but the cluster fle system CephFS can also be ATLAS purposes. At present, the software framework only accessed directly (via NFS v4.2). The users’ jobs can then runs on x86_64 Scientifc Linux 6 [11] and CentOS 7 [12] either access CephFS with high bandwidth or use HTCon- machines (with Scientifc Linux 6 slowly dying out). Given the dor’s fle transfer mechanism to store smaller input and out- age of Scientifc Linux 6 and CentOS 7, there is an increasing put fles in the home directories. tension between ATLAS requirements and other applications For each job, users can choose the operating system which which require a more up-to-date software stack. The same is should be used, and the cluster workload manager HTCon- true for support of modern hardware components. Luckily, dor takes care to instantiate the corresponding container 1 3 Computing and Software for Big Science (2021) 5:9 Page 3 of 20 9 to the users, since an interactive cluster job emulates an ssh connection to a worker node which HTCondor realizes via the Connection Broker service as explained in “HTCondor”. Using this approach, full fexibility is exposed to the users. The scheduling is based on a fair-share algorithm, so any user can submit as many jobs as needed and may at times use the full cluster resources.
Recommended publications
  • Copy on Write Based File Systems Performance Analysis and Implementation
    Copy On Write Based File Systems Performance Analysis And Implementation Sakis Kasampalis Kongens Lyngby 2010 IMM-MSC-2010-63 Technical University of Denmark Department Of Informatics Building 321, DK-2800 Kongens Lyngby, Denmark Phone +45 45253351, Fax +45 45882673 [email protected] www.imm.dtu.dk Abstract In this work I am focusing on Copy On Write based file systems. Copy On Write is used on modern file systems for providing (1) metadata and data consistency using transactional semantics, (2) cheap and instant backups using snapshots and clones. This thesis is divided into two main parts. The first part focuses on the design and performance of Copy On Write based file systems. Recent efforts aiming at creating a Copy On Write based file system are ZFS, Btrfs, ext3cow, Hammer, and LLFS. My work focuses only on ZFS and Btrfs, since they support the most advanced features. The main goals of ZFS and Btrfs are to offer a scalable, fault tolerant, and easy to administrate file system. I evaluate the performance and scalability of ZFS and Btrfs. The evaluation includes studying their design and testing their performance and scalability against a set of recommended file system benchmarks. Most computers are already based on multi-core and multiple processor architec- tures. Because of that, the need for using concurrent programming models has increased. Transactions can be very helpful for supporting concurrent program- ming models, which ensure that system updates are consistent. Unfortunately, the majority of operating systems and file systems either do not support trans- actions at all, or they simply do not expose them to the users.
    [Show full text]
  • PDF, 32 Pages
    Helge Meinhard / CERN V2.0 30 October 2015 HEPiX Fall 2015 at Brookhaven National Lab After 2004, the lab, located on Long Island in the State of New York, U.S.A., was host to a HEPiX workshop again. Ac- cess to the site was considerably easier for the registered participants than 11 years ago. The meeting took place in a very nice and comfortable seminar room well adapted to the size and style of meeting such as HEPiX. It was equipped with advanced (sometimes too advanced for the session chairs to master!) AV equipment and power sockets at each seat. Wireless networking worked flawlessly and with good bandwidth. The welcome reception on Monday at Wading River at the Long Island sound and the workshop dinner on Wednesday at the ocean coast in Patchogue showed more of the beauty of the rather natural region around the lab. For those interested, the hosts offered tours of the BNL RACF data centre as well as of the STAR and PHENIX experiments at RHIC. The meeting ran very smoothly thanks to an efficient and experienced team of local organisers headed by Tony Wong, who as North-American HEPiX co-chair also co-ordinated the workshop programme. Monday 12 October 2015 Welcome (Michael Ernst / BNL) On behalf of the lab, Michael welcomed the participants, expressing his gratitude to the audience to have accepted BNL's invitation. He emphasised the importance of computing for high-energy and nuclear physics. He then intro- duced the lab focusing on physics, chemistry, biology, material science etc. The total head count of BNL-paid people is close to 3'000.
    [Show full text]
  • Wang Paper (Prepublication)
    Riverbed: Enforcing User-defined Privacy Constraints in Distributed Web Services Frank Wang Ronny Ko, James Mickens MIT CSAIL Harvard University Abstract 1.1 A Loss of User Control Riverbed is a new framework for building privacy-respecting Unfortunately, there is a disadvantage to migrating applica- web services. Using a simple policy language, users define tion code and user data from a user’s local machine to a restrictions on how a remote service can process and store remote datacenter server: the user loses control over where sensitive data. A transparent Riverbed proxy sits between a her data is stored, how it is computed upon, and how the data user’s front-end client (e.g., a web browser) and the back- (and its derivatives) are shared with other services. Users are end server code. The back-end code remotely attests to the increasingly aware of the risks associated with unauthorized proxy, demonstrating that the code respects user policies; in data leakage [11, 62, 82], and some governments have begun particular, the server code attests that it executes within a to mandate that online services provide users with more con- Riverbed-compatible managed runtime that uses IFC to en- trol over how their data is processed. For example, in 2016, force user policies. If attestation succeeds, the proxy releases the EU passed the General Data Protection Regulation [28]. the user’s data, tagging it with the user-defined policies. On Articles 6, 7, and 8 of the GDPR state that users must give con- the server-side, the Riverbed runtime places all data with com- sent for their data to be accessed.
    [Show full text]
  • How to Create a Custom Live CD for Secure Remote Incident Handling in the Enterprise
    How to Create a Custom Live CD for Secure Remote Incident Handling in the Enterprise Abstract This paper will document a process to create a custom Live CD for secure remote incident handling on Windows and Linux systems. The process will include how to configure SSH for remote access to the Live CD even when running behind a NAT device. The combination of customization and secure remote access will make this process valuable to incident handlers working in enterprise environments with limited remote IT support. Bert Hayes, [email protected] How to Create a Custom Live CD for Remote Incident Handling 2 Table of Contents Abstract ...........................................................................................................................................1 1. Introduction ............................................................................................................................5 2. Making Your Own Customized Debian GNU/Linux Based System........................................7 2.1. The Development Environment ......................................................................................7 2.2. Making Your Dream Incident Handling System...............................................................9 2.3. Hardening the Base Install.............................................................................................11 2.3.1. Managing Root Access with Sudo..........................................................................11 2.4. Randomizing the Handler Password at Boot Time ........................................................12
    [Show full text]
  • In Search of the Ideal Storage Configuration for Docker Containers
    In Search of the Ideal Storage Configuration for Docker Containers Vasily Tarasov1, Lukas Rupprecht1, Dimitris Skourtis1, Amit Warke1, Dean Hildebrand1 Mohamed Mohamed1, Nagapramod Mandagere1, Wenji Li2, Raju Rangaswami3, Ming Zhao2 1IBM Research—Almaden 2Arizona State University 3Florida International University Abstract—Containers are a widely successful technology today every running container. This would cause a great burden on popularized by Docker. Containers improve system utilization by the I/O subsystem and make container start time unacceptably increasing workload density. Docker containers enable seamless high for many workloads. As a result, copy-on-write (CoW) deployment of workloads across development, test, and produc- tion environments. Docker’s unique approach to data manage- storage and storage snapshots are popularly used and images ment, which involves frequent snapshot creation and removal, are structured in layers. A layer consists of a set of files and presents a new set of exciting challenges for storage systems. At layers with the same content can be shared across images, the same time, storage management for Docker containers has reducing the amount of storage required to run containers. remained largely unexplored with a dizzying array of solution With Docker, one can choose Aufs [6], Overlay2 [7], choices and configuration options. In this paper we unravel the multi-faceted nature of Docker storage and demonstrate its Btrfs [8], or device-mapper (dm) [9] as storage drivers which impact on system and workload performance. As we uncover provide the required snapshotting and CoW capabilities for new properties of the popular Docker storage drivers, this is a images. None of these solutions, however, were designed with sobering reminder that widespread use of new technologies can Docker in mind and their effectiveness for Docker has not been often precede their careful evaluation.
    [Show full text]
  • Master Boot Record Vs Guid Mac
    Master Boot Record Vs Guid Mac Wallace is therefor divinatory after kickable Noach excoriating his philosophizer hourlong. When Odell perches dilaceratinghis tithes gravitated usward ornot alkalize arco enough, comparatively is Apollo and kraal? enduringly, If funked how or following augitic is Norris Enrico? usually brails his germens However, half the UEFI supports the MBR and GPT. Following your suggested steps, these backups will appear helpful to restore prod data. OK, GPT makes for playing more logical choice based on compatibility. Formatting a suit Drive are Hard Disk. In this guide, is welcome your comments or thoughts below. Thus, making, or paid other OS. Enter an open Disk Management window. Erase panel, or the GUID Partition that, we have covered the difference between MBR and GPT to care unit while partitioning a drive. Each record in less directory is searched by comparing the hash value. Disk Utility have to its important tasks button activated for adding, total capacity, create new Container will be created as well. Hard money fix Windows Problems? MBR conversion, the main VBR and the backup VBR. At trial three Linux emergency systems ship with GPT fdisk. In else, the user may decide was the hijack is unimportant to them. GB even if lesser alignment values are detected. Interoperability of the file system also important. Although it hard be read natively by Linux, she likes shopping, the utility Partition Manager has endeavor to working when Disk Utility if nothing to remain your MBR formatted external USB hard disk drive. One station time machine, reformat the storage device, GPT can notice similar problem they attempt to recover the damaged data between another location on the disk.
    [Show full text]
  • Understanding the Performance of Container Execution Environments
    Understanding the performance of container execution environments Guillaume Everarts de Velp, Etienne Rivière and Ramin Sadre [email protected] EPL, ICTEAM, UCLouvain, Belgium Abstract example is an automatic grading platform named INGIn- Many application server backends leverage container tech- ious [2, 3]. This platform is used extensively at UCLouvain nologies to support workloads formed of short-lived, but and other institutions around the world to provide computer potentially I/O-intensive, operations. The latency at which science and engineering students with automated feedback container-supported operations complete impacts both the on programming assignments, through the execution of se- users’ experience and the throughput that the platform can ries of unit tests prepared by instructors. It is necessary that achieve. This latency is a result of both the bootstrap and the student code and the testing runtime run in isolation from execution time of the containers and is impacted greatly by each others. Containers answer this need perfectly: They the performance of the I/O subsystem. Configuring appro- allow students’ code to run in a controlled and reproducible priately the container environment and technology stack environment while reducing risks related to ill-behaved or to obtain good performance is not an easy task, due to the even malicious code. variety of options, and poor visibility on their interactions. Service latency is often the most important criteria for We present in this paper a benchmarking tool for the selecting a container execution environment. Slow response multi-parametric study of container bootstrap time and I/O times can impair the usability of an edge computing infras- performance, allowing us to understand such interactions tructure, or result in students frustration in the case of IN- within a controlled environment.
    [Show full text]
  • Push-Based Job Submission Using Reverse SSH Connections
    rvGAHP – Push-Based Job Submission using Reverse SSH Connections Scott Callaghan Gideon Juve Karan Vahi University of Southern California USC Information Sciences Institute USC Information Sciences Institute Los Angeles, California Marina Del Rey, California Marina Del Rey, California [email protected] [email protected] [email protected] Philip J. Maechling Thomas H. Jordan Ewa Deelman University of Southern California University of Southern California USC Information Sciences Institute Los Angeles, California Los Angeles, California Marina Del Rey, California [email protected] [email protected] [email protected] ABSTRACT using Reverse SSH Connections. In WORKS’17: WORKS’17: 12th Workshop Computational science researchers running large-scale scientific on Workflows in Support of Large-Scale Science, November 12–17, 2017, Denver, CO, USA. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3150994. workflow applications often want to run their workflows onthe 3151003 largest available compute systems to improve time to solution. Workflow tools used in distributed, heterogeneous, high perfor- mance computing environments typically rely on either a push- 1 INTRODUCTION based or a pull-based approach for resource provisioning from these Modern scientific applications typically require the execution of compute systems. However, many large clusters have moved to multiple codes, may include computational and data dependencies, two-factor authentication for job submission, making traditional au- and contain varied computational models ranging from a bag of tomated push-based job submission impossible. On the other hand, tasks to a monolithic parallel code. These applications are often pull-based approaches such as pilot jobs may lead to increased com- complex software suites with substantial computational and data plexity and a reduction in node-hour efficiency.
    [Show full text]
  • Hardware-Driven Evolution in Storage Software by Zev Weiss A
    Hardware-Driven Evolution in Storage Software by Zev Weiss A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Sciences) at the UNIVERSITY OF WISCONSIN–MADISON 2018 Date of final oral examination: June 8, 2018 ii The dissertation is approved by the following members of the Final Oral Committee: Andrea C. Arpaci-Dusseau, Professor, Computer Sciences Remzi H. Arpaci-Dusseau, Professor, Computer Sciences Michael M. Swift, Professor, Computer Sciences Karthikeyan Sankaralingam, Professor, Computer Sciences Johannes Wallmann, Associate Professor, Mead Witter School of Music i © Copyright by Zev Weiss 2018 All Rights Reserved ii To my parents, for their endless support, and my cousin Charlie, one of the kindest people I’ve ever known. iii Acknowledgments I have taken what might be politely called a “scenic route” of sorts through grad school. While Ph.D. students more focused on a rapid graduation turnaround time might find this regrettable, I am glad to have done so, in part because it has afforded me the opportunities to meet and work with so many excellent people along the way. I owe debts of gratitude to a large cast of characters: To my advisors, Andrea and Remzi Arpaci-Dusseau. It is one of the most common pieces of wisdom imparted on incoming grad students that one’s relationship with one’s advisor (or advisors) is perhaps the single most important factor in whether these years of your life will be pleasant or unpleasant, and I feel exceptionally fortunate to have ended up iv with the advisors that I’ve had.
    [Show full text]
  • Container-Based Virtualization for Byte-Addressable NVM Data Storage
    2016 IEEE International Conference on Big Data (Big Data) Container-Based Virtualization for Byte-Addressable NVM Data Storage Ellis R. Giles Rice University Houston, Texas [email protected] Abstract—Container based virtualization is rapidly growing Storage Class Memory, or SCM, is an exciting new in popularity for cloud deployments and applications as a memory technology with the potential of replacing hard virtualization alternative due to the ease of deployment cou- drives and SSDs as it offers high-speed, byte-addressable pled with high-performance. Emerging byte-addressable, non- volatile memories, commonly called Storage Class Memory or persistence on the main memory bus. Several technologies SCM, technologies are promising both byte-addressability and are currently under research and development, each with dif- persistence near DRAM speeds operating on the main memory ferent performance, durability, and capacity characteristics. bus. These new memory alternatives open up a new realm of These include a ReRAM by Micron and Sony, a slower, but applications that no longer have to rely on slow, block-based very large capacity Phase Change Memory or PCM by Mi- persistence, but can rather operate directly on persistent data using ordinary loads and stores through the cache hierarchy cron and others, and a fast, smaller spin-torque ST-MRAM coupled with transaction techniques. by Everspin. High-speed, byte-addressable persistence will However, SCM presents a new challenge for container-based give rise to new applications that no longer have to rely on applications, which typically access persistent data through slow, block based storage devices and to serialize data for layers of block based file isolation.
    [Show full text]
  • Introduction to Python University of Oxford Department of Particle Physics
    Particle Physics Cluster Infrastructure Introduction University of Oxford Department of Particle Physics October 2019 Vipul Davda Particle Physics Linux Systems Administrator Room 661 Telephone: x73389 [email protected] Particle Physics Computing Overview 1 Particle Physics Linux Infrastructure Distributed File System gluster NFS /data Worker Nodes /data/atlas /data/lhcb NFS HTCondor /home Batch Server Interactive Servers physics_s/eduroam Network Printer Managed Laptops Managed Desktops Particle Physics Computing Overview 2 Introduction to the Unix Operating System Unix is a Multi-User/Multi-Tasking operating system. Developed in 1969 at AT&T’s Bell Labs by Ken Thompson (Unix) Dennis Ritchie (C) Unix is written in C programming language. Unix was originally a command-line OS, but now has a graphical user interface. It is available in many different forms: Linux , Solaris, AIX, HP-UX, freeBSD It is a well-suited environment for program development: C, C++, Java, Fortran, Python… Unix is mainly used on large servers for scientific applications. Particle Physics Computing Overview 3 Linux Distributions Source: https://www.muylinux.com/2009/04/24/logos-de-distribuciones-gnulinux/ Particle Physics Computing Overview 4 Particle Physics Linux Infrastructure Particle Physics uses CentOS Linux on the cluster. It is a free version of RedHat Enterprise Linux. Particle Physics Computing Overview 5 Basic Linux Commands Particle Physics Computing Overview 6 Basic Linux Commands o ls - list directory contents ls –l - long listing
    [Show full text]
  • The Translational Journey of the Htcondor-CE
    Journal of Computational Science xxx (xxxx) xxx Contents lists available at ScienceDirect Journal of Computational Science journal homepage: www.elsevier.com/locate/jocs Principles, technologies, and time: The translational journey of the HTCondor-CE Brian Bockelman a,*, Miron Livny a,b, Brian Lin b, Francesco Prelz c a Morgridge Institute for Research, Madison, USA b Department of Computer Sciences, University of Wisconsin-Madison, Madison, USA c INFN Milan, Milan, Italy ARTICLE INFO ABSTRACT Keywords: Mechanisms for remote execution of computational tasks enable a distributed system to effectively utilize all Distributed high throughput computing available resources. This ability is essential to attaining the objectives of high availability, system reliability, and High throughput computing graceful degradation and directly contribute to flexibility, adaptability, and incremental growth. As part of a Translational computing national fabric of Distributed High Throughput Computing (dHTC) services, remote execution is a cornerstone of Distributed computing the Open Science Grid (OSG) Compute Federation. Most of the organizations that harness the computing capacity provided by the OSG also deploy HTCondor pools on resources acquired from the OSG. The HTCondor Compute Entrypoint (CE) facilitates the remote acquisition of resources by all organizations. The HTCondor-CE is the product of a most recent translational cycle that is part of a multidecade translational process. The process is rooted in a partnership, between members of the High Energy Physics community and computer scientists, that evolved over three decades and involved testing and evaluation with active users and production infrastructures. Through several translational cycles that involved researchers from different organizations and continents, principles, ideas, frameworks and technologies were translated into a widely adopted software artifact that isresponsible for provisioning of approximately 9 million core hours per day across 170 endpoints.
    [Show full text]