A Lightweight Kernel for the Harness Metacomputing Framework ∗

C. Engelmann and G. A. Geist Computer Science and Mathematics Division Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA {engelmannc,gst}@ornl.gov http://www.csm.ornl.gov

Abstract cations from software modules, parallel plug-in paradigms, highly available DVMs, fault-tolerant message pass- Harness is a pluggable heterogeneous Distributed Vir- ing, fine-grain security mechanisms and heterogeneous tual Machine (DVM) environment for parallel and dis- reconfigurable communication frameworks. tributed scientific computing. This paper describes recent Currently, there are three different Harness system proto- improvements in the Harness kernel design. By using a types, each concentrating on different research issues. The lightweight approach and moving previously integrated sys- teams at Oak Ridge National Laboratory [3, 5, 12] and at the tem services into software modules, the software becomes University of Tennessee [6, 7, 13] provide different C vari- more versatile and adaptable. This paper outlines these ants, while the team at Emory University [11, 14, 15, 18] changes and explains the major Harness kernel components maintains a Java-based alternative. in more detail. A short overview is given of ongoing ef- Conceptually, the Harness software architecture consists forts in integrating RMIX, a dynamic heterogeneous recon- of two major parts: a runtime environment (kernel) and a figurable communication framework, into the Harness en- set of plug-in software modules. The multi-threaded user- vironment as a new plug-in software module. We describe space kernel manages the set of dynamically loadable plug- the overall impact of these changes and how they relate to ins. While the kernel provides only basic functions, plug- other ongoing work. ins may provide a wide variety of services needed in fault- tolerant parallel and distributed scientific computing, such as messaging, scientific algorithms and resource manage- 1. Introduction ment. Multiple kernels can be aggregated into a Distributed Virtual Machine. The heterogeneous adaptable reconfigurable networked This paper describes recent improvements in the kernel systems (Harness [9]) research project is an ongoing col- design, which significantly enhance versatility and adapt- laborative effort among Oak Ridge National Laboratory, ability by introducing a lightweight design approach. First, The University of Tennessee, Knoxville, and Emory Uni- we outline the changes in the design and then explain the versity. It focuses on the design and development of a plug- major kernel components in more detail. We continue with gable lightweight heterogeneous Distributed Virtual Ma- a short overview of recent efforts in integrating RMIX, a dy- chine (DVM) environment, where clusters of PCs, work- namic heterogeneous reconfigurable communication frame- stations, and “big iron” supercomputers can be aggregated work, into the Harness environment. We describe the over- to form one giant DVM (in the spirit of its widely-used pre- all impact of these changes and how they relate to other on- decessor, “Parallel Virtual Machine” (PVM) [10]). going work. This paper concludes with a brief summary of As part of the Harness project, a variety of experi- the presented research. ments and system prototypes were developed to explore lightweight pluggable frameworks, adaptive reconfig- 2. Kernel Architecture urable runtime environments, assembly of scientific appli- Earlier designs of a multi-threaded kernel [3] for Har- ∗ This research is sponsored by the Mathematical, Information, and ness were derived from an integrated approach similar to Computational Sciences Division; Of®ce of Advanced Scienti®c Computing Research; U.S. Department of Energy. The work was per- PVM [10], since the Harness software was initially devel- formed at the Oak Ridge National Laboratory, which is managed by oped as a follow-on to PVM, based on the distributed vir- UT-Battelle, LLC under Contract No. De-AC05-00OR22725. tual machine (DVM) model. Other User Application Harness Harness Proxy Daemons

Communicator Plug-In Plug-In A Plug-In B (HCom)

Identity Plug-Ins Invocation Database Groups Control Msg Creation Send/Receive Msg Types HCore API HMsg API Message Handler (HMsg) Controller (HCtl) (Local HCore Functionality) Identity Plug-Ins Invocation Database Groups

Figure 1. Previous Kernel Design

Figure 3. Harness Architecture

between kernels, and to control the DVM. Recent improve- ments in the design (Figure 2) are based on a lightweight ap- proach. By moving previously integrated services into plug- ins, the kernel becomes more versatile and adaptable. The new design features a lightweight kernel with plug- in, thread and process management only. All other basic components, such as communication, distributed control and event notification, are moved into plug-ins in order to make them reconfigurable and exchangeable at runtime. The kernel provides a container for the user to load and ar- range all software components based on actual needs, with- Figure 2. New Lightweight Kernel Design out any preconditions imposed by the managing framework (Figure 3). The user can choose the programming model (peer-to-peer, client/server, PVM, MPI, Harness DVM) by In PVM, every node runs a local virtual machine and the loading the appropriate plug-in(s). overall set of virtual machines is controlled by a single mas- The kernel itself may run as a daemon process and ac- ter. This master is a single point of control and failure. In the cepts various command line and configuration file options, DVM model, all nodes form together a distributed virtual such as loading default plug-ins and spawning certain pro- machine, which they equally control in virtual synchrony. grams on startup, for bootstrapping with an initial set of ser- Symmetric state replication among all nodes, or a subset, vices. A debug variant with more extensive error logging assures high availability. Since there is no single point of and memory tracing is also provided using the ‘.debug’ ex- control or failure, the DVM survives as long as at least one tension or “–debug” command line flag. node is still alive. The original Harness kernel design (Figure 1) featured In the following sections, we describe the three major all components needed to manage plug-ins and external pro- kernel components (process manager, thread pool and plug- cesses, to send messages between components inside and in loader) in more detail. 2.1. Process Manager 2.3. Plug-in Loader

Harness is a pluggable multi-threaded environment. The plug-in loader handles loading and unloading of run- However, in heterogeneous distributed scientific comput- time plug-ins, which exist in the form of shared libraries. ing scenarios it is sometimes necessary to fork a child Such plug-ins are automatically unloaded on kernel shut- process and execute a different program, e.g. for remote lo- down. Plug-ins may load and unload other plug-ins on de- gins using the ssh program. The process manager allows mand. Only plug-in modules that reside in a specific pre- just that. In addition, it also provides access to the stan- set plug-in directory, or in a respective sub-directory, may dard input and output channels of spawned programs, be loaded. The kernel itself may load plug-ins on startup, in allowing plug-ins to initiate and control them. a separate thread that applies command line options. Every plug-in has the opportunity to initialize itself dur- Only programs that reside in a specific preset program ing loading and to finalize itself during unloading via directory, or in a respective sub-directory, may be executed. init() and fini() func- Links to system programs, such as ssh, may be created by tions that are automatically resolved and called by the the system administrator. plug-in loader if they exist. Plug-ins are able to load and Forking a process in a multi-threaded environment is unload other plug-ins they depend on during initializa- not very well defined. On startup, the Harness kernel im- tion and finalization, since the plug-in loading and unload- mediately creates a separate process that is responsible for ing functions are reentrant and deadlock free. launching child processes and relaying standard input and A plug-in (shared ) is loaded by the plug-in loader output between spawned programs and the kernel. This via dlopen() without exporting any of its symbols (functions avoids any problems associated with duplicating threads and data) to the global name space of the kernel. Plug-in and file descriptors in the kernel. symbols are not made automatically available to the kernel or to other plug-ins, to prevent naming conflicts and to al- low multiple plug-ins to offer the same interface with differ- 2.2. Thread Pool ent implementations. In fact, only the module (kernel or an- other plug-in) that loads a plug-in may have access to it us- ing a unique and private handle. Due to the potentially unpredictable runtime configura- The plug-in loading and unloading policy is based on tion of the Harness environment, certain issues regarding ownership. A module that loads a specific plug-in owns it thread creation, termination and synchronization need to be and is also responsible for unloading it. The ownership of addressed. They range from dealing with the limited num- orphaned plug-ins is automatically transferred to the kernel. ber of threads controllable by a single process to stalling the They are unloaded on kernel shutdown. The plug-in loader system by using too many simultaneous threads. The thread supports multiple loading to allow different modules to load pool allows the execution of jobs in threads, while staying the same plug-in. Plug-ins that maintain state may use ref- in a preset range of running threads utilizing a job queue. erence counting and separate state instances to support mul- Jobs are submitted to the thread pool in the form of a tiple loading. job function pointer and a pointer to a job function argu- Plug-in symbols are resolved by the plug-in loader via ment. The thread pool always maintains a minimum num- dlsym(), requiring the plug-in handle and a symbol name. ber of threads to offer quick response times. The thread Plug-ins may provide tables with function and data point- count is increased for every newly submitted job if no idle ers to improve performance by offering a one-time lookup. thread is available and a preset maximum is not reached. In this case, plug-in functions are called and plug-in data is Jobs are queued into a fixed length job queue after reaching accessed virtually similar to C++ objects. this maximum. Job submissions are finally blocked when The plug-in loader also supports a specific versioning the job queue is full. Idle threads time out and exit if the scheme for plug-ins similar to the library versioning of the minimum number of threads is not yet reached. GNU Libtool [16]. A single number describes the plug-in Most thread pool properties, such as the maximum num- API version and a single number describes the revision of ber of threads, are configured during kernel startup and may this specific plug-in API implementation. The version num- be modified via command line options and a kernel con- ber is increased and the revision number is set to 0 if the API figuration file. They may also be modified at runtime via was changed. The revision number is increased if the imple- the thread pool configuration functions. This is essential for mentation of the API was improved, e.g. in case of bug fixes long running jobs which permanently occupy a thread, such and performance enhancements. as network servers. The maximum number of threads can A plug-in may also support past API versions if func- be easily adjusted to avoid thread pool starvation. tions where only added, so that there is always a continuous Figure 4. RMIX Framework Design range of versions specified by two numbers, the first ver- sion and the number of follow-up versions (age). Together with the revision, the triplet <first>.. is formed, which is used to differentiate and order files and folders by appending it to or inserting it into file or folder names, e.g. libfoo.1.2.3.so. The plug-in loader allows one to specify the desired ver- sion in addition to the plug-in name when loading a plug-in. It chooses the most recent implementation of this version. Furthermore, age and revision may also be specified. Figure 5. RMIX Plug-in for Harness Kernel In order to provide more extensive debugging support, plug-in variants with built-in extended error logging and memory tracing are supported as well using the ‘.debug’ The C based RMIX framework (Figure 4) consists of a extension, e.g. libfoo.debug.so or libfoo.1.2.3.debug.so. De- base library and a set of runtime provider plug-ins (shared bug plug-in variants are automatically preferred when load- libraries). The base library offers the same plug-in loader ing into the kernel debug variant. and thread pool as the Harness kernel. The provider plug- ins are responsible for the protocol as well as for the TCP/IP communication. In the future, provider plug-ins 3. RMIX Framework may also support different inter-process communication methods, e.g. pipes and shared memory. RMIX [15] is a dynamic heterogeneous reconfigurable communication framework, initially developed by the Har- The RMIX framework offers access to remote network ness team at Emory University, that allows Java applica- services via client stubs, which provide the illusion of local tions to communicate via TCP/IP using various RMI/RPC invocation of remote methods. Client stub methods have the protocols, like Sun RPC, Java RMI and SOAP. In addition same signature as the remote methods they represent. They to standard synchronous RMI/RPC mechanisms, RMIX translate the method arguments into and the method return also allows support for asynchronous and one-way invoca- value from a generic representation that is being used by the tions, which suits well the messaging needs of the peer-to- provider plug-ins. peer distributed control in Harness [5]. Ongoing research Network services are implemented using the RMIX at Oak Ridge National Laboratory targets the development framework via server stubs, which translate the method ar- of a stand-alone C variant of RMIX and its integration guments from and the method return value into the into the lightweight Harness framework in order to replace generic representation. The same provider plug-ins are be- the inflexible HCom communicator plug-in, thus improv- ing used on the server side. ing adaptability and heterogeneity. The RMIX framework is integrated into Harness as a base plug-in and a stub plug-in (see Figure 5). The base ity. Especially the capability for plug-in loading at kernel plug-in is linked directly to the RMIX base library and pro- startup, without having a DVM running, proved to be very vides its functions to other Harness plug-ins. The stub plug- useful. For example, a worker task for a scientific calcula- in contains all client and server stubs necessary to invoke tion can be easily started by simply spawning the kernel on local or remote kernel functions. a remote machine using ssh with the appropriate applica- On loading into the kernel, the RMIX stub plug-in auto- tion plug-in name in the kernel command line. The applica- matically loads the RMIX base plug-in and optionally starts tion plug-in has full access to its local kernel and may load one or more RMI/RPC servers depending on a RMIX stub the DVM plug-in for high availability and/or the HCom or plug-in configuration file. Using the command line boot- RMIX plug-in to communicate with other kernels. It also strap mechanism, a kernel may provide all of its essential may just perform some local computation and relay the re- services via RMI/RPC servers at startup. sult back to a repository. In contrast, the previous Harness design required a user 4. Impact to start the kernel via a PVM-like command line tool on the local and subsequently remote machine, forming a DVM. The described recent changes and improvements to the The application plug-in was then loaded and accessed via Harness lightweight kernel design were motivated by a lack the DVM management, replicating some or all of its state of variety in programming paradigms. The previously em- throughout the DVM server process group. bedded DVM programming paradigm forced a Harness user With our new design, we have experienced performance to run its application or service in the DVM context. How- improvements in some cases, e.g. where kernel functions ever, more loosely coupled peer-to-peer paradigms that do were used without going through the DVM management. not maintain global knowledge (state) among the participat- The DVM related overhead depends on the number of par- ing processes where only supported through the DVM, thus ticipating nodes inside the DVM and the current load (pend- introducing an additional overhead. ing requests) of the DVM management. Request processing The main reasons for embedding the DVM into the ker- times can vary from less than a second to a few seconds un- nel were the original PVM-like design approach and the re- der normal conditions and up to several minutes in large quirement for high availability of the kernel itself. Every- systems under heavy load. This time can be saved when by- thing that depends on the kernel can be highly available, passing the DVM management for tasks that do not need to since its managing framework is highly available. However, be highly available. managing global state among a large number of nodes has For example, applications that follow the bag-of-tasks scalability limits. Even with caching and partitioning hier- programming paradigm can easily be made fault-tolerant archies, the DVM approach is not suited for systems with by farming out tasks and restarting them at different lo- tens or hundreds of thousands of processors, such as are on cations on failure. Task farming with on-the-fly fault tol- the horizon, e.g. the IBM BlueGene\L. Localized peer-to- erance by task replacement is a widely used technique to- peer concepts [4] can be much more efficient on such ex- day. Examples are SETI@HOME [17] and Condor [1, 2]. treme scale systems. The new kernel design efficiently supports this program- With the new design, the DVM management, i.e. the dis- ming paradigm without the DVM overhead. tributed control [5], has been moved into a plug-in. The The kernel design changes also made it easier to program DVM plug-in now handles all DVM requests, such as load- plug-ins. For example, the new thread pool gives the kernel ing a plug-in or replicating plug-in state for high availabil- sole control over thread management, rather than each indi- ity. It also implements failure recovery and event notifica- vidual plug-in. Plug-in programmers need only write their tion. However, high availability is limited to plug-ins that job functions and submit them to the thread pool. Thread rely on the DVM plug-in. The kernel itself is no longer cancellation clean-up on kernel shutdown is supported via highly available. the pthread cleanup interface. The lightweight design now allows plug-ins to exist out- side of the DVM context. Furthermore, highly available 5. Related Work plug-ins inside the DVM are able to use plug-ins outside of the DVM for simple tasks, which can be easily restarted dur- This paper describes the recent changes and improve- ing recovery. For example, a communication plug-in, such ments to the C-based variant of the Harness kernel main- as HCom or RMIX, does not need to be highly available, tained at Oak Ridge National Laboratory. The Harness since its service (providing communication endpoints) is team at Emory University developed a Java-based version, bound to a specific location. H2O [18], with a similar design. An early prototype of the new design showed signifi- H2O is a secure, scalable, stateless, lightweight, flexi- cant improvements in usability, versatility and adaptabil- ble, resource sharing platform. It allows resource owners, any authorized third parties or clients themselves to de- Our experience with an early prototype has shown that ploy services (pluglets) into the H2O container (kernel). the new kernel design is able to efficiently support mul- H2O has been designed to support a wide range of dis- tiple programming models without additional DVM over- tributed programming paradigms, including self-organizing head, simply by bypassing the DVM management for tasks applications, widely distributed applications, massively par- that do not need high availability. Furthermore, the design allel applications, task farms and component composition changes have also made it easier to program plug-ins. frameworks. H2O is founded on the RMIX communica- Future work will concentrate on providing service-level tion framework. Common usage scenarios involve clients high availability features to applications, as well as to typi- deploying computational services just prior to using them, cal components, such as schedulers. thus availing of the raw computational power shared by the container owner. Hence, resource sharing and grid comput- References ing systems can naturally be formed using H2O. Our Harness kernel and H2O have similar lightweight [1] J. Basney and M. Livny. Deploying a high throughput com- designs. However, H2O is far more advanced in terms of puting cluster. In . Buyya, editor, High Performance Clus- security mechanisms, while it does not fully support native ter Computing: Architectures and Systems, Volume 1. Pren- plug-ins (shared libraries). This is due to the fact that the tice Hall PTR, 1999. Java Native Interface does not allow multiple Java objects [2] Condor at University of Wisconsin, Madison, WI, USA. At to simultaneously load the same native plug-in. http://www.cs.wisc.edu/condor. The Harness team at the University of Tennessee (UTK) [3] W. R. Elwasif, D. E. Bernholdt, J. A. Kohl, and G. A. Geist. also maintains a C-based variant which has a lightweight An architecture for a multi-threaded Harness kernel. Lec- design. However, it uses a ring-based replicated database ture Notes in Computer Science: Proceedings of PVM/MPI for high-availability, where our solution is based on more User's Group Meeting 2001, 2131:126–134, 2001. advanced distributed control [5]. The Harness variant from [4] C. Engelmann and A. Geist. A diskless checkpointing algo- UTK also provides a PVM-like console control program rithm for super-scale architectures applied to the fast fourier to manage DVMs. Furthermore, UTK developed an FT- transform. Proceedings of CLADE 2003, pages 47–52, 2003. MPI [7, 6, 8] plug-in for fault-tolerant MPI messaging in [5] C. Engelmann, S. L. Scott, and G. A. Geist. Distributed peer- to-peer control in Harness. Lecture Notes in Computer Sci- Harness. ence: Proceedings of ICCS 2002, 2330:720–728, 2002. The ongoing research effort in Harness includes the co- [6] G. E. Fagg, A. Bukovsky, and J. J. Dongarra. Harness and operative integration of technology between members of fault tolerant MPI. Parallel Computing, 27(11):1479–1495, the overall Harness team. Currently, FT-MPI is being com- 2001. bined with H2O, for support across administrative bound- [7] G. E. Fagg, A. Bukovsky, S. Vadhiyar, and J. J. Dongarra. aries. RMIX is being implemented in C as a stand-alone so- Fault-tolerant MPI for the Harness metacomputing system. lution, as well as a native Harness plug-in, to provide flex- Lecture Notes in Computer Science: Proceedings of ICCS ible heterogeneous communication for all Harness frame- 2001, 2073:355–366, 2001. works, and for any other heterogeneous distributed comput- [8] FT-MPI at University of Tennessee, Knoxville, TN, USA. At ing platforms. http://icl.cs.utk.edu/ftmpi. [9] G. Geist, J. Kohl, S. Scott, and P. Papadopoulos. HAR- 6. Conclusions NESS: Adaptable virtual machine environment for hetero- geneous clusters. Parallel Processing Letters, 9(2):253–273, 1999. We have described recent changes and improvements to [10] G. A. Geist, A. Beguelin, J. J. Dongarra, W. Jiang, the Harness lightweight kernel design. By using a more R. Manchek, and V. S. Sunderam. PVM: Parallel Virtual efficient and flexible approach and moving previously in- Machine: A Users' Guide and Tutorial for Networked Paral- tegrated services into distinct plug-in modules, the soft- lel Computing. MIT Press, Cambridge, MA, USA, 1994. ware becomes more versatile and adaptable. The kernel pro- [11] Harness at Emory University, Atlanta, GA, USA. At vides a container for the user to load and arrange all soft- http://www.mathcs.emory.edu/harness. ware components based on actual needs, without any pre- [12] Harness at Oak Ridge National Laboratory, TN, USA. At conditions imposed by the managing framework. The user http://www.csm.ornl.gov/harness. chooses the programming model (DVM, PVM, MPI, etc.) [13] Harness at University of Tennessee, Knoxville, TN, USA. At by loading the appropriate plug-in(s). http://icl.cs.utk.edu/harness. Recent efforts were described in which RMIX, a dy- [14] D. Kurzyniec, V. S. Sunderam, and M. Migliardi. PVM em- namic heterogeneous reconfigurable communication frame- ulation in the Harness metacomputing framework - Design work, was integrated into the Harness environment as a new and performance evaluation. Proceedings of CCGRID 2002, pluggable software module. pages 282–283, 2002. [15] D. Kurzyniec, T. Wrzosek, V. Sunderam, and A. Slominski. In his 20 years at ORNL, he has published two books and RMIX: A multiprotocol RMI framework for Java. Proceed- over 190 papers in areas ranging from heterogeneous dis- ings of IPDPS 2003, pages 140–145, 2003. tributed computing, numerical linear algebra, parallel com- [16] GNU Libtool - The GNU portable library tool. At puting, collaboration technologies, solar energy, materials http://www.gnu.org/software/libtool. science, biology, and solid state physics. You can find out [17] SETI@HOME at University of California, Berkeley, CA, more about Al and a complete list of his publications on his USA. At http://setiathome.ssl.berkeley.edu. web page: http://www.csm.ornl.gov/∼geist [18] V. Sunderam and D. Kurzyniec. Lightweight self-organizing frameworks for metacomputing. Proceedings of HPDC 2002, pages 113–124, 2002.

Biographies

Christian Engelmann is a Research and Development staff member in the Network and Cluster Computing Group at Oak Ridge National Laboratory (ORNL). Christian is also a postgraduate research student at the Department of Computer Science of the University of Reading. He received his Advanced European M.Sc. degree in parallel and scientific computation from the University of Reading and his German Certified Engineer diploma in computer systems engineering from the College for Engi- neering and Economics (FHTW) Berlin in 2001. Christian joined ORNL in 2000 as a Master student. He is a contributor to the Harness project, which focuses on developing a pluggable, lightweight, heterogeneous Dis- tributed Virtual Machine environment. He is also a mem- ber of the MOLAR research team, that concentrates on developing adaptive, reliable, and efficient operating and runtime system solutions for ultra-scale high-end scientific computing. Furthermore, he is also involved with a project developing self-adapting, fault tolerant algorithms for sys- tems with tens or even hundreds of thousands of processors. You can find out more about Christian on his web page: http://www.csm.ornl.gov/∼engelman Al Geist is a Corporate Research Fellow at Oak Ridge National Laboratory (ORNL), where he leads the 35 mem- ber Computer Science Research Group. He is one of the original developers of PVM (Parallel Virtual Machine), which became a world-wide de facto standard for hetero- geneous distributed computing. Al was actively involved in both the MPI-1 and MPI- 2 design teams and more recently the development of FT- MPI, a fault tolerant MPI implementation. Today He leads a national Scalable Systems Software effort, involving all the DOE and NSF supercomputer sites, with the goal of defin- ing standardized interfaces between system software com- ponents. Al is co-PI of a national Genomes to Life (GTL) center. The goal of his GTL center is to develop new al- gorithms, and computational infrastructure for understand- ing protein machines and regulatory pathways in cells. He heads up a project developing self-adapting, fault tolerant algorithms for 100,000 processor systems.