First International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters (COSET-1)

June 26th, 2004 Saint-Malo, France

Ville de Saint-Malo – Service Communication – Photos : Manuel CLAUZIER

Held in conjunction with 2004 ACM International Conference on Supercomputing (ICS ’04)

WORKSHOP PROCEEDINGS First International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters

COSET-1

Clusters are not only the most widely used general high-performance computing platform for scientific computing but also according to recent results on the top500.org site, they have become the most dominant platform for high-performance computing today.

While the cluster architecture is attractive with respect to price/performance there still exists a great potential for efficiency improvements at the software level. System software requires improvements to better exploit the cluster hardware resources. Programming environments need to be developed with both the cluster and human programmer efficiency in mind. Administrative processes need refinement both for efficiency and effectiveness when dealing with numerous cluster nodes.

The goal of this one-day workshop is to bring together a diverse community of researchers and developers from industry and academia to facilitate the exchange of ideas and to discuss the difficulties and successes in this area. Furthermore, to discuss recent innovative results in the development of cluster based operating systems and programming environments as well as management tools for the administration of high-performance computing clusters.

COSET-1

Workshop co-chairs

Stephen L. Scott Oak Ridge National Laboratory P. O. Box 2008, Bldg. 5600, MS-6016 Oak Ridge, TN 37831-6016 email: [email protected]

Christine A. Morin IRISA/INRIA Campus universitaire de Beaulieu 35042 Rennes cedex, France email: [email protected]

Program Committee

Ramamurthy Badrinath, HP, India Amnon Barak, Hebrew University, Israël Jean-Yves Berthou, EDF R&D, France Brett Bode, Ames Lab, USA Ron Brightwell, SNL, USA Emmanuel Cecchet, INRIA, France Toni Cortès, UPC, Spain Narayan Desai, ANL, USA Christian Engleman, ORNL, USA Graham Fagg, University of Tennessee, USA Paul Farrell, Kent State University, USA Andrzej Goscinski, Deakin University, Australia Liviu Iftode, Rutgers University, USA Chokchai Leangsuksun, Louisiana Tech University, USA Laurent Lefèvre, INRIA, France John Mugler, ORNL, USA Raymond Namyst, Université de Bordeaux 1, France Thomas Naughton, ORNL, USA Hong Ong, University of Portsmouth, UK Rolf Riesen, SNL, USA Michael Schoettner, University of Ulm, Germany Assaf Schuster, Technion, Israël COSET-1 Program

9:00-9:05 Opening

9:05-10:05 Session 1: Cluster Services Session chair: Christine Morin, INRIA

Parallel File System for Networks of Windows Workstations Jose Maria Perez, Jesus Carretero, Felix Garcia, Jose Daniel Garcia, Alejandro Calderon, Universidad Carlos III de Madrid, Spain

An application-oriented Communication System for Clusters of Workstations Thiago Robert C. Santos and Antonio Augusto Frohlich, LISHA, Federal University of Santa Catarina (UFSC), Brazil

10:05-10:35 Session 2: Application Management Session chair: Christine Morin, INRIA

A first step toward autonomous clustered J2EE applications management Slim Ben Atallah, Daniel Hagimont, Sébastien Jean and Noël de Palma, INRIA Rhône-Alpes, France

10:35-11:00 Coffee break

11:00 - 12:30 Session 3: Highly Available Systems for Clusters Session chair: Stephen Scott, ORNL

Highly Configurable Operating Systems for Ultrascale Systems Arthur B. Maccabe and Patrick G. Bridges, The University of New Mexico, USA Ron Brightwell and Rolf Riesen, Sandia National Laboratories, USA Trammell Hudson, Operating Systems Research, Inc., USA

Cluster Operating System Support for Parallel Autonomic Computing A. Goscinski, J. Silcock, M. Hobbs, Deakin University, Australia

Type-Safe Object Exchange Between Applications and a DSM kernel R. Goeckelmann, M. Schoettner, S. Frenz and P. Schulthess, University of Ulm, Germany

12:30-14:30 Lunch

14:30-16:00 Session 4: Cluster Single System Image Operating Systems Session chair: Christine Morin, INRIA

SGI's Altix 3700, a 512p SSI system. Architecture and Software environment. Jean-Pierre Panziera, SGI

OpenSSI speaker TBA

SSI-OSCAR Geoffroy Vallée, INRIA

16:00 -16:20 Coffee break

16:20-17:20 Session 4 (continued): Cluster Single System Image Operating Systems Session chair: Geoffroy Vallée, INRIA

Millipede Virtual Parallel Machine for NT/PC clusters Assaf Schuster, Technion

Genesis cluster operating system Andrzej Goscinski, Deakin University

17:20-18:00 Panel: SSI: Software versus Hardware Approaches Moderator: Stephen Scott, ORNL A Parallel File System for Networks of Windows Worstations José María Pérez Jesús Carretero José Daniel García Computer Science Department. Computer Science Department. Computer Science Department. Universidad Carlos III de Madrid Universidad Carlos III de Madrid Universidad Carlos III de Madrid Av. De la Unversidad, 30 Av. De la Unversidad, 30 Av. de la Universidad Carlos III, 22 Leganes 28911, Madrid, Spain Leganes 28911, Madrid, Spain Colmenarejo 28270, Madrid, Spain +34 91 624 91 04 +34 91 624 94 58 +34 91 856 13 16 [email protected] [email protected] [email protected]

ABSTRACT 1. INTRODUCTION The usage of parallelism in file systems allows the In the last years, the need of high performance data achievement of high performance I/O in clusters and storage has grown as the capacity of disks and the networks of workstations. Traditionally this kind of applications needs have grown [1][2][3]. One solution was only available for UNIX systems, approach to overpass the bottleneck that characterizes requires the usage of special servers and the usage of typical I/O systems is the usage of parallel I/O special APIs, which leads to the modification, and/or approach [1]. This technique allows the creation of recompilation of existing applications. This paper large storage systems, by joining several storage presents the first prototype of a Parallel File System, resources, to increase the scalability and performance called WinPFS, for the Windows platform. It is of the I/O system and to provide load balancing. implemented as a new Windows File System and it is The usage of parallelism in file systems relies on the integrated within the Windows kernel components, fact that a distributed and parallel system consists on which implies that no modification or recompilation several nodes with storage devices. Performance and of applications is needed to take advantage of parallel bandwidth can be increased if data accesses are I/O. WinPFS uses shared folders (through the usage exploited in parallel. Parallelism in file systems is of the CIFS/SMB protocol) to access remote data in obtained by using several independent server nodes, parallel. The proposed prototype has been developed each one supporting one or more secondary storage under the Windows XP platform, and has been tested devices. Data are striped among those nodes and with a cluster of Windows XP nodes and a Windows devices to allow parallel accesses to different files, 2003 Server node. and parallel accesses to the same file. Initially, this Categories and Subject Descriptors idea was used in RAID [4] (Redundant Array of D.4.3 [ Operating Systems ]: File Systems Inexpensive Disks). However, when a RAID is used Management – distributed file systems. C.2.4 in a traditional file server, the I/O bandwidth is [Computer-Communications Networks ]: limited by the server memory bandwidth. However, if Distributed Systems – network operating systems. several servers are used in parallel, performance can be increased in two ways: Keywords 1. Allowing parallel access to different files by Parallel I/O, Cluster, Windows. using several disks and servers. 2. Striping data using distributed partitions,

allowing parallel access to the data of the same file. However, current parallel file systems and parallel I/O

libraries lack of generality and flexibility for general purpose distributed environments. Furthermore, all parallel file systems do not use standard servers, which makes it very difficult to integrate those describes what she wants and the system tries to systems in existing networks of workstations due to optimize the I/O requests applying optimization the need of installing new servers that are not easy to techniques. This approach is used in, ViPIOS[10]. use and that are only available for specific platforms The main problem with parallel I/O software (usually some UNIX flavour). Moreover, those architectures and parallel I/O techniques is that they systems are implemented outside the operating often lack of generality and flexibility, because they system, so that a new I/O API is needed to take create only tailor-made software for specific advantage of parallel I/O with the modification of problems. On the other hand, parallel file systems are existing applications. specially conceived for multiprocessors and Most of the software related to high performance I/O multicomputers, and do not integrate appropriately in is only available for UNIX environments, or has been general purpose distributed environments as clusters created as UNIX middleware. The work presented in of workstations. this paper tries to fulfil the lack of this kind of systems in the Windows environment, presenting a Last years some file systems had emerged, such way to achieve parallel I/O in Windows platforms. as PVFS [11], which can be used in clusters, but they need the installation of special servers. Other In this paper, we present a parallel file system for solutions, such as Expand [12], can use existing Windows clusters and/or networks of workstations standard NFS servers to accomplish parallel I/O, called WinPFS. This system integrates the existing which implies that no new severs are needed in a servers in an organization, using protocols like NFS, cluster, but the standard linux NFS. Usually the client CIFS or WebDav in order to obtain Parallel I/O, side is implemented as a user level library and without needing complex installations, providing designed with UNIX in mind. support to existing applications, high performance and low overhead thanks to the integration with the Another important way to accomplish high Windows kernel. performance I/O is the usage of MPI-IO [5]. Some implementations of MPI have been adapted to In section 2, some works related to parallel I/O are Windows: MPICH, MPI-PRO, and WMPI. But, presented. Section 3 presents the design of WinPFS. usually the Windows I/O part is not optimized using Section 4 describes some evaluations of the first parallel I/O techniques. WinPFS prototype. Finally, section 5 presents our conclusions and future work. 3. WINPFS DESIGN 2. RELATED WORK The main motivation for the WinPFS design is to Three different parallel I/O software architectures build a parallel file system for networks of Windows can be distinguished: workstations using standard data servers. To satisfy this goal, authors designed and implemented a parallel • Application libraries basically consist of a set of file system using CIFS/SMB servers. This paper highly specialized I/O functions. Those functions describes the first prototype of WinPFS. provide a powerful development environment for The goals of the proposed architecture are: experts with specific knowledge of the problem to be modeled by using this solution. A • To integrate existing storage resources using representative example is MPI-IO [5], an I/O shared folders (CIFS, WebDAV, NFS, etc) rather extension of the standardized message passing than installing new servers. This is accomplished interface MPI. by using Windows Redirectors. • To simplify setup. Only a Windows driver is • Parallel file systems operate independently from needed to make use of the system. applications, thus allowing more flexibility and • To be easy to use. Existing applications must generality. Examples of parallel file systems are: work without modification and without Vesta [6], PIOUS [7], Galley [8], ParFiSys [9]. recompilation. • Intelligent I/O systems hide the physical disk • To enhance performance, scalability and capacity access to the application developer by providing a of the I/O system, through the usage of parallel transparent logical I/O environment. The user and distributed file systems mechanism: request splitting, balanced data allocation, load balancing, (Windows system services). The I/O Manager, who is etc. in charge of identifying the driver who is going to deal with each request, receives the requests in kernel Win32 MPI-IO .... Client mode. WinPFS register itself as a virtual remote file WinPFS system that allows the driver to receive the requests.

NFS CIFSHTTP Local Redirectors WebDAV .... Next sections show the remote data access and file Clients stripping techniques used in WinPFS, the I/O request NFS CIFS management ant the usage of WinPFS. HTTP-WebDav Distributed partition 2 ... Intranet Site 1 3.1 Remote Data Access and File Striping From the user point of view, Windows operating

Site 2 Site 3 system provides access to remote data storage nodes through shared folders (local folders exported to remote computers). WinPFS create a new-shared

Distributed partition 1 folder, called \\PFS, in the client side. Therefore, the users can use parallel files through the shared folder Figure 1. WinPFS installed in an Intranet mechanism.

From the kernel point of view, the accesses to remote To accomplish most of the former goals, the proposed data are performed based on several mechanisms as design is based on a new Windows kernel component CIFS (Common Internet File System), also known as that implements the basis of the file system, isolating SMB protocol, UNC (Universal Naming Convention) users from the parallel file system, and the use of and a special class of drivers, called redirectors, which protocols to connect to different network file systems. receive I/O requests from the users and send them to Figure 1 shows how a client application, using any the appropriate servers. Windows interface (for instance, Win32), can access The parallel file system implemented identifies all the distributed partitions in a cluster or network of requests trying to access a virtual path ( \\PFS ) and workstations using WinPFS. Communication with the processes them. For example, if we want to create a data servers can be performed with any available file in WinPFS, we must do something like: protocol through kernel components, called CreateFile(\\PFS\file.txt) instead of redirectors, which redirect requests to remote servers CreateFile(C:\tmp\file.txt) or with a specific protocol. In our first prototype, we CreateFile(\\server1\tmp\file.txt). have only considered CIFS/SMB servers and the In order to achieve high performance, load balancing issues related to coordinate several of them. and higher storage capability, a file is striped through The situation of WinPFS into the Windows kernel can several nodes. Our file system must coordinate the be observed in Figure 2. access to several of those remote folders in order to Win32 POSIX DOS achieve load balancing and data distribution. Striping leads to the creation of one or more requests Native NT API to access data based on the buffer size and the current offset. Then the requests are sent to one or several I/O Manager redirectors in order to access remote storage nodes (See Figure 3). WinPFS 3.2 I/O Request Management Local CIFS WebDav Netware NFS The Windows NT family has a layered I/O model, Figure 2. WinPFS in the Windows I/O subsystem which allows the definition of several layers to process a request in the I/O subsystem. Each of those layers is a driver that can receive a request and pass it A user application uses an available interface (Win32, (or additional requests) to lower layers (drivers) in the POSIX, etc), whose calls are converted in system calls I/O stack. This model allows the insertion of new layers (drivers) in the path of an I/O request, for Create : The requests (IRPs) are replicated and example for encryption or compression. WinPFS sent to each server in which the file is going to be takes advantage of this mechanism in order to provide distributed on. parallel I/O functionalities. • Read, write : The main request is split into smaller subrequests that are sent to some server. For example, if we want to read 256Kbytes and the stripping unit is 64Kbytes, four subrequests of 64Kbytes are created (one for each of four shared folders). • Create Directory : The requests are replicated in all the shared folders, in order to make the directory tree consistent through all the shared folders. This means that if we want to create a directory: \\PFS\tmp, this directory is created in Figure 3. Serving a request to several remote every shared folder: \\server1\share1\tmp, servers \\server1\share2\tmp, \\server2\share1\tmp, … • Metadata management, control, security : This To support this model the Windows I/O kind of request needs a different approach. Some subsystem presents two major features: the I/O of then do not require to split requests and/or to Manager and the I/O request packets (IRP).The I/O access the remote servers. Manager is in charge of receiving requests from the As an example, the basic steps to create/open a user in form of NT services (system calls), creating an file are presented bellow (Figure 4): IRP in order to describe the requests and delivering them to the appropriate device, which has an 0. The I/O Manager receives a request and creates an associated driver (in our case WinPFS). All the I/O IRP. This request contains the file name in the requests are delivered to drivers as I/O request packets form: \\PFS\... (IRPs). That way, the I/O subsystem presents a 1. The I/O Manager sends the IRP packet to the consistent interface to all kernel-mode drivers. This MUP. interface includes typical I/O operations: open, close, read, write, etc [13]. 2. The MUP module has to look for some redirector (network file system) that recognizes the \\PFS Apart from creating the IRP, the I/O Manager string as a network name. The MUP asks all the must identify the device and kernel component which redirectors until someone answer affirmatively. is going to complete a request. In the case of remote storage this work is supported by the MUP (Multi 3. The MUP module indicates to the I/O Manager UNC Provider) that identifies the kernel component that the request must go to the redirector that (network redirector, or WinPFS) in charge of a recognized the network name. For this reason, our specific network name (Figure 4, steps 2-3). Once the driver is created as a redirector that recognizes all driver is identified (in our case WinPFS), the IRP is the requests with the prefix \\PFS. passed to it (Figure 4, steps 4). Then WinPFS creates 4. The I/O Manager sends the request to WinPFS. one or several subrequests to be sent to remote servers and/or to access local information (cache, cached 5. The request packet received is split in several metadata, etc). parallel subrequests, also in the form of IRPs, and are sent to redirectors (CIFS, NFS, etc). In order The way in which the subrequests are created to create the subrequests, the driver has to know depends on the kind of request. It can be classified in where the data of the parallel file are stored, what the following categories: servers are available, and which protocols can be used for each request.

Figure 4 . WinPFS steps to serve request

6. The redirectors send the requests to the remote called a Partial MDL, and the buffers from the servers and they send/receive data stripes for original MDL are mapped in the Partial MDLs, so each server. In the example, data is striped using no copy is needed. Then this new buffer (MDL) is a round-robin policy. However, several policies send to the appropriate redirector that sends/receives can be used to allocate parallel file data on the data to/from the remote servers. remote servers. 7. Once the requests are served, the driver must 3.3 Using WinPFS join the results waiting in a kernel event for the From the administration point of view, the completion of all the subrequests. installation of WinPFS only requires three steps: • To install a driver in the client nodes. Steps 0...3 are only needed when a file is • To share folders in the servers, this can be created or opened. Once requests have finished, the accomplished though the Explorer or Windows users receive a File Handle (associated with a File administrative tools. Object in kernel space) that allows to access directly • to WinPFS. To indicate the shared folders to be used by WinPFS in the registry of the client nods. Other two important requests in a file system are reading and writing. Those requests come from From the user point of view, using the parallel file the user indicating a buffer where the data are system only requires that all the paths are prefixed sent/received. That buffer must be split to send to with \\PFS , but we plan to implement the necessary each server a part of them. In order to optimize the mechanisms to map the remote name ( \\PFS ) to file system, the implementation must avoid the copy common driver letters as D: . of any buffer. The Windows kernel provides the Apart from the detail of using the naming mechanisms needed to perform those operations convention \\PFS\file , nothing more is needed. without copy. The buffers that come from the users WinPFS can be used with the Win32 API, POSIX, are received in kernel structures called MDL cygwin or whatever I/O API that finally uses (Memory Description List). WinPFS splits the Windows Services. request in subrequests, by creating a new MDL Other important features to take into account are: • Caching : WinPFS caching is supported by the 1GHz with 1GByte of main memory, 200 GByte caching mechanisms performed by the disks and a GigaEthernet network, with two 3Comm redirectors. Therefore, now caching is limited to GigaEthernet switches and four nodes connected to the Windows caching model, but in the future each of them. more advanced caching algorithms could be implemented. In addition, the caching mechanisms can be disabled through the Win32 API. PC PC

• Security and Authentication : Security and GigaEthernet GigaEthernet Switch Switch authentication issues are solved by the operating PC PC system. If we work in a Windows “Domain” there are no authentication problems, because PC PC the domain users can access all the resources. Of course, the access to shared folders

(including WinPFS parallel partitions) can be PC PC controlled indicating what users can access the Figure 5. Evaluation Infrastructure resource; this is accomplished with the Windows security model. If we want to use servers through several not trusted domains, or From the operating system point of view, a in workgroups, some changes must be done to Windows Domain with one computer running our prototype to incorporate the authentication Windows 2003 Server and other seven computers mechanisms. running Windows XP Professional has been created. • Data consistency between clients: Other The test was executed with 8 clients running the important feature is the consistency between application simultaneously and with different several clients accessing a parallel/distributed configurations of the I/O system. First, the file. At the moment, this is solved using the evaluation was done with the simple sharing folders default mechanism used by the CIFS redirector. mechanisms available by default in Windows The CIFS protocol use a mechanism called (CIFS). This allowed us to evaluate the performance oplocks (opportunistic locks) that enable a provided by one server with different number of protocol to maintain the consistency between clients. Then, we tested WinPFS with different clients [14]. number of servers in parallel: PFS88 (8 servers used in parallel), PFS44 (4 servers used in parallel), and 4. Evaluation PFS84 (4 servers used in parallel and selected from In order to measure the performance of the first a set of 8 servers). prototype of WinPFS, we have made some evaluation tests. The test creates a file of 100 Figure 6 shows the results obtained in the write part Mbytes that is written sequentially and then read of the test with 8 clients (one application running in sequentially using a static buffer size (several each node). As can be seen, the performance executions are performed with different buffer obtained by WinPFS is higher that a single server sizes). We have disabled the client cache with the attending 8 clients. With one single CIFS server 40 option FILE_FLAG_NO_BUFFERING in the Mbits/s are achieved with 64KB and 128KB creation of the file (CreateFile). We do not need to application buffer. WinPFS obtains 150 Mbits/s use special features, as IOCTL, of any kind; the with four servers (PFS44), 200 Mbits with four Windows API provides this feature and others. servers selected from eight (PFS84), and almost 250 Mbits/s using eight servers (PFS88). The reason is With this test, we want to measure the access that the I/O requests are distributed between 8 performance through a network to remote disks servers instead of one. Therefore, in the single (write) and to access the server’s cache (read). We server case, the I/O throughput is the one provided write the file with WinPFS, thus the file is striped by one disk. and sent to several servers that write it to disk, and Figure 7 shows the performance obtained in the read then we read it from the server, where data are part of the test. In this case, the performance is maintained in caches. higher with WinPFS (1200 Mbits/s). The CIFS We have joined two clusters of four nodes (see server obtains good results (600 Mbits/second) Figure 5). Each node is a BiProcessor Pentium III because all the data are in main memory, so no disk access is required. 7

6

300 5

4 CIFS 250 Speedup PFS88 3 PFS44 200 2 PFS84

1 150 0 8 Clients 1K

Bandwidth Bandwidth (Mbits/S) 4 Clients 2K 4K 8K 2 Clients 100 16K 32K 64K 128K 1 Client Buffer Size 256K 512K 1M 50 Figure 8. Write Speedup (CIFS vs. WinPFS 88) 0 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Buffer Size (Bytes) Figure 6. Write Results for 8 clients (Write to

remote disks) 1

0,8

1400 0,6 CIFS

1200 PFS88 Speedup 0,4 PFS44

1000 PFS84 0,2

800 0

-0,2 600 8 Clients Bandwidth Bandwidth (Mbits/S) 1K 2K 4 Clients 4K 8K 2 Clients 16K 32K 400 64K 128K 1 Client Buffer Size 256K 512K 1M

200

0 Figure 9. Read Speedup (CIFS vs. WinPFS 88) 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Buffer Size (Bytes)

Figure 7. Read Results for 8 clients (Read from The write test takes into account the accesses to Servers Cache) disks, and the results show that under those One thing that must be remarked is that the network circumstances a speedup factor of 5 (500% of connecting the two clusters imposed a limit of improvement) is achievable, reaching almost a 1Gbit/s (120 Mbytes/s) to the system. If we use one 700% of improvement with buffers of big sizes. We server in one subcluster and the client in the other think that the read can achieve this level of subcluster, we never would overpass this limit. improvement if the data were flushed to disks. However, as can be see in Figure 7, WinPFS can As can be seen with one client, the speedup is about overpass this limit due to the use of several servers a factor of one in the writes and not exists for reads, in parallel (some of one cluster and other from the or even WinPFS works worst with bigger second cluster). application buffers. The last is because WinPFS To clarify the results obtained, Figure 8 and Figure strips the buffer in 64Kbytes buffers, so we are 9 show the speedup obtained with the PFS88 (8 limited to the performance obtained with 64Kbytes servers in parallel) configuration with respect to the buffers. This may be solved with a bigger stripping simple CIFS server. unit, but this may impose a limit in the system As can be seen, the speedup in the read part is parallelism. smaller (less than a 100% of improvement). As 5. CONCLUSIONS AND FUTURE WORK commented before, this is because data were served In this paper, we have described the design of a from server cache without disks accesses. parallel file system for clusters of Windows workstations. This system provides parallel I/O 8. REFERENCES features that allow the integration of existing storage [1] Peter M. Chen, David A Patterson. Maximizing resources by sharing folders and using a driver Performance in a Striped Disk Array. (WinPFS). Proceedings of the 17 th Annual International Our approach proposes to complement the Windows Symposium. On Computer Architecture, ACM kernel routing all the requests for the network name SIGARCH Computer Architecture News. 1990. \\PFS to a driver that splits the requests and uses [2] J. Gray. Data Management: Past Present, and several data servers in parallel. The integration of Future. IEEE Computer, Vol. 29, Nº 10, 1996, the file system into the kernel provides higher pp. 38-46. performance that other solutions that use libraries, [3] A. Chervenak, I. Foster, C. Kesselman, C. and it provides no differences from the user point of Salisbury, S. Tuecke. The Data Grid: Towards view, so that the user can execute its applications an Architecture for the Distributed Management without rewriting or recompiling them. and Analysis of Large Scientific Datasets. WinPFS achieves high scalability, availability and Journal of Network and Computer Applications, performance by using several servers in parallel. 23: 187-200, 2001 WinPFS also allows us to obtain a high capacity [4] D. A. Patterson, G. Gibson, and R. H. Katz. A storage node with a set of workstations. In the test, a Case for Redundant Array of Inexpensive Disks 1.6 Terabytes system has been built using 200 (RAID). In Proceedings of ACM SIGMOD, GBytes disks. pages 109-116, Chicago, IL, June 1988. In the test, the performance limits of the systems are [5] MPI Forum. 1997. MPI-2: Extensions to the two: disks in the write operations, and network Message-Passing Interface. http://www.mpi- bandwidth in the read operations. forum.org With the usage of redirectors, a client can stripe [6] P. Corbett, S. Johnson, and D. Feitelson. files over CIFS, NFS and WebDAV servers Overview of the Vesta Parallel File System. independently that those servers reside in Windows ACM Computer Architecture News, vol. 21, no. or UNIX servers, NAS, or whatever other storage 5, pp. 7--15, Dec. 1993. protocol that is supported through redirectors. [7] S. A. Moyer and V. S. Sunderam. PIOUS: A Future work is going on to dynamically add and Scalable Parallel I/O System for Distributed remove storage nodes to the cluster, on data Computing Environments. Proceedings of the allocation and load balancing for heterogeneous Scalable High-Performance Computing distributed systems, and on parallel usage of Conference, 1994, pp. 71--78. heterogeneous resources and protocols (CIFS, NFS, WebDAV, etc) in a network of workstations, [8] N. Nieuwejaar and D. Kotz. The Galley Parallel addressing the implications to performance, File System. Proceedings of the 10th ACM management and security. In addition, we will use International Conference on Supercomputing, the Active Directory Service provided by Windows May 1996. to create a metadata repository, so that all clients [9] J. Carretero, F. Perez, P. de Miguel, F.Garcia, can obtain a consistent image of the parallel files. and L.Alonso, Performance Increase Mechanisms for Parallel and Distributed File 6. ACKNOWLEDGMENTS Systems. Parallel Computing: Special Issue on This work has been partially support by Microsoft Parallel I/O Systems. Elsevier, no. 3, pp. 525-- Research Europe, by the Community of Madrid 542, Apr. 1997. under the 07T/0020/2003 1 contract, and by the [10] Fuerle, T., O., Schikuta, E., and Wanek, H. Spanish Ministry of Science and Technology under 1999. Meta-ViPIOS: Harness distributed I/O the TIC2003-01730 contract. resources with ViPIOS. Journal of Research Computing and Systems, 4(2):124-142 7. ADITIONAL AUTHORS Additional authors: Felix Garcia [11] P.H. Carns, W.B. Ligon III, R.B. Ross, and R. ([email protected] ) and Alejandro Takhur, PVFS: A Parallel File System for Linux Calderón ( [email protected] ). Clusters.Tech. Rep. ANL/MCS-P804-0400, 2000. [12] F. Garcia, A. Calderon, J. Carretero, J.M. Perez, J. Fernandez, The Design of the Expand Parallel File System. International Journal of High [14] SNIA (Storage Networking Industry Performance Computing Applications, 2003. Association). Common Internet File System [13] Rajeev Nagar. Windows NT File System (CIFS). Technical Reference. Revision: 1.0, Internals. A Developer’s Guide. O’Reilly, 1997. 2002. pp. 6-10 Pp. 158

An Application-Oriented Communication System for Clusters of Workstations

Thiago Robert C. Santos and Antonio Augusto Frohlich

Laboratory for Software/Hardware Integration (LISHA) Federal University of Santa Catarina (UFSC) 88049-900 Florianopolis - SC - Brazil PO Box 476 Phone: +55 48 331-7552 Fax: +55 48 331-9770 {robert | guto}@lisha.ufsc.br

http://www.lisha.ufsc.br/ {robert | guto}

Abstract The present paper proposes an application-oriented communication sub-system to be used in SNOW, a high-performance, application-oriented parallel-programming environment for dedicated clusters. The proposed communication sub-system is composed of a baseline architecture and a family of lightweight network interface protocols. Each one of these protocols is built on top of the baseline architecture and can be tailored in order to satisfy the needs of specific classes of parallel ap- plications. The family of lightweight protocols, along with the baseline architecture that supports it, consists of a customizable component in EPOS, an application-oriented, component-based operating system that is the core of SNOW. The idea behind providing a set of low-level protocol implemen- tations instead of a single monolithic protocol is that parallel applications running on clusters can improve their communication performance by using the most appropriate protocol for their needs.

Keywords: lightweight communication protocols, application-oriented operating systems, user- level communication.

1 Introduction

Clusters of commodity workstations are now commonplace in high-performance computing. In fact, commercial off-the-shelf processors and high-speed networks evolved so much in recent years that most of the hardware features once used to characterize massively parallel processors (MPP) are now available in clusters as well. Nonetheless, the majority of the clusters in use today rely on commodity run-time support systems (run-time libraries, operating systems, compilers, etc.) that have usually been designed in disregard of both parallel applications and hardware. In such systems, delivering a parallel API like MPI is usually achieved through a series of patches or middleware layers that invariably add on overhead for applications. Therefore, it seems logical to suppose that a run-time support system specially designed to support parallel applications on clusters of workstations could considerably improve on performance and also on other software quality metrics (e.g. usability, correctness, adaptability). Our supposition that ordinary run-time support systems are inadequate to support high-performance computing is sustained by a number of research projects in the field focusing on the implementation of message passing [8, 9] and shared memory [10, 11, 12] middlewares and on user-level communica- tion [4, 5, 6]. If ordinary operating systems could match with parallel application’s needs, delivering adequate run-time support for the most traditional programming paradigms with minimum overhead, many of theses researches would be hard to justify outside the realm of operating systems. Indeed,

1 the way ordinary operating systems handle I/O is largely based on multitasking concepts such as do- main protection and resource sharing. This impacts the way recurring operations like system calls, CPU scheduling, and application’s data management are implemented, making little room for novel techno- logical features [2]. Not surprisingly, system designers often have to push the operating system out of the way in order to implement efficient schedulers and communication systems for clusters. In addition to that, commodity operating systems usually target reconfigurability at standard con- formance and device support, failing to comply with applications’ requirements. Clusters have been quenching the industry’s thirst for low-end supercomputers for years as HPC service providers deploy cost-effective solutions based on cluster systems. There are all kinds of applications running on clusters today, ranging from communication-intensive distributed databases to CPU-hungry scientific applica- tions. Having the chance to customize a cluster’s run-time support system to satisfy particular applica- tions’ needs could improve the system’s overall performance. Indeed, systems such as PEACE [13] and CHOICES [14] have already confirmed this hypothesis in the 90s. In this paper, we discuss the use of dedicated run-time support system, or, more specifically, of dedicated communication systems, as effective alternatives to support communication-intensive parallel applications in clusters of workstations. The research that subsided this discussion was carried out in the scope of the SNOW project [19], which aims at developing a high-performance, application-oriented parallel-programming environment for dedicated clusters. Actually, SNOW’s run-time system comes from another project, namely EPOS, which takes on a repository of software components, an adaptive component framework, and a set of tools to build application-oriented operating systems on demand [16]. The remainder of this paper is structured as follows. Section 2 gives an overview of EPOS. Section 3 presents a redesign of EPOS communication system aimed at enhancing support for network interface protocols and describes the baseline architecture that supports these protocols. Section 4 elaborates on related works. Conclusions are presented in Section 5, along with the directions for future work.

2 An Overview of EPOS

EPOS, the Embedded Parallel Operating System, is a highly customizable operating system developed using state-of-the-art software engineering techniques. EPOS consists of a collection of reusable and adaptable software components and a set of tools that support parallel application developers in “plug- ging” these components into an adaptive framework in order to produce a variety of run-time systems, including complete operating systems. Being fruit of Application-Oriented System Design [15], method that covers the development of application-oriented operating systems from domain analysis to imple- mentation, EPOS can be customized to match the requirements of particular parallel applications. EPOS components, or scenario-independent system abstractions as they are called, are grouped in families and kept independent of execution scenario by deploying aspect separation and other factorization techniques during the domain engineering process, illustrated in Figure 1. EPOS components can be adapted to be reused in a variety of execution scenarios. Usability is largely improved by hiding the details of a family of abstraction behind an hypothetical interface, called the family’s inflated interface, and delegating the selection of proper family members to automatic configuration tools. An application written based on inflated interfaces can be submitted to a tool that scans it searching for references to the interfaces, thus rendering the features of each family that are necessary to support the application at run-time. This tool, the analyzer, outputs a specification of requirements in the form of partial component interface declarations, including methods, types and constants that were used by the application. The primary specification produced by the analyzer is subsequently fed into a second tool, the configurator, that consults a build-up database to further refine the specification. This database holds information about each component in the repository, as well as dependencies and composition rules that are used by the configurator to build a dependency tree. Additionally, each component in the

2 Families of Frameworks Abstractions Problem Domain

adapter inflated i

aspect config. feature Family aspect adapter Scenario Member Member Member

adapter aspect Member

Figure 1: An overview of Application-Oriented System Design. repository is tagged with a “cost” estimation, so that the configurator will chose the “cheapest” option whenever two or more components satisfy a dependency. The output of the configurator consists of a set of keys that define the binding of inflated interfaces to abstractions and activate the scenario aspects and configurable features eventually identified as necessary to satisfy the constraints dictated by the target application or by the configured execution scenario.

application inflated interfaces framework program

system instance

analyzer configurator generator

info

aspects

components adapters

Figure 2: An overview of EPOS generation tools.

The last step in run-time systems generation process is accomplished by the generator. This tool translates the keys produced by the configurator into parameters for a statically meta programed component framework and triggers the compilation of a tailored system instance. Figure 2 brings an overview of the whole procedure.

3 3 EPOS Communication System

EPOS communication system is designed around three major families of abstractions: communicator, channel, and network. The communicator family encompasses communication end-points such as link, port, and mailbox, thus acting as the main interface between the communication system and appli- cation programs1. The second family, channel, features communication protocols, so that application data fed into the communication system via a communicator gets delivered at the destination commu- nicator accordingly. A channel implements a communication protocol that would be classified at level four (transport) according to the ISO/OSI reference model. The third family in EPOS communication system, network, is responsible for abstracting distinct network technologies through a common inter- face2, thus keeping the communication system itself architecture-independent and allowing for flexible combinations of protocols and network architectures.

Commu− Channel Network nicator

Figure 3: An overview of EPOS communication system.

Previous partial implementations of EPOS communication system for the Myrinet high-speed net- work architecture confirmed the lightness of its core, delivering unprecedented bandwidth and latency to parallel applications running on SNOW [17]. Nonetheless, EPOS communication system original design makes it hard to split the implementation of network interface protocols [1] between the processors in the host machine and in the network adapter. Besides, it is very difficult to specify a single network interface protocol that is optimal for all parallel applications, since different applications impose different traffic patterns on the underlying network. Instead of developing a single, highly complex, all encompassing protocol, it appears more feasible to construct an architecture that permits fine-grain selection and dy- namic configuration of precisely specified low-level lightweight protocol mechanisms. In an application- oriented environment, this set of low-level protocols can be used to customize communication according to applications’ needs. EPOS design allows for the several network interface protocols that arise from the design decisions re- lated to network features to be grouped into a software component with a defined interface, a family, that can be easily accessed by the communication system of the OS. EPOS’ framework implements mech- anisms for fine-grain selection of modules according to applications’ needs. These same mechanisms can be used to select the low-level lightweight protocols that better satisfy the applications’ communica- tions requirements. Besides, an important step towards an efficient, application-oriented communication system for clusters is to better understand the relation between the design decisions in low-level com- munication software and the performance of high-level applications. Grouping the different low-level implementations of communication protocols in a component coupled with the communication system has the additional advantage of allowing an application to experiment with different communications schemes, collecting metrics in order to identify the best scheme for its needs. In addition to that, to structure communication in such a modular fashion enhances maintainability and extensibility.

3.1 Myrinet baseline architecture The baseline communication architecture that supports the low-level lightweight protocols for the Myrinet networking technology must be simple and flexible enough not to hinder the design and implementation

1The component nature of EPOS enables individual elements of the communication system to be reused in isolation, even directly by applications. Therefore, the communicator is not the only visible interface in the communication system. 2Each member in the network family is allowed to extend this interface to account for advanced features.

4 of specific protocols. The highest bandwidth and lowest latency possible are desired since complex pro- tocol implementations will definitely affect both of them. User-level communication was the best answer from academia to the lack of efficient communication protocols for modern, high-performance networks. The baseline architecture described in this section follows the general concepts behind successful user- level communication systems for Myrinet. Figure 4 exhibits this architecture, highlighting the data flow during communication as well as host and NIC memory layout.

Host (Epos) Host (Epos)

Messages

Frames Physical Memory Non−swappable Flat address space 4 NIC Send Ring NIC 1 Receive Ring 3

2 Unsolicited Ring OS OS

Tx DMA Requests Tx FIFO Queue Rx FIFO Queue Rx DMA Requests

Figure 4: The Myrinet Family baseline architecture.

The NIC memory holds the six buffers that are used during communication. Send Ring and Receive Ring are circular buffers that hold the frames before they are accessed by the Network-DMA engine, the responsible for sending/receiving frames to/from the Myrinet network. Rx DMA Requests and Tx DMA Requests are circular chains of DMA control blocks, used by the Host-DMA engine for transferring frames between host and NIC memory. Rx FIFO Queue and Tx FIFO Queue are circular FIFO queues used by the host processor and LANai, Myrinet’s network interface processor, to signal for each other the arrival of a new frame. The size of these buffers affects the communication performance and reliability and the choice of their sizes is influenced by the host’s run-time system, memory space considerations, and hardware restrictions. Much of the overhead observed in traditional protocol implementations is due to memory copies during communication. Some network technologies provide host-to-network and network-to-host data transfers, but Myrinet requires that all network traffic go through NIC memory. Therefore, at least three copies are required for each message: from host memory to NIC memory in the sending side, from NIC to NIC and from NIC memory to host memory in the receiving side. Write-combining, the DMA transfers start-up overhead and the fact that a DMA control block has to be written in NIC memory for each Host-DMA transaction make write PIO more efficient than DMA for small frames. The baseline architecture copies data from host to NIC using programmed I/O for small frames (less than 512 bytes) and Host-NIC DMA for large frames. Since reads over the I/O bus are much slower than writes, the baseline architecture uses DMA for all NIC-to-host transfers. During communication, messages are split into frames of fixed size that are pushed into the com- munication pipeline. The frame size that minimizes the transmission time of the entire message in the pipelined data transfer is calculated [18] and the baseline architecture uses this value to fragment mes- sages. Besides, the maximum frame size (MTU) is dynamically configurable. For each frame, the sender

5 host processor uses write PIO to fill up an entry in the Rx DMA Requests (for large frames) or to copy (1) the frame directly to the Send Ring in NIC memory (for small frames). It then triggers a doorbell, creating a new entry in the Tx FIFO Queue and signaling for the LANai processor that a new frame must be sent. For large frames, the transmission of frames between host and NIC memory is carried out asynchronously by the Host/NIC DMA engine (1) and the frame is sent as soon as possible by LANai after the corresponding DMA finishes (2). Small frames are sent as soon as the doorbell is rung, since at this point the frame is already in NIC memory. A similar operation occurs in the receiving side: when a frame arrives from the network, LANai receives it and fills up an entry in the Rx DMA Requests chain. The message is assembled asynchronously in the Unsolicited Ring circular buffer in host memory (3). The receiving side is responsible for copying the whole message from the Unsolicited Ring before it is overwritten by other messages (4). Note that specific protocol implementations can avoid this last copy using rendezvous-style communication, where the receiver posts a receive request and provides a buffer before the message is sent, a credit scheme, where the sender is requested to have credits for the receiver before it sends a packet, or even some other technique, achieving the optimal three copies. The host memory layout is defined by the operating system being used. Besides, the Myrinet NIC impose some constraints on the usage of its resources that must be addressed by the OS. The most critical one relates to the Host/NIC DMAs: the Host-DMA engine can only access contiguous pages pinned in physical memory. Most communication systems implementations for Myrinet address this issue by letting applications pin/unpin the pages that contain its buffers on-the-fly during communication or by using a pinned copy block. The problem with these approaches is that they add extra overhead since pinning/unpinning memory pages requires system calls, which implies in context saving and switching, and using a pinned copy block adds extra data copies in host memory. In EPOS, where swapping can be left out of the run-time system by mapping logical address spaces contiguously in physical memory, this issue does not affect the overall communication.

Host (GNU/Linux) Host (GNU/Linux)

Messages

Frames Application Address Space

NIC Send Ring NIC 1 Receive Ring 5 2 4

3

Copy block Non−swappable

Figure 5: The Myrinet Family baseline architecture in a GNU/Linux host.

Figure 5 shows the memory layout and dynamic data flow of an implementation of the baseline ar- chitecture in a Myrinet GNU/Linux cluster. Issues such as address translation, kernel memory allocation and memory pinning had to be addressed in this implementation. Besides, a pinned copy block in kernel

6 memory is used to interface Host/NIC DMA transfers, which adds one extra copy for each message in the sending side. Figure 6 exhibits a performance comparison between the GNU/Linux baseline architecture and the low-level driver GM (version 1.6.5), provided by the manufacturer of Myrinet. A round trip time test was performed in order to compare the two system’s latency.

Comparison between the baseline architecture’s and GM’s latency 1000 GM ¦ Baseline

100 ¦

¦

¡£¢¤¢

¦

¥

¦

¦

¦

¦

¦

¦

¦ ¦ ¦ 10 ¦

1 1 4 16 64 256 1024 4096 Frame size

Figure 6: Comparison between the baseline architecture’s and GM’s latency (in microseconds) for dif- ferent frame sizes (in bytes).

Many Myrinet protocols assume that the Myrinet network is reliable and, for that reason, no re- transmission or time-out mechanism is needed. Indeed, the risk of a packet being lost or corrupted in a Myrinet network is so small that reliability mechanisms can be safely left out of the baseline architecture. Alternative implementations that assume unreliable network hardware and recover from lost, corrupted, and dropped frames by means of time-outs, retransmission, and hardware supported CRC checks are addressed by specific protocol implementations since different application domains may need different trade off between reliability and performance. The presented architecture may drop frames because of insufficient buffer space. The baseline ar- chitecture rests on a NIC hardware mechanism to partially solve this problem. Backpressure, Myrinet’s hardware link-level flow control mechanism, is used to prevent overflow of network interface buffers, stalling the sender until the receiver is able to drain frames from the network. More sophisticated flow- control mechanisms must be provided by specific protocol implementations since specialized applica- tions may only require limited flow-control from the network, performing some kind of control on their own. Besides, the architecture supports only point-to-point messages. Multicast and broadcast are de- sirable since they are fundamental components of collective communication operations. Lightweight protocols that provide these features could be easily implemented on top of point-to-point messages or using more efficient techniques [3]. Finally, the proposed baseline architecture provides no protection since there is a large number of parallel applications running on dedicated environments.

7 3.2 Myrinet low-level lightweight protocols While the baseline architecture is closely related to the underlying networking technology, low-level lightweight protocols are designed according to the communication requirements of specific classes of parallel applications. The lightweight protocols in the Myrinet family are divided into two categories: Infrastructure and High-Performance protocols.

Infrastructure protocols provide communication services that were left out of the baseline architec- ture: transparent multicasting, QoS, connection management, protection schemes, reliable delivery and flow control mechanisms, among others. In order to keep latencies low it would be desirable to efficiently execute the entire protocol stack, up to the transport layer, in hardware. Programmable network interfaces can be used to achieve that goal. Infrastructure protocols exploit the network processor to the maximum, using more elaborate Myrinet control programs in order to offload a broader range of communication tasks to the LANai. The communication performance is affected due to the trade-off between performance and MCP complexity but for some specific classes of applications this is a small price to pay for the communication services provided.

High-performance protocols deliver minimum latency and maximum bandwidth to the run-time sys- tem. They usually consist in minimal modifications in the baseline architecture that are required by applications or in protocols that monitor the traffic patterns and dynamically modify the baseline architecture’s customization points in order to address dynamic changes in application require- ments.

4 Related Works

There are several communication systems implementations for Myrinet, such as AM, BIP, PM, and VMMC, to name a few. Although these communication systems share some common goals, perfor- mance being one of them, they have made very different decisions in both the communication model and implementation, consequently offering different levels of functionality and performance. From the several published comparison between these implementations one can conclude that there is no single best low-level communication protocol, since the communication patterns of the whole run-time system (application and run-time support) influences the impact of the low-level implementation decisions of a given communication system on applications’ performance. Besides, run-time system specifics greatly influences communication system implementations’ functionality. While the Myrinet communication systems mentioned before try to deliver a generic, all-purpose solution for low-level communication, the main goal of the presented research is customization of low- level communication software. The architecture we propose should be flexible enough to allow that a broad range of the implementation decisions behind each one of the several Myrinet communication systems be supported as a lightweight protocols. Although our work has focused on Myrinet, there are some other networks for which the same concepts can be applied. Cluster interconnection technologies that are also implemented with a pro- grammable NIC that can execute a variety of protocol tasks include DEC’s Memory Channel, the inter- connection network in the IBM SP series and Quadrics’ QsNet.

5 Conclusions

The widespread of cluster systems brings up the necessity of improvements in the software environ- ment used in cluster computing. Cluster system software must be redesigned to better exploit clusters’ hardware resources and to keep up with applications’ requirements. Parallel-programming environments need to be developed with both the cluster and applications efficiency in mind.

8 In this paper we outlined the design of a communication sub-system based in low-level lightweight protocols, along with the design decision related to this sub-system’s baseline architecture for the Myrinet networking technology. Experiments are being carried out to determine the best values for the architec- ture’s customization points in different traffic pattern conditions. We intend to create an efficient, application-oriented communication system for clusters and the redesign of EPOS communication system was one more step towards that goal. We believe that it is necessary to better understand the relation between the design decisions in low-level communication software and the performance of high-level applications. The proposed lightwight communication pro- tocols, along with the application-oriented run-time system provided by EPOS, will be used in order to evaluate how different low-level communication schemes impact on parallel applications’ performance.

References

[1] Raoul A. F. Bhoedjang, Tim Ruhl, and Henri E. Bal. User-level Network Interface Protocols. IEEE Computer, 31(11):53–60, November 1998. [2] IEEE Task Force on Cluster Computing. Cluster Computing White Paper, Mark Baker, editor, online edition, December 2000. [http://www.dcs.port.ac.uk/˜mab/tfcc/WhitePaper]. [3] M. Gerla, P. Palnati, and S. Walton. Multicasting protocols for high-speed, wormhole-routing local area networks. In Proceedings of the SIGCOMM, pages 184–193, 1996. [4] Loic Prylli and Bernard Tourancheau. BIP: a New Protocol Designed for High Performance Net- working on Myrinet. In Proceedings of the International Workshop on Personal Computer based Networks of Workstations, Orlando, USA, April 1998. [5] Steven S. Lumetta, Alan M. Mainwaring, and David E. Culler. Multi-Protocol Active Messages on a Cluster of SMP’s. In Proceedings of Supercomputing’97, Sao Jose, USA, November 1997. [6] Hiroshi Tezuka, Atsushi Hori, Yutaka Ishikawa, and Mitsuhisa Sato. PM: An Operating System Coordinated High Performance Communication Library. In High-Performance Computing and Networking, volume 1225 of Lecture Notes in Computer Science, pages 708–717. Springer, April 1997. [7] T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: A User-Level Network Interface for Parallel and Distributed Computing. In Proceedings of the 15th ACM SOSP, pages 40–53. Copper Mountain, Colorado, December 1995. [8] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22(6):789–828, September 1996. [9] Greg Burns, Raja Daoud, and James Vaigl. LAM: An Open Cluster Environment for MPI. In Proceedings of the Supercomputing Symposium, pages 379–386, 1994. [10] Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy. User-level Interprocess Communication for Shared Memory Multiprocessors. ACM Transactions on Computer Systems, 9(2):175–198, May 1991. [11] Jorg Cordsen. Virtuell Gemeinsamer Speicher,. PhD thesis, Technical University of Berlin, Berlin, Germany, 1996. [12] H. Hellwagner, W. Karl, M. Leberecht, and H. Richter. SCI-Based Local-Area Shared-Memory Multiprocessor. In Proceedings of the International Workshop on Advanced Parallel Processing Technologies - APPT’95, Beijing, China, September 1995.

9 [13] Wolfgang Schroder-Preikschat. The Logical Design of Parallel Operating Systems. Prentice-Hall, Englewood Cliffs, U.S.A., 1994.

[14] Roy H. Campbell, Nayeem Islam, and Peter Madany. Choices, Frameworks and Refinement. Com- puting Systems, 5(3):217–257, 1992.

[15] Antonio Augusto Frohlich. Application-Oriented Operating Systems. Number 17 in GMD Research Series. GMD - Forschungszentrum Informationstechnik, Sankt Augustin, August 2001.

[16] Antonio Augusto Frohlich and Wolfgang Schroder-Preikschat. High Performance Application- oriented Operating Systems – the EPOS Aproach. In Proceedings of the 11th Symposium on Com- puter Architecture and High Performance Computing, pages 3–9, Natal, Brazil, September 1999.

[17] Antonio Augusto Frohlich and Wolfgang Schroder-Preikschat. On Component-Based Communi- cation Systems for Clusters of Workstations. In Proceedings of the First IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID 2001), pages 640–645, Brisbane, Aus- tralia, May 2001.

[18] Antonio Augusto Frohlich, Gilles Pokam Tientcheu, and Wolfgang Schroder-Preikschat. EPOS and Myrinet: Effective Communication Support for Parallel Applications Running on Clusters of Commodity Workstations. In Proceedings of 8th International Conference on High Performance Computing and Networking, pages 417–426, Amsterdam, The Netherlands, May 2000.

[19] Antonio Augusto Frohlich, Philippe Olivier Alexander Navaux, Sergio Takeo Kofuji, and Wolfgang Schroder-Preikschat. Snow: a parallel programming environment for clusters of workstations. In Proceedings of the 7th German-Brazilian Workshop on Information Technology, Maria Farinha, Brazil, September 2000.

10 A First Step towards Autonomous Clustered J2EE Applications Management

Slim Ben Atallah 1 Daniel Hagimont 2 Sébastien Jean 1 Noël de Palma 1

1 Assistant professor 2 Senior researcher INRIA Rhône-Alpes – Sardes project 655 avenue de l’Europe , Montbonnot Saint Martin 38334 Saint Ismier Cedex, France Tel : 33 4 76 61 52 00, Fax : 33 4 76 61 52 52

[email protected]

ABSTRACT mechanism really exists and dynamic reconfiguration remains a A J2EE application server is composed of four tiers: a web front- goal to achieve. This lack of manageability makes it very difficult end, a servlet engine, an EJB server and a database. Clusters to take fully advantage of clustering capabilities, i.e. allow for replication of each tier instance, thus providing an expanding/collapsing replicas sets as needed, and so on and so appropriate infrastructure for high availability and scalability. forth… Clustered J2EE application servers are built from clusters of each This paper presents the first results of an ongoing project that tier and provide the J2EE applications with a transparent view of aims to provide system administrator with a management a single server. However, such applications are complex to environment that is as automated as possible. Managing a system administrate and often lack deployment and reconfiguration tools. means being able to deploy, monitor and dynamically reconfigure such a system. Our first experiments target the deployment This paper presents JADE, a java-based environment for clustered (i.e. installation/configuration) of a clustered J2EE application. J2EE applications deployment. JADE is the first attempt of The contribution in this field is JADE, a java-based application providing a global environment that allows deploying J2EE deployment environment that eases administrator’s job. We show applications on clusters. Beyond JADE, we aim to define an how JADE allows deploying a real benchmark application called infrastructure that allows managing as autonomously as possible a RUBIS. wide range of clustered systems, at different levels (from The outline of the rest of this paper is as follows. Section 2 recalls operating system to applications). clustered J2EE applications architecture and life cycle and shows the limits of existing deployment and configuration tools. Then, General Terms Section 3 presents JADE, a contribution to ease such application Management, Experimentation. management by providing automatic scripting based deployment and configuration tools. Section 4 resets this work in a wider project that consists of defining a component-based framework Keywords for autonomous systems management. Finally, Section 5 Clustered J2EE Applications, Deployment, Configuration. concludes and presents future work.

1. INTRODUCTION 2. ADMINISTRATION OF J2EE J2EE-driven architectures are now a more and more convenient CLUSTERS: STATE-OF-THE-ART AND way to build efficient web-based ecommerce applications. CHALLENGES Although this multi-tiers model, as is, suffers from a lack of This introductory section recalls clustered J2EE applications scalability, it nevertheless benefits from clustering techniques that architecture and life cycle before showing the limits of associated allow by means of replication and consistency mechanisms to management tools. increase application bandwidth and availability. However, J2EE applications are not really easy and comfortable 2.1 Clustered J2EE Applications and their to manage. Their deployment process (installation and Lifecycle configuration) is as complex as tricky, no execution monitoring Web tier Business logic tier

mod_jk plugin HTTP SQL JDBC AJP13 RMI Tomcat

end-user Presentation tier Database tier

Figure 1. J2EE Applications Architecture . The global architecture of clustered J2EE applications is depicted in Figure 2 and detailed below in the case of an J2EE application servers [1], as depicted in Figure 1, are usually {Apache, Tomcat, JoNAS, MySQL} cluster. composed of four different tiers, either running on a single machine or on up to four ones: Apache clustering is managed through HTTP load balancing mechanisms that can involve hardware and/or software helpers. - A web tier , as a web sever (e.g. Apache [2]), that manages We cite below some well-known general-purpose techniques [6] incoming clients requests and, respectively depending if those that apply to any kind of web servers: relate to static or dynamic content, serves them or route them to the presentation tier using the appropriate protocol (e.g. AJP13 for Tomcat). • Level-4 switching, where a high-cost dedicated router - A presentation tier , as a web container (e.g. Tomcat [3]), can distribute up to 700000 simultaneous TCP that receives forwarded request from the web tier, interacts with connections over the different servers the business logic tier (using the RMI protocol) to get related • RR-DNS (Round-Robin DNS), where a DNS server data and finally dynamically generates a web document periodically changes the IP address associated to the presenting the results to the end-user. web site hostname • Microsoft’s Network Load Balancing or Linux Virtual - A business logic tier , as an Enterprise JavaBeans server Server that use modified TCP/IP stacks allowing a set (e.g. JoNAS [4]), that embodies application logic components of hosts to share a same IP addresses and (providing them with non-functional properties) which mainly cooperatively serve requests interact with the database storing application data by sending • SQL requests by the way of the JDBC framework. TCP handoffs, where a front-end server establishes TCP connections and lets a chosen host directly - A database tier , as a database management system (e.g. handle the related communication. MySQL server [5]), that manages application data. Tomcat clustering is made by using the load balancing feature of Apache’s mod_jk plugin. Each mod_jk can be configured in The main motivations of clustering are scalability and fault- order to balance requests on whole or of a subset of Tomcat tolerance. Scalability is a key issue in case of web applications instances, according to a weighted round-robin policy. that must serve billion requests a day. Fault-tolerance does not necessarily apply to popular sites, even if it is also required in No common mechanism exists to manage business logic tiers this case, but to applications where information delivery is replicas, but ad’hoc techniques have been defined. For example, critical (as commercial web sites for example). Both scalability JoNAS clustering can be achieved by using a dedicated and fault-tolerance are offered through replication (and “cluster” stub instead of the standard RMI stub in Tomcat in consistency management for the last). In the case of J2EE order to interact with EJB. This stub can be seen as a collection applications, database replication provides application with stub that manages load balancing, assuming that whatever the service availability when machine failures occur, as well as JoNAS instance where a bean has been created, its reference is efficiency by load balancing incoming requests between bound in all JNDI registries. replicas. Database clustering solutions often remain commercial, like Oracle RAC (Real Application Cluster) or DB2 cluster and require using a set of homogeneous full replicas. We can however cite C-JDBC [7], an open source JDBC clustering

middleware that allows using heterogeneous partial replicas providing with consistency, caching and load balancing.

J2EE applications life cycle consists in three main steps that are detailed below: deployment, monitoring and reconfiguration.

cluster stub JDBC clustering JDBC middleware AJP13 RMI JNDI JDBC replica HTTP

JNDI replica

end-user HTTPbalancer Load mod_jk JNDI + LB replica

Web tiers Presentation tiers Business logic tiers Database tiers

Figure 2. Clustered J2EE Applications Architecture . Deployment At the deployment step, tiers must firstly be In this context, in order to alleviate the burden of application installed on hosts and be configured to be correctly bound to administrator, to take advantage of clustering and thus to be able each other. Then, application logic and data can be initialized. to optimize performance and resource consumption, there is a Application tiers are often delivered through installable crucial need for a set of tools: packages (e.g. rpms) and the configuration is statically expressed in configuration files that statically map components - an automated deployment and configuration tool, that allows to resources. to easily and user-friendly deploy and configure a entire J2EE Monitoring Once the application has been deployed on the application, J2EE cluster, one needs to know both the system and the - an efficient application monitoring service that automatically application states to be aware of problems that may arise. Most gathers, filters, and notifies events that are pertinent to the common issues are due either to hardware faults such as a node administrator, or network link failure, or inappropriate resource usage when a - a framework for dynamic reconfiguration. node or a tier of the application server becomes a bottleneck. Research work that is directly related to this topic is provided by Reconfiguration Once a decision has been taken (e.g., extension the Software Dock [8.]. The Software Dock is a distributed, of a J2EE tier on new nodes to handle increased load), one must agent-based framework for supporting the entire software be able to perform appropriate reconfiguration, avoiding as most deployment life cycle. as possible to stop the associated component. One major aspect of the Software Dock research is the creation of a standard schema for the deployment process. The current 2.2 Deployment, Monitoring and prototype of the Software Dock system includes an evolving Reconfiguration Challenges software description schema definition. Abstractly, the Software Currently, no integrated deployment environment exists for Dock provides infrastructure for housing software releases and clustered J2EE applications. Each tier must be installed their semantic descriptions at release sites and provides manually and independently. Identically, the whole assembly, infrastructure to deploy or “dock” software releases at consumer including clustering middleware, must be configured manually sites. Mobile agents are provided to interpret the semantic mainly through static configuration files (and there also no descriptions provided by the release site in order to perform configuration consistency verification mechanism). various software deployment life cycle processes. Initial Consequently, the deployment and configuration process is a implementations of generic agents for performing configurable complex task to perform. content install, update, adapt, reconfigure, and remove have J2EE cluster monitoring is also a weakly offered feature. It is been created. However Software Dock does not deal with J2EE obviously possible to see hosts load or to use SNMP to track deployment as well as with clustered environment. failures, but this is not enough to get pertinent information about 3. JADE: J2EE APPLICATIONS application components. There is no way to monitor an apache web server, and even if DEPLOYMENT ENVIRONMENT JoNAS offer JMX interfaces to see what applications are In this section, we present JADE, a deployment environment for running, cluster administrator can not gather load evaluations at clustered J2EE applications. We firstly give an overview of the application level (but only the amount of memory used by the architecture and follow with the example of a benchmark JVM). Finally, database servers usually do not offer monitoring application deployment called RUBIS. features, except in few commercial products. In terms of reconfiguration, no dynamic mechanism is really 3.1 Architecture Overview offered. Only Apache server enables to dynamically take into JADE is a component-based infrastructure which allows the account configuration file changes, others tiers need to be deployment of J2EE applications on cluster environment. As stopped and restarted in order to apply low-level modifications. depicted in Figure 3, JADE is mainly composed of three levels defined as follows: Figure 3. JADE Architecture Overview.

each started component is managed through its own GUI Konsole level konsole. The GUI konsole also allows managing existing configuration shells. In order to deploy software components, JADE provides a configuration shell language. The language introduces a set of deployment commands described as follows: Deployment level • “start daemon”: starts a JADE daemon on a node It describes component repository, deployment engine and • “create”: creates a new component manager component manager: • “set”: sets a component property • “install”: installs component directories - The repository provides access to several software • “installApp”: installs application code and data releases (Apache, Tomcat, …) and associated • “start”: starts a component component managers. It provides a set of interfaces • ”stop”: stops a component. for instantiating deployment engine and component manager. The use of configuration commands is illustrated in the RUBIS - The deployment engine is the software responsible of deployment use case in Appendix. performing specific tasks of the deployment process on the cluster nodes. The deployment process is

driven by the deployment tools using interfaces The Shell commands are interpreted by the command invoker provided by the deployment engine. that builds deployment requests and submit them to the deployer - Component manager allows setting component engine . JADE provides a GUI konsole which allows deploying properties at launch time and also at run time. of software components of cluster nodes. As shown in Figure 3, Cluster level 4. TOWARDS A COMPONENT-BASED The cluster level illustrates the components deployed and INFRASTRUCTURE FOR AUTONOMOUS started on cluster nodes. At this stage, deployed components are able to be managed. SYSTEMS MANAGEMENT The environment presented in the previous section is suitable for The JADE deployment engine is a component-based J2EE application deployed but, more generally, it can be easily infrastructure. It provides the interface required to deploy the derived to be applied to system management. application on the required nodes. It is composed by component factory and by components deployer on each node involved. 4.1 Overview of System Management When a deployment shell runs a script, it begins with the Managing a computer system can be understood in terms of the installation of component factories on required nodes and then construction of system control loops as stated in control theory interacts with factories to create component deployer. The shell [10]. These loops are responsible for the regulation and can then execute the script invoking component deployer. optimization of the behavior of the managed system. They are Component factory exposes an interface to remotely create and typically closed loops, in that the behavior of the managed destroy component managers. Components deployers are system is influenced both by operational inputs (provided by wrappers that encapsulate legacy code and expose interface that clients of the system), and by control inputs (inputs provided by allows installing tiers from the repository onto the local node, the management system in reaction to observations of the configuring the local installation, loading the application from system behavior). Figure 4 depicts a general view of control the repository on tiers, configuring the application and loops that can be divided into multi-tier structures including: starting/stooping tiers and the application. sensors, actuators, notification transport, analysis, decision, and command transport subsystems. The JADE command invoker submits deployment and configuration requests to the deployment engine. Even if currently the requests are implemented as synchronous RMI Analysis Decision calls to the deployment engine interface, other connectors (such as MOM) should be easily plugged in the future. Notification transport Command transport

Sensor Actu ato A standard deployment script can perform the following actions: Managed system install the tiers, configure a tier instance, load the application on tiers, configure the application, start tiers. An example of deployment script is given in Appendix. A standard Figure 4. Overview of a Supervision Loop. undeployment script should stop the application and tiers and Sensors locally observe relevant state changes and event should uninstall all the artefacts previously installed. occurrences. These observations are then gathered and transported by notification transport subsystems to appropriate 3.2 RUBiS deployment scenario observers, i.e., analyzers. The analysis assesses and diagnoses RUBiS [9] provides a real-world example of the needs for the current state of the system. The diagnosis information is then improved deployment activities support. This example is used to exploited by the decision subsystem, and appropriate command design a first basic deployment infrastructure. RUBiS is an plans are built, if necessary, to bring the managed system auction site prototype modelled after eBay.com that is used to behavior within the required regime. Finally, command evaluate application design patterns and application servers transport subsystems orchestrate the execution of commands performance and scalability. required by the command plan while actuators are the RuBis offers different application logic implementations. It may implementation of local commands. take various forms, including scripting languages such as PHP Therefore, building an infrastructure for system management that execute as a module in a Web server such as Apache, can be understood as providing support for implementing the Microsoft Active Server Pages that are integrated with lowest tiers of a system control loop, namely the Microsoft's IIS server, Java servlets that execute in a separate sensors/actuators and notification/command transport Java virtual machine, and full application servers such as an subsystems. We consider that such an infrastructure should not Enterprise Java Beans (EJB) server [22]. This study focuses on be sensitive as to how the loop is closed at the top (analysis and the Java servlets implementation. decision tiers), be it by a human being or by a machine (in the case of autonomous systems). Since we take the use case of RuBis in a cluster environment, In practice, a control loop structure can merge different tiers, or we depict a load balancing scenario. In Appendix is presented a have trivial implementations for some of them (e.g., a reflex arc configuration implying two Tomcat servers and two MySQL to respond in a predefined way to the occurrence of an event). servers. In this configuration, the Apache server is deployed on Also, in complex distributed systems, multiple control loops are a node called sci40 , the tomcat servers are on nodes called sci41 bound to coexist. For instance, we need to consider horizontal and sci42 , and finally the two MySQL servers are on nodes coupling whereby different control loops at the same level in a called sci43 and sci44 . system hierarchy cooperate to achieve correlate regulation and optimization of overall system behavior by controlling separate but interacting subsystems [11]. We also need to consider sensors. However, J2EE containers usually provide JMX vertical coupling whereby several loops participate, at different interfaces that offer a way to instrument the application server. time granularities and system levels, to the control of a system Additionally, the application programmer can provide user level (e.g., multi-level scheduling). sensors (e.g., in the form of JMX MBeans). Notification transport is in charge of event and reaction 4.2 Beyond JADE dispatching. Once the appropriate sensors are deployed, they JADE is a first tool that has to be completed by other ones that generate notifications to report the state of the resource they provide administration process with monitoring and monitor. The notifications must be collected and transported to reconfiguration. the observers and analyzers that have expressed interest in them. An observer can for instance be a monitoring console that will display the state of the system in a human readable form. Before this, a prerequisite is a cartography service that builds a Different observers and analyzers may require different comprehensive system model, encompassing all hardware and properties from the channel used to transport the notifications. software resources available in the system. Instead of relying on An observer in charge of detecting a node failure may require a a “manual” selection of resource eligible for hosting tiers, the reliable channel providing a given QoS, while these properties deployment process should dynamically map application are not required by a simple observer of the CPU load of a node. components on available resources by querying the cartography Therefore the channels used to transport the notifications should service that maintains a coherent view of the system state. A be configured according to the requirements of the concerned component-based model is well suited to represent a system observers and analyzers. Typically, it should be possible to model. Each resource (node, software …) can be represented by dynamically add, remove, or configure a channel between a component. Composition manifests hierarchical and sensors and observers/analyzers. containment dependencies. To this effect, we have implemented DREAM (Dynamic With such an infrastructure, a deployment description no more REflective Asynchronous Middleware) [12], a Fractal-based needs to bind static resources but only needs to define the set of framework to build configurable and adaptable communication required resources. The architecture description might include subsystems, and in particular asynchronous ones. DREAM an exact set of resources or just define a minimal set of components can be changed at runtime to accommodate new constraints to satisfy. The cartography service can then inspect needs such as reconfiguring communication paths, adding the system representation to find the needed components that reliability or ordering, inserting new filters and so on. We are correspond to the resources needed by the application. The currently integrating various mechanisms and protocols in the deployment process itself consists in inserting the application DREAM framework to implement scalable and adaptable components into the node component that contains the required notification channels, drawing from recent results on publish- resources. Finally, the application components are bound to the subscribe routing and epidemic protocols. resources via bindings to reflect the resource usage in the cartography. The effective deployment of a component is then performed consecutively, as the component model allows some processing to be associated with the insertion or removal of a 5. CONCLUSION AND FUTURE WORK sub-component. As the popularity of dynamic-content Web sites increases rapidly, there is a need for maintainable, reliable and above all

scalable platforms to host these sites. Clustered J2EE servers is Then, there is also a need for a monitoring service reporting the a common solution used to provided reliability and current cluster state to take the appropriate actions. Such a performances. J2EE clusters may consist of several thousands of monitoring infrastructure requires sensors and a notification nodes, they are large and complex distributed system and they transport subsystem. are challenging to administer and to deploy. Hence is a crucial Sensors can be implemented as components and component need for tools that ease the administration and the deployment controllers that are dynamically deployed to reify the state of a of these distributed systems. Our ultimate goal is to provide a particular resource (hardware or software). Some sensors can be reactive management system. generic to interact with resources through common protocols such as SNMP or JMX/RMI, but other probes are specific to a We propose the JADE tool which is a framework to ease J2EE resource (processor sensor). Deploying sensors optimally for a applications deployment. Jade provides automatic scripting- given set of observations is an issue. Sensors monitoring based deployment and configuration tools in clustered J2EE physical resources may have to be deployed where the resource applications. We experienced a simple configuration scenario is located (e.g., to monitor resource usage) or on remote nodes based on a servlet version of an auction site (RuBiS). This (e.g., for detecting node failures). Another direct concern about experiment provides us the necessary feedback and a basic sensors is their intrusiveness on the system. For instance, the component to develop a reactive management system. It shows frequency of probing must not significantly alter the system the feasibility of the approach. JADE is a first tool that provides behavior. In the case of a J2EE cluster, we have to deal with with deployment facility, but it has to be completed to provide a different legacy software for each tier. Some software, such as full administration process with monitoring and reconfiguration. web or database servers, do not provide monitoring interfaces, in which case we have to rely on wrapping and indirect observations using operating system or physical resource We are currently working on several open issues for the International Working Conference on implementation of our architecture system model and Component Deployment (CD'2004), Edinburgh, instrumentation for resource deployment, scalability and coordination in the presence of failures in the transport Scotland, may 2004. subsystem, automating the analysis and decision processes for our J2EE use cases. We plan to experiment JADE with other 7. APPENDIX J2EE scenarii including EJB (The EJB version of RuBis). Our deployment service is a basic block for administration system. It // start the daemon (ie : the factory) will be integrated in the future system management service. start daemon sci40 start daemon sci41 start daemon sci44 6. References start daemon sci45 [1] S. Allamaraju et al. – Professional Java Server Programming J2EE Edition - Wrox Press, ISBN // create the managed component: 1-861004-65-6, 2000. // type name host [2] http://www.apache.org create apache apache1 sci40 create tomcat tomcat1 sci41 [3] http://jakarta.apache.org/tomcat/index.html create tomcat tomcat2 sci42 [4] http://jonas.objectweb.org/ create mysql mysql1 sci43 [5] http://www.mysql.com/ create mysql mysql2 sci44

[6] http://www.onjava.com/pub/a/onjava/2001/09/2 // Configure the apache part 6/load.html set apache1 DIR_INSTALL [7] Emmanuel Cecchet and Julie Marguerite. C- /users/hagimont/apache_install JDBC: Scalability and High Availability of the set apache1 DIR_LOCAL /tmp/hagimont_apache_local Database Tier in J2EE environments. In set apache1 USER hagimont the 4th ACM/IFIP/USENIX International set apache1 GROUP sardes Middleware Conference (Middleware), Poster set apache1 SERVER_ADMIN [email protected] session, Rio de Janeiro, Brazil, June 2003. set apache1 PORT 8081 [8] R.S. Hall et Al. An architecture for Post- set apache1 HOST_NAME sci40 Development Configuration Management in a Wide-Area Network. In the 1997 International //bind to tomcat1 Conference on Distributed Computing Systems. set apache1 WORKER tomcat1 8009 sci41 100 [9] Emmanuel Cecchet, Anupam Chanda, Sameh // bind to tomcat2 Elnikety, Julie Marguerite and Willy set apache1 WORKER tomcat2 8009 sci42 100 Zwaenepoel. Performance Comparison of Middleware Architectures for Generating set apache1 JKMOUNT servlet Dynamic Web Content. In Proceedings of the 4th ACM/USENIX International Middleware // Configure the two tomcat set tomcat1 JAVA_HOME Conference (Middleware), Rio de Janeiro, /cluster/java/j2sdk1.4.2_01 Brazil, June 16-20, 2003 set tomcat1 DIR_INSTALL [10] K. Ogata – Modern Control Engineering, 3rd ed. /users/hagimont/tomcat_install set tomcat1 DIR_LOCAL – Prentice-Hall, 1997. /tmp/hagimont_tomcat_local [11] Y. Fu et al. – SHARP: An architecture for secure resource peering – Proceedings of SOSP'03. // provides worker port [12] Vivien Quéma, Roland Balter, Luc Bellissard, set tomcat1 WORKER tomcat1 8009 sci41 100 David Féliot, André Freyssinet and Serge Lacourte. Asynchronous, Hierarchical and set tomcat1 AJP13_PORT 8009 Scalable Deployment of Component-Based set tomcat2 DataSource mysql2 Applications. In Proceedings of the 2nd set tomcat2 JAVA_HOME // Install the component /cluster/java/j2sdk1.4.2_01 install tomcat1 {conf, doc, logs,webapps} set tomcat2 DIR_INSTALL install tomcat2 {conf, doc, logs,webapps} /users/hagimont/tomcat_install install apache1 {icons,bin,htdocs,cgi- set tomcat2 DIR_LOCAL bin,conf, logs} /tmp/hagimont_tomcat_local install mysql1 {}

install mysql2 {} // provides worker port set tomcat2 WORKER tomcat2 8009 sci42 100 // Load the application part in the

middleware set tomcat2 AJP13_PORT 8009 installApp mysql1 /tmp/hagimont_mysql_local set tomcat2 DataSource mysql2 "" installApp mysql2 /tmp/hagimont_mysql_local "" // Configure the two mysql installApp tomcat1 set mysql1 DIR_INSTALL /users/hagimont/appli/tomcat rubis /users/hagimont/mysql_install installApp tomcat2 set mysql1 DIR_LOCAL /users/hagimont/appli/tomcat rubis /tmp/hagimont_mysql_local installApp apache1 set mysql1 USER root /users/hagimont/appli/apache Servlet_HTML set mysql1 DIR_INSTALL_DATABASE

/tmp/hagimont_database // Start all the component

start mysql1 set mysql2 DIR_INSTALL /users/hagimont/mysql_install start mysql2 set mysql2 DIR_LOCAL start tomcat1 /tmp/hagimont_mysql_local start tomcat2 set mysql2 USER root start apache1 set mysql2 DIR_INSTALL_DATABASE /tmp/hagimont_database

Highly Configurable Operating Systems for Ultrascale Systems∗

Arthur B. Maccabe and Ron Brightwell and Trammell Hudson Patrick G. Bridges Rolf Riesen Operating Systems Research, Department of Computer Sandia National Laboratories Inc. Science, MSC01-1130 PO Box 5800; MS 1110 1729 Wells Drive NE 1 University of New Mexico Albuquerque, NM 87185-1110 Albuquerque, NM 87112 Albuquerque, NM 87131-0001 [email protected] [email protected] [email protected] [email protected] [email protected]

ABSTRACT ming models (e.g. message-passing vs. shared-memory vs. Modern ultrascale machines have a diverse range of usage global shared address space) have also used for programming models, programming models, architectures, and shared ser- these systems. vices that place a wide range of demands on operating and runtime systems. Full-featured operating systems can sup- In spite of these differences and other evolving demands, port a broad range of these requirements, but sacrifice op- operating and runtime systems are expected to keep pace. timal solutions for general ones. Lightweight operating sys- Full-featured operating systems can support a broad range tems, in contrast, can provide optimal solutions at specific of these requirements, but sacrifice optimal solutions for gen- design points, but only for a limited set of requirements. eral ones. Lightweight operating systems, in contrast, can In this paper, we present preliminary numbers quantifying provide optimal solutions at specific design points, but only the penalty paid by general-purpose operating systems and for a limited set of requirements. propose an approach to overcome the limitations of pre- vious designs. The proposed approach focuses on the im- In this paper, we present an approach that overcomes the plementation and composition of fine-grained composable limitations of previous approaches by providing a framework micro-services, portions of operating and runtime system for configuring operating and runtime systems tailored to functionality that can be combined based on the needs of the specific needs of the application and environment. Our the hardware and sofware. We also motivate our approach approach focuses on the implementation and composition by presenting concrete examples of the changing demands of micro-services, portions of operating and runtime system placed on operating systems and runtimes in ultrascale en- functionality that can be composed together in a variety of vironments. ways. By choosing appropriate micro-services, runtime and operating system functionality can be customized at build 1. INTRODUCTION time or runtime to the specific needs of the hardware, system Due largely to the ASCI program within the United States usage model, programming model, and application. Department of Energy, we have recently seen the deploy- ment of several production-level terascale computing sys- The rest of this paper is organized as follows: section 2 de- tems. These systems, for example ASCI Red, ASCI Blue scribes the motivation for our proposed system, including Mountain, and ASCI White, include a variety of hardware the hardware and software architectures of current terascale architectures and node configurations. In addition to dif- computing systems and the challenges faced by operating fering hardware approaches, a range of usage models (e.g., systems on these machines, and presents preliminary num- dedicated vs. space-shared vs. time-shared) and program- bers and experiences to outline the scale of this problem. It also presents several motivating examples that are driving ∗This work was supported in part by Sandia National Lab- our design efforts. Section 3 describes the specific challenges oratories. Sandia is a multiprogram laboratory operated by faced by operating systems in ultrascale environments, and Sandia Corporation, a Lockheed Martin Company, for the section 4 presents our approach to addressing these chal- United States Department of Energy under contract DE- lenges. Section 5 describes various related operating system AC04-94AL85000. work, and section 6 concludes.

2. MOTIVATION 2.1 Current and Future System Demands Modern ultrascale systems, for example the various ASCI machines and the Earth Simulator, have widely varying system- level and node-level hardware architectures. The first teras- cale system, ASCI Red, is a traditional distributed memory massively parallel processing machine – thousands of nodes, each with a small number of processors (2). In contrast, Cougar the ASCI Blue Mountain machine was composed of 128- 1400 Linux/Portals processor nodes, while ASCI White employs 16-way SMP 1200 nodes. We also expect additional hardware advances such as multi-core chips and processor-in-memory chips to be avail- 1000 able in similar systems in the near future. 800 In addition to hardware, the approach from a programming 600 model standpoint has varied as well. The lightweight com- pute node operating system on ASCI Red does not sup- Millions of Operations per Second 400 port a shared-memory programming model on individual compute nodes, while the other platforms support a va- 200 riety of shared memory programming constructs, such as 0 threads and semaphores. This has lead to the development 4 16 32 64 128 Number of Processors of mixed-mode applications that combine MPI and OpenMP (or pthreads) to fully utilize the capabilities of systems with Figure 1: CG Performance on Linux and Cougar on large numbers of processors per node. Applications have also ASCI/Red Hardware been developed for these platforms that extend the bound- aries of a traditional programming model. The distributed 5000 Cougar implementation of the Python scripting language is one such Linux/Portals example [14]. Advanced programming models, such as the Global Address Space model, are also gaining support within 4000 the parallel computing community.

Even within the context of a specific programming model 3000 such as MPI, applications can have wide variations in the number and type of system services they require and can 2000 also have varying requirements for the environment in which they run. For example, the Common Component Architec- Millions of Operations per Second ture assists in the development of MPI applications, but it 1000 requires dynamic library services to be available to the in- dividual processes within the parallel job. Environmental 0 4 16 32 64 128 services, such as system-level checkpoint/restart, are also Number of Processors becoming an expected part of the standard parallel applica- tion development environment. Figure 2: MG Performance on Linux and Cougar on ASCI/Red Hardware The usage model of these large machines has also expanded. The utility of capacity computing, largely driven by the ubiquity of commodity clusters, has led to changes in the Figures 1 and 2 show the performance of these benchmarks way in which large machines are partitioned and scheduled. when running on two different operating systems. Linux out- Machines that were originally intended to run a single, large performs Cougar on the cg benchmark with small numbers parallel simulation are being used more frequently for pa- of nodes because Cougar uses older, less optimized compil- rameter studies that require thousands of small jobs. ers and libraries, but as the number of nodes used increases, application performance on Linux falls off. Similar effects 2.2 Problems with Current Approaches happen on the mg benchmark, though mg on Cougar outper- General-purpose operating systems such as Linux provide a forms mg on Linux even on small numbers of nodes despite wide range of services. These services and their associated using older compilers and libraries. A variety of different kernel structures enable sophisticated applications with ca- overheads cause Linux’s performance problems on larger- pabilities for visualization and inter-networking. This gen- scale systems, including lack of contiguous memory layout erality unfortunately comes at the cost of performance for and the associated TLB overheads and suboptimal node al- all applications that use the operating system because of the locations due to limitations with Linux job-launch on ASCI overheads of unnecessary services. Red.

In an initial attempt to measure this performance differ- Such operating system problems have also been seen in other ence, we compared the performance of the mg and cg NAS systems. Researchers at Los Alamos, for example, have B benchmarks on ASCI Red hardware [21] when running two shown that excess services can cause dramatic performance different operating systems. We use Cougar, the productized degradations [17]. Similarly, researchers at Lawrence Liver- version of the Puma operating system [26], as the specialized more National Laboratory have shown that operating sys- operating system, and Linux as the general-purpose operat- tem scheduling problems can have a large impact on appli- ing system. To make the comparison as fair to Linux as cation performance in large machines [13]. possible, we have ported the CplantTMversion of the Por- tals high-performance messaging [1] layer to the ASCI Red 2.3 Motivating Examples hardware. Cougar already utilizes this Portals for message The changing nature of demands on large scale systems transmission. present some of the largest challenges to operating system design in this environment. We consider changing demands I/O system, or aggressively prefetch data based on observed in several areas along with specific examples from each are access patterns. Doing such work in a dedicated PIM built to motivate our work. for handling latency-sensitive operations would free the sys- tem’s heavyweight (e.g. vector) processors from having to 2.3.1 Changing Usage Models. perform the latency-oriented services common in operating As large-scale systems age, they frequently transition from systems. specialized capability-oriented usage for a handful of appli- cations to capacity usage for a wide range of applications. 2.3.4 Changing Environmental Services. Operating systems for capability-oriented systems often pro- Finally, consider the variety of shared environmental ser- vide a restricted usage model (dedicated or space-shared vices that operating systems must support, such as file sys- mode) and need to provide only minimal services, allowing tems and checkpointing functionality. New implementations more operating system optimizations. Operating systems of these services are continually being developed, and these for capacity-oriented systems, in contrast, generally support implementations require changing operating system support. much more flexible usage models, such as timesharing, and As just one example, the Lustre file system [2] is currently must provide additional services including TCP/IP inter- being developed to replace NFS in ultrascale systems. Lus- networking and dynamic process creation. tre requires a specific message-passing layer from the oper- ating system (i.e., Portals), in contrast to the general net- 2.3.2 Changing Application Demands. working necessary to support NFS but in return provides Applications have varying demands for similar operating much better performance and scalability. Similarly, check- system services depending on their needs. Correctly cus- pointing services require a means to determine operating tomizing these services can have a large impact on appli- system state and network quiescence. Finally, these services cation performance. As a concrete example, consider four are often implemented at user-level in lightweight operating different ways for a signal to be delivered to an application systems; in these cases, the operating system must provide a indicating the receipt of a network packet: way to authenticate trusted shared services to applications and other system nodes. • Immediate delivery using interrupts (e.g., UNIX sig- nals) for real-time or event driven applications 3. CHALLENGES • Coalescing of multiple signals and waiting until some The processing resources for ultra scale systems will likely be other activity (e.g., an explicit poll or quantum expi- partitioned based on functional needs [7]. These partitions ration) causes an entry into the kernel, thereby mini- will most likely include: a service partition to provide gen- mizing signal handling overhead. eral services including application launch and compilation; an I/O partition providing shared file systems, a network • Extending the kernel with application-specific handler partition that provides communication with other systems, code for performance-critical signals and a compute partition that provides the primary compu- tational resources for an application. In this work, we are • Forking a new process to handle each new signal/packet primarily interested the operating system used in the com- (e.g., inetd in UNIX) pute partition, the compute node operating system.

2.3.3 Changing Hardware Architecture. Like any operating system, the compute node operating sys- Operating system structure can present barriers to hardware tem, provides a bridge between the application and the ar- innovation for ultrascale systems. Operating systems must chitecture used to run the application. That is, the operat- be customized to present novel architectural features to ap- ing system presents the application with abstractions of the plications and to make effective use of new hardware fea- resources provided by the computing system. The form of tures themselves. Existing operating systems such as Linux these abstractions will depend on the nature of the physical assume that each machine is similar to a standard architec- resources and the way in which these resources are used by ture, the Intel x86 architecture in the case of Linux, and in the application. doing so limit their ability to expose innovative architectural features to the application or to use such features to opti- The compute node operating system will also arbitrate ac- mize operating system performance. The inability of current cess to shared resources on the compute nodes, resolving operating systems to do so presents a significant impediment conflicts in the use of these resources as needed. The need to hardware innovation. for this mediation will depend on the way in which the com- pute nodes are used, the system usage model. It may also Consider, for example, operating system support for parcel- need to provide special services (e.g., authentication) to sup- based processor-in-memory (PIM) systems [22]. Operating port access to shared services (e.g., file systems or network systems for such architectures must be flexible enough to services) that reside in other partitions. perform scheduling and resource allocation on these archi- tectures and make effective use of this hardware for its own Figure 3 presents a graphical interpretation of the five pri- purposes. We specifically consider the use of a PIM as a mary factors that influence the design of the compute node dedicated OS file cache that makes its own prefetching, re- operating systems for ultrascale computing systems. We in- placement, and I/O coalescing decisions. Processes that ac- clude history in addition to the four factors identified in the cess files would send parcels to this PIM, which could im- preceding paragraphs: application needs, system usage mod- mediately satisfy them from a local cache, coalesce small els, architectural models (both system-level and node-level writes together before sending the request on to the main architectures), and shared services. adversely affecting the performance and scalability of other applications that do not use these features. Application

Shared Services Advanced programming models strive to provide a high- level abstraction of the resources provided by the comput- ing system. Describing computations in terms of abstract Operating History resources enhances portability and can reduce the amount System of effort needed to develop an application. While high-level abstractions offer significant benefits, application develop- System Usage ers frequently need to bypass the implementations of these abstractions for the small parts of the code that are time Architecture critical. For example, while the vast majority of the code in an application may be written in a high-level language (e.g., FORTRAN or C), it is not uncommon for application devel- opers to write a core calculation, such as a BLAS routine, Figure 3: Factors Influencing the Design of Operat- in assembly language to ensure an optimal implementation. ing Systems The crucial point is that the abstractions implemented to support advanced programming methodologies must allow application developers to drop through the layers of abstrac- 3.1 History tion as needed to ensure adequate performance. Because we Every operating system has a history and this history may are interested in supporting resource constrained applica- impact the feasibility of using the OS in new contexts. For tions, providing variable layers of abstraction is especially example, as a Unix-like operating system Linux assumes important. that all OS requests come from processes running on the local system. As the network has become a source of re- Finally, because the development of new programming mod- quests, Unix systems have adopted a daemon approach to els is an ongoing activity, the operating and runtime system handle these requests. In this approach, a daemon listens for must be designed so that it is relatively easy to develop incoming requests and passes them to the operating system. high-performance implementations of the features needed to In this context, inetd is a particularly interesting example. support a variety of existing programming models as well as Inetd listens for connection requests. When it receives a new new models that may be developed. connection request, inetd examines the request and, based on the request, creates a new process to handle the request. 3.3 System Usage Models That is, the request is passed through the operating system The system usage model defines the places where the princi- to inetd which calls the operating system to create a process pal computational resources can be shared by different users. to handle the request. While it might make more sense to Example usage models include: dedicated systems in which modify Unix to handle network requests directly, this would these resources are not shared; batch dedicated system in represent a substantial overhaul of the basic Unix request which the resources are not shared while the system is be- model. ing used, but may be used by different users at different times; space-shared systems in which parts of the system 3.2 Application Needs (e.g., compute nodes) are not shared, but multiple users Applications present challenges at two levels. First, applica- may be using different parts of the system at the same time; tions are developed in the context of a particular program- and time-shared systems in which the resources are being ming model. Programming models typically require a basic used by multiple users at the same time. set of services. For example, in the explicit message pass- ing model, it is necessary to allow for data to be moved Sharing requires that the operating system take on the role efficiently between local memory and the network. Second, of arbiter, ensuring that all parties are given the appropri- applications themselves may require extended functionality ate degree of access to the shared resources – in terms of beyond the minimal set needed to support the programming time, space, and privilege. The example usage models pre- model. For example, an application developed using a com- sented earlier are listed in roughly the order of operating ponent architecture may require system services to enable system complexity needed to arbitrate the sharing: dedi- the use of dynamic libraries. cated systems require almost no support for arbitrating ac- cess to resources, batch dedicated systems require that the While lightweight operating systems have been shown to usage schedule be enforced, space sharing requires that ap- support the development of scalable applications, this ap- plications running on different parts of the system not be proach places an undue burden on the application devel- able to interfere with one another, and timesharing requires oper. Given any feature typically associated with modern constant arbitration of the resources. In considering system operating systems (e.g., Unix sockets), there is at least one usage models the challenge is to provide mechanisms that application that could benefit from having the feature read- can support a wide variety of sharing policies, while ensur- ily available. In the lightweight operating system approach, ing that these mechanisms do not have any adverse impact the application developer is required to either implement on performance when they are not needed. the feature or do without. In fact, this is the reason that many of the terascale operating systems today are based on 3.4 Architectures full-featured operating systems. The real challenge is to pro- Architectural models present challenges at two levels: the vide features needed by a majority of applications without node level and the overall systems level. An individual com- pute node may exhibit a wide variety of architectural fea- 4.1 Micro-Services tures, including: multiple processors, support for PIM, mul- At a minimum, each application will need micro-services for tiple network interfaces, programmable network interfaces, managing the primary resources: memory, processor, com- access to local storage, etc. The key challenge presented munication, and file system. We can imagine several imple- by different architectures is the need to build abstractions mentations for each of these micro-services. One memory al- of the physical resources that match the resource abstrac- location service might perform simple contiguous allocation; tions defined by the programming model. If this is not done another might map physical page frames to arbitrary loca- efficiently, it could easily inhibit application scaling. tions in the logical address space of a process; another might provide demand page replacement; yet another may provide Beyond the need to provide abstractions of physical resources, predictive page replacement. A processor management ser- variations in systems-level architectures may require differ- vice may simply run a single process whenever a processor ent levels of operating system functionality on the compute is available, and another might include thread scheduling. nodes. In most cases, specialized hardware (e.g., PIMs) will require specialized OS functionality. However, hardware fea- There may be dependencies and incompatibilities within the tures may simplify OS functionality. As an example, Blue micro-services. As an example, a communication micro- Gene/L supports the partitioning of the high speed com- service that assumes that logically contiguous addresses are munication network: compute nodes in different partitions physically contiguous (thus reducing the size of a memory cannot communicate with one another using the high speed descriptor) would depend on a memory allocation service network. If the partitions correspond to different applica- that provides this type of address mapping. There will also tions in a space shared usage model, there is no need for the be dependencies between micro-services and system usage OS to arbitrate access to the high speed network. As this ex- models. For example, a communication service that provides ample illustrates, the interactions between architecture and direct access to a network interface would not be compatible usage models may not be trivial. with a usage model that supports time sharing on a node.

3.5 Shared Services In addition to micro-services that provide access to primary Finally, applications will need access to shared services, e.g., resources, there will be higher-level services layered on top file systems. Unlike the resource sharing that is arbitrated of the basic micro-services. As an example, a micro-service by the node operating system, access to the shared resources might provide garbage collected dynamic allocation, another (e.g., disk drive) provided by a shared server is arbitrated might provide first fit, explicit allocation and de-allocation by the server. In addition to arbitration, these servers may (malloc and free) for dynamic memory allocation. Other also require support for authentication and, using this au- examples include an RDMA service or a two-sided message thentication, provide access control to the logical resources service layered on top of a basic communication service. (e.g., files) that they provide. Finally, we will need “glue” services: micro-services that en- Here, the challenge is to provide the support required by able combinations of other services. As an example, consider the shared service. In some cases, this may be negligible. a usage model that supports general time-sharing among the In other cases, the server may require that is be able to applications on a node. Further, suppose that one of the ap- reliably determine the source of a message. In other cases, plications to be run on a node requires a memory allocator the shared server may rely on the operating system may that supports demand page replacement and another appli- need to maintain user credentials in a secure fashion while cation requires a simple contiguous memory allocator. A an application is running so that these credential can be memory compactor service would make it possible to run trusted by the shared file system. both applications on the same node. 4. APPROACH 4.2 Signal Delivery Example In the context of the challenges described in the previous sec- To illustrate how our micro-services approach can be used tion, a “lightweight operating system” reflects a minimal set to address the challenges presented by ultrascale systems, of services that meet the requirements presented by a small we consider the signal delivery example presented toward set of applications, a single usage model, a single architec- the end of Section 2. Because signal delivery may not be ture, and a single set of shared services. The Puma operating needed by all applications, micro-services associated with system [27], for example, represents a lightweight with the signal delivery would be optional and, as such, would not following bindings: application needs are limited MPI and have any performance impact on applications that did not access to a shared file system, the system usage model is need signal delivery. space sharing, the system architecture consists thousands of simple compute nodes connected by a high performance net- For applications that do require signal delivery, we would work, and the shared services include a parallel file system need a collection of “signal detector” micro-services that are which relies on Cougar to protect user identification. capable of observing the events of interest to the applica- tion (e.g., the reception of a message). These micro-services Our goal is to develop a framework for building operating would most likely run as part of the operating system ker- and runtime systems that are tailored to the specific require- nel. To ensure that they are run with sufficient frequency, ments presented by an application, the system usage model, the signal detector micro-services may place requirements the system architecture, and the shared services. Our ap- on the micro-service used to schedule the processor. proach is to build a collection of micro-services and tools that support the automatic construction of a lightweight op- The signal detector micro-services would then be tied to erating system for a specific set of circumstances. one of several specialized “signal delivery” micro-services. The specific signal delivery micro-service will depend on tional features of these programming models. In addition, the needs of the application. An immediate delivery service this tool will need to match the application needs for shared would modify the control block for the target process so that services to the shared services that are available. The appli- the signal handler for the process is run the next time the cation analysis tool will produce two, intermediate outputs: process is scheduled for execution. A coalescing signal de- the application requirements and the requirements associ- livery service would simply record the relevant information ated with the shared resources that are used by the applica- and make this information available to another micro-service tion. that would respond to explicit polling operations in the ap- plication. A user defined signal delivery service could take a In a second step, these intermediate outputs will be com- user defined action whenever an event is detected. Finally, bined with a specification of the system usage model, a spec- a message delivery service could convert the signal informa- ification of the underlying architecture and the collection of tion into to data and pass this to the micro-service that is micro-services to build an OS/Runtime that is tailored to responsible for delivering messages to application processes. the specific needs of the application. Here, we envision a The runtime level could then include a micro-service that tool that will take as input a set of the top-level services would read these messages and fork the appropriate process. used by an application and produce a directed graph of the permissible lower-level services for the required runtime en- 4.3 Tools vironment. Nodes of this graph will be weighted by the We cannot burden application programmers with all of the degree to which the micro-service represented by the node micro-services that provide the runtime environment for their meets the goals of the application developer. We plan to applications. Application programmers should only be con- base some of our work on tools for composing micro-services cerned with the highest-level services that they need (e.g., on existing tools, such as the Knit composition tool devel- MPI) and the general goals for lower-level services. We envi- oped at the University of Utah in the context of the Flux sion the need to develop a small collection of tools to analyze project [19]. Other tools will be needed to select particular application needs and to combine and analyze combinations services in the context of a system usage model. These tools of micro-services. Figure 4 presents our current thinking will also need to ensure that the services selected meet the regarding the required tools. sharing requirements of the system.

5. RELATED WORK Application A number of other configurable operating systems have been designed including microkernel systems, library operating systems, extensible operating systems, and component-based operating systems. In addition, configurability has been de- Application Available Analysis Shared signed into a variety of different runtime systems and system Tool Resources software subsystems, including middleware for distributed computing, network protocol stacks, and file systems.

Shared Application Resource 5.1 Configurable Operating Systems Requirements Requirements Most standard operating systems such as Linux include a limited amount of configuration that can be used to add or remove subsystems and device drivers from the kernel. How- Micro- OS/Runtime System ever, this configurability does not generally extend to core services Constructor Usage Model operating system functions, such as the scheduler or virtual memory system. In addition, the configuration available in many subsystems such as the network stack and the file sys- tem is coarse-grained and limited; entire networking stacks OS/Runtime Architecture and file systems can be added or removed, but these sub- systems cannot generally be composed and configured at a much finer granularity. In Linux, for example, the entire Figure 4: Building an Application Specific TCP/IP or Bluetooth stack can be optionally included in OS/Runtime the kernel, but more fine-grained information about exactly which protocols will be used cannot easily be used to cus- As shown in Figure 4, the tool chain takes several inputs and tomize system configuration. produces an application specific OS/Runtime system. If the system usage model is timesharing, this OS/Runtime will be Other operating systems allow more fine-grained configura- merged with the OS/Runtime needed by other applications tion. Component-based operating systems such as the Flux that share the the computing resources (this merging will OSKit [5], Scout [15], Think [4], eCos [18], and TinyOS [10], most likely be done on a node-by-node basis). For other allow kernels to be built from from a set of composable mod- usage models, the resulting OS/Runtime will be loaded with ules. Scout, for example, is built from a set of routers that the application, when the application is launched. can be composed together into custom kernels. The THINK framework is very similar to the framework we propose here. The application analysis tool extracts the application spe- The primary differences are that we expect to build operat- cific requirements from an application. This tool will need to ing systems that are far more tailored to the needs of specific be cognizant of potential programming models and the op- applications and we do not expect to do much in the way of dynamic binding of services. eCos and TinyOS provide sim- els and application requirements. To address this problem, ilar functionality in the context of embedded systems and we presented a design for a framework that uses micro- sensor networks, respectively. The Flux OSKit provided a services and supporting tools to construct an operating sys- foundation for component-based OS development based on tem and associated runtime environment for a specific set code from the Linux and BSD kernels, focusing particularly of requirements. This approach minimizes the overhead of on allowing device drivers from these systems to be used in unneeded features, allows for carefully tailored implemen- developing new kernels. Unlike our proposal, however, none tations of required features, and enables the construction of these systems have concentrated on customizing system new operating and runtime systems to adapt to evolving functionality at the fine granularity necessary to take full ad- demands and requirements. vantage of new hardware environments or optimize for the different usage models of ultrascale systems. 7. REFERENCES [1] R. Brightwell, T. Hudson, R. Riesen, and A. B. Microkernel and library operating systems such as L4 [8], Maccabe. The Portals 3.0 message passing interface. Exo-kernels [3], and Pebble [6], for example, allow oper- Technical report SAND99-2959, Sandia National ating system semantics to be customized at compile-time, Laboratories, December 1999. boot-time, or run-time by changing the server or library that provides a service, though this composability is even more [2] Cluster File Systems, Inc. Lustre: A Scalable, coarse-grained than the systems described above. Such flex- High-Performance File System, November 2002. ibility generally comes at a price, however; these operating http://www.lustre.org/docs/whitepaper.pdf. systems may have to use more system calls and up-calls to implement a given service than a monolithic operating sys- [3] D. Engler, M. Kaashoek, and J. O’Toole. Exokernel: tem, resulting in higher overheads. It also can result in a loss An operating system architecture for application-level of cross-subsystem optimization opportunities. In contrast, resource management. In Proceedings of the 15th ACM our approach seeks to decompose functionality using more Symposium on Operating Systems Principles, pages fine-grained structures and to preserve cross-subsystem opti- 251–266, Copper Mountain Resort, CO, 1995. mization opportunities through tools designed explicitly for [4] J.-P. Fassino, J.-B. Stefani, J. Lawall, and G. Muller. composing system functionality. THINK: A software framework for component-based operating system kernels. In Proceedings of the 2002 5.2 Configurable Runtimes and Subsystems USENIX Annual Technical Conference, June 2002. A variety of different systems have also been built that en- able fine-grained configuration of system services, generally [5] B. Ford, G. Back, G. Benson, J. Lepreau, A. Lin, and in the realm of protocol stacks and file systems. In contrast O. Shivers. The Flux OSKit: A substrate for kernel to our approach, none of these systems seek to use configu- and language research. In Proceedings of the 16th ration pervasively across in a an entire operating system. ACM Symposium on Operating Systems Principles, pages 38–51, Saint-Malo, France, 1997. Coarse-grained configuration of network protocol stacks has [6] E. Gabber, C. Small, J. Bruno, J. Brustoloni, and been explored in System V STREAMS [20], the x-kernel [12], A. Silberschatz. The Pebble component-based and CORDS [23]. Composition in these systems is layer- operating system. In Proceedings of the 1999 USENIX based, with each component defining one protocol layer. Annual Technical Conference, pages 267–282, Similar approaches have been used for building stackable Monterey, CA, USA, 1999. file systems [9, 28]. [7] D. S. Greenberg, R. Brightwell, L. A. Fisk, A. B. More fine-grained composition of protocol semantics has been Maccabe, and R. Riesen. A system software explored in the context of Cactus [11], [24], Ensemble [25], architecture for high-end computing. In ACM, editor, and Rwanda [16]. Cactus’s event-based composition model, SC’97: High Performance Networking and Computing: in particular, has influenced our approach to building; in Proceedings of the 1997 ACM/IEEE SC97 Conference: fact, we are using portions of the Cactus event framework November 15–21, 1997, San Jose, California, USA., to implement our system. To date the Cactus project has fo- New York, NY 10036, USA and 1109 Spring Street, cused primarily on using event-based composition in network Suite 300, Silver Spring, MD 20910, USA, Nov. 1997. protocols, not the more general operating system structures ACM Press and IEEE Computer Society Press. as described in this paper. [8] H. H¨artig, M. Hohmuth, J. Liedtke, S. Sch¨onberg, and 6. CONCLUSIONS J. Wolter. The performance of µ-kernel-based systems. In this paper, we have presented an argument for a frame- In Proceesings of the 16th ACM Symposium on work for customizing an operating system and runtime en- Operating Systems Principles, 1997. vironment for parallel computing. Based on the results of [9] J. Heidemann and G. Popek. File-system development preliminary experiments, we conclude that the demands of with stackable layers. ACM Transactions on Computer current and future ultrascale systems cannot be addressed Systems, 12(1):58–89, 1994. by a general-purpose operating system if high-levels of per- formance and scalability are to be maintained and achieved. [10] J. Hill, R. Szewczyk, A. Woo, S. Hollar, D. E. Culler, The current methods of using specialized lightweight ap- and K. S. J. Pister. System architecture directions for proaches and generalized heavyweight approaches will not be networked sensors. In Architectural Support for sufficient given the challenges presented by current and fu- Programming Languages and Operating Systems, pages ture hardware platforms, programming models, usage mod- 93–104, 2000. [11] M. A. Hiltunen, R. D. Schlichting, X. Han, [25] R. van Renesse, K. P. Birman, M. Hayden, M. Cardozo, and R. Das. Real-time dependable A. Vaysburd, and D. A. Karr. Building adaptive channels: Customizing QoS attributes for distributed systems using Ensemble. Software Practice and systems. IEEE Transactions on Parallel and Experience, 28(9):963–979, 1998. Distributed Systems, 10(6):600–612, 1999. [26] S. R. Wheat, A. B. Maccabe, R. Riesen, D. W. van [12] N. Hutchinson and L. L. Peterson. The x-kernel: An Dresser, and T. M. Stallcup. PUMA: An operating architecture for implementing network protocols. system for massively parallel systems. In Proceedings IEEE Transactions on Software Engineering, of the Twenty-Seventh Annual Hawaii International 17(1):64–76, 1991. Conference on System Sciences, pages 56–65. IEEE Computer Society Press, 1994. [13] T. Jones, W. Tuel, L. Brenner, J. Fier, P. Caffrey, S. Dawson, R. Neely, R. Blackmore, B. Maskell, [27] S. R. Wheat, A. B. Maccabe, R. Riesen, D. W. van P. Tomlinson, and M. Roberts. Improving the Dresser, and T. M. Stallcup. PUMA: An operating scalability of parallel jobs by adding parallel awareness system for massively parallel systems. Scientific to the operating system. In Proceedings of SC’03, Programming, 3:275–288, 1994. 2003. [28] E. Zadok and I. Badulescu. A stackable file system [14] P. Miller. Parallel, distributed scripting with python. interface for Linux. In Proceedings of the 5th Annual In Third Linux Clusters Institute Conference, October Linux Expo, pages 141–151, Raleigh, North Carolina, 2002. 1999. [15] D. Mosberger and L. L. Peterson. Making paths explicit in the Scout operating system. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 153–168, 1996. [16] G. Parr and K. Curran. A paradigm shift in the distribution of multimedia. Communications of the ACM, 43(6):103–109, 2000. [17] F. Petrini, D. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proceedings of SC’03, 2003. [18] Redhat. eCos. http://sources.redhat.com/ecos/. [19] A. Reid, , M. Flatt, L. Stoller, J. Lepreau, and E. Eide. Knit: Component composition for system software. In Proceesings of the 4th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 347–660, 2000. [20] D. M. Ritchie. A stream input-output system. AT&T Bell Laboratories Technical Journal, 63(8):311–324, 1984. [21] Sandia National Laboratories. ASCI Red, 1996. http://www.sandia.gov/ASCI/TFLOP. [22] T. L. Sterling and H. P. Zima. The gilgamesh MIND processor-in-memory architecture for petaflops-scale computing. In Internatinal Sympoium on High Performance Computing (ISHPC 2002), volume 2327 of Lecture Notes in Computer Science, pages 1–5. Springer, 2002. [23] F. Travostino, E. M. III, and F. Reynolds. Paths: Programming with system resources in support of real-time distributed applications. In Proceedings of the IEEE Workshop on Object-Oriented Real-Time Dependable Systems, 1996. [24] R. van Renesse, K. P. Birman, R. Friedman, M. Hayden, and D. A. Karr. A framework for protocol composition in Horus. In Proceedings of the 14th ACM Principles of Distributed Computing Conference, pages 80–89, 1995. Cluster Operating System Support for Parallel Autonomic Computing A. Goscinski J. Silcock M. Hobbs School of Information Technology School of Information Technology School of Information Technology Deakin University, Geelong Deakin University, Geelong Deakin University, Geelong Victoria, 3217, Australia Victoria, 3217, Australia Victoria, 3217, Australia +61 3 5227 2088 +61 3 5227 1378 +61 3 5227 3342 [email protected] [email protected] [email protected]

ABSTRACT offer ease of use and ease of programming. Computer clusters, The aim of this paper is to show a general design of autonomic including non-dedicated clusters that allow the execution of both elements and initial implementation of a cluster operating system parallel and sequential applications concurrently, are seen as that moves parallel processing on clusters to the computing being user unfriendly, due to their complexity. Parallel processing mainstream using the autonomic computing vision. The on clusters is not broadly accessible and it is not used on daily significance of this solution is as follows. Autonomic Computing basis – parallel processing on clusters has not yet become a part was identified by IBM as one of computing’s Grand Challenges. of the computing mainstream. Many activities, e.g., selection of The human body was used to illustrate an Autonomic Computing computers, allocation of computations to computers, dealing with system that possesses self-knowledge, self-configuration, self- faults and changes caused by adding and removing computers optimization, self-healing, and self-protection, knowledge of its to/from clusters, must be handled (programmed) manually by environment and user friendliness properties. One of the areas that programmers. Ordinary engineers, managers, etc do not have, and could benefit from the comprehensive approach created by the should not have, specialized knowledge needed to program autonomic computing vision is parallel processing on non- operating system oriented activities. The deficiencies of current dedicated clusters. Many researchers and research groups have research in parallel processing in particular on clusters have also responded positively to the challenge by initiating research been identified in [19,6,33,2,31]. A similar situation exists in the around one or two of the characteristics identified by IBM as the area of Distributed Shared Memory (DSM). A need for an requirements for autonomic computing. We demonstrate here that integrated approach to building DSM system was advocated in it is possible to satisfy all Autonomic Computing characteristics. [16]. We decided to demonstrate a possibility to address not only high performance but also ease of programming/use, reliability, and availability through proper reaction to unpredictable changes Categories and Subject Descriptors and transparency, and developed the GENESIS cluster operating D.4.7 [Operating Systems]: Organization and Design – system that provides a SSI and offers services that satisfy these Distributed systems. requirements [12]. However, to the end of 2001 there was no wider response to satisfy these requirements. General Terms A comprehensive program to re-examine the “obsession with Management, Design and Reliability. faster, smaller, and more powerful” and “to look at the evolution of computing from a more holistic perspective” has been launched Keywords by IBM in 2001 [16,15]. Autonomic computing is seen by IBM Cluster Operating Systems, Parallel Processing, Autonomic [16] as “the development of intelligent, open systems capable of Computing. running themselves, adapting to varying circumstances in accordance with business policies and objectives, and preparing their resources to most efficiently handle the workloads we put 1. INTRODUCTION upon them”. There is a strong trend in parallel computing to move to cheaper, general-purpose distributed systems, called non-dedicated As it has been stated above, we have been carrying out clusters, that consist of commodity off-the-shelf components such research in the area of building new generation non-dedicated as PCs connected by fast networks. Many companies, businesses clusters through the study into cluster operating systems and research organizations already have such "ready made supporting parallel processing. However, in order to achieve a parallel computers”, which are often idle and/or lightly loaded not truly effective solution we decided to synthesize and develop an only during nights and weekends but also during working hours. operating system for clusters rather than to exploit middleware approach. Our experience with GENESIS [12], which is a A review by Goscinski [9] shows that none of the research predecessor of Holos, demonstrated that incorporating many performed thus far has looked at the problem of developing a services (currently provided by middleware) into a single technology that goes beyond high performance execution and comprehensive operating system that exploits the concept of a allows clusters and grids to be built for supporting their microkernel, has made using the system easy and improved the unpredictable changes and provide services reliably to all users, overall performance of application execution. We are strongly

1 convinced that the client-server and microkernel approaches leads techniques for building highly-dependable Internet services. ROC to better a design of operating systems, which are not bloated, can focuses on the recovery of the system from failures rather than be easily tailored to applications, and improve security and their avoidance. reliability. An identical line of thought is presented just recently Anthill (University of Bologna, Italy) [1] is a framework to in [23]. As a natural progression of our work we have decided to support the design, implementation and evaluation of peer-to-peer move toward autonomic computing on non-dedicated clusters. applications. Anthill exploits the analogy between Complex The aim of this paper is to present the outcome of our work Adaptive Systems (CAS) such as biological systems and the in the form of the designed services underlying autonomic non- decentralized control and large-scale dynamism of P2P systems. dedicated clusters, and to show the Holos (‘whole’ in Greek) An Anthill system consists of a dynamic network of peer nodes; cluster operating system (the implementation of these services) societies of adaptive agents (ants) travel through this network, that is built to offer autonomic parallel computing on non- interacting with nodes and cooperating with other agents in order dedicated clusters. The problem we faced was whether to present to solve complex problems. The types of P2P services constructed this new cluster operating system showing its architectural vision using Anthill show the properties of resilience, adaptation and or to introduce the system from the perspective of its matching the self-organization. characteristics of autonomic computing systems. We decided to Neuromation [25], Edinburgh University's information use the latter because it could “say” more to the wider audience. structuring project, involves the structuring of information based This approach allows us also to better convey a message of a on human memory. The structure used would be suited for novelty and contribution of the proposed system through organizing information in an autonomic architecture. The individual elements of the grid and clustering technologies. structure used is simple, homogeneous and self-referential. This paper is organized as follows. Section 2 shows related University of Freiburg's Multiagent Systems Project [24] work, and in particular demonstrates that there is no revolves around the self-organized coordination of multiagent project/system that addresses all characteristics of autonomic systems. This topic has some connections with Grid computing, computing. Section 3 presents the logical design of autonomic especially economic coordination issues like in Darwin, Radar or elements and their services that must be created to provide Globus. parallel autonomic computing on non-dedicated clusters. Section 4 introduces the autonomic elements, presented in the previous The Immunocomputing project [18] (International Solvay section, implemented or being implemented as cooperating Institutes for Physics and Chemistry, Belgium) is to use the servers of the Holos cluster operating system. Section 5 concludes principles of information processing by proteins and immune the paper and shows the future work. networks in order to solve complex problems while at the same time being protected from viruses, noise, errors and intrusions. 2. RELATED WORK A Grid scheduling system, developed at Monash University, IBM's Grand Challenge identifying Autonomic Computing as a called Nimrod-G [28], has been built to provide tools and services priority research area has brought research carried out for many for solving coarse-grain task farming. The resource broker/Grid years on self-regulating computers into focus. We have long scheduler has the ability to lease resources at runtime depending identified lack of user friendliness as a major obstacle to the on their capability, cost, and availability. widespread use of parallel processing in distributed systems [10]. In 1993 Joseph Barrera discussed a framework for the design of The Bio-inspired Approaches to Autonomous Configuration self-tuning systems [3]. While IBM is advocating a "holistic" of Distributed Systems [4] at University College London has used approach to the design of computer systems much of the focus of bio-inspired approaches to autonomous configuration of researchers is upon failure recovery rather than uninterrupted, distributed systems (including a bacteria inspired approach) are continuous, adaptable execution. The latter includes execution being explored. under varying loads as well as recovery from hardware and While many of these systems engage in some aspects of software failure. Autonomic Computing none engage in research to develop a A number of projects related to Autonomous Computing are system which has all eight of the characteristics required. mentioned by IBM in [16]. OceanStore (Berkeley University of Furthermore, none of the projects addresses parallel processing, in California) [29] is a persistent data store which has been designed particular parallel processing on non-dedicated clusters. to provide continuous access to persistent information to an enormous number of users. The infrastructure is made up of 3. THE LOGICAL DESIGN OF untrusted servers, hence the data is protected using redundancy and cryptography. Any computer can join the infrastructure by AUTONOMIC ELEMENTS PROVIDING subscribing to one OceanStore service provider. Data can be AUTONOMIC COMPUTING ON NON- cached anywhere, anytime, to improve the performance of the DEDICATED CLUSTERS system. Information gained and analysed by internal event According to Horn [15], an autonomic computing system could be monitors allow OceanStore to adapt to changes in its described as one that possesses at least the following environments such as regional outages and denial of service characteristics: knows itself; configures and reconfigures itself attacks. under varying and unpredictable conditions; optimizes its The Recovery-Oriented Computing (ROC) [30] project is a working; performs something akin to healing; provides self- joint Berkeley/Stanford research project that is investigating novel protection; knows its surrounding environment; exists in an open

2 Computer i Where t < t < t < t Communication Availability 0 1 2 3 Resource Services Discovery Pattern & Load Computational Local Comms. Load Virtual Cluster (t1) Load & Virtual Cluster (t0) Parameters CPU Main CE1 CEn Memory RD RD RD Remote Resource Comms. RD RD Discovery RD

CE1 CEn RD CPU Main RD Memory Computer j Virtual Cluster (t ) Virtual Cluster (t ) Figure 1. Resource Discovery Service Design 3 2 (CE – computation element) Figure 2. Availability Service Design (RD – resource discovery element) (non-hermetic) environment; and anticipates the optimized resources needed while keeping its complexity hidden. users’ applications because the owners removed them from a shared pool of resources. An autonomic computing system is a collection of autonomic elements, which can function at many levels: computing To allow a system to offer high availability, i.e., to configure components and services, clusters within companies, and grids and reconfigure itself under varying and unpredictable conditions within entire enterprises. Each autonomic element is responsible of adding and removing computers, the system was designed to: for its own state, behavior and management, which satisfy the • Adaptively and dynamically form virtual cluster user objectives. These elements interact among themselves and according to load and changing resources; with surrounding environments. Objectives of individual components must be consistent with the objective of the whole set • Offer high availability of resources, in particular of cooperating elements [19]. computers. We proposed and designed a set of autonomic elements that An illustration of the outcome of the general design of this must be provided to develop an autonomic computing system autonomic element is illustrated in Figure 2. It shows how virtual supporting parallel processing on non-dedicated cluster. These clusters can change over time, with the virtual cluster expanding elements are described in the following subsections. from t0, t1 and t2; and contracting at t3. 3.3 Cluster should optimize its working 3.1 Cluster knows itself Computation elements of a newly created parallel (or sequential) To allow a system to know itself there is a need for resource application should be placed in an optimal manner on computers discovery. This autonomic element (service) is designed to run on of a virtual cluster formed for the application. Furthermore, if a each computer of the cluster and: new computer is added to the virtual cluster or load of some of the • Identifies its components, in particular computers, and computers in the cluster changes dramatically, load balancing their state; should be employed to improve the overall execution performance. When improving performance not only computation • Acquires knowledge of static parameters of the whole load and available memory should be taken into consideration, cluster, in particular computers, such as processor type, but also communication costs, which in non-dedicated clusters are memory size, and available software; and high. Thus, to optimize cluster’s working: • Acquires knowledge of dynamic parameters of cluster • Static allocation and load balancing is employed; components, e.g., data about computers’ load, available memory, communication pattern and volume. • Changing scheduling from static to dynamic, and dynamic to static is provided; Figure 1 shows an illustration of the outcome of the general design of this autonomic element (service). It depicts the • Changing performance indices, which reflect user Resource Discovery Service on each Computer of the cluster objectives, among computation-oriented, and obtaining information from the various local resources such as communication-oriented applications should be provided; processor loads, memory usage and communication statistics • Computation element migration, creation and duplication between both local and remote Computational Elements (CEs or is exploited; processes). • Dynamic setting of computation priorities of parallel 3.2 Cluster configures and reconfigures itself applications, which reflect user objectives, is provided. In a non-dedicated cluster computers could become heavily The outcome of the general design of this autonomic element loaded. On the other hand there are time periods when some is shown in Figure 3. In this example the static allocation computers of a cluster are lightly loaded or even idle. Some component instantiates computational elements on selected computers cannot be used to support parallel processing of other computers, and the load balancing component migrates

3 Checkpointing Global Scheduler (coordinated) {Static Allocation Static Load Decisions: which, Allocation Balancing { Dynamic Load C1 C2 Cj where: Balancing Decisions: CE1 → C1, CE Checkpoint Checkpoint Availability where, which, when: i CE2 → C2, for for Services CEi : Cn → C3 } CE ……… i CEi {CEi, CEj} → Cn} C1..n = Computer 1..n Checkpoint for Ck CE C1 C2 C3 i CE CEi after CE1 2 Recovery crash Disk recovery Virtual Migration Cluster CEi CEj Cn Figure 4. Self-Healing Service Design Figure 3. High Performance Service Design • Authentication, as a countermeasure against active attacks. computational elements between computers. The decisions of when, which and where are made by higher level services, such as This autonomic element is the subject of our current design a Global Scheduler. and will be addressed in another report. 3.4 Cluster should perform something akin to 3.6 To allow a system to know and work with healing its surrounding environment Despite the fact that PCs and networks are becoming more There are applications that require more computation power, reliable, hardware and software faults can occur in non-dedicated specialized software, unique peripheral devices etc. Many owners clusters. Failures in the system currently lead to the termination of of clusters cannot afford such resources. On the other hand, computations. Many hours or even days of work can be lost if owners of other clusters and systems would be happy to offer these computations have to be restarted from scratch. Thus, the their services and resources to appropriate users. Thus, to allow a system should be able to provide something akin to healing: system to know its surrounding environment, to prevent a system from existing in a hermetic environment, to the benefit of existing • Faults and their occurrence are identified and reported; unique resources and services: • Checkpointing parallel applications are provided; • Resource discovery of other similar clusters is provided; • Recovery from failures is employed; • Advertising services to make user’s own services • Migrating application computation elements from faulty available to others is in place; computers to other, healthy computers that are located • The system is able to communicate/cooperate with other automatically is carried out; systems; • Redundant/replicated autonomic elements are provided. • Negotiation with service providers is provided; An illustration of the outcome of the general design of this • Brokerage of resources and services is exploited; autonomic element is illustrated in Figure 4. (Fault detection is not shown in this figure.) Checkpoints are stored in main • Resources should be made available/shared in a memories of other virtual clusters for performance and on disk for distributed/grid-like manner. high reliability. A process is recovered after a fault by using one An example of a set of cooperating brokerage autonomic of the checkpoint copies on a selected computer or from disk. elements running on different clusters that illustrate some aspects of the designed autonomic element is shown in Figure 5. 3.5 Cluster should provide self-protection Computation elements of parallel and distributed applications run 3.7 A cluster should anticipate the optimized on a number of computers of a cluster. They communicate using resources needed while keeping its complexity messages. As such, they are subject to passive and active attacks. Thus, resources must be protected, applications/users hidden authenticated and authorized, in particular when computation Until now the single factor limiting the harnessing of the element migration is used, and communication security computing power of non-dedicated clusters for parallel computing countermeasures must be applied. The design of an autonomic has been the scarcity of software to assist non-expert element providing self-protection includes: programmers. This implies a need for at least the following: • Virus detection and recovery; • Single System Image, in particular where transparency is offered; • Resource protection based on access control lists or/and capabilities; • A programming environment that is simple to use and, does not require the user to see distributed resources is • Encryption, as a countermeasure against passive attacks; provided;

4 Cluster 1 Cluster 2 Parallel MP / PVM / DSM Application Computational Advertisement MPI Agent Services Brokerage Processes Service Exporting Brokerage Storage/Memory Services Service Services Printer Information Withdrawal Brokerage Global Execution Services Services Services Server Scheduler Server

Cluster 3 Cluster n Migration DSM File System Server Server Server Servers Brokerage Import Brokerage Service Requests Service Resource Availability Checkpoint Discovery Server Server Server

Figure 5. Grid-like Service Design Space Process IPC Network Manager Manager Manager Manager Kernel • Message passing and DSM programming is supported Managers GENESIS transparently. Microkernel When these features are provided the complexity of a cluster is greatly reduced from the perspective of both the programmer Figure 6. The Holos operating system and user. Thus, hiding the complexities of managing the resources of a non-dedicated cluster and relieving the programmer from underlying network, and supports communication among remote many of the system related functions. processes. All the kernel managers support system servers. 4. THE HOLOS AUTONOMIC ELEMENTS The servers, which form a basis of an autonomic operating system for non-dedicated clusters, are as follows: FOR AUTONOMIC COMPUTING • Resource Discovery Server – collects data about CLUSTERS computation and communication load; and supports To demonstrate that it is possible to develop an easy to use establishment of a virtual cluster; autonomic non-dedicated cluster, we decided to implement the autonomic elements presented in Section 3 and build a new • Availability Server, which dynamically and adaptively autonomic cluster operating system, called Holos. We decided to forms a virtual cluster for the application; implement autonomic elements as servers. Each computer of a • Global Scheduling Server – maps application processes on cluster is a multi-process system with its objectives set up by their the computers that make up the virtual cluster for the owners and the whole cluster is a set of multi-process systems application; with its objectives set up by a super-user. • Execution Server – coordinates the single, multiple and 4.1 Holos architecture group creation and duplication of application processes on Holos is being built as an extension to the Genesis system [12]. both local and remote computers; Holos exploits the P2P paradigm and object-based approach • Migration Server – coordinates moving an application (where each entity has a name) supported by a microkernel [8]. process (or set of application processes) from one The general architecture is shown in Figure 6. Holos uses a three computer to another computer or a set of computers, level hierarchy for naming: user names, system names, and respectively; physical locations. The system name is a data structure, which • DSM Server – hides the distributed nature of the cluster’s allows objects in the cluster to be identified uniquely and serves memory and allows programmers to write their code as as a capability for object protection [11]. though using physically shared memory; The microkernel provides services such as local inter-process • Checkpoint Server – coordinates creation of checkpoints communication (IPC), basic paging operations, interrupt handling for an executing application; and context switching. Other operating system services are provided by a set of cooperating processes. There are three groups • Inter-Process Communication (IPC) Manager – supports of processes: kernel managers, system servers, and application remote inter-process communication and group processes. Whereas the kernel and system servers are stationary, communication within sets of application processes; application processes are mobile. All processes communicate • File Server – supports both system and user level using messages. Kernel managers are responsible for managing processes in accessing secondary storage, particularly the the resources of the operating system. The Process Manager, Execution Manager in the creation of processes, the Space (Memory) Manager, and IPC Manager manage the Process Chekpoint Server in storage of checkpoint data, and the Control Blocks (PCBs), memory regions, and IPC of processes, Space Manager in the provision of paging; and respectively. The Network Manager provides access to the

5 Table 1. Servers working together to carry out services of autonomic computing Autonomic Computing Requirement Cooperating Holos Servers - Relationships Among Autonomic Elements To allow a system to know itself Resource Discovery Server A system must configure and reconfigure itself under varying and Resource Discovery, Global Scheduling, Migration, Execution, unpredictable conditions and Availability Servers A system must optimize its working Global Scheduling, Migration, and Execution Servers A system must perform something akin to healing Checkpoint, Migration, Global Scheduling and Servers A system must provide self-protection Capabilities in the form of System Names A system must know its surrounding environment Resource Discovery, and Brokerage Servers A system cannot exist in a hermetic environment Inter-process Communication Manager, and Brokerage Server A system must anticipate the optimized resources needed while DSM and Execution Servers, DSM Programming Environment, keeping its complexity hidden (most critical for the user) Message Passing Programming Environment, PVM/MPI Programming Environment

• Brokerage Server – supports resource advertising and computer, and provides this information to the Availability sharing through the services of exporting, importing and Server, which uses it to establish a virtual cluster. revoking. The virtual cluster changes dynamically in time as some 4.2 Holos possesses the autonomic computing computers are removed or become overloaded and cannot be used as a part of the execution environment for a given parallel characteristics application, and some computers are added or become idle/lightly Sets of Holos servers that individually and in cooperation provide loaded and can become a component of the virtual cluster. The services that satisfy the IBM’s Autonomic Computing dynamic nature of the virtual cluster creates an environment, requirements are specified in Table 1. which can address the application requirements that when The following subsections present the servers, which provide executed expands or shrinks. services that allow the operating system to offer autonomic The current resource discovery server collects (using operating system and support autonomic parallel computing on specially designed hooks installed in the microkernel and Process, non-dedicated clusters. As inter-process communication in Holos Space and IPC Servers) static parameters such as processor type is the basis of all services, and in particular is the basis of and memory size and dynamical parameters such as computation transparency, it is also presented. load (the number of processes in the ready and blocked state), available memory, and communication pattern and volume. We 4.3 Communication among parallel processes are enhancing this server using the study of the way of data To hide distribution and make remote inter-process collection and processing. We also concentrate our efforts on the communication look identical to communication between local availability server. We study the identification of events that application processes, we decided to build the whole operating report on the computer and software faults, adding and removal system services of Holos around the inter-process communication to/from the cluster by an administrator and/or user, changing facility. To programmers of standard and parallel applications, computation (a completion of a process, a new process creation) local and remote communication is indistinguishable, which and communication load (process computers communicating forms a basis for complete transparency. intensively, computers/processes completing, intensive communication), and new requests to allocate/releases computers The IPC Manager is also responsible for both local and for their application. This information is used in the development remote address resolution for group communication. Messages of adaptive algorithms for forming and reconfiguring virtual that are sent to a group require the IPC Manager to resolve the clusters. destination process location and provide the mechanism for the transport of the message to the requested group members. To 4.5 Mapping parallel processes to computers support programmers, the Holos group communication facility allows processes to create, join, leave and kill a group, and for cluster self optimization supports different message delivery, response and message Mapping parallel processes to computers of a virtual cluster is ordering semantics [31]. performed by the Global Scheduling Server. This process combines static allocation and dynamic load balancing 4.4 Establishment of a virtual cluster for components, which allow the system to provide mapping by finding the best locations for parallel processes of the application cluster self awareness to be created remotely and to react to large fluctuations in system The Resource Discovery Server [12,26] and the Availability load. The decision to switch between the static allocation and Server play a key role in the establishment of virtual clusters upon dynamic load balancing policies is dictated by the scheduling a cluster. The Resource Dsicovery Server identifies idle and/or policy, which uses the information gathered by the Resource lightly loaded computers and their resources (processor model, Discovery Server. memory size, etc.); collects both computational load and communication patterns for each process executing on a given Currently, the global scheduler is a centralized server. Our initial performance study of MPI computation-and computation-

6 bound parallel applications on a 16-computer cluster shows that In Holos, the parent’s origin computer manages all process their concurrent execution with sequential applications “exits” and “waits” issued from the parent and its children. (computation-bound, I/O bound, and in between) leads to the Furthermore, child processes in a parallel section of the program improved execution performance and makes the utilization of the must co-ordinate their execution by waiting for both data whole cluster better. Sequential applications have demonstrated a allocation at the beginning of their execution and the completion very small slow-down, which in some cases could even be of the slowest process in the group in order to preserve the unnoticed and in other cases could be rejected [37,11]. In both correctness of the application, implied by a data consistency cases of parallel applications the utilization of static allocation to requirement. In the Holos system barriers are employed for this initially place parallel processes on cluster computers and purpose. dynamic load balancing were employed. 4.9 Checkpointing for cluster self healing 4.6 Process creation Checkpointing and fault recovery have been selected to provide In Holos, each computer is provided with an EXecution (EX) fault tolerance. Holos uses coordinated checkpointing, which Server, which is responsible for local process creation [13]. A requires that non-deterministic events, such as processes local EX Server is capable of contacting a remote EX Server to interacting with each other, the operating system or end user, be create a remote process on its behalf. Currently, the remote prevented during the creation of checkpoints. However, under a process creation service employs multiple process creation that microkernel-based architecture, operating system services are concurrently creates n parallel processes on a single computer, accessed by sending requests to operating system servers, rather and group process creation that is able to concurrently create than directly through system calls. This prevents non- processes on m selected computers. These mechanisms are of deterministic events by stopping processes communicating with great importance for instance for master-slave based applications, each other or with operating system servers during the creation of where a number of identical child processes is mapped to remote checkpoints. These messages are then included in the checkpoints computers. of the sending processes to maintain the consistency of the checkpoints. Messages are dispatched to their destinations after When a new process is to be created, the Global Scheduler all checkpoints are created. instructs the EX Server to create the process locally or on a remote computer. In both instances, i.e., group remote process To improve the performance of checkpointing, the approach creation and individual creation, a list of destination computers is that employs the main memory of another computer of a cluster, provided to the by the Global Scheduler, and they are forwarded rather than a centralized disk is used. However, as in this case onto the EX Server on the respective destination computers. A back-up computers can also fail, a checkpoint is stored on at least process is created from an image stored in a file. This implies a k computers and k-delivery of group communication is used to need for employing the File Server to support this operation. To support this operation. Disk based checkpointing is also used, but achieve high performance of the group process creation operation, the frequency of storing checkpoints on the disk is much lower. a copy of the file that contains a child image is distributed to To control the creation of checkpoints, another process of selected computers by a group communication facility. Holos, the Checkpoint Server, is employed. This process is placed 4.7 Process duplication and migration on each computer and invokes the kernel managers to create a checkpoint of processes on the same computer [32]. The Parallel processes of an application can also be instantiated on the coordinating Checkpoint Server (where the application was selected computers of the virtual cluster by duplicating a process originally created) directs the creation of checkpoints for a locally by the EX Server and, if necessary, migrating it to selected parallel application by sending requests to the remote Checkpoint computer(s) [14]. Servers to perform operations that are relevant to the current stage Migrating an application process involves moving the of checkpointing. To create a checkpoint of a process, each of the process state, address space, communication state, and any other kernel managers must be invoked to copy the resources under associated resources. This implies that a number of kernel their control. Managers, such as Process, Space, and IPC Servers, are involved Currently, fault detection and fault recovery is the subject of in process migration. The Migration Server only plays a our research. A basis of this research is liveness checking and coordinating role [8]. Group process migration is performed, i.e., process migration, which moves a selected checkpoint to the a process can be concurrently migrated to n computers selected by specified computer, respectively. We also develop and study the the Global Scheduling Server. methods of recording and using information about location of 4.8 Computation co-ordination for cluster self checkpoints within the relevant virtual cluster. optimization 4.10 Brokerage (toward grids) for cluster self When a parallel application is processed on a virtual cluster, and surroundings’ awareness where parallel processes are executing remotely, application Brokerage and resource discovery have been studied to build semantics require an operating system to transparently maintain: basic autonomic elements allowing Holos services and input and output to/from the user, the parent/child relationship, applications to be offered to both other users working with Holos and any communication with remote processes. As all and users of other systems [26]. communication in Holos is transparent, input and output to/from a user and communication with the remotely executing process is A copy of a brokerage process runs on each computer of the transparent. cluster. Each Holos broker is such a process that preserves user

7 example group process creation and process migration), which are Message Passing Programming Shared or PVM / MPI Environment Memory not provided by operating systems such as Unix or Windows. The move of MPI from UNIX to Holos has been achieved by Message Passing Based DSM Based replacing the two lower layers of MPICH with the services that Communication Primitives Communication Primitives Holos provides. These services were group communications, group process creation, process migration, and global scheduling, including static allocation and dynamic load balancing. System Services of an Operating System Incorporating these services to MPI has shown promising results and provided a better solution for implementing parallel programming tools. Kernel Services of an Operating System 4.11.3 Distributed Shared Memory Figure 7. Easy Programming Service Design Holos DSM exploits the conventional memory sharing approach (to write shared memory code using concurrent programming skills) by using the basic concepts and mechanisms of memory autonomy as in a centralized environment; and supports sharing management to provide DSM support [35]. by advertising services to make user’s own services available to other users, by allowing object to be exported to other clusters or One of the unique features of Holos DSM is that it is to be withdrawn from service, and by allowing objects that have integrated into the memory management of the operating system. been exported by users from other clusters to be imported. We decided to embody the DSM within the operating system in order to create a transparent, easy to use and program The Holos broker supports object sharing among clusters of environment and achieve high execution performance of parallel homogeneous clusters [26] and grids [27]. This implies that applications. The options for placing the DSM system within the resources should be made available/shared in a distributed/grid- operating system were either to build it as a separate server or like manner. The test version of the broker was developed based incorporate to it into one of the existing servers. The first option on attribute names, in order to allow users to access the objects was rejected because of a possible conflict between two servers without knowing their precise names. (Space Manager and the DSM system) both managing the same object type, i.e. memory. Synchronised access to the memory to 4.11 Programming interface for user maintain its consistency would become a serious issue. Since friendliness DSM is essentially a memory management function, the Space Holos provides transparent communication services of standard Manager is the server into which we decided the DSM system message passing (MP) and DSM as its integral components. We should be integrated. This implies that programmers are able to present in this sub-section details of these common parallel use the shared memory as though it were physically shared, programming communication mechanisms and how they are hence, the transparency requirement is met. Furthermore, because integrated transparently into the Holos system. The logical design the DSM system is in the operating system itself and is able to use of the communication services and how they interface to the low level operating system functions the efficiency applications using MP and DSM is shown in Figure 7. This figure requirement can be met. also shows the hierarchical relationship between the The granularity of a shared memory object is a critical issue communication services and the system and kernel services. in the design of a DSM system. As the DSM system is placed within the Space Manager and the memory unit of the Holos 4.11.1 Holos message passing Space is a page, the most appropriate object of sharing for the The standard MP service within the Holos parallel execution DSM system is a page. The Holos DSM system employs release environment is provided by the Local IPC component of the consistency model (the memory is made consistent only when a microkernel and the IPC Manager that is supported by the critical region is exited), which is implemented using the write- Network Manager. These combine to provide a transparent local update model [34]. and remote MP service, which supports both the various qualities of service and group communication mechanisms. The standard In Holos DSM synchronisation of processes that share MP and RPC primitives such as send and receive; and call, memory takes the form of semaphore type synchronisation for receive and reply, respectively are provided to programmers. mutual exclusion. The semaphore is owned by the Space Manager on a particular computer. Because the ownership of the 4.11.2 Holos PVM and MPI semaphore is controlled by the Space Manager on each computer, PVM and MPI have been ported to Holos as they allow exploiting gaining ownership of the semaphore is still mutually exclusive advanced message passing based parallel environment [30,22]. when more than one DSM process exists on the same computer. Three modifications to PVM applications running on UNIX have Barriers are used in GENESIS to co-ordinate executing processes. been identified to improve performance: avoiding the use of XDR One of the most serious problems of the current DSM encoding where possible, using the direct IPC model instead of systems is that they have to be initialised manually by the default model, and balancing the load. The PVM programmers [5], [19]. Transparency of this operation is not communication is transparently provided by a service that is only provided. In Holos, DSM is initialised automatically and a mapping of the standard PVM services onto the Holos transparently. Machines are selected, process created and data communication services and benefits from additional services (for distributed auromatically.

8 5. CONCLUSION [9] Goscinski, A. (2000): Towards an Operating System In this paper, autonomic computing has been shown to be feasible Managing Parallelism of Computing on Clusters of and able to move parallel computing on non-dedicated clusters to Workstations. Future Generation Computer Systems: 293- the computing mainstream. The autonomic elements have been 314. designed and implemented by respective servers or part of other [10] Goscinski, A. and Haddock, A. (1994): A Naming and system servers. All the cooperating processes that employ these Trading Facility for a Distributed System. The Australian mechanisms offer self and surroundings discovery, ability of Computer Journal, No. 1. reconfiguration, self-protection, self-healing, sharing and ease of [11] .Goscinski, A. and Wong, A (2004) The Performance of a programming. The Holos autonomic operating system has been Parallel Communication-Bound Application Executing built as an enhancement of the Genesis system to offer an Concurrently with Sequential Applications on a Cluster autonomic non-dedicated cluster. This system relieves developers nd Case Study. (To be submitted to) The 2 Intl. Symposium on from programming operating system oriented activities, and Parallel and Distributed Processing and Applications (ISPA- provides to developers of parallel applications both message 2004). Dec. 2004, Hong Kong, China. passing and DSM. In summary, the development of the Holos cluster operating system demonstrates that it is possible to build [12] Goscinski, A., Hobbs, M. and Silcock, J. (2002): GENESIS: an autonomic non-dedicated cluster. An Efficient, Transparent and Easy to Use Cluster Operating System. Parallel Computing. This paper contributes to both the area of autonomic computing, in particular parallel autonomic computing on non- [13] Hobbs, M. and Goscinski, A. (1999a): A Concurrent Process dedicated clusters, by harnessing many technologies developed by Creation Service to Support SPMD Based Parallel the authors and the area of cluster operating systems by the Processing on COWs. Concurrency: Practice and development of a comprehensive cluster operating system Experience. 11(13). supporting parallel computing and solving the problem of [14] Hobbs, M. and Goscinski, A. (1999b): Remote and building virtual clusters changing dynamically and adaptively Concurrent Process Duplication for SPMD Based Parallel according to load and changing resources, in particular adding, Processing on COWs. Proc. Int’l Conf. on High Performance and removing computers to/from the cluster. Computing and Networking, HPCN Europe'99. Amsterdam. [15] Horn, P. (2001): Autonomic Computing: IBM’s Perspective 6. ACKNOWLEDGEMENTS on the State of Information Technology. We would like to thank the anonymous COSET reviewers for the [16] IBM (2001): IBM Corporation, valuable feedback and comments they provided. These http://www.research.ibm.com/autonomic/research. (Accessed suggestions helped us greatly to improve this paper. 26 May 2004). [17] Iftode L. and Singh J. P. (1997): Shared Virtual Memory: 7. REFERENCES Progress and Challenges, Technical Report, TR-552-97, [1] Anthill (University of Bologna, Italy) Department of Computer Science, Princeton University, http://www.cs.unibo.it/projects/anthill/, (accessed 26 May October. 2004) [18] Immunocomputing (International Solvay Institutes for [2] Auban, J.M.B. and Khalidi, Y.A. (1997): Operating System Physics and Chemistry, Belgium) Support for General Purpose Single System Image Cluster. http://solvayins.ulb.ac.be/fixed/ProjImmune.html, (accessed Proc. Int’l Conf. Parallel and Distributed Processing 6 May 2003). Techniques and Applications. PDPTA’97, Las Vegas. [19] Keleher, P. Lazy Release Consistency for Distributed Shared [3] Barrera, J. (1993) Self-tuning systems software. Proc. Fourth Memory, PhD Thesis, Rice University, 1994. Workshop on Workstation Operating Systems. 194-197. [20] Kephart, J. and Chess D. (2003): The Vision of Autonomic [4] Bio-inspired Approaches to Autonomous Configuration of Computing. Computer, Jan. Distributed Systems (University College London), http://www.btexact.com, (accessed 6 May 2003). [21] Lottiaux, R. and MORIN, C. (2001): Containers: A Sound Basis for a True Single System Image. Proc. First [5] Carter, J., Efficient Distributed Shared Memory Based on IEEE/ACM Int’l Symp. on Cluster Computing and the Grid. Multi-Protocol Release Consistency, Ph.D. Thesis, Rice Brisbane. University, 1993. [22] Maloney, A., Goscinski, A. and Hobbs, M.: An MPI [6] Cluster, (2000): Cluster Computing White Paper, Version Implementation Supported by Process Migration and Load 2.0. M. Baker (Editor). Balancing, Recent Advances in Parallel Virtual Machine and [7] De Paoli, D. and Goscinski, A. (1998): The RHODOS Message Passing Interface: Proc. of the 10th European Migration Facility. Journal of Systems and Software 40:51- PVM/MPI User's Group Meeting, pp. 414-423, Springer- 65. Verlag. [8] De Paoli, D. et al. (1995): The RHODOS Microkernel, [23] McGraw, G. and Hoglund, G. (2004) Dire Straits: The Kernel Servers and Their Cooperation. Proc. First IEEE Int’l Evolution of Software Opens new vistas for Business and the Conf. on Algorithms and Architectures for Parallel Bad Guys. Processing, ICA3PP’95. http://infosecuritymag.techtarget.com/ss/0,295796,sid6_iss36 6_art684,00.html, (accessed 26 May 2004).

9 [24] Multiagent Systems (Freiburg University) [33] Shirriff, K. et al. (1997): Single-System Image: The Solaris http://www.iig.uni-freiburg.de/~eymann/publications/, MC Approach. Proc. Int’l Conference on Parallel and (accessed 26 May 2004). Distributed Processing Techniques and Applications, [25] Neuromation (Edinburgh University) PDPTA’97. Las Vegas. http://www.neuromation.com/, (accessed 26 May 2004). [34] Silcock, J. and Goscinski, A. (1997). Update-Based [26] Ni, Y. and Goscinski, A. (1994): Trader Cooperation to Distributed Shared Memory Integrated into RHODOS' Enable Object Sharing Among Users of Homogeneous Memory Management. in: Proc. Third Intl. Conference on Distributed Systems. Computer Communications. 17(3): 218- Algorithms and Architecture for Parallel Processing 229. ICA3PP'97, Melbourne, Dec. 1997, 239-252. [27] Ni, Y. and Goscinski, A. (1993): Resource and Service [35] Silcock, J. and Goscinski, A. (1999): A Comprehensive Trading in a Heterogeneous Distributed Systems. Proc. IEE DSM System That Provides Ease of Programming and Workshop on Advances in Parallel and Distributed Systems, Parallelism Management. Distributed Systems Engineering, Princeton. 6: 121-128. [28] Nimrod-G (Monash University) http://www.gridbus.org/, [36] Walker, B. (1999): Implementing a Full Single System Image (accessed 26 May 2004). UnixWare Cluster: Middleware vs. Underware. Proc. Int’l Conf. on Parallel and Distributed Processing Techniques and [29] OceanStore (Berkeley University of California). Applications, PDPTA’99. http://oceanstore.cs.berkeley.edu/, (accessed 26 May 2004). [37] Wong, A. and Goscinski, A. (2004) Scheduling of a Parallel [30] Recovery-Oriented Computing (Berkeley/Stanford). Computation-Bound and Sequential Applications Executing http://roc.cs.berkeley.edu/, (accessed 26 May 2004). Concurrently on a Cluster - Case Study. (Submitted to) IEEE [31] Rough, J. and Goscinski, A. (1999): Comparison Between Intl. Conference on Cluster Computing. Sept. 2004, San PVM on RHODOS and Unix. Proc. Fourth Int. Symp. on Diego, California. Parallel Architectures, Algorithms and Networks (I- SPAN’99). A. Zamoya et al. (Eds), Freemantle.

[32] Rough, J. and Goscinski, A. (2004): The Development of an Efficient Checkpointing Operating System of the GENESIS Cluster Operating System. Future Generation Computer Systems, 20(4):523-538.

10 Type-Safe Object Exchange Between Applications and a DSM Kernel

R. Goeckelmann, M. Schoettner, S. Frenz and P. Schulthess

Department of Distributed Systems, University of Ulm, 89075 Ulm, Germany

[email protected]

Phone: ++49 731 50 24238; Fax: ++49 731 50 24142

Abstract: The Plurix project implements an object-oriented Operating System (OS) for PC clusters. Communication is achieved via shared objects in a Distributed Shared Memory (DSM) - using restartable transactions and an optimistic synchronization scheme to guarantee memory consistency. We contend that coupling object orientation with the DSM property allows a type-consistent system bootstrapping, quick system startup and simplified development of distributed applications. It also facilitates checkpointing of the system state. The OS (including kernel and drivers) is written in Java using our proprietary Plurix Java Compiler (PJC) translating Java source code directly into Intel machine instructions. PJC is an integral part of the language-based OS and tailor-made for compiling in our persistent DSM environment. In this paper we briefly illustrate the architecture of our OS kernel which runs entirely in the DSM and the resulting opportunities for checkpointing and communication between applications and OS. We present issues of memory management related to the DSM-kernel and to strategies to avoid false-sharing .

Keywords: Distributed Shared Memory, Object-Orientation, Reliability, Single System Image

1 Introduction Typical cluster systems are built on top of traditional operating systems (OS) as Linux or Microsoft Windows and data is exchanged using message passing (e.g. MPI) or remote invocation (e.g. RPC,RMI) strategies. As each node in a cluster is running its own OS with different configurations, the migration of processes is difficult, as it is unknown which libraries and resources will available on the next node. Additionally, if a process is migrated to another node, the entire context including relevant parts of the kernel state must be saved and transferred. Because these OSs are not designed for cluster operation it is difficult to migrate kernel contexts [Smile]and as a consequence cluster systems typically redirect calls of migrated processes back to the home node, e.g. Mosix [Mosix]. Plurix is an OS specifically tailored for cluster operation and avoids these difficulties. The Distributed Shared Memory (DSM) offers an elegant solution for distributing and sharing data among loosely coupled nodes [Keedy],[Li]. Applications running on top of the Plurix DSM are unaware of the physical location of objects. A reference can either point to a local or to a remote memory block. During program execution the OS detects a remote memory access and automatically fetches the desired memory block. Plurix extends the DSM to a distributed heap storage, providing the benefit, that not only data but also the code segments of the programs are available on each node as they are shared in the DSM. One of our major research goals is to simplify the development of distributed applications. Typically, DSM systems use weak consistency models to guarantee the integrity of shared data. This makes the development of applications hard, as each programmer must explicitly manage the consistency of the data by using the offered synchronization mechanism [TreadMarks]. Plurix uses a strong consistency model, called transactional consistency [Wende02] relieving the programmer from explicit consistency management.

1 Single-System-Image (SSI) computing architectures have been the mainstay of high performance computing for many years. In a system implementing the SSI concept, each user gains a global and uniform view on available resources and programs. The system provides the same libraries and services on each node in the cluster, which is very important for load balancing and migration of processes. We extend the SSI concept by storing OS, kernel, and all drivers in the DSM. As a consequence we can implement a type-safe kernel interface and at the same time simplify checkpointing and recovery. In 1990 Fuchs introduced checkpointing and recovery for DSM systems [Fuchs90]. Numerous subsequent papers discuss the adaptation of checkpointing strategies designed for message-passing systems ranging from global coordinated solutions to independent checkpointing with and without logging [Morin97]. However, the more sophisticated solutions have not been evaluated in real implementations because checkpointing is difficult to achieve in PC-clusters even under global coordination. If a checkpoint needs to be saved it is not sufficient to save the DSM context but also the local kernel context needs to be saved - which is not trivial. Plurix avoids these drawbacks by storing OS and applications in the DSM. The remainder of the paper is organized as follows. The design of Plurix is briefly presented in section 2. We then describe the advantages of a type-safe kernel interface. In the sequel we describe the benefits of running the kernel within the DSM. Extending the SSI provides additional advantages for the checkpointing, which are described in section 5. Finally, we present measurements and give an outlook on future work.

2 Design of Plurix Plurix implements SSI properties at the operating system level, using a page-based distributed shared memory. According to the SSI concept all programs and libraries must be available on all nodes in the cluster. Therefore Plurix uses a global address space shared by all nodes and organized as distributed heap storage (DHS) containing both data and code. Sharing the programs in the DHS reduces redundancy concerning code segments and makes the administration of the system easier.

2.1 Java-based Kernel and Operating System Plurix is entirely written in Java and works in a fully object oriented fashion. The development of an operating system requires access to device registers which is not possible in standard Java. For this reason we have developed our own Plurix Java Compiler (PJC) with language extensions to support hardware level programming. The compiler directly generates Intel machine instructions and initializes runtime structures and code segments in the heap. Traditional object-, symbol-, library- and exe-files are avoided. Each new program is compiled directly into the DHS and is thereby immediately available at each node. Plurix is designed as a lean and high speed OS and therefore able to start quickly. The start time of the primary node, which creates a new heap (installation of Plurix) or restarts an preexisting heap from the Pageserver (see Section 5.1), is less than one second. Additional nodes, which only have to join the existing heap, can be started in approximately 250 ms. This quick boot function of Plurix is helpful to guarantee fast node and cluster start-up time, which helps avoiding long downtimes in case of critical errors.

2.2 Distributed Shared Memory The transfer of the DHS-objects from one cluster node to a new one is managed within the page-based distributed shared memory (DSM), and takes advantage of the Memory Management Unit (MMU) hardware. The MMU detects page faults, which are raised if a node requests an object on a page which is not locally present. Each page fault results in a separate network packet which contains the address of the missing page (PageRequest). This packet is broadcast to all nodes in the cluster (Fast Ethernet LAN) and only the current owner of the page send it to the requesting node. An important topic in distributed systems is the consistency of shared and replicated objects. In Plurix this is synonymous to the consistency of the DSM. Plurix offers a new consistency model, called transactional consistency, which is described in the following section.

2 2.3 Consistency and Restartability Unlike traditional systems, Plurix does not burden the programmer with the consistency of the DHS- objects. All actions in Plurix are encapsulated in transactions. At the start of a transaction, write access to pages are prohibited. If a page is written, the system creates a shadow image of it and then enables write access. Additionally, the system logs the pages for which shadow images were created. At the end of a transaction (commit phase) the addresses of all modified pages are broadcasted and all partner nodes in the cluster invalidate these pages. If there is a collision with a running transaction on another node, it is aborted and eventually restarted. In case of an abort all modified pages are discarded. Since there is a shadow image for each modified page, the system can reconstruct the state of the node at the time just before the transaction has been started. A token mechanism is used to ensures, that only one node is in the commitphase at a time. The token is passed using a first wins strategy. To improve fairness further commit strategies will be developed.

2.4 False Sharing and Backchain All page-based DSM systems suffer from the notorious false-sharing syndrome. False-sharing occurs, if two or more nodes access separate objects which nevertheless reside on the same page. If a node writes to such an object, all other nodes are forced to abort their current transaction and restart it later. As these objects are not shared, such an abort is semantically unjustified and unnecessarily slows down the entire cluster. To handle this problem relocation of DSM-objects from one physical page to an other is required. When an object is relocated, all pointers to this object are adjusted. Due to the substantial network latency in the cluster environment, it is not possible to inspect each object whether contains a pointer to the relocated object. To adjust the affected references, Plurix uses the Backchain [Traub]. This concept links together all references to an object by recording the addresses of these pointers (see fig. 1). All references to a relocated DSM-object are found in the Backchain. To reduce invalidations of remote objects when a new Backchain entry is inserted, references on the stack are not tracked in the Backchain.

backchain

object pointer

DSM

Figure 1 The Backchain Concept

2.5 Garbage Collection The previously described Backchain concept can also be used to simplify a distributed garbage collection (GC). A Mark-and-Sweep algorithm should not be used in a DHS-environment, because it is either very difficult to implement (incremental Mark-and-Sweep) or it would stop the entire cluster while collecting garbage. Copying GC algorithms will unduly reduce the available address space - only reference counting algorithms appear feasible. The Backchain can be used as a reference counter. If an object contains an empty Backchain, no references to this object remain. This is equivalent to a reference counter of 0, so in this case the object is garbage and can be deleted. Because stack references are not included in the Backchain, the GC may only run if the stack is empty. Between two transactions this condition is always true, and that the GC task can be run as a regular Plurix transaction.

3 3 A Type-Safe Interface for a DSM-Kernel The SSI concept requires, that all nodes in the cluster have the same programs installed. In a distributed environment the easiest why to achieve this goal is to share not only data but also the code of the programs, for this reason the Plurix extends the DSM to the DHS. In this case it is mandatory to protect the code segments from unwanted modification either by corrupted pointers or by malicious attacks. This can be achieved by using a type-safe language like Java. Language-based OS development has been successfully demonstrated by the Oberon system [Wirth]. The requirement for type safety in the DSM affects also the interface to the OS. As data in the DSM is represented by objects and these data must be passed to the kernel, either the objects must be serialized before they are used as parameters or the kernel must be able to accept objects.

3.1 Traditional Kernel interfaces Traditionally, distributed systems are implemented as a middleware layer on top of a local OS, such as Linux or Mach, which are mostly written in C and therefore do not provide objects. The communication between the distributed system and the local OS takes place using primitive data types or structures. If the kernel cannot handle objects as such, they are serialized (and data items are copied) before being passed to the kernel. This kind of raw communication does not provide type checks for parameters and signatures by the runtime environment. Hence no type-safe calls of kernel methods are possible and each kernel method has to check explicitly its parameters to avoid runtime errors.

3.2 Benefits of a Type-Safe Kernel Interface To reduce programming complexity and to increase system performance we recommend to pass typed objects to the kernel. This was part of the motivation to create Plurix as a stand alone OS not as a middleware layer. Since the kernel of Plurix is written in Java and easily handles objects, a type-safe communication between the DSM applications and the OS is natural. All Java types and objects, can be handed to the kernel methods. The programmer has no need to pay attention to the type of the passed object because this is checked by the compiler and in some cases by the runtime environment. Further on there is no reason to serialize objects which are used as parameters for kernel methods, so the performance of the entire system increases. Another benefit of using objects as parameters is that the object respectively the data included in these objects need not be copied. The kernel method obtains a reference and accesses the object directly. This increases the system performance again.

3.3 Inter Address Space Pointers In traditional systems there are at least two different address spaces, one for the kernel and at least one for user applications. As the kernel methods are always needed on each node the straight-forward way of implementing the system would be to place the kernel in the local address space. These local addresses are not shared with other nodes, and each node in the cluster can use them in different ways. In this case a separation between the kernel and user address space would mean to differentiate between the local- or NonDSM- and the DSM-address-space. If in such an environment objects are used as parameters, some references will point from the Non-DSM into the DSM address space. References which point from the Non-DSM into the DSM reduces the performances of the cluster, as they inhibit the relocation of objects so that avoidance of false-sharing and memory fragmentation is prevented. The reason being that the Backchain entries are not longer unambiguous when an object migrates to another node and is then relocated from one DSM address to another. If an object is referenced by a Non-DSM-object, the Backchain leads into the local memory of the node. As addresses in the local memory are not unique, the pointer can not be adjusted, as it is not possible to detect which local memory area is specified by this Backchain entry. The correct reference to this object can not be found and an adjustment of the memory location which is specified by the Backchain will lead to invalid pointers or even destroyed code segments (see figure 2).

4 DSM DSM DSM

local local local local local local memory memory memory memory memory memory node 1node 2 node 1node 2 node 1 node 2 DSM-Object referenced by DSM-object migrated to DSM-object was relocated to a Non-DSM-object another node another address

pointer backchain

Figure 2 Migration and subsequently relocation of a DSM-object As long as DSM-objects are relocatable, references from the Non-DSM into the DSM address space are not possible, as they could lead to dangling pointers or destroyed code segments. To solve this problem it would be possible to prevent relocation of DSM-Objects, which are referenced by Non-DSM-object. As it is not possible to specify, which objects are used as parameters for kernel methods, nearly all objects in the DSM could not be relocated and so the performance of the cluster will be impaired because false sharing and fragmentation of the memory can not be handled. Therefore direct pointers from the Non-DSM into the DSM-address- space must be avoided. Another interesting question is how kernel methods can be called from DSM applications. Two alternative methods are conceivable: 1. Software Interrupt: Like in most traditional systems, kernel methods may be called using kernel- or system-calls. These are software interrupts which request a specific function from the kernel. If kernel-calls are used to communicate between the DSM applications and the operating system there are no “address space spanning” pointers but the question arises how to pass parameters from the DSM to the kernel, as the software interrupt itself cannot accept parameters. One possible solution is to pass data to the kernel through a fixed address. If an object is used as a parameter, this address would contain the pointer to the object which should be used. As each kernel method requires different parameters, this object must be of a generic type and thereby each object can be passed. Each kernel method has to check the given object if it is type compatible with the expected one as this could not be handled by the runtime environment. This rises the complexity for system programmers and makes the system vulnerable to faults by simultaneously reducing the performance and the possibilities of parameter passing. 2. Object oriented invocation: Kernel methods are invoked in an object oriented fashion via direct pointers to the requested kernel class. This implies that all kernel classes and their methods have to reside at the same addresses on each node in the cluster. This is necessary as each application can only have one pointer to a kernel class. Should they reside at different addresses, these references would point to invalid addresses and the corresponding kernel methods could not be called correctly on some nodes (see figure 3). If direct pointers are used each node in the cluster must run the same kernel, and such a kernel can never be changed during runtime. Consequently all pointers in the applications, which reference kernel methods require adjustment. To achieve this, all kernel methods must contain a Backchain which points from Non-DSM into the DSM and thereby the above described problems will occur.

5 DSM DSM

local local local local memory memory memory memory node 1node 2 node 1 node 2

Figure 3 Invalid reference to kernel methods

Both techniques give rise to an additional problem. The compiler is running in the DSM and any new program is automatically created in the DSM. If the new program is a device driver (which typically resides in kernel space) the code segments must be transfered from the DSM into the Non-DSM address space and this must occur simultaneously on each node. Our implemented solution which solves all the challenges above is to remove the kernel from the local memory address space and move it into the DSM. Further benefits of this approach are described in the following section.

4 Extending the Single System Image Concept We elaborate the SSI concept by moving the OS and the Kernel into the DSM. The local memory is only used for a few state variables for the network device drivers and -protocol and for the so called Smart-Buffers, which help to bridge the gap between not restartable interrupts and transactions [Bindhammer].

4.1 Benefits of a kernel running in the DSM If the kernel runs in the DSM parameter passing between applications and kernel is elegant and all objects can be used as parameters. Kernel methods are called directly as described in section 3.3. There are no references pointing from one address space to the other. Since all device drivers now reside in the DSM even the problem of transferring newly compiled drivers from the DSM into the kernel space vanishes. Because the code segments of the kernel methods are in the DSM redundancy is avoided. Further benefits from this concept, especially for system checkpointing are described in Chapter 5. Some interesting questions surfaced when moving the kernel into the DSM, but before we describe these topics and our corresponding solutions we describe the memory management of Plurix and the allocation mechanism for new objects as this is important for our solution.

4.2 Distributed Heap Management A basic design topic of the Plurix system is the page-based DSM, raising the false sharing problem. The allocation strategy of the memory management must try to avoid false sharing wherever possible. Furthermore collisions during the allocation of objects in the DHS must be avoided, as such a collision will abort other transactions and thereby serialize all allocations in the cluster. To achieve those goals, Plurix uses a two stage allocation concept consisting of allocator-objects and a central memory manager. The latter is needed, as the memory has to be portioned to the different nodes in the cluster. This division must not be static, as this would reduce the maximum size of the objects. The memory manager is used to create allocators and large objects. As the allocator must have at least the same size as the new object which should be created, the usage of allocators for large objects would lead to large allocators and thereby to a static fragmentation of the heap. The alternative for this is to limit the size of the allocators and thereby the maximum size of the DHS-objects which is unacceptable. Allocator-objects represent a portion of empty memory. The size of an allocator is reduced for each allocated object. If it is exhausted, the Allocator is discarded and a new one is requested from the central memory manager.

6 4.2.1 Allocation of Objects If a new object is requested, the memory management first decides if the object is created by the corresponding allocator or by the memory manager. This decision depends on the size of the object. Each object which is greater than 4KB is directly allocated by the memory manager. To avoid false sharing on these objects, their size is increased to a multiple of 4 KB (page granularity of the 32-bit Intel architecture). As all objects which are allocated by the memory manager have a size of a multiple of 4 KB, each object starts at a page border and consumes N pages. Therefore these objects do not co- reside with other objects on the same page. Objects which are less than 4 KB are created by an allocator. As each node has its own allocator, collisions can only occur if a large object is allocated or if an allocator is exhausted and a new one must be created. The measurements in section 6 show, that the size of most of the objects in Plurix are less than 4KB, so large objects are rarely allocated. The collisions which occur during these allocations are tolerable most of the time.

4 GB Allocator Node1 kB < 4 Allocator Node2 s ze ize < si 4kB

Node1 Node2 siz e 4kB >= = 4 > kB ze Memory manager si object

512 MB

Figure 4 Allocation of objects The benefit of the two level allocation of objects is that small objects from one node are clustered in the memory. As a consequence collisions do not occur during the allocation of small objects and are rare if large objects are allocated. As large objects are not allocated within the allocator, its size can be limited, without limiting the maximum size of the objects. No static division of the memory is needed and therefore no static fragmentation is created.

4.2.2 Reduction of False Sharing Generally speaking objects can be divided into two categories: Read-Only (RO) and Read-Write (RW) objects. False-sharing on RW-objects is reduced by the mechanism described above. To further reduce false-sharing it is reasonable to make sure that RO-objects like code segments and class descriptors without static variables do not co-reside with RW-objects on the same page, as this would lead to unnecessary invalidations of the RO-objects due to false-sharing. Code segments are only written during compilation by the compiler. If these objects would be indiscriminantly allocated, they could reside on the same pages as the RW-objects of the node which is currently running the compiler. To avoid this, Plurix provides additional allocators for RO-objects.

4.3 Protection of SysObjects If the entire system is running in the DSM, some code segments and instances of classes must be protected against invalidation, as these objects are vital for the system. The objects which must always be present on a node are called SysObjects. These are nota bene all classes and instances concerning the Page-Fault-handler, the DSM protocol and the network device drivers. As these objects reside in the DSM

7 they might be affected by the transaction mechanism and in case of a collision on such a page, the page would be discarded and the system will hang, as the node is no longer able to request missing pages. The protection of SysObjects against invalidation is easy to achieve just by defining two additional allocators. SysObjects are either code segments or instances of SysClasses. As described above, code segments are only written during compilation otherwise they are read only. Additionally, code segments should not co-reside on the same pages as RW-objects as this would lead to false-sharing and therefore a special allocator is used. The compiler will create the new kernel classes in a different memory area. Afterward update messages need to be sent to all nodes in the cluster, to replace old classes and instances by new ones. To achieve this, it is sufficient to make sure that such an allocator is only used by the current compilation and after that the remaining part of the last used page is consumed by a Dummy-SysObject. RW-SysObjects are instances of SysClasses which are meaningless for all nodes except that one, that has created the instance. For this reason RW-SysObjects are not published through the global name service. Therefore no other nodes can access a RW-SysObject. The only case where a RW-SysObject could be invalidated is as a result of false-sharing. To prevent this, each node acquires a SysRW-Allocator during the boot phase. All instances of SysClasses are allocated in this private allocator, so that there are only SysObjects from one node on the same page. These two additional allocators and the described techniques to use them are sufficient to protect all SysObjects against invalidation at run time.

4.4 Local memory for State Variables State variables of the DSM protocol and the network device drivers must outlast the abort mechanism, as these variables are needed to handle the abort mechanism. If they would be reset the current state of the protocol and the network adapter would be lost. The network device driver would never be able to receive the next packet as the receive-buffer pointer would also be reset. Also the protocol contains a sequence number for the messages to make sure, that no vitally important message is lost. If the state variables are reset, the protocol will receive messages from the future. In this case it would not be possible to decide if this number is invalid due to an abort or if the node has missed important network packets. As the protocol is not a device driver, its current state variables can not be read from the hardware registers, as it is possible for normal (not network) device drivers. Hence these variables must be stored outside the DSM address space. For device drivers and the protocol, the kernel provides special memory areas in the local memory in which state variables are stored. To access these areas, Plurix provides “structs“, allowing to address raw memory much like the variables in an object. “structs“ are also used to access the memory mapped registers of devices. As Structs may not contain pointers and are not referenced by pointers no problems with address space spanning pointers arise.

4.5 Restart of device drivers In case of an abort the state of the entire node is reset to the state just before the current transaction was started. Devices can not be automatically reset and the device driver programmer must implement an Undo-method, which is called by the system in case of an abort. This method has to ensure that both the state of the hardware and that of the state variables in the device driver object are reset. To make this possible the state of all devices before the transaction must be conserved. An example for such an Undo-method is shown for graphics controller devices. In this case the current On- and Off-screen memory-area on the display adapter must be reset. Since between two transactions the On-and Off-screen contains the same data, it is sufficient to reset the Off-screen memory and afterward copy this value to the On-screen area. This is easy to implement as most graphics controllers contain substantial amounts of memory for textures and vertices. A small part of this memory can be used to save the committed state of the graphic controller. After the commit phase, the current On-Screen area is copied into this separate memory area and can be restored if necessary. The serial-line controller is more difficult to handle. This controller sends all data if it receive it from the system. In case of an abort it is not possible to “undo” the sent data. For this problem there are two possible solutions. Either the affected application is able to handle duplicated data or the driver has to use smartbuffers. Data in this special buffer type are invisible to the device until the commitphase, so the device can only access committed and therewith value data.

8 5 Checkpointing and Recovery State of the art PC-Clusters are built using Linux or Microsoft Windows but implementing checkpointing and recovery in these operating systems is difficult because it is not sufficient to save the process context but also the local kernel context needs to be observed. The latter includes internal data structures, open files, used sockets, pending page requests, ... which can be read only at kernel level. Resetting the kernel and process context in case of a rollback is also challenging because of the complex OS architectures. As a consequence taking a checkpoint is time consuming and checkpointing intervals are quite large, e.g. 15-120 min. for the IBM LoadLeveler. By extending the Single System Image concept we avoid these drawbacks. Storing the kernel and its contexts in the DSM makes it easy to save this data. Rollback in case of an error is no problem in Plurix because the OS and all applications are designed to be restartable anyway.

5.1 Current Implementation A central fault-tolerant PageServer stores consistent heap images in an incremental fashion on disk. Between two checkpoints the PageServer uses a bus snooping protocol to intercept transmitted and invalidated memory pages to reduce the amount of data to be retrieved from the cluster at the next checkpoint. If a checkpoint must be saved the cluster is stopped and the PageServer collects invalidated pages that have not been transmitted since the last checkpoint. All memory pages are written to disk synchronously. We have implemented a highly optimized disc driver that is able to write about 45 MB/s. An early performance evaluation of our PageServer can be found in section 6. Because the kernel and its context reside in the DSM we must not save node local data. Furthermore, we have no long running processes or threads with preemptive multitasking that need to be checkpointed. Currently, we use a cooperative multitasking scheme for executing short transactions. A transaction is executed by a command or periodically called from the scheduler. Long running computations have to divided in sub transactions manually. In case of an error the node can perform a reboot and fetch required memory pages again from the DSM from the last checkpoint.

5.2 Fault-Tolerance We support clusters running within a single Fast Ethernet LAN and assume a fail-stop behavior of nodes. Most DSM systems typically use a reliable multicast or broadcast facility to avoid inconsistencies caused by lost network packets. Because of the low error probability of a LAN we are not willing to impose the overhead by a reliable communication during normal operation. Instead we rely on a fast error detection, fast recovery, and the quick boot option of our cluster OS. As described in 2.x our DSM implements transactional consistency and committing transactions are serialized using a token. We introduce a logical global time (a 64-Bit value) incremented each time a transaction commits. In the latter case the new time is broadcasted to the cluster and each node updates its time variable. A node can immediately detect if it missed a commit and ask for recovery. If the commit message cannot be sent with one Ethernet frame, the commit number is incremented for each commit packet. Thus we avoid inconsistencies if a node did miss a packet of a multiple packet commit. Furthermore, any page or token requests always includes the global time value of the requesting node. If such a request contains an out-of-date commit number it is not processed but recovery is started. Thus a node that missed a commit is note able to commit a transaction because it is not granted the token. If a single node fails temporarily it can reboot and join the DSM again. If the PageServer detects missing pages during the next checkpoint that were lost because of the a node failure the cluster is reset to the last checkpoint. If a multiple nodes fail temporarily or permanently the same error detection scheme works, too. The network might be partitioned temporarily into two or more segments. Only one token and one PageServer is available in any of these segments. Nodes within the segments send page and token requests. If either request cannot be satisfied the segment tries to recover by contacting the PageServer. Only the segment with the PageServer can recover the others have to wait until the PageServer becomes available again. We plan to implement a distributed version of our PageServer two avoid a bottleneck and to replicate data stored on the PageServers to tolerate failures of PageServers, too. We also plan to introduce a

9 asynchronous checkpointing scheme to avoid stopping the cluster during checkpointing operation. Dependency tracking will also be investigated to restart only affected nodes in case of a failure.

6 Measurements The performance evaluation is carried out on three PCs interconnected by Fast Ethernet Hub. Each node is equipped with a RLT8139 network card and an ATI Radeon graphic adapter. Only the first machine (with Celeron-CPU) is featured with hard disk (IBM 120GB, max disc write throughput without network 45 MB/s) and acts as pageserver. Table 1. Node configuration Node CPU RAM 1 Celeron 1.8 GHz 256 MB DDR RAM at 266 MHz 2 Athlon XP2.2+ at 1.8 GHz 256 MB DDR RAM at 333 MHz 3 Athlon XP2.0+ at 1.66 GHz 256 MB DDR RAM at 333 MHz

6.1 General System Measurements We have tested the startup time of the above described cluster nodes. The results are split into the time which the kernel needs and the time which is needed to detect and start the hardware such as HD, mouse and keyboard. The nodes have been started with and without harddisc and the time difference is about 540 ms during which we have to wait for the harddisc to answer.

Table 2. Startup times (in ms) Node Startup as Master Kernel time Startup as Slave Kernel time 1 791 254 240 234 2 780 248 238 233 3 792 254 239 234

The kernel allocates 2787 objects if running as master and 518 objects if running as slave. It takes approximately 3 microseconds to allocate an object and additional 0.5 microseconds to assign a pointer to an object. To get the kernel from the DHS a slave node must request 284. To show the correlation of changed heap size, heap spreading and time to save a checkpoint, ten measurements were made. Comparison of several measurements is needed for predications about speed of hard disk, performance of implemented software and latency caused by network. In the following table for each measurement the configuration (single station or cluster) and heap spreading is given. The pageserver creates consistent images of the complete heap containing both user data (node1 – node3) and operating system. The latter is contained in “saved data”.

Table 2. Measurements # nodes Node 1 Node 2 Node 3 Saved data Time to Throughput save to disc (resulting disc write bandwidth) 1 1 20 MB - - 21,4 MB 1639 ms 13,7 MB/s 2 1 40 MB - - 42,5 MB 2491 ms 17,5 MB/s 3 1 60 MB - - 63,0 MB 3371 ms 19,1 MB/s 4 1 80 MB - - 83,4 MB 4321 ms 19,7 MB/s 5 1, 2, 3 60 MB 0 MB 0 MB 63,1 MB 3422 ms 18,9 MB/s 6 1, 2, 3 20 MB 20 MB 20 MB 63,1 MB 4476 ms 14,4 MB/s 7 1, 2, 3 0 MB 28 MB 32 MB 63,1 MB 4971 ms 13,0 MB/s

10 8 1, 2, 3 40 MB 40 MB 40 MB 124,6 MB 8049 ms 15,8 MB/s 9 1, 2, 3 48 MB 48 MB 48 MB 149,1 MB 9540 ms 16,0 MB/s 10 1, 2, 3 60 MB 60 MB 60 MB 186,0 MB 11707 ms 16,3 MB/s In comparison of measurement 1-4 we see an increase of throughput in consequence of increased data size. Measurements 3, 5-7 have same size of saved data, so decreased throughput depends on network latency. Comparing measurements 6, 8-10 approves nearly constant throughput. The slight improvement for increased data size is due to faster saving of local data. The following chart shows these three comparisons:

7 Experiences and Future Work Moving the kernel into the DHS and therewith elaborating the SSI concept made it possible to create a type-safe kernel interface and to solve the problem of address space spanning pointers. Additionally, checkpointing is made much easier and the question in which way kernel methods should be called is answered. The current version of Plurix is running stable in the cluster environment, without collisions during allocation. The usage of the allocator strategy inhibits false sharing if no applications which share objects are running. As soon as objects are created by an application and shared with other nodes, the allocation mechanism is not able to prevent false sharing but we are working on a monitoring tool to detect false sharing. Relocation of objects to dissolve false sharing is currently available. Plurix uses a distributed garbage collection algorithm which is able to detect and collect garbage (including cyclic garbage) without stopping the cluster. The detection algorithm for cyclic garbage works error free but currently there is no information which object could be cyclic garbage so each object in the DHS must be checked. The consistency of the DHS is ensured by the page server, which uses linear segment technique to save all changed pages. This includes data and code objects of user applications as much as the OS. In the current implementation the speed of saving the complete heap is limited by network throughput and not by OS or hard disc. For this reason it is necessary to save the state of the cluster continuously which could be achieved by some minor changes, regarding the mechanism of detecting missing pages.

8 References [Mosix] Barak A. and La'adan O., The MOSIX Multicomputer Operating System for High Performance Cluster Computing , Journal of Future Generation Computer Systems, Vol. 13, No. 4-5, pp. 361-372, March 1998. [Wirth] N. Wirt and J. Gutknecht, „Project Oberon“, Addison-Wesley, 1992. [Traub] S. Traub, “Speicherverwaltung und Kollisionsbehandlung in transaktionsbasierten verteilten Betriebssystemen”, PhD thesis, University of Ulm, 1996. [TreadMarks] Amza C., Cox A.L., Drwarkadas S. and Keleher P., „TreadMarks: Shared Memory Computing on Networks of Workstations“, Proceedings of the Winter 94 Usenix Conference, pp. 115-131, January 1994. [Fuchs90] Kun-Lung Wu and W. Kent Fuchs, „Recoverable Distributed Shared Virtual Memory”, IEEE Transactions on Computers, 39(4):460-469, April 1990. [Morin97] C. Morin, I. Puaut, “A Survey of Recoverable Distributed Shared Virtual Memory Systems”, IEEE Transactions on Parallel and Distributed Systems, Vol. 8, No. 9, September 1997. [Keedy] J.L. Keedy, and D. A. Abramson, “Implementing a large virtual memory in a Distributed Computing System”, in: Proc. of the 18th Annual Hawaii International Conference on System Sciences, 1985.

11 [Li] K. Li, “IVY: A Shared Virtual Memory System for Parallel Computing”, In Proceedings of the International Conference on Parallel Processing, 1988. [Wende02] M. Wende, M. Schoettner, R. Goeckelmann, T. Bindhammer, P. Schulthess, “Optimistic Synchronization and Transactional Consistency”, in: Proceedings of the 4th International Workshop on Software Distributed Shared Memory, Berlin, Germany, 2002 [Bindhammer] T. Bindhammer, R. Göckelmann, O. Marquardt, M. Schöttner, M. Wende, and P. Schulthess, “Device Programming in a Transactional DSM Operating System”, in: Proceedings of the Asia-Pacific Computer Systems Architecture Conference, Melbourne, Australia, 2002. [Simle] http://os.inf.tu-dresden.de/SMiLE/

12