Programming models, environments and languages

Seminar Cluster Computing, SS2005

Gordana Stojceska

Technishe Universität München, Institut für Informatik Boltzmannstraße 3 , 85748 Garching bei München [email protected]

Abstract. If low cost hardware has speeded up the use of clusters for industrial, scientific and commercial high performance computing, it is the software that has all of them enabled its utility and restrained its usability. While the rapid advances in hardware performance have propelled clusters to the forefront of next generation high performance computer systems, equally important has been the evolving capability and maturity of the support software systems and tools. The result is a one global system environment that is converging on previous generation of supercomputers and MPPs (Massively Parallel Processors). Like all these predecessors, the clusters give opportunity for advanced future research and development in programming tools and resource management software in order to enhance their applicability, availability, scalability and usability. This paper includes one software point of view to the clusters. The main discussion is about the Software Distributed Shared Memory Systems and one of the most famous problems: the atomic page update problem.

Keywords: Operating systems, programming models, distributed shared memory, atomic update page problem

1 Introduction

One can realize that the development of software components is really necessary and essential for the success of any type of parallel system, and that includes the clusters as well. Everyday there are many improvements at an accelerated rate and quick approach of the stable and sophisticated levels of functionality that will establish clusters as the long-term solution to high- end computation. The main thing that should be achieved building one cluster of computers is a speed up or reducing the computation time in general, or wherever it is strongly essential, for example large scientific computations. Therefore, from the very first beginning one needs to take care about parallelization in each sense. Analyzing the Amdahl’s Law, which shortly explained, says that the total execution time on the parallel machine is a sum of the part that is parallelized and the serial part, one can make some conclusions. Although the Amdahl’s equation looks pretty simple, it has very interesting implication: the most that one can speed up using a cluster of processors, no matter how much of them are used, is a factor of a relatively small number! For 2 Gordana Stojceska example, in a case when the number of processors gets extremely large (theoretically: infinitely large) and assumption that 5% of a task is irreducibly serial; the highest speed that one can achieve is a factor of 20! But, if one wants to achieve the best speed up using clusters, should be aware of the Amdahl’s Law, which general point is following (or somewhere is used the term “game of elimination”):  The hardware can be adequately parallel, but it would not do a bit of good if the operating system is not.  Both of the hardware and operating system can be configured in the best parallelisable way, but it would not do any good, if the middleware (databases, communications, etc) is not.  All abovementioned parts can be perfectly configured to act in parallel, but if the application is not, one can lose again.  If the hardware, the OS, the middleware and the application all have low serial content, than one can have a good chance to get the best of the clusters Thus, in this paper will be discussed, more or less all of this points. First, there is overview of the most famous “State-of-the-Art”-operating systems that are used nowadays, as well as their abilities, performance, advantages and/or disadvantages, some possible improvements and so on. The next very important thing is the programming model, or how the programs are going to be parallelized and which mechanisms should be used, depending on the hardware architecture. The main discussion is about the Distributed Shared Memory Systems and one of the most famous problems: the atomic page update problem. Some solutions that solve this problem are presented as well the comparison between them. At the end is given short introduction in OpenMP.

2 Operating Systems for Clusters

2.1 Basic characteristics

The ideal operating system, in general, as well as for clusters, should always help as much as possible, and never hinder the users. That means, it should help the user (which in such case almost always is an application-or-middleware designer) to set up the system for optimal program execution. This could be done by supplying a consistent and well-targeted set of functions offering as many system resources as possible. After configuring the environment, it would be nice if it stays out of the user’s way avoiding any time consuming context switches. Anyway, how different one chooses to make the operating system depends on ones view of clustering. On the one hand, we have those who argue that each node of a cluster must contain a full-featured operating system such as Unix, with all the positives and negatives that implies. At the other extreme, there are researchers asking the question, "Just how much can I remove from the OS and have it still be useful?" It is a common question what attributes exactly should have a cluster operating system. But the most mentioned in the research community are the following ones: manageability, stability, performance, extensibility, scalability, support and heterogeneity. It should be said that the experience shows that these attributes may be mutually exclusive. That means, for example, supplying a SSI (Single System Image) at the operating system level, while a definite boon in terms of manageability, drastically inhibits scalability. Another example is the availability of the source code in conjunction with the possibility to extend (and thus modify) the operating system on this base. This property has a negative influence on the stability and manageability of the Programming models, environments and languages 3 system: over time, many variants of the operating system will develop, and the different extension may conflict when there is no single supplier.

2.2 Current State-of-the-art

State-of-the-art can be interpreted to mean the most common solution as well as the best solution using today's technology. By far the most common solution for the current clusters is running a conventional operating system, with little or no special modification. This operating system is usually a Unix derivative, although NT clusters are becoming more common.

2.2.1 Linux

One of the most popular operating system used for the clusters in the public research community is Linux [1], because it is released as true open and it is cheap in terms of both required software and hardware platforms. It offers all the functionality that is expected from standard UNIX OS, and it is developing fast as missing functionality can be implement by anyone who needs it. However, these solutions are usually not as thoroughly tested as releases for commercial UNIX variants. This requires frequent updates, which do not ease the job of the administrators to create a stable system, as it is required in commercial environments. However, for scientific and research applications, the Linux concept is a great success, as can be seen from the well-known Beowulf[2] project, which is actually collection of tools, middleware libraries, network drivers and kernel extensions to enable a cluster for specific HPC purposes. This leads to a big and growing user community developing a great number of tools and environments to control and manage clusters, which are mostly available for free. However, it is often hard to configure and integrate these solutions to suit ones own, special set up as not all Linux OSs are created equal: the different distributions are getting more and more different in the way the complete system is designed. Even the common base, the kernel, is in danger, as solution providers like TurboLinux [1] start to integrate advanced features into the Kernel. In general, in has to be noted that the support for Linux by commercial hardware and software vendors is continuously improving, so it can no longer be ignored as a relevant computing platform.

2.2.2 Windows NT

Although NT contains consistent solutions for local system administration in a workstation or small-server configuration (consistent API, intuitive GUI-based management, registry database), it lacks several characteristics that are required for efficient use in large or clustered servers. The most prominent are standardized remote access and administration, SMP scaling in terms of resource limits and performance, dynamic reconfiguration, high availability and clustering with more than a few nodes. Through Microsoft’s omnipresence and market power, the support for NT by the vendors of interconnect hardware and tools developers is good, and therefore a number of research projects in clustering for scientific computing is based on Windows NT and Windows 2000[4]. Microsoft itself is working hard to extend the required functionality and will surely improve. 4 Gordana Stojceska

2.2.3 AIX

IBM’s AIX operating system, running on the SP series of clusters, is surely one of the most advanced solutions for commercial and scientific cluster computing. It has proven scalability and stability for several years and a broad number of applications in both areas. However, it is a closed system, and as such the development is nearly totally in the hand of IBM, which however has enough resources to create solutions on hardware and software, such as HA (High Availability). Research from outside the IBM laboratories is very limited, though.

2.2.4 Solaris

Somehow, Solaris can be seen as a compromise or merge of the three systems described above: it is still not an open system, which ensures stability on a commercial level and a truly identical interface to administrators and programmers on every installation of the supported platforms. With the new decision of Sun to make it Solaris one open source project [20] maybe the interest in easy kernel extensions or modifications will grow with the time. Solaris offers a lot of functionality required for commercial, enterprise-scale cluster-based computing like excellent dynamic reconfiguration and fail-over and also offers leading inter and intra-nodes scalability for both, scientific and commercial clustering. Its support by commercial vendors is better for high-end equipment, and the available software solutions are also directed towards a commercial clientele. However, Solaris can be run on the same low-cost off-the-shelf hardware as Linux as well as on the original Sun Sparc-based equipment and the support for relevant clustering hardware like interconnects is given (Myrinet [21], SCI). Software solutions generated by the Linux community are generally portable to the Solaris platform with little or no effort.

2.2.5 Puma OS

The Puma operating system [19], from Sandia National Labs and the University of New Mexico, represents the ideological opposite of Solaris OS. Puma takes a true minimalist approach: there is no sharing between nodes, and there is not even a file system or demand paged virtual memory. This is because Puma runs on the “compute partition” of the Intel Paragon and Tflops/s machines, while a full-featured OS (e.g. Intel’s TflopsOS or Linux) runs on the Service and I/O partitions. The compute partition is focused on high-speed computation, and Puma supplies low-latency, high-bandwidth communication through its Portals mechanism.

2.2.6 Mosix

MOSIX [22] is a set of kernel extensions for Linux that provides support for seamless process migration. Under MOSIX, a user can launch jobs on their home node, and the system will automatically load balance the cluster and migrate the jobs to lightly-loaded nodes. MOSIX maintains a single process space, so the user can still track the status of their migrated jobs. MOSIX offers a number of different modes in which available nodes form a cluster, submit and migrate jobs, ranging from a closed batch controlled system to a open network-of- workstation like configuration. MOSIX is a mature system, growing out of the MOS project and having been implemented for seven different operating systems/architectures. openMosix [23] is a Linux kernel extension for single-system image clustering. This kernel extension turns a network of ordinary computers into a supercomputer for Linux applications. Programming models, environments and languages 5

3 Programming Models

A programming model is the architecture of a computer system, both hardware and system, software, above the level that is hidden by traditional high-level languages. It is an application’s high-level view of the system on which it is running [3]. Because there exist only one serial (nonparallel) programming model, it is common understood that the term “parallel programming model” means actually “programming model”. Therefore today are known various programming models, but just very few of them are really used. Here, the following programming models should be mentioned: 1. the uniprocessor model, otherwise known as the Von Neumann model, 2. the symmetric multiprocessor model, or also known as the shared-memory model 3. the message passing model (mainly used in the clusters) 4. cc-NUMA (cache coherent Non-Uniform Access Memory) model There are numerous discussions in the scientific community, about that which of the programming models is the best to be used. Thus the question of importance of programming models should be answered. They are actually extremely important just because of the following reasons: first, since some are easier to use than others, they have a very big influence on that how difficult is to write a program; second, is the issue of portability: a program written using one model can be essentially difficult (or extremely easy) to move to a computer system implementing another model. The most used commercially programming models nowadays are the shared-memory (or also distributed shared memory) models as well as the message passing programming model.

4 Distributed Shared Memory on Clusters

4.1 Distributed Shared Memory (DSM) System

A DSM system logically implements the shared-memory model on a physically distributed- memory system. System designers could implement the specific mechanism for achieving the shared-memory abstraction either of hardware or of software in a different ways. The DSM system hides the remote communication mechanism from the application writer, preserving the programming ease and portability typical of shared-memory systems. DSM systems allow relatively easy modification and efficient execution of existing shared-memory system applications, which preserves software investments while optimizing the performance. In addition, the scalability and cost-effectiveness of underlying distributed-memory systems are also inherited. Consequently, DSM systems offer a viable choice for building efficient, large- scale multiprocessors. The DSM model's ability to provide a transparent interface and convenient programming environment for distributed and parallel applications have made it the focus of numerous research efforts in recent years. Current DSM system research [24] focuses on the development of general approaches that minimize the average access time to shared data, while maintaining data consistency. Some solutions implement a specific software layer on top of existing message-passing systems. Others extend strategies applied in shared-memory multiprocessors with private caches to multilevel memory systems [14]. Although the shared memory abstraction is gaining ground as a programming abstraction for parallel computing, the main platforms that support it, small-scale symmetric multiprocessors (SMPs) and hardware cache-coherent distributed shared memory systems (DSMs), seem to lie 6 Gordana Stojceska inherently at the extremes of the cost-performance spectrum for parallel systems. Because of this reason, there exist also shared virtual memory (SVM) clusters [16] that try to bridge this gap by examining how application performance scales on a state-of- the-art shared virtual memory cluster[15]. The results are:  The level of application restructuring needed is quite high compared to applications that perform well on a DSM system of the same scale and larger problem sizes are needed for good performance.  However, surprisingly, SVM performs quite well for a fairly wide range of applications, achieving at least half the parallel efficiency of a high-end DSM system at the same scale and often much more. However, SMPs do not scale to larger numbers of processors, and DSM systems, although they can be scaled down to smaller numbers of processors, they are still very costly compared to SMPs and clusters.

4.2 Software Distributed Shared Memory (SDSM) System

A Software Distributed Shared Memory (SDSM) system is known to provide shared memory abstraction over physically distributed memory of clusters. Programming with SDSM is considered to be easier than with a message-passing model. The goal of many researches is to investigate how to build efficient SDSM on clusters. One such example is KAIST DSM [25]. The team designed and implemented the several versions of SDSM with focusing on communication layers, consistency models and high availability. In addition, the SDSM has been attractive for large clusters due to its high performance and scalability. However, the probability of failures increases as the system size grows. Thus, they also designed and implemented SDSM with fault tolerant capabilities. It guarantees the bounded recovery time and ensures no domino effect.

4.3 Memory management for multithread software DSM systems

Software distributed shared memory (SDSM) systems have been powerful platforms to provide shared address space on distributed memory architectures. The first generation of SDSM systems like IVY [5], Midway [6], Munin [7], and TreadMarks [8] assume uniprocessor nodes, thus allow only one thread per process on a node. Currently, commodity off-the-shelf microprocessors and network components are widely used as building blocks for parallel computers. This trend has made cluster systems consisting of symmetric multiprocessors (SMPs), attractive platforms for high performance computing. However, the early single- threaded SDSM systems are too restricted to exploit multiprocessors in SMP clusters. The next generation SDSM systems like Quarks [9], Brazos [10], DSM-Threads [11], and Murks [12] are aware of multiprocessors and exploit them by means of multiprocesses or multithreads. In general, naive process-based systems experience high context switching overhead and additional inter-process communication delay within a node. Programming models, environments and languages 7

5 The Atomic Page Update Problem

5.1 Definition of the problem

The conventional fault-handling process used in the implementation of single-node DSM systems is not useful as a concept anymore in multithreaded environments because other threads may try to access the same page during the update period. The SDSM system is in doubt when multiple threads try to access an invalid page within a short interval. On the first access to an invalid page, the system should set the page writable to replace with a valid one. Unfortunately, this change also allows other application threads to access the same page freely. This phenomenon is known as atomic page update and change right problem [11] or mmap() race condition [12]. Shortly, this is known as “atomic page update problem”.

5.2 Solutions

Famous solution to this problem adopted by major multithreaded SDSM systems (like TreadMarks [8] and Brazos [10]), is to map a file to two different virtual addresses. Even though the file mapping method achieves good performance on some systems, file mapping is not always the best solution. This observation gives the researchers an idea that other solutions have to be found to the atomic page update problem. Moreover, file mapping has high initialization cost and reduces the available address space because SDSM and application share the same address space. A solution to this problem in general, is the following one: separate the application address space from the system address space for the same physical memory, and assign different access permission to each address space. Since the virtual memory protection mechanism is implemented in the per-process page table, different virtual addresses (pages) can have different access permission even though they refer to the same physical page. Then, the system can guarantee the atomic page update by changing the access permission of a virtual page in the application address space only after it completes the page update through the system address space. Except this general solution, there are also proposed other different solution. Here are presented the following three [17]: 1. System V shared memory. Mapping a physical page to different virtual addresses using System V shared memory IPC. The shmget() system call enables a process to create a shared memory object in the kernel and the shmat() system call enables the process to attach the object to its address space. Meanwhile, a process can attach the shared memory object to its address space more than once and a different virtual address is assigned to each attachment. This is very cheap, compared to file mapping. 2. mdup() system call method. The basic mechanism of mdup() is to allocate new page table entries and to copy the page table entries of the anonymous memory to new ones. The reasons why one uses anonymous memory, are following: (1) no initialization step is required, (2) there is no size limit, and (3) the memory region is released automatically at program termination. 3. fork() system call. When a process forks a child process, the child process inherits the execution image of the parent process. Especially, the content of the child process page table is copied from that of the parent process. The parent process creates shared memory regions and forks a child process. They have independent access paths even though they use the same virtual address to access the same physical page. The parent 8 Gordana Stojceska

process could execute applications and the child process could perform memory consistency mechanisms. Hence, the SDSM system can successfully update the shared memory region in a thread-safe way through the child process’s address space. Experiments on a Linux-based cluster and on an IBM SP2[13] machine showed that the three proposed methods overcome the drawbacks of the file mapping method such as high initialization cost and buffer cache flushing overhead. In particular, the method using a fork() system call is portable and preserves the whole address space to the application even though the others can use only the half of the virtual address space. The System V shared memory method shows low initialization cost and runtime overhead, and the new mdup() system call method has the least coding overhead in the application code. Not all the methods can be implemented on a given SMP cluster system due to the limitation of the operating system as observed in the IBM SP System.

6 OpenMP

OpenMP[18] is an industry-standard API for programming shared memory computers. It is available on most if not all commercially available shared memory computers. With OpenMP one can direct the compiler to create multi-threaded blocks of code by adding compiler directives to your program. It is easy to use and in many cases supports the incremental addition of parallelism to a program. So why should a shared memory API such as OpenMP be of interest to the cluster computing community? There are two reasons. First, many clusters are built from shared memory nodes. OpenMP can be used to exploit parallelism on a node while a distributed memory API is used between nodes. Second, OpenMP is evolving to address non-uniform memory architecture (NUMA) computers [19]. A cluster running some sort of distributed shared memory (DSM) is an extreme case of NUMA. Hence one can program a cluster using OpenMP.

7 Future Work

This paper is just one drop in a ocean called software for cluster computing. There are many interesting research areas in this area that grow within each day. The open source projects are becoming more and more popular for the scientific community in general. That means, the people involved in such projects have freedom to do realize their research visions. There is a lot of work and progress to be done in the system software, the operating systems as well as different ways of solving the problems with the DSM and SDSM systems.

8 Bibliography

[1] Information about Linux http://www.linux.org/, http://www.linux.org/info/index.html [2] Beowulf Project, http://www.beowulf.org [3] Gregory Pfister, „In search of Clusters“, [4] http://www.microsoft.com/windows2000/techinfo/howitworks/cluster/introcluster.asp Programming models, environments and languages 9

[5] L. Kai, IVY: a shared virtual memory system for parallel computing, International Conference on Parallel Processing (1988) 94–101. [6] B.N. Bershad, M.J. Zekauskas, W.A. Sawdon, The midway distributed shared memory system, IEEE International Computer Conference, February 1993, pp. 528–537. [7] J.K. Bennett, J.B. Carter, W. Zwaenepoel, Munin: distributed shared memory based on type- specific memory coherence, Principles and Practice of Parallel Programming (1990) 168–176. [8] C. Amza, A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, W.Zwaenepoel, TreadMarks: shared memory computing on networks of workstations, IEEE Computer 29 (2) (1996) 18–28. [9] D.R. Khandekar, Quarks: distributed shared memory as a basic building block for complex parallel and distributed systems, Master’s Thesis, University of Utah, 1996. [10] E. Speight, J.K. Bennett, Brazos: “A third generation DSM system”, USENIX Windows NT Workshop, August 1997, pp. 95–106. [11] F. Mueller, “Distributed shared-memory threads: DSM-Threads”, Workshop on RunTime systems for Parallel Programming, April 1997, pp. 31–40. [12] M. Pizka, C. Rehn, Murks––“A POSIX threads based DSM system”, in: Proceedings of the International Conference on Parallel and Distributed Computing Systems, 2001. [13] http://www.tc.cornell.edu/~slantz/what_is_sp2.html [14] J. Protic, M. Tomasevic, and V. Milutinovoic, “Distributed Shared Memory: Concepts and Systems,” IEEE Parallel and Distributed Technology, vol. 4, pp. 63–79, Summer 1996. [15] Angelos Bilas,_ Dongming Jiang and Jaswinder Pal Singh, “Shared virtual memory clusters: bridging the cost-performance gap between SMPs and hardware DSM systems” [16] Home-based SVM Protocols for SVM Clusters, Design, Implementation and performmance http://www.cs.princeton.edu/research/techreps/TR-548-97 [17] “Memory management for multi-threaded software DSM systems” Yang-Suk Kee, Jin- Soo Kim, Soonhoi Ha, www.elsevier.com/locate/parco [18] Online Information about OpenMP, http://www.openmp.org/ [19] OpenMp on NUMA, http://www2.cs.uh.edu/~hpctools/openmpOngoing.shtml [20] Open Solaris, http://www.adtmag.com/article.asp?id=10519 [21] Myrinet, http://www.myricom.com/myrinet/overview/ [22] MOSIX OS, http://www.mosix.org/, http://www.ospueblo.com/mosix.shtml [23] Information about openMosix OS, http://openmosix.sourceforge.net [24] International Workshop on DSM, http://perso.ens-lyon.fr/laurent.lefevre/dsm2005/ [25] KAIST Project, http://camars.kaist.ac.kr/~nrl/team/dsm.html