Scalable Cluster Computing with MOSIX for LINUX

Amnon Barak Oren La’adan Amnon Shiloh Institute of Computer Science The Hebrew University of Jerusalem Jerusalem 91904, Israel http://www.mosix.cs.huji.ac.il

ABSTRACT 1 Introduction

Mosix is a software tool for supporting cluster This paper describes the Mosix technology for computing. It consists of kernel-level, adaptive Cluster Computing (CC). Mosix [4, 5] is a set resource sharing algorithms that are geared for of adaptive resource sharing algorithms that are high performance, overhead-free scalability and geared for performance scalability in a CC of ease-of-use of a scalable computing cluster. The any size, where the only shared component is core of the Mosix technology is the capability the network. The core of the Mosix technology of multiple workstations and servers (nodes) to is the capability of multiple nodes (workstations work cooperatively as if part of a single system. and servers, including SMP’s) to work coopera- The algorithms of Mosix are designed to re- tively as if part of a single system. spond to variations in the resource usage among In order to understand what Mosix does, let the nodes by migrating processes from one node us compare a Shared Memory (SMP) multicom- to another, preemptively and transparently, for puter and a CC. In an SMP system, several pro- load-balancing and to prevent memory deple- cessors share the memory. The main advan- tion at any node. Mosix is scalable and it at- tages are increased processing volume and fast tempts to improve the overall performance by communication between the processes (via the dynamic distribution and redistribution of the shared memory). SMP’s can handle many si- workload and the resources among the nodes of multaneously running processes, with efficient a computing-cluster of any size. Mosix conve- resource allocation and sharing. Any time a pro- niently supports a multi-user time-sharing envi- cess is started, finished, or changes its computa- ronment for the execution of both sequential and tional profile, the system adapt instantaneously parallel tasks. to the resulting execution environment. The user So far Mosix was developed 7 times, for dif- is not involved and in most cases does not even ferent version of Unix, BSD and most recently know about such activities. for Linux. This paper describes the 7-th version Unlike SMP’s, Computing Clusters (CC) are of Mosix, for Linux. made of collections of share-nothing worksta-

tions and (even SMP) servers (nodes), with ¢ ¡ Copyright c 1999 Amnon Barak. All rights reserved different speeds and memory sizes, possibly from different generations. Most often, CC’s Current CC’s lack such capabilities. They rely are geared for multi-user, time-sharing environ- on user’s controlled static allocation, which is ments. In CC systems the user is responsible to inconvenient and may lead to significant perfor- allocate the processes to the nodes and to man- mance penalties due to load imbalances. age the cluster resources. In many CC systems, Mosix is a set of algorithmsthat support adap- even though all the nodes run the same operating tive resource sharing in a scalable CC by dy- system, cooperation between the nodes is rather namic process migration. It can be viewed as a limited because most of the operating system’s tool that takes CC platforms one step closer to- services are locally confined to each node. The wards SMP environments. By being able to allo- main software packages for process allocation cate resources globally, and distribute the work- in CC’s are PVM [8] and MPI [9]. LSF [7] and load dynamically and efficiently, it simplifies Extreme Linux [10] provide similar services. the use of CC’s by relieving the user from the These packages provide an execution environ- burden of managing the cluster-wide resources. ment that requires an adaptation of the appli- This is particularly evident in a multi-user, time- cation and the user’s awareness. They include sharing environments and in non-uniform CC’s. tools for initial (fixed) assignment of processes to nodes, which sometimes use load consider- ations, while ignoring the availability of other 2 WhatisMosix resources, e.g. free memory and I/O overheads. These packages run at the user level, just like Mosix [4, 5] is a tool for a Unix-like ker- ordinary applications, thus are incapable to re- nel, such as Linux, consisting of adaptive respond to fluctuations of the load or other resource sharing algorithms. It allows multiple sources, or to redistribute the workload adap- Uni-processors (UP) and SMP’s (nodes) run- tively. ning the same kernel to work in close coopera- In practice, the resource allocation problem tion. The resource sharing algorithms of Mosix is much more complex because there are many are designed to respond on-line to variations in (different) kinds of resources, e.g., CPU, mem- the resource usage among the nodes. This is ory, I/O, Inter Process Communication (IPC), achieved by migrating processes from one node etc, where each resource is used in a different to another, preemptively and transparently, for manner and in most cases its usage is unpre- load-balancing and to prevent thrashing due to dictable. Further complexity results from the memory swapping. The goal is to improve the fact that different users do not coordinate their overall (cluster-wide) performance and to cre- activities. Thus even if one knows how to op- ate a convenient multi-user, time-sharing en- timize the allocation of resources to processes, vironment for the execution of both sequen- the activities of other users are most likely to in- tial and parallel applications. The standard run terfere with this optimization. time environment of Mosix is a CC, in which For the user, SMP systems guarantee effi- the cluster-wide resources are available to each cient, balanced use of the resources among node. By disabling the automatic process migra- all the running processes, regardless of the re- tion, the user can switch the configuration into a source requirements. SMP’s are easy to use be- plain CC, or even an MPP (single-user) mode. cause they employ adaptive resource manage- The current implementation of Mosix is de- ment, that is completely transparent to the user. signed to run on clusters of X86/Pentium based workstations, both UP’s and SMP’s that are con- tion environment of the UHN. Processes that mi- nected by standard LANs. Possible configura- grate to other (remote) nodes use local (in the tions may range from a small cluster of PC’s remote node) resources whenever possible, but that are connected by Ethernet, to a high perfor- interact with the user’s environment through the mance system, with a large number of high-end, UHN. For example, assume that a user launches Pentium based SMP servers that are connected several processes, some of which migrate away by a Gigabit LAN, e.g. Myrinet [6]. from the UHN. If the user executes “ps”, it will report the status of all the processes, including processes that are executing on remote nodes. If 2.1 The technology one of the migrated processes reads the current time, i.e. invokes gettimeofday(), it will get the The Mosix technology consists of two parts: current time at the UHN. a Preemptive Process Migration (PPM) mechanism and a set of algorithms for adaptive re- The PPM is the main tool for the resource source sharing. Both parts are implemented at management algorithms. As long as the require- the kernel level, using a loadable module, such ments for resources, such as the CPU or main that the kernel interface remains unmodified. memory are below certain threshold, the user’s Thus they are completely transparent to the ap- processes are confined to the UHN. When the plication level. requirements for resources exceed some thresh- The PPM can migrate any process, at any old levels, then some processes may be migrated time, to any available node. Usually, migra- to other nodes, to take advantage of available tions are based on information provided by one remote resources. The overall goal is to maxi- of the resource sharing algorithms, but users mize the performance by efficient utilization of may override any automatic system-decisions the network-wide resources. and migrate their processes manually. Such a The granularity of the work distribution in manual migration can either be initiated by the Mosix is the process. Users can run parallel ap- process synchronously or by an explicit request plications by initiating multiple processes in one from another process of the same user (or the node, then allow the system to assign these pro- super-user). Manual process migration can be cesses to the best available nodes at that time. useful to implement a particular policy or to test If during the execution of the processes new different scheduling algorithms. We note that resources become available, then the resource the super-user has additional privileges regard- sharing algorithms are designed to utilize these ing the PPM, such as defining general policies, new resources by possible reassignment of the as well as which nodes are available for migra- processes among the nodes. The ability to as- tion. sign and reassign processes is particularly im- Each process has a Unique Home-Node portant for “ease-of-use” and to provide an ef- (UHN) where it was created. Normally this ficient multi-user, time-sharing execution envi- is the node to which the user has logged-in. ronment. In PVM this is the node where the task was Mosix has no central control or master-slave spawned by the PVM daemon. The system im- relationship between nodes: each node can op- age model of Mosix is a CC, in which every erate as an autonomous system, and it makes process seems to run at its UHN, and all the all its control decisions independently. This processes of a users’ session share the execu- design allows a dynamic configuration, where nodes may join or leave the network with min- ping out of processes [2]. The algorithm is trig- imal disruptions. Algorithms for scalability en- gered when a node starts excessive paging due sure that the system runs well on large configu- to shortage of free memory. In this case the al- rations as it does on small configurations. Scal- gorithm overrides the load-balancing algorithm ability is achieved by incorporating randomness and attempts to migrate a process to a node in the system control algorithms, where each which has sufficient free memory, even if this node bases its decisions on partial knowledge migration would result in an uneven load distri- about the state of the other nodes, and does not bution. even attempt to determine the overall state of the cluster or any particular node. For example, in the probabilistic information dissemination al- 3 Process migration gorithm [4], each node sends, at regular inter- vals, information about its available resources Mosix supports preemptive (completely trans- to a randomly chosen subset of other nodes. At parent) process migration (PPM). After a migra- the same time it maintains a small “window”, tion, a process continues to interact with its en- with the most recently arrived information. This vironment regardless of its location. To imple- scheme supports scaling, even information dis- ment the PPM, the migrating process is divided semination and dynamic configurations. into two contexts: the user context – that can be migrated, and the system context – that is UHN 2.2 The resource sharing algorithms dependent, and may not be migrated. The user context, called the remote, contains The main resource sharing algorithms of Mosix the program code, stack, data, memory-maps are the load-balancing and the memory ush- and registers of the process. The remote encap- ering. The dynamic load-balancing algorithm sulates the process when it is running in the user continuously attempts to reduce the load differ- level. The system context, called the deputy, ences between pairs of nodes, by migrating pro- contains description of the resources which the cesses from higher loaded to less loaded nodes. process is attached to, and a kernel-stack for the This scheme is decentralized – all the nodes ex- execution of system code on behalf of the pro- ecute the same algorithms, and the reduction of cess. The deputy encapsulates the process when the load differences is performed independently it is running in the kernel. It holds the site- by pairs of nodes. The number of processors dependent part of the system context of the pro- at each node and their speed are important fac- cess, hence it must remain in the UHN of the tors for the load-balancing algorithm. This al- process. While the process can migrate many gorithm responds to changes in the loads of the times between different nodes, the deputy is nodes or the runtime characteristics of the pro- never migrated. cesses. It prevails as long as there is no extreme The interface between the user-context and shortage of other resources, e.g., free memory or the system context is well defined. Therefore empty process slots. it is possible to intercept every interaction be- The memory ushering (depletion prevention) tween these contexts, and forward this interac- algorithm is geared to place the maximal num- tion across the network. This is implemented ber of processes in the cluster-wide RAM, to at the link layer, with a special communication avoid as much as possible thrashing or the swap- channel for interaction. Figure 1 shows two User-level User-level

remote local Link Link process layer layer deputy

Kernel Kernel

Figure 1: A local process and a migrated process

processes that share a UHN. In the figure, the This location requirement is met by the commu- left process is a regular Linux process while the nication channel between them. In a typical sce- right process is split, with its remote part mi- nario, the kernel at the UHN informs the deputy grated to another node. of the event. The deputy checks whether any The migration time has a fixed component, action needs to be taken, and if so, informs the for establishing a new process frame in the new remote. The remote monitors the communica- (remote) site, and a linear component, propor- tion channel for reports of asynchronous events, tional to the number of memory pages to be e.g., signals, just before resuming user-level ex- transfered. To minimize the migration overhead, ecution. We note that this approach is robust, only the page tables and the process’ dirty pages and is not affected even by major modifications are transferred. of the kernel. It relies on almost no machine dependent features of the kernel, and thus does not In the execution of a process in Mosix, lo- hinder porting to different architectures. cation transparency is achieved by forwarding site dependent system calls to the deputy at the UHN. System calls are a synchronous form of One drawback of the deputy approach is the interaction between the two process contexts. extra overhead in the execution of system calls. All system calls that are executed by the process Additional overhead is incurred on file and net- are intercepted by the remote site’s link layer. If work access operations. For example, all net- the system call is site independent it is executed work links (sockets) are created in the UHN, by the remote locally (at the remote site). Other- thus imposing communication overhead if the wise, the system call is forwarded to the deputy, processes migrate away from the UHN. To over- which executes the system call on behalf of the come this problem we are developing “migrat- process in the UHN. The deputy returns the re- able sockets”, which will move with the process, sult(s) back to the remote site, which then con- and thus allow a direct link between migrated tinues to execute the user’s code. processes. Currently, this overhead can signifi- Other forms of interaction between the two cantly be reduced by initial distribution of com- process contexts are signal delivery and pro- municating processes to different nodes, e.g. us- cess wakeup events, e.g. when network data ar- ing PVM/MPI. Should the system become im- rives. These events require that the deputy asyn- balanced, the Mosix algorithms will reassign the chronously locate and interact with the remote. processes to improve the performance [3]. 4 The implementation In many kernel activities, such as the execution of system calls, it is necessary to transfer The porting of Mosix for Linux started by a fea- data between the user space and the kernel. This sibility study. We also developed an interactive is normally done by the copy to user(), kernel debugger, a pre-requisite for any project copy from user() kernel primitives. In of this scope. The debugger is invoked either by Mosix, any kernel memory operation that in- a user request, or when the kernel crashes. It al- volves access to user space, requires the deputy lows the developer to examine kernel memory, to communicate with its remote to transfer the processes, stack contents, etc. It also allows to necessary data. trace system calls and processes from within the The overhead of the communication due to re- kernel, and even insert break-points in the kernel mote copy operations, which may be repeated code. several times within a single system call, could In the main part of the project, we imple- be quite substantial, mainly due to the network mented the code to support the transparent op- latency. In order to eliminate excessive re- eration of split processes, with the user-context mote copies, which are very common, we imple- running on a remote node, supported by the mented a special cache that reduces the number deputy, which runs in the UHN. At the same of required interactions by prefetching as much time, we wrote the communication layer that data as possible during the initial system call re- connects between the two process contexts and quest, while buffering partial data at the deputy designed their interaction protocol. The link be- to be returned to the remote at the end of the tween the two contexts was implemented on top system call. of a simple, but exclusive TCP/IP connection. To prevent the deletion or overriding of After that, we implemented the process migra- memory-mapped files (for demand-paging) in tion mechanism, including migration away from the absence of a memory map, the deputy holds the UHN, back to the UHN and between two re- a special table of such files that are mapped to mote sites. Then, the information dissemination the remote memory. The user registers of mi- module was ported enabling exchange of sta- grated processes are normally under the respon- tus information among the nodes. Using this sibility of the remote context. However, each facility, the algorithms for process-assessment register or combination of registers, may be- and automatic migration were also ported. Fi- come temporarily owned for manipulation by nally, we designed and implemented the Mosix the deputy. application programming interface (API) via the Remote (guest) processes are not accessible to /proc . the other processes that run at the same node (locally or originated from other nodes) - and vice 4.1 Deputy / Remote mechanisms versa. They do not belong to any particular user (on the remote node, where they run) nor can The deputy is the representative of the remote they be sent signals or otherwise manipulated by process at the UHN. Since the entire user space local processes. Their memory can not be ac- memory resides at the remote node, the deputy cessed and they can only be forced, by the local does not hold a memory map of its own. Instead, system administrator, to migrate out. it shares the main kernel map similarly to a ker- A process may need to perform some Mosix nel thread. functions while logically stopped or sleeping. Such processes would run Mosix functions “in cyclically alternate between a combination of their sleep”, then resume sleeping, unless the computation and I/O. event they were waiting for has meanwhile oc- curred. An example is process migration, possibly done while the process is sleeping. For 4.4 TheMosixAPI this purpose, Mosix maintains a logical state, The Mosix API has been traditionally imple- describing how other processes should see the mented via a set of reserved system calls, process, as opposed to its immediate state. that were used to configure, query and operate Mosix. In line with the Linux convention, we 4.2 Migration constraints modified the API to be interfaced via the /proc file system. This also prevents possible binary Certain functions of the Linux kernel are not incompatibilities of user programs between dif- compatible with process context division. Some ferent Linux versions. obvious examples are direct manipulations of The API was implemented by extending the I/O devices, e.g., direct access to privileged bus- Linux /proc file system tree with a new di- I/O instructions, or direct access to device mem- rectory /proc/mosix. The calls to Mosix via ory. Other examples include writable shared /proc include: synchronous and asynchronous memory and real time scheduling. The last case migration requests; locking a process against au- is not allowed because one can not guarantee it tomatic migrations; finding where the process while migrating, as well as being unfair towards currently runs; finding about migration con- processes of other nodes. strains; system setup and administration; con- A process that uses any of the above is auto- trolling statistic collection and decay; informa- matically confined to its UHN. If the process has tion about available resources on all configured already been migrated, it is first migrated back nodes, and information about remote processes. to the UHN.

4.3 Information collection 5 Conclusions Statistics about a process’ behavior are collected Mosix brings the new dimension of scaling to regularly, such as at every system call and ev- cluster computing with Linux. It allows the ery time the process accesses user data. This construction of a high-performance, scalable information is used to assess whether the pro- CC from commodity components, where scal- cess should be migrated from the UHN. These ing does not introduce any performance over- statistics decay in time, to adjust for processes head. The main advantage of Mosix over other that change their execution profile. They are CC systems is its ability to respond at run-time also cleared completely on the “execve()” sys- to unpredictable and irregular resource require- tem call, since the process is likely to change its ments by many users. nature. The most noticeable properties of executing Each process has some control over the col- applications on Mosix are its adaptive resource lection and decay of its statistics. For instance, distribution policy, the symmetry and flexibil- a process may complete a stage knowing that its ity of its configuration. The combined effect of characteristics are about to change, or it may these properties implies that users do not have to know the current state of the resource usage of Journal of Microprocessors and Microsys- the various nodes, or even their number. Parallel tems, 22(3-4), Aug. 1998. applications can be executed by allowing Mosix to assign and reassign the processes to the best [3] A. Barak, A. Braverman, I. Gilderman, and possible nodes, almost like an SMP. O. Laden. Performance of PVM with the The Mosix R&D project is expanding in sev- MOSIX Preemptive Process Migration. In eral directions. We already completed the de- Proc. Seventh Israeli Conf. on Computer sign of migratable sockets, which will reduce Systems and Software Engineering, pages the inter process communication overhead. A 38–45, June 1996. similar optimization is migratable temporary [4] A. Barak, S. Guday, and R.G. Wheeler. files, which will allow a remote process, e.g. a The MOSIX Distributed Operating Sys- compiler, to create temporary files in the remote tem, Load Balancing for UNIX. In Lec- node. The general concept of these optimization ture Notes in Computer Science, Vol. 672. is to migrate more resources with the process, to Springer-Verlag, 1993. reduce remote access overhead. In another project, we are developing new [5] A. Barak and O. La’adan. The MOSIX competitive algorithms for adaptive resource Multicomputer Operating System for High management that can handle different kind of Performance Cluster Computing. Journal resources, e.g., CPU, memory, IPC and I/O [1]. of Future Generation Computer Systems, We are also researching algorithms for network 13(4-5):361–372, March 1998. RAM, in which a large process can utilize available memory in several nodes. The idea is to [6] N.J. Boden, D. Cohen, R.E. Felderman, spread the process’s data among many nodes, A.K. Kulawik, C.L. Seitz, J.N.Seizovic, and rather migrate the (usually small) process to and W-K. Su. Myrinet: A Gigabit-per- the data than bring the data to the process. Second Local Area Network. IEEE Micro, 15(1):29–36, Feb. 1995. In the future, we consider extending Mosix to other platforms, e.g., DEC’s Al- [7] Platform Computing Corp. LSF Suite 3.2. pha or SUN’s Sparc. Details about the 1998. current state of Mosix are available at URL http://www.mosix.cs.huji.ac.il. [8] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM - Parallel Virtual Machine. MIT References Press, Cambridge, MA, 1994. [9] W. Gropp, E. Lust, and A.Skjellum. Using [1] Y. Amir, B. Averbuch, A. Barak, R.S. MPI. MIT Press, Cambridge, MA, 1994. Borgstrom, and A. Keren. An Opportu- nity Cost Approach for Job Assignment [10] Red Hat. Extreme Linux. 1998. and Reassignment in a Scalable Comput- ing Cluster. In Proc. PDCS ’98, Oct. 1998.

[2] A. Barak and A. Braverman. Memory Ushering in a Scalable Computing Cluster.