Infrastructure for Load Balancing on Mosix Cluster

MadhuSudhan Reddy Tera and Sadanand Kota Computing and Information Science, Kansas State University Under the Guidance of Dr. Daniel Andresen.

behind any typical clustering environment Abstract such as Beowulf parallel computing system, which comes with free versions of UNIX The complexity and size of software are and public domain software packages. increasing at a rapid rate. This results in the increase in build time and execution times. Mosix is a software that was specifically Cluster computing is proving to be an designed to enhance the kernel with effective way to reduce this in an cluster computing capabilities. It is a tool economical way. Currently, most available consisting of kernel level resource sharing cluster computing software tools which algorithms that are geared for performance achieve load balancing by process scalability in a cluster computer. Mosix migration schemes, do not consider all supports resource sharing by dynamic characteristics such as CPU load, memory process migration. It relieves the user from usage, network bandwidth usage during the responsibility of allocation of processes migration. For instance, Mosix, a cluster to nodes by distributing the workloads computing software tool for Linux, does not dynamically. In this project, we are consider network bandwidth usage and concentrating on homogeneous systems, neither does it consider CPU usage and wherein we have machines with same family memory characteristics together. In this of processors running the same kernel. paper we present the infrastructure for efficient load balancing on Mosix cluster The resource sharing algorithm of Mosix through intelligent scheduling techniques. attempts to reduce the load differences between pairs of nodes (systems in the Introduction cluster) by migrating processes from higher As computers increase their processing loaded nodes to lesser loaded nodes. This is power, software complexity grows at an done in decentralized manner i.e. all nodes even larger rate in order to consume all of execute the same algorithms and each node those new CPU cycles. Not only does the performs the reduction of loads running of the new software require more independently. Also, Mosix considers only CPU cycles, but the time required to balancing of loads on processors and compile and link the software also increases. responds to changes in loads on processors The basic idea behind clustering approach is as long as there is no extreme shortage of to make a large number of individual other resources such as free memory and machines act like a single very powerful empty process slots. Mosix does not machine. With the power and low prices of consider certain parameters such as network today’s PCs and the availability of high bandwidth usage by a process running on a performance Ethernet connections, it makes node. In addition, Mosix distributes the load sense to combine them to build High evenly and does not give the user, the Performance Computing and Parallel control of load distribution i.e. if the user Computing environment. This is the concept wants only few of the machines to be evenly loaded and few others to be heavily /lightly b. Profitability Determination phase: We loaded, he will not be able to do this. Our should perform load balancing only when project aims to overcome these the cost of imbalance is greater that cost of shortcomings. Our initial scheduling load balancing. This comparison of cost of technique is decentralized and tries to give imbalance vs. cost of load balancing is the user, the control of balancing the load on profitability determination. Generally if cost various machines. The scheduling is not considered during actual migration, an algorithms, we use, try to achieve balance in excessive number of tasks can be migrated, load, memory and network bandwidth by and this will have negative influence on the collecting performance metrics of a process system performance. through Performance Co-pilot (PCP), a c. Task Selection phase: Now we must select framework and services to support system- a set of tasks that must be dispatched from level performance monitoring and the system so that the imbalance is removed. performance management from SGI. We This is done in the task selection phase. The also propose an implementation, which is task should be selected in such a way that based on a centralized scheduler, which tries moving the task from the system would to eliminate the problems of decentralized remove the imbalance to a large extent. For scheduling, such as every node trying to instance, we can see the proportion of CPU move their CPU intensive processes to a usage of the task on the system. We should lightly loaded node. The centralized also consider the cost of moving the task scheduler also takes care of network over the link in the cluster, size of the bandwidth usage of a process and tries to transfer, since larger tasks will take longer reduce the overall network bandwidth time to move that the smaller ones. consumption by migrating the d. Task Migration phase: This is the final communicating processes to single node in phase of load balancing in the cluster. This addition to balancing the load on individual step must be done carefully and correctly to processes as required by the user. ensure continued communication integrity.

Load Balancing In the following sections, we will explore The notion in Mosix cluster is that whenever two most popular cluster computing a system in the cluster becomes heavily technologies namely, Mosix and Condor and loaded, then the load is distributed in the also conclude why we chose Mosix. We cluster. The dispatching of tasks from continue the paper with our implementation heavily loaded system and scheduling it to a and provide sample test results. lightly loaded system in the cluster is called load balancing. Load balancing can be Mosix divided into following phases: Mosix is a cluster-computing enhancement of Linux, which allows multiple uni- a. Load Evaluation phase: “ The usefulness processors, and SMP’s running the same of any load balancing scheme is directly version of kernel to share resources by dependent on the quality of load preemptive process migration and dynamic measurement and prediction.” [Watts98] load balancing. Mosix implements resource- Any good load balancing technique not only sharing algorithms, which respond to load has good measurement of load, but also, variations on individual computer systems sees that it does not affect the actual load on by migrating processes from one the system. workstation to another, preemptively. The

goal is to improve the overall performance migration in Mosix is implemented by and to create a convenient multi-user, time- dividing the migrating process into two sharing environment for execution of contexts: user context (‘remote’)– that can applications. Unique features of Mosix are be migrated, and system context (‘deputy’) – that is UHN dependent and cannot be a. Network transparency: For all network migrated (see figure 1). The ‘remote’ related operations, the application level consists of stack, data, program code, programs are provided with a virtual memory maps and registers. The ‘deputy’ machine that looks like a single machine i.e. consists of description of the resources, the application programs do not need to which the process is attached to, and a know the current state of the system kernel-stack for the execution of the system configuration. on behalf of the process. The interaction of b. Preemptive process migration: It means ‘deputy’ and the ‘remote’ is implemented at that it can migrate any user’s process, the link layer as shown in figure 1, which transparently to any available node at any also shows two processes sharing a UHN, time. Transparency in migration means that one local and a deputy. the functional aspects of the system behavior should not be altered as a result of Remote processes are not accessible to other migration. processes that run at the same node and vice c. Dynamic load balancing: As explained versa. They do not belong to any particular earlier, Mosix has resource sharing user nor can they be sent signals or algorithms, which work in decentralized otherwise manipulated by any local process. manner. They can only be forced by system administrator to migrate out. The deputy The granularity of work distribution in does not have a memory map of its own. Mosix is process. Each process has Unique Instead, it shares the main kernel map Home Node (UHN) where it was created. similar to kernel thread. The system calls Process that migrates to other nodes (called’ that are executed by the process (remote) are remote’) use local (in the remote node) intercepted by remote site’s link layer. If the resources whenever possible, but interact system calls are site independent, it is with the user’s environment through the executed by the ‘remote’ locally, else, the UHN, for e.g. gettimeofday() would get the system call is forwarded to the ‘deputy’. The time from the UHN. Preemptive process ‘deputy’ then executes the call and returns c. Jobs can be ordered: The ordering of job the result back to remote site. execution required by dependencies among jobs in a set is easily handled. The set of jobs is specified using a directed acyclic Condor graph, where each job is a node in the graph. Condor is a High Throughput Computing Jobs are submitted to Condor following the environment that can manage very large dependencies given by the graph. collections of distributively owned d. Condor Enables Grid Computing: As grid workstations. The environment is based on a computing becomes a reality, Condor is novel layered architecture that enables it to already there. The technique of glide in provide a powerful and flexible suite of allows jobs submitted to Condor to be Resource Management services to sequential executed on grid machines in various and parallel applications. The following are locations worldwide. As the details of grid the features of Condor: computing evolve, so does Condor's ability, starting with Globus-controlled resources. a. Checkpoint and Migration: Where e. Sensitive to the Desires of Machine programs can be linked with Condor Owners: The owner of a machine has libraries, users of Condor may be assured complete priority over the use of the that their jobs will eventually complete, machine. An owner is generally happy to let even in the ever changing environment that others compute on the machine while it is Condor utilizes. As a machine running a job idle, but wants it back promptly upon submitted to Condor becomes unavailable, returning. The owner does not want to take the job can be checkpointed. The job may special action to regain control. Condor continue after migrating to another handles this automatically. machine. Condor's periodic checkpoint f. ClassAds: The ClassAd mechanism in feature periodically checkpoints a job even Condor provides an extremely flexible, in lieu of migration in order to safeguard the expressive framework for matchmaking accumulated computation time on a job from resource requests with resource offers. Users being lost in the event of a system failure can easily request both job requirements and such as the machine being shutdown or a job desires. For example, a user can require crash. that a job run on a machine with 64 Mbytes b. Remote System Calls: Despite running of RAM, but state a preference for 128 jobs on remote machines, the Condor Mbytes, if available. A workstation owner standard universe execution mode preserves can state a preference that the workstation the local execution environment via remote runs jobs from a specified set of users. The system calls. Users do not have to worry owner can also require that there be no about making data files available to remote interactive workstation activity detectable at workstations or even obtaining a login certain hours before Condor could start a account on remote workstations before job. Job requirements/preferences and Condor executes their programs there. The resource availability constraints can be program behaves under Condor as if it were described in terms of powerful expressions, running as the user that submitted the job on resulting in Condor's adaptation to nearly the workstation where it was originally any desired policy. submitted, no matter on which machine it really ends up executing on. Condor has some limitations on jobs that it i. A fair amount of disk space must be can transparently checkpoint and migrate available on the submitting machine for which are the following storing a job’s checkpoint images. A a. Multi-process jobs are not allowed. This checkpoint image is approximately equal to includes system calls such as fork(), exec(), the virtual memory consumed by a job while and system(). it runs. If disk space is short, a special b. Inter-process communication is not checkpoint server can be designated for allowed. This includes pipes, semaphores, storing all the checkpoint images for a pool. and shared memory. c. Network communication must be brief. A The following are the reasons for selecting job may make network connections using Mosix over Condor for our implementation system calls such as socket(), but a network a. Condor has too many limitations on the connection left open for long periods will type of process it can migrate as we have delay checkpointing and migration. seen above. d. Sending or receiving the SIGUSR2 or b. Condor is not an open source project. SIGTSTP signals is not allowed. Condor Also it does not provide any API for reserves these signals for its own use. migrating the processes. Mosix API and its Sending or receiving all other signals is source code are available free of cost allowed. through GPL for programmers to develop e. Alarms, timers, and sleeping are not the existing product or to utilize them in allowed. This includes system calls such as upper layers. Also it has an option to restrict alarm(), getitimer(), and sleep(). its resource sharing algorithms so that users f. Multiple kernel-level threads are not can develop their own resource sharing/ allowed. However, multiple user-level scheduling algorithms in order to gain a threads are allowed. better control over load distribution. g. Memory mapped files are not allowed. This includes system calls such as mmap() and munmap(). Performance Co-Pilot (PCP) h. All files must be opened read-only or PCP is a framework and services to support write-only. A file opened for both reading system-level performance monitoring and and writing will cause trouble if a job must performance management. Performance data be rolled back to an old checkpoint image. may be collected and exported from multiple For compatibility reasons, a file opened for sources, most notably the hardware both reading and writing will result in a platform, the IRIX kernel, layered services warning but not an error. and end-user applications. The diagram below shows architecture of PCP.

Figure 2: PCP architecture PCP consists of several monitoring and encapsulates domain-specific knowledge collecting tools. The monitoring tools and methods about performance metrics that consume and process performance data implement the uniform access protocols and using a public interface, the Performance functional semantics of the PCP. There is Metrics Application Programming Interface one PMDA for the , (PMAPI). Below the PMAPI level is the another for process specific statistics, one Performance Metric Collector Daemon each for common DBMS products, and so (PMCD) process, which acts in a on. Connections between PMDAs and coordinating role, accepting requests from PMCD are managed by the PMDA clients, routing requests to one or more functions. There can be multiple monitor Performance Metrics Domain Agents clients and multiple PMDAs on the one host, (PMDA), aggregating responses from the but there may be only one PMCD process. PMDAs, and responding to the requesting PCP also allows extending the client. Each performance metric domain functionalities by writing agents to collect (such as IRIX or some Database performance metrics from uncharted Management System or NETSTAT in our domains, or to program new analysis or case)) has a well-defined name space for visualization tools using the Performance referring to the specific performance metrics Metrics Application Programming Interface it knows how to collect. Each PMDA (PMAPI).

Design and Implementation Figure below shows the various modules interact in final application (on every machine).

PMDA - PMDA - process netstat

PMClient- process

Scheduler Migration Module

P1 P2 Pn

Figure 3: Interaction of Modules The following are the modules in our communicating). The name of this PMDA is implementation ‘netstat’ and user’s can access its a. PMDA: PCP provides PMDAs for functionalities by using the name ‘netstat’. collecting process metrics such as number of E.g.: pminfo –f netsatat gives the metrics of processes running, user time, system time, the communicating processes. memory used and so on. We have used these PMDAs through the PMClient in order to Precisely, the metrics of a process being get the process metrics. As the PCP does not considered presently are system time and have any PMDAs to give the network netstat metrics. A CPU intensive process characteristics of a process, we have written will have a high value for the ratio of system a PMDA, using the common netstat utility to time to the total time taken. Whenever there give metrics concerning network is a load imbalance, we consider the process characteristics of each process such as with maximum CPU intensity for migration. intensity of communication (in bytes/sec), This step corresponds to the ‘Load address of the machine on which the process Evaluation Phase’, mentioned earlier. is running on the other machine, source port number and destination port number (port b. PMClient: It is responsible for talking number of the process with which its with PMDA though PMAP Interface. It fetches the metric values, which can be used when the execution ends) whereas for making the decision upon process Algorihm2 gets the migrated process back if migration. it sees that the load on the machine to which the process migrated has increased more c. Scheduler: It acquires all the performance than the current machine by a certain metric values of all processes and then threshold. determines if any scheduling is required based upon the scheduling algorithm. This Algorithm2: corresponds to the ‘Task Selection Phase’ 1: fetch metrics for all process. (We do not consider ‘Profitability 2:fetch load on two machines Determination phase’ as the cost of load 3:check if loads are different (with balancing, such as the time/ load taken for difference greater than a threshold value). running the scheduler and the time taken for 3a:If load on current machine is very

actual migration of a process are very less as less ¢ compared to the load of a process). 3a1:Check if any process has migrated to other machine. If so, get the We have implemented two different process with highest load, back to current scheduling techniques for balancing CPU machine. Wait for few seconds and jump to loads. (All the algorithm implementations start. were done on a cluster of two machines). 3b:If load on current machine is

much higher than load on other machine £ Algorithm 1: 3b1:select the process, which 1: fetch metrics for all process. is causing maximum load and also 2: fetch load on two machines running on current machine (by 3: check if loads are different (with system time value). difference greater than a threshold value). If the process is 3a: if load on current machine is less migratable, move the process sleep for few seconds and jump to to other machine. Then sleep step 1. for few seconds and go step1. 3b:If load on current machine is 3b2:If process already

much higher than load on other machine ¡ migrated, repeat step 3b1 for remaining 3b1:select the process, which processes. is causing maximum load and also running on current machine. These algorithms correspond to the ‘Task If the process is Selection and Migration Phase’. The migratable, move the process ‘Migration Module’ handles the actual to other machine. Then sleep migration by using Mosix API. Both the for few seconds and go step 1 algorithms give a better performance over 3b2:If process already process assignment (static) without any migrated, repeat step 3b1 for remaining scheduling as they try to balance the load on processes. both the machines. Algorithm2 is more efficient than algorithm1 as the later does Algorithm 2 differs from Algorithm1 in the not does not migrate the processes back case of handling processes which a have home even when the load on other machine been already migrated. i.e. Algorithm1 waits become high. The graphs, later in the paper for the process to return back by itself ( show the improvement over the completion time of processes scheduled with algorithm which it is communicating. This might lead over the ones run without algorithm. to just swapping of processes with respect to machines and does not reduce the The advantages of the above scheduling communication intensity. Later we propose algorithms over Mosix’s resource sharing a solution for all the above issues along with algorithm are that they give the control of providing the flexibility of scheduling load balancing to the user. User can specify control to the user by means of a central the threshold of loads on each machine for scheduler. the scheduler to make better decisions on load balancing. Above methodologies have Results certain shortcomings as for example 1) They The diagrams below shows the performance are decentralized, in which every machine of the machines with sample test programs, tries to run the scheduling algorithms and it which are either CPU, bound or I/O bound. might happen that all highly loaded The CPU bound processes were benefited by machines try to migrate their processes onto our scheduling algorithms which is evident lightly loaded machine. This causes the form their faster execution as also seen in lightly loaded machine (say ‘X’) to become the graphs. Figure 4 & 5 show the heavily loaded within a short span of time. performances of the algorithm2 and Now, ‘X’ tries to migrate its own processes algorithm2 respectively. The results show or all other machines, which migrated their that algorithm2 is more efficient than processes, might take their processes back. algorithm1 as expected .The I/O intensive This series of actions will lead to frequent processes are not benefited much by the increase and decrease of loads and would schedulers as the system calls for files are not help in any way for load balancing. always redirected towards the ‘deputy’ in Also, every machine might try to migrate the the home node (see figure 6). communicating process to the machine with

Performance of scheduling algorithm1 for CPU bound jobs

1600 1400 1200 1000 800 600 400 200 0 1000 10000 50000 100000 500000 600000 1000000

with scheduling algorithm1 without scheduling algorithm

Figure 4: Graph of program size (X Axis) vs. execution time (Y Axis) Performance of scheduling algorithm2 for CPU bound jobs

2000

1500

1000

500

0 200000 500000 1000000

with scheduling algorithm2 without scheduling algorithm

Figure 5: Graph of program size (X Axis) vs. execution time (Y Axis)

Performance of scheduling algorithm for I/O bound jobs

1400 1200 1000 800 600 400 200 0 100 1000 5000

with scheduler algorithm2 without scheduler algorithm

Figure 6: Graph of program size (X Axis) vs. execution time (Y Axis) without

Centralized Scheduler This is our next step of scheduling and we are presenting the planned design/implementation method here.

PMDA - PMDA - process netstat

PMClient- PMClient- process process

Local Local Migration Scheduler Migration Module Scheduler Module

P1 Pn P1 Pn

Centralized Scheduler

Figure 7: Interaction of Modules with Central Scheduler

In this model, all the scheduling decisions time through RPC function calls. The are done by centralized scheduler. RPC centralized scheduler would schedule the communication mechanism is the best way jobs in the following manner for passing information from all the a. For all CPU bound jobs (with zero or very machines to the centralized scheduler. All less network intensity), the centralized individual machines would behave as RPC scheduler tries to balance the load servers and the centralized scheduler (depending upon user’s constraints, if any) behaves as RPC client, which asks for the on all machines. This would eliminate the process information from the individual earlier problem of frequent change of loads. machines (servers) at regular intervals of b. For all jobs with considerable network network in addition to the actual intensity, the centralized scheduler tries to communication of the processes themselves. create a graph/table which determines the We can reduce this overhead to a large communicating pairs (or a set) of processes. extent by making the local schedulers more The scheduler then tries to reduce the intelligent. All processes with only CPU network intensity by having the load can be handled for migration, by the communicating processes on the same local schedulers themselves. The local machine. Also, the central scheduler tries to scheduler will then only passes the balance CPU load, which is caused by the information about the processes, which have processes, which are network intensive. The some communication. The total amount of phases of getting the metrics, constructing information passed would be very less as the graph of communicating processes, compared to earlier amount of information getting the loads of various machines are passed if the network intensive processes are separated from the actual scheduling a small subset of the total processes running mechanism. This will help in changing the in the system. This scheme will make the scheduling mechanism based on user’s whole process efficient by dividing the requirements as and when needed. The scheduling overhead to different machines centralized scheduler will then send a and also reduces the network load when the message to the individual machines (servers) scheduling is done frequently. If the to actually migrate the processes. However, scheduling is done occasionally (say, once there are certain issues to be considered in every 15 minutes) then making the central centralized scheduler implementation. The scheduler handle all the work (scheduling amount of information passed from the both CPU intensive and network intensive individual machines to the central scheduler processes) will be the efficient way. might be enormous. As for example if a machine passes the following structure on Conclusions and Future Work every request for metrics, (values in braces In all the work done so far, we have indicates the size of each field) prepared the infrastructure for load balancing on Mosix Cluster. We have tested struct info { this with our (local) schedulers based on char pid[2];(2) // pid simple load balancing algorithms that we float CPU_Load; (4) //the cpu load of this discussed in the earlier sections. As we have process mentioned earlier, the local schedulers have int sport ;(2) //source port their own disadvantages such as frequent int dport; (2) //destination port variation of loads, frequent swapping of char daddr[8]; (8) // destination IP address communicating processes. We plan to float net_intensity; (4) // network intensity overcome these disadvantages by } implementing the central scheduler. Also, extensive scripts will facilitate large scale The total size of the structure is 22 bytes. testing with varying parameters for complex The values for sport, dport and net_intensity scheduling algorithms. will be zero non-communicating processes. If machine is loaded with many processes, then sending the above information for all processes frequently (say every 5 – 10 seconds), then it will be an overhead on the References [1] http://www.mosix.org/ [2] http://oss.sgi.com/projects/pcp [3] http://www.cs.wisc.edu/condor/manual [4] Watts, Jerrell and Taylor, Stephen, "A Practical Approach to Dynamic Load Balancing, " IEEE Transactions on Parallel and Distributed Systems, Vol 9, No 3, March 1998. [5] Amnon Barak, Oren La’adan Amnon Shiloh, "Scalable Cluster Computing with MOSIX for LINUX." [6] Amnon Barak, Avner Braveman, Ilia Gilderman and Oren Laden, Performance of PVM with the MOSIX Preemptive Process Migration Scheme." [7] Steve McClure, Richard Wheeler, "Mosix: How Linux Clusters Solve Real World Problems."