A Data Manager Using Condor

A Data Manager using Condor

Abstract

The complexity and size of applications today is increasing by leaps and bounds. For fast computation, parallel operation is a must. But writing parallel code is not easy considering that one may not know initially the resources available at runtime. The data manager is a paradigm for managing problems, which have large sets of data. The applications, which use these large sets, can be run as stand-alone applications on one/multiple machines or on Condor. The data manager abstracts the parallelism obtained by using multiple machines on the data set. This report describes the design and implementation of the data manager. The way the data manager handles all the data is described. Then we show how two applications (viz. hash-join and semi-join) were built on top of the data manager. Performance numbers for Semi-Join show that in-memory semi- joins can be done extremely fast and the data manager can be used for management of the results and the data to be joined.

1. Introduction Applications today operate on large sets of data. As the data size increases, the time of computation also increases. A natural solution towards increase in throughput would be to write parallel code and run the application on multiple machines. But writing parallel code is not easy. Also having multiple resources (machines) for execution is also not always possible. The Condor distributed system provides users with idle machines to run their jobs on. So users can exploit idle CPU cycles and get their jobs done. The problem with writing parallel code for Condor is that one does not know the type and number of machines on which machines the user job will be running. Moreover, the number and identity of machines can keep changing over time. This presents a whole domain of problems, which discourages the writing of parallel code to be run on Condor. One paradigm the Master-Worker[ Sanjeev et.al], aims at solving some of these problems. But the MW File approach does not work well for large data sets since MW File operates using files, which

1 are very slow and require large scratch space for large data sets. The MW-PVM version requires PVM installed on all machines. Another problem with MW is that it is blocking and event-driven. So data transfer cannot take place until the MW driver has control of execution. A single thread of control implies either the MW driver executes or the application code. This raises some issues, which are discussed and explained in the next sections.

Goals The goals of this project were to create an infrastructure, which would: 1. Help an application manage its huge data set 2. Distribute data to the various machines and collect results without the application being aware of how many machines are currently operating on the data and who has what data. 3. Provide a simple API for the application to plug its code in and get going 4. Be non-blocking and provide two threads of control, one for the application and one for the data manager. So both can execute in parallel and communication is only done when required.

The rest of the report discusses how these goals were accomplished. Section 2 describes the design of the data manager while Section 3 covers the implementation. Two sample applications using the data manager are presented in Section 4 and some experimental results in Section 5. In Section 6 we discuss the limitations and known problems. Related work is presented in Section 7 and we finally conclude in Section 8.

2 Design The data manager consists of 2 main components as shown in Figure 1. The data manager itself is responsible for management of the data set while the workers are responsible for operating on the data sets allocated to them. Section 2.1 covers the design of the data manager while Section 2.2 covers the Worker design.

2 Data Manager

Worker Worker Worker Worker Worker

Figure 1: Data Manager and its Workers

2.1 Data Manager The Data Manager is responsible for making sure that the application's entire data set has been operated upon. The Data Manager (DM) consists of 2 processes, the DM itself and the application process as shown in Figure 2. The reason for use of two processes and not two threads is that Condor does not support processes with multiple threads but does support forking.

Data Manager

Application

Data to be operated on

Figure 2: Data Manager design overview

The application process informs the DM the data size it wants the DM to manage. This data size can be in terms of bytes, records, 100s of bytes or any arbitrary demarcation made by the application. The user submits multiple worker jobs to the Condor system (or starts the jobs manually/script when Condor is not available). The worker jobs after initialization contact the DM. So the DM accepts JOINS from workers and keeps track of how much memory each worker has.

3 The DM keeps record of how much data has been processed, allocated and unprocessed. On a JOIN, the data manager checks if some data is unallocated, if so, it tries to allocate the data set to the new worker. The actual allocation of data is done by the application and the application informs the data manager as to the actual size of data allocated. (eg. Data set 10-20 might be unallocated but only data set 10-15 can fit into worker’s memory. So the application will only allocate 10-15 and inform the data manager about this). Since the unit of data size is application dependent and only the application knows where to split the data (eg. Some applications may require that data be sent only in multiple of 5 records.), the decision of how much data to send is made by the application. The data selected by the application is then sent to the worker.

2.2 Worker

Worker

Application

Figure 3: Worker design overview

Like the data manager, the worker also consists of 2 processes, a Worker process and an application process. The worker process sends regular "Hello" messages to the data manager to inform it that it is alive and kicking. It also receives data on behalf of the application and passes it on to the application process. The application process performs computation on the data allotted to it and sends back the results to the data manager.

3 Implementation 3.1 Data Manager The Data Manager provides an interface, which is to be used by the application. The application inherits from the classes of the data manager and implements some virtual functions (provided by the Data Manager classes). The main() is in control of the application. Data Manager paradigm consists of 2 entities, a Manager and a Worker. The workers are the ones that actually compute on

4 the data while the Manager coordinates data transfer to the workers and result receipt from the workers. The number of workers under the control of the data manager can vary upon time and this is oblivious to the application.

3.1.1 Data Communication The Data Manager consists of 2 processes. The parent process is the data manager while the child process is the application. The application (child) process keeps doing its work and is interrupted by the data manager whenever some data has to be sent to a worker or some result or other data is received from a worker. The data manager interrupts the application process with a SIGUSR1 signal which transparently calls the appln_packet() function which is to be implemented by the application. The data received by the data manager on behalf of the application is sent to the application in this function. The application can process this data packet. If the application wants to receive some data now, it calls the recv_data function provided by the data manager. Data transfer between the data manager and the application process takes place through pipes. Since a pipe limits the amount of data that can be transferred between processes, a shared memory region called "results" is created at init time. This can be used for receiving large data sets by the application process. Thus large receives take place through the shared memory region for efficiency. Similarly for large sends, one can use the same, shared memory region.

When the data manager wants to transfer some data to a worker, it calls, the appln_send_data() function, which is to be written by the application. The application decides how much data can be sent to a worker with available memory M and calls the send_data call to send the data over to the worker. Note that the appln_send_data function operates in the process space of the data manager, so no shared memory is required for this purpose. After the data has been sent to the worker, the application process (child process on data manager) is informed of this. Thus the application process can continue its operation while the data manager is transferring data (which can be huge) to the worker.

3.1.2 Check-pointing Once every few minutes the data manager checkpoints its state information (worker list and hash- table of data allocation) to disk. This is done to recover from data manager crashes. So if the data manager crashes, on restart it reads the old checkpoint file and starts processing from there. So on restart it knows which data portions have been allocated and which haven't been, which have been processed and which are unprocessed and the status of memory utilization of all workers.

5 3.1.3 Indication that data has been processed (mark_processed) When some set of data has been processed, it is the responsibility of the application to call the function mark_processed() to indicate that to the data manager. Also, when a worker has finished computing on all sets of data allotted to it (a worker can be allocated multiple sets. E.g. 0-10 and 20-25), the application should again call mark_processed() to inform of this to the data manager, so that the data manager can now allocate some data to that worker. The reason why the data manager should be informed for every completed data set and not when a worker has completed all data sets is that a worker may "disappear" before it completes computing on all data allocated to it. So by keeping track of which data sets have been completed, we don't have to reallocate the whole data to another worker if a worker disappears, and only have to reallocate the uncomputed data sets. For example Worker W was allocated sets 0-10 and 20-25. Now after computing upon 0-10 it informed the application of this, which in turn informed the data manager. Now suddenly during computing of 20-25, the worker disappeared. Then when the data manager detects this, it will only allocate 20-25 to another Worker W2 and not 0-10.

The protocol for coordinating sending of data to a worker and receipt of results from a worker has to be developed by the application. A simple protocol for sending data to worker could be 1. Send an information packet informing the worker of the data set range and data size that it going to be sent to it. 2. Send the actual data. Note that the worker must know how much data to receive so that it can post a corresponding recv_data.

3.1.4 Other communication features ( appln_send_default_data , broadcast_data and send_old_query )

The data manager also provides functionality for sending some default data to a worker once when the worker starts. This data could be something like some constants, which will need to be used by the worker, and are not required to be sent per data set or per data sets transferred to the worker. Also, every time some data set(s) is/are transferred to the worker, the application can specify a callback function called appln_done_sending() to take some action. If the application wants to broadcast some data to all workers, it can call broadcast_data(). This broadcast data could be something like a SQL query or a message telling the worker to do some particular

6 computation on data allotted to it. Now whenever data is sent to a worker, this broadcast data will also be sent to it at the end. Thus using appln_done_sending(), the application can customize sending of messages to worker on data allocation while using broadcast_data(), application can send a common message to a worker.

3.1.5 Communication between the two processes running on Data Manager ( appln_specific_data ) Since the data manager and worker consist of 2 processes with some part of application code executing in the address space of the data manager/worker, it might be necessary to update the data structures used by application code, which executes in this address space. The application process can do a send with null IP address and port number parameters to indicate that this data is to be sent to the application code on data manager/worker. appln_specific_data function is then called by the data manager/worker on such a send call and data transfer between the 2 processes takes place updating the required information. Since this data transfer takes place through pipes, the limits of pipes (4K data size) must be taken care when calling such a send.

3.1.6 Information kept about each worker, Hot-Stand-By list and reallocation policy The data manager keeps the following information about each worker 1) IP address 2) Port number on which it is listening 3) Total memory available 4) Currently free (unallocated) memory 5) TTL, if it sent Hello during the current ongoing refresh interval

This list is currently organized as a linked list, traversing which gets cumbersome for large number of workers. A hash table would be a better alternative for the next design. Workers, which have free (unallocated) memory available, are also added to a Hot-Stand-By (HSB) list, which is basically a list of pointer to workers with free memory. This list is scanned whenever there is more data to be processed and there is some worker with free memory.

3.1.7 Hash table of data to be processed and sorting of IPs Information about data to be processed is kept in a chained hash-table for quick access. Each hash table chain entry keeps information about a data range (whether is has been processed or not, if allocated, to whom). Each hash-table bucket also keeps information about the largest and

7 smallest IP address among the IP addresses of workers who have some data allocated in the chains of this bucket. By keeping the largest and smallest IP, when a worker informs that has finished computing all data allotted to it, we can easily track narrow down from which all buckets the data was allotted to it by comparing its IP with the smallest and largest IP of each bucket. Entries of a similar type (eg PROCESSED, UNPROCESSED) within a bucket chain are regularly compressed to reduce the length of the chain. Initially the chain will have only 1 entry corresponding to the whole data range for that bucket. Then as data is allotted, the single entry splits into 2 entries (part allocated and part unprocessed) and so on. When processing for chain entries is completed, the reverse process (chain compaction) ensues. Thus when all data has been processed, the chain again contains just 1 entry.

3.2 Worker 3.2.1 Initialization (worker_init) The worker init function is responsible for contacting the data manager and informing it of the worker's existence along with the memory available at the worker. The initialization also creates 2 shared memory regions, one for data sent by data manager and the other for the results, which will be generated by the worker. The application can specify the fractions of the total data to be kept for data and results. It is not possible to create a net shared memory region equal to the memory of the worker. In some machines, we were able to use up to half the available memory while at others, only about 35%. By creating a shared memory region, we avoid the process of creating a disk file to store the data and results, since the worker machine might not have sufficient disk space or usage of disk might be inadvisable.

3.2.2 Startup (worker_run) This function basically forks off the child (application process) after creating the pipes for communication between the child and the parent. The child goes off to execute the application code while the parent waits for messages from the data manager. The data manager can send messages like it is sending more data. On such a message the parent will simply call the new_appln_data function written by the application to store the incoming data in memory. It then informs the application (child) process of this newly received data. On receipt of any application specific messages (messages meant for application only), the parent simply passes them off to the child process.

8 The worker also receives requests from the child to send and recv data on its behalf and satisfies those requests.

3.2.3 Bookkeeping features for data stored at worker end ( worker_add_to_data_list and list management functions ) The Worker provides API for the application to keep account of what data sets have been allocated to it and the application can add/delete/query information about data sets using simple API calls.

3.2.4 Receipt of new data by worker (new_appln_data ) The application process calls the usual send_data and recv_data functions to send and recv data. Since the data to be computed and the results generated can be large, these are stored in a shared memory region accessible to both the worker and the application process. And the receipt of new data is done in the worker address space by the function new_appln_data(), which is to be coded by the application. This function is called whenever the data manager wants to send new data to the worker. Note that it is possible that while a worker is computing on some data, the data manager decides to send it more data since the worker has some unutilized free memory. Also, the worker does not receive data from the data manager when it first sends a JOIN message to it. It is the data manager, which contacts the worker when it wants to send some data to it.

3.2.5 Periodic Keep-Alive information (send_refresh_timer) The worker periodically sends "Hello" messages to the data manager. This keeps the data manager informed about the worker's existence.

3.2.6 Soft-kill of worker (catching of check-pointing and job evicting signals) When condor sends a worker the check-point or job evict signal, the worker catches the same and sends a leave message to the data manager so that the data manager can update its structures immediately instead of timing out on the worker's Hello messages and then realizing that the worker has disappeared.

9 3.3 Other Common Implementation modules

3.3.1 RPC mechanism for data transfer between parent (data manager / worker) and application process An RPC mechanism is used for data transfer between the two processes running at the data manager (worker) end. When data meant for the application is received by the data manager (worker), it has to be transferred to the application process. To do this (see Fig 4. ) , first a header packet is created which contains information about the data’s sender, data size and some other fields. This header packet is then written into the pipe from data manager (worker) to application. Then the actual data is written to the pipe. Finally the data manager (worker) sends a signal (SIGUSR1) to the application process. All this is done in the function extern_recv_data. On receipt of the signal, the application process enters into the signal handler, reads the header packet. Based on the size mentioned in the header packet, the application reads the actual data from the pipe. The signal handler function then calls the application code giving it the data packet and information about the sender. The data packet is malloced by the signal handler at application end and so it is the duty of the application to free it. This RPC mechanism is hidden from the application and the application need not even know of this. This complex RPC mechanism can be done away with when Condor supports multi-threaded applications.

Write header fields to pipe

Write actual data

Exec.Send Signal signal handler

Read header fields

Unmarshall header fields

Read actual data

Figure 4: Send-Recv RPC among application and DM

10 A similar mechanism takes place when the application wants to send (receive) data. The application will simply post a send_data ( recv_data ) call. But behind the scenes different things take place. The send_data ( recv_data ) call will marshal the parameters of the receive call into a header packet, write the header packet into a pipe and signal the data manager (worker) using SIGVTALRM. On receipt of the signal, the data manager (worker) will unmarshal the header packet, determining to/from whom how much data is to be sent/received and will actually perform the send/recv. On completion of the send/recv, the data manager will pass back the status code of data send/recv and the received data, if a receive was posted.

Data transfer of any type between data manager and worker takes place, using TCP. Since opening and closing TCP sockets takes time, the API provides a keepalive option. If the application feels that it is going to be posting another send or receive call in the near future, it can turn on the keepalive flag during the send/recv call. This will result in the TCP connection (socket ) being not closed after the send/receive is completed. The application can then use the same connection (socket) for more data transfers until there are none.

3.3.2 Hello Mechanism and Timers The data manager and the worker both use periodic timers. Since these timers need to be asynchronous (i.e. one should not block waiting for the timers to expire periodically), the timers were implemented using the setitimer system call on Unix. The setitimer call delivers a signal periodically to the process, which sets the timer. It is very important that these timer signals should not obstruct any ongoing critical activity such as sending or receiving of data. It is all right if a timer is delivered late or is missed for one periodic interval, but obstruction of a send call requires coordination between the sender and receiver to restart the send-recv process from where it was obstructed which can get messy at times since it becomes difficult to recognize when the error is due to the other party's disappearance and when it is due to a timer on either side. Hence we use periodic restartable system call timers. Unfortunately such a timer comes in different flavors under Linux and Solaris. The sub-type of setitimer, which specifies our required timer, differs on the two platforms and may not be present on other platforms ! But this seemed the only way to implement asynchronous timers. An important point to be noted is that the signal handler, which handles the timer signal, must not involve any code, which is non-reentrant.

11 3.3.3 File Size Limitations When using the data manager over the current Linux file system, the file size limit of 2GB will easily not be sufficient for large data sets. The solution is to use multiple data files and inform the data manager that more files are to be processed. Whenever a file from among a set of files has been processed, it is the duty of the application to inform the data manager that there is one more file to be processed and the number of records/dataset size in the file. The data manager will reset it data structures and continue processing with the new file. It is the work of the application to mmap this file to the same data region where the previous file was mmaped so that the data manager can access it with ease. The data manager will thus process each of the files one by one and make sure all your data has been processed.

Similarly, when the entire set of application data has been processed, the data manager will inform the application of this. It will then reset its data structures to reflect that no data has been processed but some data is already allocated to some workers ( the data which was last processed by each worker). Now the application can request the data manager to start processing again on the same data set. This time the data processing starts instantly and there is no transfer required as data has already been allocated to the workers during the previous processing cycle. If the application wants to process a new data set (e.g. hash new data), then the data manager will have to be stopped and started again. Currently there is no way to do this without stopping. A simple function call to inform the data manager of "new processing" can be introduced to add this feature.

4 Sample Applications Two sample applications were built using the data manager paradigm, viz. Hash-Join and Semi- Join. In this section we explain the working and implementation details of the two applications.

4.1 Hash-Join 4.1.1 Introduction Join is the process of taking the Cartesian product of two relations and then applying a select condition on the result to filter out unwanted results. For example, consider two relations, one having information about customers and another having information about products purchased by customers at a retail store. A simple join query would be taking the Cartesian product of the two relations such that the customer name in both relations is John. One can apply projection on this

12 to filter out unwanted attributes (columns) in the results. i.e. find out the addresses of all customers with name John who have shopped at the retail store. Thus the join is a powerful way of extracting related information from two relations.

Hash-Join is one of the techniques used to implement Join efficiently. It operates on 2 relations that have some attribute(s) in common. Both relations are hashed into buckets based on the same hash function. These buckets have the property that tuples in bucket_1 of relation_1 will match (join) only with tuples in bucket_1 of relation_2. Then for each bucket number i of both relations, we check if a record in relation_1 matches a record in relation_2 based on the common attribute. A Cartesian product of the matching records is then output into the result vector.

Consider the sales records by a computer store in a month. Also consider the details of valued customers for that store. If we wish to determine the preferences of those valued customers, then we do a join of the 2 relations on the common attribute of valued customer id. This will give us details of items purchased by our valued customers and using the customers details (eg. address, email-id), we can send them special offers catered to their needs.

4.1.2 Implementation In the data manager paradigm, both the relations (customer and retail store sales) were stored at the data manager end. As explained above, there are two aspects of hash-join, one is hashing of the relations and the other is the join. Both these aspects were done using the data-manager paradigm. For hashing, the raw data is sent to the workers. The worker application has the requisite code for hashing the data. The hashed data buckets are then stored either back on some storage device or sent back to the Data Manager for storage. Hashing is done for both relations. For doing the join on the hashed data, bucket(s) of hashed data of relation_1 are distributed among the workers. Then corresponding bucket(s) of relation_2 are distributed to the workers for the join. There are different intelligent ways for an application to do the join w.r.t. space utilization storage of data of the two, hashed relations. In our implementation, we tried to keep most of the hashed data of relation_1 in the workers’’ memory and kept piping in hashed data of relation_2 to the workers and getting back the processed results. By keeping most of relation_1 in memory, you avoid reading it again and again. Also, while one thread of the worker is processing some data, another one could be receiving new data and a third thread could be sending back the results. Thus we can achieve parallelism w.r.t. operations within a join operation.

13 4.2 Semi-Join 4.2.1 Introduction It operates on 2 relations that have some attribute(s) in common, called key attributes. It works well for cases where relation 2 is small is size. Relation 1 is hashed into buckets. The key attributes of the 2nd relation are projected (i.e. only those attributes are selected from the set of attributes) for all records. The key attributes of relation 2 are compared with key attributes of the records of relation 1. Records of relation_1, which match, are output as the result. Thus the result of a Semi-Join is matching records of relation_1.

Consider a set of records representing sales of a computer store in a month. If we need information on sales of all DVD players, then we join the relation containing item numbers of all DVD players in store with the relation containing the sales records for that month. The output is a list of sales records for DVD players that month.

A parallel select on an in-memory database is a special case of Semi-Join.

4.2.2 Implementation The implementation of Semi-Join was similar to that of Hash-Join. The bigger relation, relation 1 is distributed among all the workers. The workers maintain a in-memory hash of the relation. So the relation need not be hashed before distribution. The key attributes of second relation are then sent to all the workers. Either all records of relation 2 can be sent or just few at a time. On the worker end, the hash function is applied on the attribute records of relation 2 and they are probed in the resulting bucket with the records of relation 1. Those records of relation 1 that satisfy the criteria are then sent back as results.

5 Experimental Results This section contains experimental results for the semi-join operation showing the operation and efficiency of the data manager paradigm. Figures 5 and 6 show that with time, the data manager gets more workers into its control and can distribute more data to them. During night as people start leaving for home, the computers become free for acquisition and use by the Data Manager.

14 Number of Workers vs Time

60 Within a period of 8 minutes, around 50 computers were acquired. Taking an average of 32 MB of memory50 per computer, it comes to 1.6 GB of memory. s r e

Figurek 2. shows that for 1 GB database, we can easily use the memory of the workers for data r 40 processing.o If we have the nodes acquired before-hand, then we can divide the data sent to each W worker at a smaller granularity thereby making parallel data transfer possible. This also increases f parallelismo 30 in data processing. A positive side-effect of this is ease in handoff, i.e. if a Series2worker decidesr to leave, then we now need to allocate a smaller chunk of memory to another worker to e compensateb 20 the leaving worker. Thus, if we have an idea about the average number of workers thatm might we available, we might be able to tune our applications for better performance. This is u

usefulN in big companies where they can do their data crunching at night when all the employees have10 gone home and we know that their computers can be used at that time.

0 5 5 2 0 1 6 4 1 1 8 0 9 9 9 5 4 9 9 4 6 7 7 8 9 0 2 5 9 2 9 4 3 4 4 4 1 1 1 1 1 1 2 2 4 4 4 Time in sec

Workers acquired over time (includes time for data transfer ) Figure 5

15 Available vs Allocated Memory (1 GB database)

1200

1000 B M

800 n i Available Memory y r 600 o Allocated Memory m e 400 M 200

0 1 4 7 0 3 6 9 2 5 8 1 4 7 0 1 1 1 1 2 2 2 3 3 3 4 Number of Workers

Worker Memory Utilization Figure 6

Available vs Allocated Memory (1 GB Database)

1500

1200 B M Available Memory in n i 900 MB y r o Allocated Memory in 600 m

e MB M 300

0

1 6 1 6 1 6 1 6 1 6 1 1 2 2 3 3 4 4 Number of Workers

Bandwidth Excess Figure 7

16 Figure 7 illustrates that with time as the number of workers increase, we have more memory available for use. These workers can be used as hot-stand-by workers in case if some working worker has to leave because of the arrival of a red-eye employee.

Data Processed vs Time (1 GB)

1200

1000

B 800 M

n i

600 Data Processed a t a

D 400

200

0 0 1 2 3 4 6 7 8 Time in sec

First Time Semi-Join Figure 8

Figure 8 shows the progress of semi-join processing over time. The steep curve in the beginning corresponds to a fewer number of results being returned by some workers. As the result set increases in size, the data transfer time for the results becomes significant, indicated by a gradual increase in overall processing time.

Figure 9 shows the progress of another semi-join using a different selection criterion. This criterion returns 10 times more results as compared to the one used for Figure 7. As a result, the data processing time at the worker end is significant for this test case. The results start pouring in after 1 sec. A small dip in the middle may be attributed to slower workers. The curve remains straight for some time until it receives the results from one errant worker that took a lot of time to process the data. This graph shows that one could put some intelligence in the data manager to detect slow workers and consequently allocate less work to slower workers. This is left as an optimization for further research and study.

17 Data Processed vs Time (1 GB)

1200

1000

B 800 M

n i

600 Data Processed a t a

D 400

200

0

0 4 7 5 2 8 4 9 1 3 4 6 7 Time in sec

First Time Semi-Join Figure 9

Figure 10 depicts semi-join processing over a 2 GB data set. The graph is similar to the one for 1 GB. The difference here is that since the data set is larger, data distribution takes time, which delays receipt of results. After that we see a jump in the amount of received results. The results for the last 500 MB of data gradually come on account of slower workers, variable result set sizes and other unknowns.

Data Processed vs Time (2 GB)

2500

2000 B M

1500 n i

Data Processed a t 1000 a D 500

0

0 3 6 9 2 5 8 1 4 1 1 1 2 2 Time in sec

First Time Semi-Join for 2 GB data set Figure 10 18 Figure 11 shows the repeat processing time for the same query on the same data set. After a query was executed (for Figure 9), the query was re-executed a second time, and it resulted in much faster response time from the workers. The linear graph shows that the only bottleneck was the data manager, since it could only receive results one at a time from each worker. The faster response time is due to the fact that the first time query processing time also includes the time for hashing and rearranging the first relation in memory. Once the first relation has been hashed at the worker, it doesn’t need to be hashed again and that time vanishes from the query processing time.

Data Processed vs Time (2 GB repeat)

2500 B M 2000 n i

d e

s 1500 s

e Data Processed c

o 1000 r P

a

t 500 a D 0 0 1 2 3 Time in sec

Repeat Processing Time for 2 GB Data Set Figure 11

Figure 12 shows that as we repeat the same query over the same data again and again, its processing time is almost the same. The high processing time for first iteration is due to reasons explained above.

19 Time for Semi-Join on 2GB data

350 300

c 250 e s

n 200 Time for processing nth i

e 150 iteration m i

T 100 2.4 1.4 2.1 .58 50 0 1 2 3 4 5 Processing iteration

Semi-Join query processing over multiple-iterations on same data-set Figure 12

6 Proposed Extensions and Known Limitations The Data Manager paradigm is very general and leaves rooms for improvements and optimizations. Some possible optimizations are listed below.

1 When a worker sends a LEAVE message, if another worker is available with free memory, a protocol can be established for direct transfer of data between the 2 workers to avoid the data manager having to go to disk again to fetch the data. 2 An optimization would be to make the data manager and the application (child) a single process if it is OK to block the application during data transfers. This would simplify the design of the data manager by a great deal.

3. The use of socket migration between processes can be used to have the application process (at both data manager and worker side) control all data send and receives and the data manager/worker only act as conduit for receiving the first message and send it (message and socket) over to the application to deal with it. This would simplify passing of messages between the application process and its parent (data manager or worker).

20 There are some interesting issues that need to be resolved for a more-wider acceptance of the paradigm.

1 The Data Manager paradigm will not be much useful for non-compute intensive jobs. i.e. jobs where the data transfer time is not insignificant as compared to data-computation time. In such cases, the DM will become a bottleneck. 2 The Data Manager uses signals for communication between processes on the Data Manager and those on the Worker. Applications currently need to write re-entrant code for some of the virtual function implementations.

7 Related Work The most commonly known efforts to tap the CPU cycles and memory over a WAN is the SETI@HOME project that aims to discovering intelligent interstellar life by processing the signals being recorded in space. Any user can install a SETI client on their machine and the client will operate on the interstellar data when the user’s machine is idle. The data and results are communicated via the SETI servers. The SETI project, unlike the Data Manager was built for a particular application in mind and is not generic. Moreover, the Condor paradigm allows intelligent interaction among user communities of machines, which allows a user to permit, only trusted machines on the Internet for sharing of resources. Various other projects have concentrated on utilizing idle CPU cycles, but no one effectively deals with the intelligent utilization of both CPU cycles and memory.

8 Conclusions Thus we present a Data Manager for efficient utilization of idle memory and idle CPU cycles. The Data Manager is a middleware, which can be plugged into any data-intensive application. The Data Manager can be run with or without Condor. If running without Condor, then the worker processes have to started manually or by some cron daemon. The APIs are platform independent and the workers have been run on Linux x86, Sun Sparc and Solaris x86. We presented some applications to show the viability and use of the paradigm.

21 Acknowledgements

I am indebted to Prof. Miron Livny whose brainchild this project was. He guidance throughout the project was invaluable. I wish to acknowledge members of the Condor team who have helped me with technical discussions and brainstorming sessions on the design and implementation. I am especially thankful to Doug Thain, Sanjeev Kulkarni and Derek Wright in this regard. I also thank my other colleagues at Univ. of Wisconsin for their help and support.

References

[1] The Condor Distributed System , www.cs.wisc.edu/condor

[2] D. H. J Epema, Miron Livny, R. van Dantzig, X. Evers, and Jim Pruyne, "A Worldwide Flock of Condors : Load Sharing among Workstation Clusters" Journal on Future Generations of Computer Systems, Volume 12, 1996

[3] Michael Litzkow and Marvin Solomon, Supporting Checkpointing and Process Migration Outside the UNIX Kernel, Usenix Conference Proceedings, San Francisco, CA, January 1992, pages 283-290

[4] Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny, Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System, University of Wisconsin-Madison Computer Sciences Technical Report #1346, April 1997

[5] Jeff Linderoth, Sanjeev Kulkarni, Jean-Pierre Goux, and Michael Yoder, An Enabling Framework for Master-Worker Applications on the Computational Grid, Proceedings of the Ninth IEEE Symposium on High Performance Distributed Computing (HPDC9), August 2000, pp 43-50.

[6] Shapiro , Join Processing in Database Systems with Large Main Memories, Readings in Database Systems, pp. 128-140.

[7] Patrick Valduriez , Georges Gardarin , Join and Semijoin Algorithms for a Multiprocessor Database Machines, ACM Transactions on Database Systems (TODS) January 1984

22 [8] Dennis Shasha , Tsong-Li Wang, Optimizing equijoin queries in distributed databases where relations are hash partitioned, ACM Transactions on Database Systems (TODS) May 1991

[9] Philip A. Bernstein , Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM) January 1981

[10] SETI@HOME http://setiathome.ssl.berkeley.edu/

[11] W. T. Sullivan, III, D. Werthimer, S. Bowyer, J. Cobb, D. Gedye, D. Anderson, Astronomical and Biochemical Origins and the Search for Life in the Universe, Proc. of the Fifth Intl. Conf. on Bioastronomy, 1997.

23