A PARALLEL COMMUNICATION ARCHITECTURE

FOR THE LANGUAGE SEQUENCEL

by

JULIAN B. RUSSBACH, B.S.

A THESIS

IN

COMPUTER SCIENCE

Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for the Degree of

MASTER OF SCIENCE

Approved

Chairperson ofthe Committee

Accepted

Uean ot the Uraduate School

December, 2004 ACKNOWLEDGEMENTS

I would like to thank my committee chair Dr. Per Andersen for his wisdom, patience, and insight as a person and computer scientist; Dr. Nelson Rushton for his numerous ideas and contributions to SequenceL grammar and semantics, the inception of the token ring for dynamic load balancing SequenceL execution, and his contribution to the cluster; and Dr. Daniel E. Cooke for his enthusiasm, keen eye, cluster provisions, and the opportunity to work on a great research team.

Thanks also to Chris G. Fielder for help with cluster assembly and troubleshooting; Chris McClimmans for cluster maintenance suggestions; and Dr. Philip

Smith for use of the Texas Tech HPCC computers and valuable MPI lessons.

A special thanks goes to my girlfriend Radoslava for her tolerance of me through a year of work. TABLE OF CONTENTS

ACKNOWLEDGEMENTS ii

ABSTRACT v

LIST OF FIGURES vi

CHAPTER

I. INTRODUCTION 1

I. I Document Overview 3

1.2 Introduction to the Language SequenceL 4

1.2.1 What is SequenceL? 4

1.2.2 Consume-Simplify-Produce 8

1.2.3 Normalize-Transpose-Distribute 10

1.3 Other Parallel Languages 12

1.4 SequenceL Implementations 14

II. LITERATURE REVIEW 19

2.1.1 Load Balancing 19

2.2 Token rings 23

2.2.1 IBM's Token Ring 23

2.2.2 "token rings" 26

III. METHODOLOGY 28

3.1 Inspiration 28

3.2 Cluster Implementation 31

111 3.3 Communication Model 33

3.3.1 Communication Model Problems 39

3.3.2 Communication Model Solutions and Changes 41

3.4 The SequenceL 44

3.5 Proof of Concept: Interpreter and Architecture 48

IV. RESULTS 56

4.1 Interpreter Testing 56

4.2 Distributing Parallelisms 60

4.3 Performance Analysis 66

4.3.1 Time metrics 66

4.3.2 Explanations 73

4.3.3 Other Considerations 77

4.4 Intermediate Code and Persistent Data 78

V. CONCLUSIONS 80

5.1 Suggestions and Improvements 80

5.2 Future Work 82

5.3 Closing Remarks 82

REFERENCES 83

A. GRAMMARS 86

B. CLUSTER SPECIFICATIONS 88

IV ABSTRACT

SequenceL is a language that discovers all parallelisms in a program from the nature of its execution cycle. This suggests the language is a good candidate for execution in a distributed memory, high performance computing environment. However, unrestricted execution of SequenceL parallelisms in distributed memory can lead to problems of granularity and load imbalance associated with the distribution of fully parallelized program and data. This thesis is a proof of concept investigation into a token ring communication architecture to load balance SequenceL execution in a distributed memory environment. The thesis provides a background over previous work on

SequenceL in distributed memory, research into dynamic load balancing and token rings, and a methodology for construction of the communication architecture. This thesis has achieved the following results:

• A non-recursive SequenceL interpreter written and tested on a set of SequenceL

programs.

• A working distributed memory communication architecture and parallelized

SequenceL execution

• A distributed memory representation of SequenceL

• A method of enforcing persistent SequenceL data and programs

• Performance measurements of SequenceL programs executed on the

communication architecture LIST OF FIGURES

I.I TSpace Communication 15

2.1 Bisection Tree 22

2.2 Token Ring token 24

2.3 Token Ring frame 24

3.1 Single node concunency in token-based operations 37

3.2 Token ring 38

3.3 Communication table after tuple offload 54

4.1 Dynamics of parallelisms trace output 59

4.2 Communication trace output 61

4.3a Illustration of communications occurring in lines 1-6 62

4.3b Illustration of communications occurring in lines 7-25 63

4.4 Communication trace output 64

4.5 Communication tree representation 64

4.6 Communication hierarchy 65

4.7a Time proportion for 2 processors 67

4.7b Time proportion for 3 processors 68

4.7c Time proportion for 4 processors 68

4.7d Time proportion for 5 processors 69

4.8a # of processors versus execution time, data size = 1000 70

4.8b # of processors versus execution time, data size = 2000 71

4.8c # of processors versus execution time, data size = 3000 71

vi 4.9 Time results from variations in the upper bound profitability threshold 73

4.10 Times steps during distribution and aggregation of tuples 75

Vll CHAPTER I

INTRODUCTION

SequenceL is a language that discovers all parallelisms in a program from the nature of its execution cycle. This suggests the language is a good candidate for execution in a distributed memory, high performance computing environment. However, unrestricted execution of SequenceL parallelisms in distributed memory can lead to problems of granularity and load imbalance associated with the distribution of fully parallelized program and data. This thesis is a proof of concept investigation into a token ring communication architecture to load balance SequenceL execution in a distributed memory environment. The thesis provides a background over previous work on

SequenceL in distributed memory, research into dynamic load balancing and token rings, and a methodology for construction of the communication architecture. This thesis has achieved the following results:

• A non-recursive C SequenceL interpreter written and tested on a set of SequenceL

programs.

• A working distributed memory communication architecture and parallelized

SequenceL execution

• A distributed memory representation of SequenceL

• A method of enforcing persistent SequenceL data and programs

• Performance measurements of SequenceL programs executed on the

communication architecture • Insight into the dynamics of distributed SequenceL execution

SequenceL frees the of the burden of finding or explicitly marking parallelisms in code - all parallelisms are generated for the programmer. This is potentially a revolutionary innovation in computer science language theory as cunently we are unaware of any other language that has been recognized to do so. The advantage of this feature has implications in the scientific and high performance computing communities for researchers who seek quicker solutions to mathematical problems through parallel execution. Removing the difficulty will allow programming ease, save programming time, and potentially reduce cost. One line of parallel code has been estimated to average $800 [And].

However, with the burden comes control of execution. If the programmer is not required to be aware of the parallelisms found, then the execution of SequenceL cannot require the programmer to explicitly state how the parallelisms are executed. On a single processor this is of little concem. By default, a time shared threaded process could execute code and emulate concunent execution or it could execute all of the code serially. In either scenario the code is executed on a single processor with little overhead.

When in a distributed memory environment, one must approach differently. An automatic solution is needed for balancing automatic generation of parallelisms for arbitrary programs. This begs the questions of where, how much, and when should the parallel code be executed/distributed.

A synchronous token ring network is a proposed as a solution. SequenceL parallel tasks are passed between and executed on nodes in this network. Load information is stored in a token and communicated by default circularly through the nodes. When a node receives the token it will then update the token with its cunent load estimation and as in traditional token rings, is given the temporary exclusive right to communicate with other nodes before passing the token on. In this case node-to-node communication refers to offloading or sending parallel tasks to another node if a load imbalance is detected.

When a node is not engaged in sending load to another machine or token-based operations, it will compute parallel tasks. This allows serial execution of parallelisms by default and periodic dynamic load balancing when needed on the network. The nodes of the network are considered autonomous - decisions are made independently and all communication is peer-to-peer. There is no active arbitrator or server thus bottlenecking is not an issue and the token passing provides synchronicity between machines. The overall design aims to eliminate these common concems and others found in previous

SequenceL distributed research.

Two Beowulf clusters, a SequenceL interpreter, and the underlying communication architecture were built from scratch to test specific problems sets in

SequenceL. Results of this study may draw insight into future implementations, modifications/revisions, and other dynamics of distributed SequenceL execution.

1.1 Document Overview

Chapter I continues with a brief history, introduction, and description of

SequenceL with examples. The chapter also touches on implementations of other parallel languages and concludes with a detailed look at previous SequenceL development. Chapter II provides a literature review over both, considerations of dynamic load balancing in distributed memory, and token ring communication. The two-fold investigation provides an academic background for the basis of the methodology covered in Chapter III. Chapter IV presents results of this study and Chapter V states conclusions and suggestions for future work.

1.2 Introduction to the Language SequenceL

The SequenceL language was originally created in 1991 (by Dr. Daniel E. Cooke) and has since undergone a series of maturations continuing through to 2004. In 1995 the language was proven Turing complete. In 1999 it was discovered that SequenceL execution lent itself to the natural unfolding of data and control parallelisms [CoAn].

Work continued on the language and a SequenceL research team was organized. In 2002 a shared memory SequenceL was completed resulting in a series of semantic revelations that lead to a drastic reduction in the grammar and a leaner simplified execution cycle. In 2003, a version of SequenceL, written in the revised grammar, was executed in a distributed environment. As of 2004, a typing scheme is being applied along with research into SequenceL intermediate code.

1.2.1 What is SequenceL?

SequenceL is a functional language that operates on one data type - a sequence.

The original idea of SequenceL was to provide a unique where all data and control operations were contained in sequences. All SequenceL functions are expressed as operations on sequences or in terms of other functions that operate on sequences. In tum all computations yield sequences or functions that operate on sequences and all terminating SequenceL programs retum a sequence. There are a few other languages akin to this concept; NESL and SetL specifically operate on sequences and sets respectively and both were developed around the same time period as

SequenceL. Here are a few examples of small SequenceL operations:

[ ( [1, 2, 3] -H [4, 5, 6] ) -^ 2 ] = [25 ,49, 81]

A simple example of a SequenceL operation that squares the addition of two sequences.

abs([-l, 6, 9, -20, 8, 7, -3]) = [1, 6, 9, 20, 8, 7, 3]

Another that takes the absolute value of all elements in a sequence.

Aside from its orthogonal approach as a language SequenceL exhibits many other qualities. It has been described as a language for processing nonscalars [Coo], expressing solutions as abstract data products, and finding all implicit control and data parallelisms within a program [CoAn]. These aspects distinguish the language from all other languages and will be discussed in more detail throughout this chapter.

Nonscalars refer to sequences of length greater than I e.g. [1,2,3,4,5], whereas scalars refer to sequences of length 1 e.g. [45] (sometimes called "singletons"). When presented with operations between scalars and nonscalars SequenceL behaves as follows:

Scalar operator Scalar -^ Scalar Scalar operator Nonscalar -^ Nonscalar Nonscalar operator Nonscalar —> Nonscalar Most languages do not support nonscalar operations, while others that do will only support operations on nonscalars ofthe same length e.g. NESL [BLE02]. SequenceL handles operations on nonscalars of different length by its normalize function. For example, given the sequences [I, 2, 3, 4, 5], [3] SequenceL normalizes the length ofthe smaller sequence to the length of the larger i.e. [1, 2, 3, 4, 5], [3, 3, 3, 3, 3]. In this fashion SequenceL can compute results of nonscalars of different lengths:

[1, 2, 3, 4, 5] * [3] =[1, 2, 3, 4, 5] * [3, 3, 3, 3, 3] =[3, 6, 9, 12, 15]

Here is another example of normalization on multiple sequences:

[[8,9,10], [1,2,3,4], [5,6]] normalizes to: [[8, 9, 10, 8] , [1, 2, 3, 4], [5, 6, 5, 6]]

Notice the repetition of each sequence to the normalized length of the longest sequence.

SequenceL can also normalize nested sequences within nonscalars:

[4, [5, 6]], [7, 8, 9, 10, 11] normalizes to: [4, [5, 6], 4, [5,6], 4], [7, 8, 9, 10, 11]

Since the sequence [5, 6] is considered an individual element of the left sequence it is

repeated as a single element when nonnalized. The flexibility of normalization allows

SequenceL to solve operations between nonscalars of any length without returning an

enor.

One can also view SequenceL as expressing high-level solutions as abstract data

products where the underiying sequence operations are implied. This is to allow for the

programmer to declare intuitively the composition of the solution as one would think of

or say the problem. Other languages lack this feature and require the programmer to explicitly code all operations on data. This is usually done through iterative or looping constructs e.g. for each or while do - the presence of which is non-existent in SequenceL.

Looping in SequenceL can be forced through recursion or a generative function

(explained later), however most iterative operations can simply be implied. For instance to write a program that multiplies two matrices in a traditional procedural language like

C, one would need to allocate the memory and provide the following operations:

for (i = 0; i<=m.rows; i-!--!-) { for (j=0 ; j<=m. columns ; j-n-) { s = 0; for (k=0;k<=m. length-l/k-H-i-) s += m[j][k]*m[k][i]; mr [ i ] [ j ] = s ; } } Where as in SequenceL, matrix multiply can be written as:

dotProd: [s]*[s]^ s dotProd(x,y)::= sum{x * y)

matmul: [s]*[[s]] ^ [[s]]

matmul(a,b)::= dotProd(a,transpose(b))

This can be read as - matrix multiply of two matrices "a" and "b" is the dot product of a matrix "a" with the transpose of another matrix "b", where the dot product of two matrices is the sum of the product of each of their rows. Notice the lack of a looping construct in the SequenceL code and the declarative method in which the code is presented. Here is another example of how SequenceL allows for programming intuition. For example an instantiation function:

instantiate: s*s*s ^ s instantiate(var,val,char)::= val when (char = var) else char This function is used to instantiate a variable labeled as parameter "var" in some character string labeled as parameter "char" with a value "val" when a character in the character string is equal to the variable.

The line above the function definition is called the function signature. The astute functional programmer will notice this is akin to Haskell. Cunent SequenceL signatures specify the level of nesting each parameter of the function must be, i.e. [[s]] represents a doubly nested sequence (this is equivalent to a two dimensional anay). The left side of the —* refers strictly to the parameters the function takes and the right side refers to the retum sequence of the function. A single s denotes zero level of nesting. When a user- defined function such as "instantiate" above receives parameters in conflict with its predefined level of nesting, a normalize operation is invoked to reconcile the lengths of and levels of nesting. For a more complete look at the SequenceL grammar see Appendix

A.

1.2.2 Consume-Simplify-Produce

This still leaves the unfolding of implicit data and control parallelisms, which is accomplished through the SequenceL execution cycle. The original cycle was termed

"Consume-Simplify-Produce" but is now called "Normalize-Transpose-Distribute" after revisions to the grammar and semantics. Older versions of SequenceL using the

"Consume-Simplify-Produce" (refened to as "CSP" from here on) cycle contained a collection of 3 constmcts. Their usage and terminology is obsolete in the newer versions but nonetheless some of the underlying semantics survive and are worth mentioning. The constructs are: regular, irregular, and generative.

The regular construct applies an operator to elements of conesponding cardinality

in two normalized sequences (sequences of the same length). For example the operation:

[1, 2, 3] + [4, 5, 6]

the regular constmct now applies the "-I-"

[l]+[4] [2]+[5] [3]+[6]

=[4, 5, 6]

If two sequences are not normalized then normalization must take place.

[1, 2, 3] * [2]

normalize![1, 2, 3], [2]) = [1, 2, 3], [2, 2, 2]

[1, 2, 3] * [2, 2, 2]

the regular constmct now applies the "*"

[1]*[2] [2]*[2] [3]*[2]

=[2, 4, 6]

The irregular constmct differs only from the regular constmct by applying operators to

operands only when a condition is true. The use of a when clause determines the

outcome, shown here:

[1..5] when ([1..5] % 2) = 1 = [1, 3, 5]

Finds the odd numbers in the sequence [1, 2, 3, 4, 5] The generative construct is the predecessor to the SequenceL's cunent "gen" function (denoted ".." ) shown earlier, and has the same functionality. As mentioned earlier SequenceL contains no traditional looping construct, thus all looping must be forced through recursion or the SequenceL generative construct. The generative constmct or function simply produces the sequences beginning with the left hand operand and ending with the right hand operand containing the incrementally ascending or descending elements between the two values respectively. Here is a look at the old versus the new:

Old New

gen{1,...,5) 1..5

These 3 constructs making up the backbone ofthe CSP cycle, work as follows:

Each element of a sequence is consumed and examined in parallel. A placeholder is used to mark the consumed elements location within the sequence. The element is then simplified in which operations on the element are performed using the 3 constructs mentioned above. The resultant element is produced from the operations and is reinserted back into the sequence. Repetition of the CSP process occurs until all elements are at their simplest form.

1.2.3 Normalize-Transpose-Distribute

CSP however has been deprecated for replacement with what is now considered the far more obvious "Normalize-Transpose-Distribute" cycle (refened to as "NTD").

The steps of the NTD cycle fonn a more explicit procedure for parallel execution. The subtieties of CSP were methodically dissected and described such that the newly formed

10 NTD leaves nothing to the imagination as to how to execute parallelisms. NTD works as follows:

When SequenceL encounters an operation on a set of sequences the length of each sequence is checked. If the sequences are of different length, the length of the smaller sequences is normalized to the length of the longer sequence. The sequences are then transposed such that the ith element of each sequence is paired with the ith element of every other sequence. There are now / unique subsequences in the sequence. The operator is then normalized to length / and transposed among all / subsequences. Following this each pair can be distributed in parallel and evaluated. Below shows the steps of the process, prefix notation is used for ease of demonstration:

-(-([2,4,6,8,10,12], [1,3,5]) N -!-( [2,4,6,8,10,12] , [1,3,5,1,3,5]) T +{[[2,1], [4,3], [6,5], [8,1], [10,3], [12,5]]) N ++++++([[2,1], [4,3], [6,5], [8,1], [10,3], [12,5]]) T [+{[2,1]). -K[4,3]), -^([6,5]), -K[8,l]), -K[10,3]), -K[12, 5])] D [+{[2.1]), -H([4,3]), +{[6,5]). -K[8,l]), -^([10,3]), -K[12, 5])]

The resultant sequences are computationally independent and can be executed in parallel. NTD is repeated on nested sequences until all sequences are ground, i.e. completely computed. Here is another example of NTD on nested sequences:

-[false,true,[false,true],false] & true

1 N [false,true,[false,true],false] & true 2 T [~false,~true,~[false,true].-false] & true 3 D [true,false,-[false,true],true] & true 4 N [true,false,—[false,true],true] & true 5 T [true,false,[-false,-true],true] & true 6 D [true,false,[true,false],true] & true 7 N [true,false,[true,false],true] & [true,true,true,true]

II 8 T &[[true,true],[false,true],[true,[true,false]],[true,true]] 9 N &&&&[[true,true],[false, true],[true,[true,false]],[true,true]] 10 T [Sc [true,true],&[false,true],&[true, [true,false]],&[true,true]] 11 D [true,false,&[true[true,false]],true] 12 N [true,false,&[[true,true],[true,false]],true] 13 T [true,false,&[[true,true],[true,false]],true] 14 N [true,false,&&[[true,true],[true,false]],true] 15 T [true,false,[&[true,true],&[true,false]],true] 16 D [true,false,[true,false],true]

Initially in the problem above we must first apply the '-' or logical not since it has higher

precedence than the '&' (logical and). However the sequence its to be applied to is

nonscalar. To remedy this, normalize and transpose place the function operator through

the body ofthe sequence and allow for a distribute operation in parallel (lines 1, 2, 3).

The same steps happen for the subsequence we encounter after the first distribute (lines 4,

5, 6). After all not operations are complete we move to the logical and. The left and right

sequences are of different length so we apply normalize to the right to extend it to match

the length ofthe left sequence (line 7). Then transpose is applied to the two sequences

(line 8) and followed by an NTD on the operator.

1.3 Other Parallel Languages

Although we are not aware of any other language that has been written to find all

data and control parallelisms by default, SequenceL is not alone as a parallel language.

Other languages do allow for declaration of sequential versus parallel code on nested data

stmctures. NESL, and GpH are two definitive languages of this genre. NESL is described as the first nested data-parallel language [BHCS], where functions are applied

12 concunently over elements of a sequence, including nested subsequences. Parallel

operations on sequences must be explicitly marked with bracket delimiters, highlighting

sequential versus parallel code. At compile time, the marked parallel NESL sequences are

flattened into vectors and operations are converted to VCODE (vector code). The

resultant VCODE can be run on an assortment of vector processing computers including

the Cray C90, Cray J90, IBM SP2, Intel Paragon and Connection Machine CM-5.

GpH (Glasgow Parallel Haskell) is an extension to Haskell98 with an integrated

suite of sequential and parallel software tools. GpH defines two primitive constmcts

"par" and "seq" to define parallel and sequential operations on code respectively. Upon

execution a threaded-approach is used to exploit parallelisms. Evaluation is guided by the

use of "evaluation strategies" defined separately from the program. Strategies are used to

guide parallel execution and prevent unstmctured use of the "par" and "seq" primitives

[LRSH]. GpH uses abstract C as an intermediate language and PVM as a set of

communication libraries for parallelism [GPH]. GUM is the name ofthe mntime system

providing realtime support for GpH parallel execution and has been ported to a range of

UNIX based shared and distributed memory platforms.

Other parallel languages developed as extensions to Haskell include Eden and

NEPAL. Eden focuses on explicit process management and communication in the form of threads and communication channels to expedite parallelism distribution at the back end. NEPAL (an extension to GpH) was developed as successor to NESL, in which control parallelism is introduced by arbitrary evaluation of sub expressions on nested sequences called "parallel anays" [CKLP]. For further reading see [And].

13 1.4 SequenceL implementations

Since its inception several SequenceL interpreters have been written in Prolog as a means of rapid prototyping and testing. The Prolog interpreters can find and label parallelisms but do not exploit them. Although highly useful as a development tool the interpreters limit research into . Other implementations are necessary.

A C/Pthread shared memory compiler was written as a solution to achieve tme concunency. The compiler works by threaded evaluation of parallelism in the CSP cycle and has been tested on the SGI Origin 2000.

The first implementation of SequenceL in a distributed memory environment was done using IBM's TSpaces (refened to also as tuple spaces). TSpaces are a set of communication libraries that facilitate a producer consumer work distribution model on a network of workstations. They were originally designed for distribution and load balancing of multitude of tasks (e.g. web and printer services). The SequenceL TSpace model worked as follows. A producer, also thought of as a server, holds a collection of tasks called "tuples" and awaits requests for tuples by consumers (clients). When a request for tuples is received the producer sends arbitrary tuples to the awaiting consumer. Consumers receive the tuple and begin interpreting it as a task. Once completed a consumer will again request more tuples. By this method, only ready or

"hungry" consumers will request work. Load balancing is achieved by servicing the hungry consumers with an asynchronous method of tuple distribution and consumption.

TSpaces when applied to SequenceL execution communicated computationally parallel tasks in the form of tuples i.e. SequenceL program and data. The client machines

14 requested, received and ground these tasks and returned the ground tuples back to the server. Jython (an implementation of Python written in Java) was chosen as the interpreting language. The server did not participate in computation but was strictly dedicated to tuple maintenance and communication. Execution halted when all tuples were completely ground and reassembled on the server. The following is an illustration of tuple space communication process servicing SequenceL computation:

Tuple Server (producer)

Tuple Clients Tuple Clients (consumers) (consumers)

Tuples Ready for Computation

Computed Tuples

Figure 1.1 TSpace Communication

The previous TSpace implementation of a disttibuted SequenceL interpreter revealed fundamental issues of

15 • Serialization

• Portability

• Scalability

• Slow execution

• Granularity

Although a reasonable approach to a distributed SequenceL environment, these problems emerged from the choice of design, language, execution environment, and communication libraries. The results of which were extremely influential in the cunent method of distributing SequenceL execution.

Serialization in this case refers to the flattening of an N dimensional object into a

1-dimensional object so as to transmit the object over a network as a I-dimensional bit stream. The previous implementation utilized one of Jython's built in data types called a tuple (the tuple refers to a collection of two or more objects not necessarily of the same type). All serialization was handled automatically as a feature of the Jython language.

The programmer had no control over the format of what was being distributed. This caused slow-downs and was considered an inefficiency.

Scalability was another major consideration for the client-server architecture.

How many clients should be paired with a server? How many servers are needed? What kind of communication hierarchy should be enforced with multiple servers? One must also beware of problems with bottlenecking. When a server is overioaded with servicing too many communication requests communication can slow to a halt. In a high performance computing setting this is undesirable and must be avoided. The tuple space

16 implementation is by default a client-server architecture. It works well with a handful of

machines, however as more machines are added it becomes encumbered.

Consideration for portability of the cunent execution environment stemmed from

previous work on the tuple space implementation. The previous environment executed on

any machine on a network with an accessible IP address, IBM's TSpace communication

libraries, and a Jython interpreter installed. Although these packages can be installed on a

wide range of platforms, they are not common in high performance computing

configurations and therefore are very rarely (if at all) used in conjunction as a high

performance computing solution.

Slow execution can be attributed to a multitude of factors, many mentioned above.

Jython as a language was probably the largest contributor. Jython is interpreted by Java

which itself is an interpreted language. Java is interpreted to Java byte code, which is

then translated to machine specific code. Altogether to interpret a SequenceL program,

one must interpret SequenceL code, then interpret Jython code, then interpret Java code,

then translate the Java byte code to machine language, resulting in 4 levels of

interpretation. This resulted in execution time magnitudes of order slower than C code.

Other slow down might be attributed to the communication libraries. IBM's TSpaces

(written in Java) were never geared toward high performance computing use, but more

toward a multipurpose producer consumer communication model.

Another problem, overshadowing parallel implementations of SequenceL

(including TSpaces) has been the issue of granularity in parallelism. How many and how minute of a parallelism does a system evaluate? The problem is inherent to SequenceL's

17 discovery of all parallelisms and is one of the oldest and largest problems facing parallel execution. One cannot take a fully parallel course of execution without seeing a significant increase in communication or context switching overhead. Ideally parallelisms should be "throttled" to reduce the execution of excess parallelism. Unfortunately there has been no previous evaluation scheme introduced to perform such moderation in

SequenceL. All previous and future implementations will have to contend or solve this problem by conect choice in a parallel execution model.

These specific tuples space problems presented from the tuple space implementation along with the communication intense, fully parallel, distribution and execution model outlined the path to future work and development for SequenceL in a distributed environment. Much of the interest fueling this research was sparked by the

SequenceL tuple space implementation.

18 CHAPTER n

LITERATURE REVIEW

This chapter introduces the topics of load balancing, and token ring communication architectures. It is designed to provide a skeleton of knowledge for the proof of concept of this thesis.

2.1 Load Balancing

Load balancing techniques can be split into two categories, static load balancing and dynamic load balancing. Static load balancing refers to the partitioning of a computational domain into tasks over a collection of processors such that each task is mapped to a processor and each processor has a relatively proportional amount of work before evaluation begins. Ideally the programmer should not be required to explicitly map tasks to processors. For applications with simple data, static load balancing can usually be automated by the application. Data can be more or less divided equally or proportionally among all processors. However, for more complex or inegular data, schemes, tools, or algorithms must be used [DHBS]. The process is called domain decomposition and can be represented as a weighted graph partitioning problem, in which vertices represent computation on data, and edges represent communication cost.

Profiling tools are used to establish the characteristics of the problem domains and weight the graphs accordingingly [ToD].

Dynamic load balancing can be stated as the partitioning of a computational domain into tasks over a collection of processors, where each task is dynamically mapped

19 to a processor during execution so as to provide an optimal distribution of work in real time. Watts and Taylor in [WaT] divide dynamic load balancing into five considerations paraphrased here:

• Load Evaluation - an estimate of a processor's load to determine if a load

imbalance exists.

• Profitability Determination - a decision on when to initiate load balance. If the

cost of the imbalance exceeds the cost of load balancing, then load balancing

should be initiated.

• Work Transfer Vector Calculation - how much work should be transfened to what

processor.

• Task Selection - tasks selected for transfer or exchanged to best fulfill the transfer

vector.

• Task Migration - moving tasks from one processor to another; state and

communication integrity must be maintained to ensure algorithmic conectness.

Load evaluation is extremely important to the success of a load-balancing scheme.

Inconectiy estimating or ignoring a load imbalance can cause severe break down in overall effectiveness. Use of metrics including execution time, CPU cycles, memory usage, cache usage and others are system level methods of estimating load. Statistical analysis of load from other nodes in a network as in [AGM], or utilizing records of past execution behavior are also used to determine load imbalance. Application level analysis, given some knowledge of the dynamics of a program is even another method. With a combination of these techniques one hopes to gauge the cunent or future load of a

20 processor with respect to others in the system. Predictability of load imbalance through past behavior, profiling, or specific application insight is highly sought quality of load evaluation.

Profitability determination refers to whether or not load balancing is opportune. Even if a load imbalance is detected, if cost to balance exceeds the benefit of balancing then load balance should not be initiated. Cost effectiveness of load balancing is usually predetermined by thresholds, or limits. If a load imbalance exceeds some maximum or minimum threshold then it might be perceived to be profitable to balance, otherwise it is not. Accuracy of threshold selection is another cmcial factor in load balancing effectiveness.

The work transfer vector represents the transaction of load between machines in the form of a direction and a length. The direction refers to whether tasks are "pushed,"

"pulled," or exchanged from one processor to another, and the length represents the size or amount of tasks in the transaction. Pushing is a term used for busy initiated load balances, where processors with greater amounts of load send to processors with less.

Pulling, on the other hand, refers to idle initiation where load is requested and received by idle processors (or those with less work).

Task selection is the choice of which tasks will best fit the work transfer vector.

Given tasks of different computational weight i.e. duration or space, one must choose which local tasks can be gathered to distribute. This can involve the choice of a good bisector. A bisection is no more than the division of a problem into two smaller sub- problems [BEE]. The bisector represents the dividing point in a task in which part of the

21 work can be broken off and sent to another processor. The following is an example of a bisection tree:

Figure 2.1 Bisection Tree

Each circle with a number can be thought of a task with a computational weight. The trick of task selection is finding the conect branches of the tree in which to break off, pack, and offload to satisfy the work transfer vector and achieve an optimal balance. With homogeneous data the choice is much easier due to a more uniform division of work resulting in a more balanced tree with each bisection. However heterogeneous data requires a more meticulous approach.

Task migration involves marshalling selected data and tasks, and packing the data objects with any task state information into a transmittable form. The packed data is then sent to another processor, where data is unpacked and inserted back into memory in its original representation. Task state information is unpacked and tasks resume in their preserved state. Once the transaction is complete, old tasks and data must be removed from the sending machine. Task migration can occur on tasks in any varying state of evaluation i.e. complete, new, and mid-evaluation.

22 2.2 Token Rings

The term "token ring" refers to a wide range of token passing network architectures. These architectures are classified as such by their characteristic of passing of a token circularly through a network of machines, usually connected as a star.

However "Token Ring" with capital letters T and R refers to the mature IEEE 802.5 standards and protocols associated with the network architecture developed by IBM. To alleviate any confusion this spelling convention will be used for the rest ofthe document.

2.2.1 IBM's Token Ring

Token Ring was originally developed at IBM's Zurich laboratories in the late

I970's to service 10 to 100 bank tellers. The initial configuration physically wired machines in a loop to conserve the amount of cable otherwise used in a star network. This proved troublesome when the failure of one cable caused the system to fail. The result was to migrate to a star network and/or star networks connected as a ring.

The initial Token Rings operated at ring speed I Mbits/s and were later sold at 4

Mbits/s. Through the I980's and I990's Token Ring was in competition with the development of Ethemet. Emits of the race introduced ring speeds of 16, 100, and 1

Gbits/s, along with a migration from twisted pair copper cable to fiber optic cabling.

Advances in Token Ring hardware adapters and bandwidth management were also a result. However, Token Ring usage and standardization has declined in recent years due to the overwhelming popularity of switched Ethemet [WIQ].

23 Token Rings pass a 24 bit token (shown in Figure 2.2) in a circular unidirectional fashion through a network of connected stations. Possession of the token grants the right of data transmission to the holding station. If transmission (to another station) is desired, a station changes the token to a data type called a "frame", inserts the addressing information and data for the target station, and transmits the frame to its successor in the ring. Frames are the fundamental unit of transmission in the token rings and consist of extra fields appended to the token for use in transmission (Figure 2.3). Each station in the ring will look at the addressing information of the frame and detemiine if it is the targeted receiver. If a station is not targeted for receiving the frame it will simply repeat the frame to its successor. When a target station identifies the frame as intended for itself, the station will read the frame information and again repeat the frame on to the other stations.

Once a frame completes a full revolution through the ring and arrives at the originating station, the station destroys the frame and releases a new token.

24 bits ^ ^ Starting Access Ending Delimiter Control Delimiter 8-bits 8-bits 8-bits

Figure 2.2 Token Ring token

Start Access Frame Dest. Source Franne End Frame Delim. Cntri Cntri Addr. Addr. Routing Data Check Delim. Status 8-bits 8-bits 8-bits 6-bits 6-bits Info. Seq. 8-bits 8-bits

Figure 2.3 Token Ring frame

24 The grey areas of the frame buffer represent the fields from a token. The white areas are the fields specific to a frame buffer. The starting delimiter of the token has a bit format as follows:

JKOJKOOO

Where the J's and K's represent hardware detectable code violations of the Differential

Manchester Encoding scheme explained in [WiM]. The code violations are used to distinguish tokens from any other arbitrary bit pattems being transmitted.

The Access Control byte contains priority bits, reservation bits, a monitor bit, and a token bit in the following format:

PPPTMRRRR

P - priority bit T - token bit M - nranitor bit R - reservation bit

Priority bits are used to define the priority of a token frame. The token bit is set to I when a frame follows the token and 0 when just a token is present. The monitor bit is used by a station called an "active monitor" (discussed later) to determine if a frame or token has circulated through the ring more than once. Reservation bits are used by stations of different priorities to reserve the token on a future pass if it is cunently not available.

The frame control bits are used by receiving stations to detemiine the frame type and frame management methods. The source and destination fields refer to the addresses of the sending and receiving stations respectively. Routing information is used only when a frame leaves a source ring. Data stands for the actual user appended transmission information. The frame check sequence is used for storing and checking bits in a cyclic redundancy check to ensure validity of the data. The end delimiter is again a byte with code violations, and the frame status contains communication status of the cunent frame.

25 Designated stations called "active monitors" monitor the health of token revolutions and detect the presence of lost tokens with timers. They also remove stale tokens and frames (mentioned previously). The timer is set to a time just greater than the length of the time it takes the longest frame to circulate. If a token is not received within the time limit, a new token is released. By this method, the active monitor always ensures that a token is on the ring. Should an active monitor fail, a station designated as & passive monitor will detect the lapse of the active monitor, broadcast the failure to the other nodes on the ring and initiate an election process for a new active monitor.

2.2.2 "token rings"

Token rings in general are more flexible than IBM's Token Ring and refer to any variety of implementations of token passing rings. Token rings can circulate tokens in either a bidirectional or unidirectional fashion, using synchronous or asynchronous communication, on a multitude of hardware platforms, with one or several tokens in transit. Implementations involving the use of one circulating token are commonly used for mutual exclusion to solve the critical section problem [Ros],[HiM]. This allows one machine the right to perform some atomic operation when in possession of the token that would otherwise be interfered by concunent operations of other nodes on the ring.

Implementations as in [Jain] present a bounded token delay time called a "Token Holding

Time" (THT) which further mimics this round robin model with the element of time­ sharing.

26 Fault tolerance, as hinted at earlier, is a prime tenet of most token ring implementations. A heavy focus has been reliability, and stability of the system given the occunence of transient faults. Several investigations have been made into the ability of a token ring to self-stabilize [Ros], [HiM], a concept first introduced by the renowned E.

W. Dijkstra in 1974 [Dij]. Leader election schemes, excess token removal, and detection of failed nodes are all part of the fault tolerant schemes built into token rings. The unidirectional ring itself is conducive to locating faults:

"The ring topology with its unidirectional flow provides a robust environment for enor isolation and recovery. Since each station knows its upstream neighbor, the first station (in order around the ring) to note an enor will then know the fault domain." Michael Willet Monitoring nodes and enforcement of recovery algorithms are also integral to the reliability and stability of token ring architectures.

27 CHAPTER III

METHODOLOGY

3.1 Inspiration

Working with different people, languages, and computing tools provided inspiration for the methodology of this thesis. The following paragraphs in this section provide an insight into the background work leading up to cunent the methodology.

In order to tackle the language, speed, communication, and portability issues of the SequenceL Jython interpreter, the C language and MPI (Message Passing Interface) communication libraries were originally investigated as a replacement for Jython and

TSpaces respectively. C is a mature, heavily optimized compiled language for use in high performance computation and MPI is a standard means of communication in high performance networks. Both of which are native to Beowulf clusters, the much more economical and available altemative to shared memory supercomputers. The idea was to recreate the TSpace client-server execution model on a Beowulf cluster so as to make available a version of SequenceL to the research community that was highly portable, fast, and of potential research quality use to the scientific community. Implementations of

MPI contain a wide anay of asynchronous and synchronous communication calls that match and surpass the functionality of calls in TSpaces. All packing and unpacking of data i.e. serialization can be directly manipulated by MPI and C functions as well. MPI is also executable on shared memory machines, allowing distributed SequenceL code to mn in both shared memory and distributed memory environment - a flexibility not afforded by other implementations.

28 The use of a Beowulf cluster meant that the execution environment would be limited to a geographically restricted network of computers, dedicated to computation, a limitation not enforced with TSpaces. However, Beowulf clusters are increasingly more popular with the scientific community. They make up 83 of the top 500 supercomputers in the world according to http://www.top500.org and much more attainable than shared memory computers. A small cluster of say 16-32 nodes of the latest processors can cost several thousands of dollars whereas the equivalent amount of processors in a shared memory machine can range from hundreds of thousands to millions. Rather than competing for execution time on expensive shared memory machines, a researcher could potentially have access to the more affordable, available, and mainstream Beowulf clusters utilized as supercomputers.

Initial testing of C/MPI applied to the TSpace asynchronous client server model was done solving stochastic differential equations on a 32 processor SGI shared memory computer "Pleione" and 3 Beowulf clusters "Mathwulf', "Antaeus", and "Weland" (see

Appendix B for statistics or visit http://www.hpcc.ttu.edu). The server node divided an integral of a function into subintervals and delegated integration of the subintervals to the client nodes. If the integration of the client's subinterval was not of satisfactory precision according to a predefined enor value, the client would then subdivide its integral further, send half back to the server and begin integrating a smaller subinterval. When a client integrated an interval to a satisfactory precision it would retum the value to the server and ask for more work. The server kept a collection of intervals and would continuously supply "hungry" clients with work as they returned completed intervals. This mimicked

29 the producer-consumer model used in the TSpace implementation and proved that asynchronous i.e. non-blocking calls in MPI were a functional replacement to tuple spaces communication calls. It also provided a dynamically load balanced environment as desired.

Test results indicated the new approach as viable on both shared memory and distributed memory machines, however the issues of bottienecking and scalability still lingered. C and MPI might have provided a faster client-server architecture, but the issues inherent to the architecture remained. Another idea stemming from SequenceL research group discussion was to apply a token ring communication architecture to SequenceL execution. Token rings allow for synchronous communication among nodes in a cluster by circulating a token through the nodes and granting exclusive temporary rights of communication to the node holding the token. This method eliminates the need for a server and alleviates bottlenecking by placing small burdens of communication briefly on token holding nodes. It also allows for dynamic load balancing by giving nodes the opportunity to distribute work to other nodes when in possession of the token.

Furthermore, nodes would execute sequentially (not in parallel) by default since distribution could not occur without the token. Overall this would seemingly resolve many major concems facing SequenceL distribution.

Replacing a client-server architecture with a token ring is a difficult task and presents new problems that need to be addressed. One must now consider new solutions to the 5 aspects of dynamic load balancing: load evaluation, profitability determination, work transfer vector, task selection, and task migration, along with schemes for

30 initialization/static load balancing, and persistent data. Work toward the solution of these issues and implementation of such a token ring as an underiying communication architecture using MPI and a C SequenceL interpreter on a Beowulf cluster led to the methodology of this thesis.

3.2 Cluster Implementation:

For testing purposes and exclusive access to a cluster, a heterogeneous cluster of

Pentium processors was constmcted from scratch. Spare computers were bonowed, donated, and commandeered while others were assembled from extra parts. A total of 9

Pentium processors were combined with a 3Com SuperStackll 12 port lOBase T / 100

Base Tx switch in a star network configuration forming the "Dungeon" cluster. All CAT

5 network cables were hand made. Processor speeds ranged from 400mhz to 2.4Ghz with

2 computers being dual processor machines. For a more complete list of specifications of the "Dungeon" cluster see the Appendix B.

Slackware 9.1 (available at http://www.slackware.org) was the of choice for the cluster because of its slim utilitarian approach as a Linux distribution. The 2.4.24 version of the Linux kernel was installed and custom compiled on each machine to allow for a minimal amount of kemel processes and memory consumption during mntime. Custom compilations were also necessary due to extremely varied hardware on each machine. The 2.4.24 was chosen because it is a mature 2.4 series kemel. A 2.6 series kemel was available at the time but was ruled out due to risk of instability as a burgeoning kemel.

31 NFS (Network File System) daemons and kemel modules were installed on each machine. The frontend of the cluster was designated as NFS server while the rest were

NFS clients. The /usr/local and /home directories were shared from the server and mounted on each of the clients to allow seamless file system access to installed programs and files residing on the server.

MPICH version 1.2.5.2 an implementation of MPI available at http://www- unix.mcs.anl.gov/mpi/mpich/download.html was installed with shared memory computation enabled (for the two dual processor machines) and ssh (secure shell) as means of remote access and communication. Public RSA keys were exchanged for each machine so as to allow unprompted secure access to all other machines in the cluster.

MPICH was chosen as a MPI distribution because of familiarity and use on the

"Mathwulf, "Antaeus", and "Weland" clusters.

Specific problems resulted from the "Dungeon" cluster. One of the machines either due to bad memory or bad hardware crashed during execution several times. This machine was removed only to find that the front-end a (1.8 Ghz Pentium 4) of the cluster was computing at a gmeling pace in comparison to the rest of the machines. As a result the front-end was then swapped with another computer. The problem persisted, so the former front-end was also removed from the cluster.

Not only was hardware a battie, but dynamic load balancing schemes differ for heterogeneous clusters. This would prove difficult when creating an initial design for

SequenceL and is outside of the scope of this thesis. After several months of testing on the heterogeneous cluster, the idea was abandoned for a faster homogeneous cluster. 9

32 Pentium 4 1.7Ghz machines were made available for research and utilized in constmcting

a new cluster named the "Beergarden" (for more complete hardware specifications see

Appendix B). This cluster was originally tested with the Rocks 3.2.0 (Shasta), a RedHat

based distribution of Linux containing precompiled binaries and tools specifically for

installation on clusters (Rocks is available at http://www.rocksclusters.org/Rocks/).

However due to the excessive configuration and time constraints. Rocks was replaced by

Slackware 9.1.

Now that each machine was uniform in hardware, software installation and

kemel compilation were easier. UDPCast (available at http://udpcast.linux.lu/) was used to create and distribute uniform hard drive images to all machines in the cluster. The

2.4.24 kemel and the same NFS configuration were maintained. The same version of

MPICH was used except that shared memory execution was disabled since all ofthe machines were single processors.

3.3 Communication Model

A test token ring communication architecture was coded using MPI. The architecture utilized a few primitive MPI calls to pass an anay of integers as a token circularly around the cluster. The anay of integers represented the load of each machine, i.e. the ith position of the anay contained a numeric value representing the load of the ith machine in the cluster. Dummy tuples consisting of random character data were produced and stored on each machine in a stack to represent the storage of SequenceL tuples in

33 distributed memory. The load estimate of each node was calculated by simply taking the local stack size.

When a node gained access to the token, it examined the load estimates of all of the other machines and made a decision of whether or not to offload or push its jobs to another machine. If an offload was in order then tuples residing locally on the machines stack were popped and packed into a contiguous anay in memory. The anay was streamed over via MPI to a receiving node in need of more work, where it was unpacked and pushed as tuples to the receiving nodes stack. Whether or not an offload occuned the node would update the token with its cunent stack size and send the token on to its successor node in the ring. This kept the load information in the token reasonably up to date.

The program halted when all tuples on all machines were ground and the stack size for each was zero. The architecture designated one node as the token ring "monitor" this node produced the original token and kept track of the number of token revolutions transpiring through the ring. When the monitor received a token indicating all tuples were ground it sent a termination message to all other nodes.

Almost all communication was asynchronous (i.e. non-blocking). Blocking calls were only used briefly when sending buffer size information between nodes so as to anticipate the size of large streamed data. For the majority of mntime a node could send and receive the token, and send tuples concunentiy with computation. Atomicity between communication and computation was only enforced when packing, unpacking, pushing.

34 and popping to and from the stack. The follow MPI primitive communication calls were used:

MPI_Send - blocking send MPI_Recv - blocking receive MPI_Isend - non-blocking send MPI_Irecv - non-blocking receive MPI_Barrier - blocks until all processes reach the barrier MPI_Test - test if an asynchronous receive has completed MPI_Pack - packs basic data types into contiguous memory MPI_Unpack - unpacks basic data types from contiguous memory

MPI uses the SPMD (single program multiple data) convention in which all nodes execute the same program and the unique role of each processor is determined by conditional branching statements based on processor ranks assigned by MPI at runtime.

An example of basic MPI code with role delegation is as follows:

if(my_rank==0) MPI_Send(buffer,count,MPI_INT,1,tag,MPI_COMM_WORLD);

If (my_ranlc==l)

MPI_Recv(buffer,count,MPI_INT,0,tag,MPI_COMM_WORLD,status);

The fourth parameter represents the destination/source ofthe message. By use of the NFS server and mounted directories all processors receive an identical copy of the executable. In client-server architectures roles are more definitive, however in the token ring all of the nodes perform similar tasks, thus there is less conditional branching based on rank.

Figure 3.1 is an illustration of concunency and execution steps taken to receive a token, pack/send tuples, and send the token on a single node in the cluster. The rounded rectangular objects represent data. "Token buffer" is a buffer in which a token is sent or received from. "Packed Buffer" is a buffer in which packed tuples are sent or received

35 from. "Stack" is the main stack containing locally based tuples. The circle, labeled as

"Main," represents the main of execution, while the semicircles attached to the data represent concunent non-blocking calls. Those colored in grey are executing concunently with the main thread. Red arrows represent a transfer of data to or from the storage devices. Step I illustrates concunent access to the main stack (in which tuples are popped and ground), along with the reception of an incoming token to a receiving token buffer i.e. an MPI_Irecv. In step 2, the non-blocking receive expires upon completing reception of the token and load estimates from the token are compared to the main stack.

Step 3 shows the packing of elements from the main stack to packed buffers ready to be sent to other processors. In step 4 non-blocking sends (MPI_Isend) are initiated on any packed buffers awaiting shipment to another processor, while an outgoing token buffer is created and updated with the local stack size. Step 5 shows full concunency between offloading packed tuples, offloading the cunent token, awaiting a new token, and grinding elements from the main stack.

36 Packed buffer

Incoming token

u ia>C 3 ^ Main \ j IMain \ / > .Q J ffe r / V 1V Vu C .Q

stack Stack

Outbound Outbound 4^ packed tuples 4^ packed tuples

/ (send \ Isend

Packed buffer Packed buffer

Outgoing token w Toke n buffe r V / Main \ !f3c > .a 4u) c I Update il # iL 0) — J£ to-

stack

Figure 3.1 Single node concunency in token-based operations

Figure 3.2 is a global view ofthe token circulation and packed tuples offload.

37 Token Passing Ring

Token Path Node to Node Communication

Figure 3.2 Token ring

When not engaged in token-based operations a node ground tuples by default.

This directly combated execution of fine grain parallelism by enforcing each node in the cluster to execute sequentially. Parallel execution only occuned when a node possessed the token and deemed load balancing profitable. For a majority of time, any given node would not hold the token. Thus the combination ofthe synchronicity of token passing

38 with a parallel-restrictive profitability determination (explained in 3.5) gave way to a removal of excesses parallel communication and computation.

"By packing several parallel objects into a single indivisible computing grain i.e., by performing a dynamic grain packing all intra-grain operations can be faster if their execution is sequential (i.e. if the operations are serialized)"

"Dynamic removal of excess parallelism should encourage the programmer and/or the pre-processor to detect and specify all the parallelism opportunities and let the RTS to remove parallelism in excess on the fly."

Joao Luis Sobral, Alberto Jose Proenga

Floating-point operations were used to simulate the time delay in computation or

"grinding" of tuples. The stacks of each node were populated with hundreds of thousands of dummy tuples making an aggregate total of a few million. More tuples were introduced randomly when grinding to simulate SequenceL's dynamic flux in parallelisms. Initiahzation and the specific concems of dynamic load balancing will be addressed later in the proof of concept.

3.3.1 Communication Model Problems

When using MPI non-blocking calls, the programmer is required to test their completion with a simple call to MPI_Test. Thus to determine if an MPI_Irecv has completed reception of a token the programmer must explicitiy halt grinding, test the buffer, and then resume. This begs the questions: When is it a good time to check the buffer? How often should one check the buffer? What ratio of time should one spend checking and performing buffer operations versus grinding? These are questions the MPI

39 programmer should not answer. If the programmer overstates the necessity to check the buffer, then too much time is wasted in token checking and maintenance. If the programmer understates the necessity then load imbalance may result due to a node's ignorance of possession of the token. If the programmer does resolve an optimal ratio of checking versus grinding it may not be applicable to other clusters since processor speeds may differ from cluster to cluster. The problem is even more apparent on a heterogeneous cluster. If a check ratio is established and applied to all machines fast and slow, the faster machines will wind up checking more often than slower ones, thus wasting more computation time.

Other negative side effects result when testing the token buffers too often. The result is unhindered revolutions of the token through the ring. Meaning the token will arrive and depart from nodes more frequently if checked and sent more frequently. So not only will the ratio of check time to grinding cause computational delays but each check has a higher probability of detecting the successful reception of a token due to the faster revolution speed of the token. Thus more token maintenance and more communication result.

Original ideas to combat the "hot-potato" style of token revolution included switching away from MPI calls into a serial network and socket programming architecture for token passing. By the use of a serial network MPI_Test calls could be replaced with operating system intenupts signaling the successful arrival of a token. This would eliminate the guessing factor associated with MPI_Test. However with more thought the idea was thrown out. The reason being that token revolutions would still

40 increase since grinding would be intenupted immediately to service communication. This

gave token passing an even higher priority over gnnding, despite the fact that the

check/grind ratio would be eliminated with intenupts.

Another idea was to introduce a bounded delay for token reception. The delay

would be based on some time value ultimately independent of processing speed of each

node in the cluster. This would assure that the token revolutions were moderated at a

certain speed, reducing any excess communication. However this is akin to the first

problem of the check/grind ratio. The bound delay can simply be seen as a fixed ratio of

time between computation and communication. This would not port well to other clusters

since a fixed delay on one cluster maybe less than optimal on another.

3.3.2 Communication Solutions and Changes

The methodology finally adopted frees the MPI programmer from over checking

or under checking the MPI_Test call. In order to do this two thresholds were introduced,

one for the condition of an overburdened machine (i.e. one with too much load) and one for an under burdened machine (one with littie or no load). Again, load estimation was strictly based on stack size since each tuple in the stack represents the equivalent of one parallel task. CPU and memory usage were not considered although as the stack size increases so does the amount of memory consumed. A machine operating with load within the thresholds does not check for, or accept tokens. When load exceeds or falls below either threshold respectively, token reception is once again enabled. This method empowers machines with ample load with the right not to engage in unnecessary

41 communication. Only machines in need of communication are granted right to

communicate by possession of the token.

However when allowing for unresponsive nodes the circular convention of token

passing becomes disrupted. If, for instance, 3 of 10 nodes were outside ofthe load

thresholds, then only these 3 nodes would be able to communicate. Attempts to send the

token to the other 7 would fail. A node possessing the token would by default attempt to pass the token to its assigned successor in the ring (e.g. if the cunent node's rank is I, it will attempt to pass to the node with rank 2). If the successor does not respond (i.e. its in an optimal load range) the node will spin and attempt to pass the token to the next node after its successor. This process of skipping unresponsive nodes repeats until a node is found that will accept the token. If no other nodes are accepting token communication then the node in possession will attempt to grind its tuples for a brief period of time and then try sending the token again. Consequently, this could leave a few nodes with more communication when spinning through a ring of mostly balanced nodes. The effect is considered as a trade off for the elimination of the check/grind ratio.

Changes made in the communication model to enable this new style of load balancing resulted in unreliable message passing. This occuned when a node attempted to pass the token to an unresponsive load balanced node. The node in possession of the token initiated a non-blocking MPI_Isend to the target node and waited for a specific amount of time. If a conesponding MPI_Irecv was not initiated on the target node within the set time limit a timeout was declared, the message was cancelled with MPI_Cancel, and the node would spin to pass the token to the next node. However, in most cases

42 MPI_Cancel resulted in a "no-op" call in which no operation actually cancelled the

MPIJsend. It followed that some of the tokens were still being sent after the send call

had been cancelled. Other experiments with different MPI send calls, namely MPIJssend

(the MPI non-blocking synchronous send), failed as well. When investigated further, it

was found that canceling non-blocking MPI send calls is a common problem. The cancel

feature is in fact implementation specific. Given MPI is a standard and a specification for

message passing, MPI vendors have the liberty of choosing their own methods of

implementation, some of which vary. MPICH does not support reliable cancellation of

non-blocking MPI send calls, whereas other versions of MPI e.g. LAM MPI do. This

directly affects portability of a MPI communication based token ring, as a resolution MPI

was eliminated as the method for sending and receiving the token. All other MPI calls

remained, for instance those used to pack and unpack, send and receive packed tuple

buffers.

In order to materialize a working model, socket programming was used in lieu of

MPI calls. Non-blocking socket "connect" calls were used to attempt to connect to target

nodes. If a connection could not be made with a target machine, connection attempts spun to another node. By failing to connect, the passing of the token was effectively cancelled

and reliability reestablished.

On nodes willing to receive the token a listening socket was opened. If a connection attempt was detected, the listening socket would retum another socket opened for reception of the token. When a node achieved optimal load, the listening socket was temporarily closed. An example of the spin cycle is as follows:

43 attempting to connect to 192.168.0.2... no connection established spinning. attempting to connect to 192.168.0.3... no connection established spinning. attempting to connect to 192.168.0.4... connection established sending token.

The hybrid application of MPI and socket programming was successful in constructing the new communication model. Although it should be mentioned that socket programming using intemet protocols limits the model to a strictly distributed environment. MPI allowed for use on shared memory machines, however those machines do not have multiple IP addresses as required by this implementation with intemet socket programming.

3.4 The SequenceL Interpreter:

A C SequenceL interpreter was written for execution on the nodes of the token ring. This allows for replacement of dummy tuples with actual SequenceL tuples. When placed in conjunction with communication model on each node each SequenceL interpreter would tackle a portion of the distributed load by producing and grinding all encountered parallelisms.

The "flex" the GNU equivalent to AT&T lex was used as a lexical analyzer generation tool for the interpreter. Regular expressions representing the lexemes of

SequenceL were translated into a C lexical analyzer by use of flex. Parsing was done by recursive decent and with parser constmction based on the SequenceL grammar in the

Appendix A. Conect parses resulted in the creation of an abstract syntax tree (AST). A prefix traversal of the AST was used in the translation of SequenceL code to an intermediate form stored in a hash table. Each hash of the hash table represented a

44 SequenceL tuple and could be ground or distributed in parallel with other hashes. The

idea of the hash table was to unfold the SequenceL code into hashes that could be

arbitrarily examined and evaluated in parallel. Each hash pointed to an object called a

sequence representing what has previously been referred to as a tuple. Sequences were

implemented using a linked list data type containing 13 types of list nodes. The types are

as follows:

I.) Numeral 2.) Boolean 3.) Identifier 4.) User defined function hash 5.) Nested hash 6.) Non-nested hash 7.) User defined parameter hash 8.) Resolved infix operator 9.) Resolved prefix operator 10.) Non-transposeable unresolved prefixl operator II.) Transposeable unresolved prefixl operator 12.) Unresolved prefix2 operator 13.) Unresolved infix operator

Resolved and unresolved operators refer to whether the character representation of an operator has been translated to a local function pointer. Types 4, 5, 6, and 7 are all integer values referencing other hashes. For example, a nested sequence 1,[2] would look like:

Hash[0] : 1 ** 1 * Hash[l] : 2

Where *I* is an integer value of type 4 representing the nested sequence at hash[l]. Non­ nested hash pointers are used for evaluating expressions with a higher precedence before those with lower precedence. For example olo is a non-nested hash pointer in the hash

45 slice below representing higher precedence of infix multiplication over addition in the

following operations: 4*5-1-6

Hash[0] : + olo 6 Hash[l] : * 4 5

Here is a full example of the intermediate hash representation of a parsed

SequenceL program:

func: [s]*[s]->s func(a,b) ::= sum(a,b)/2

main: ?->? main(void) ::= func([1,2,3],[4])

func status: 3 Hash[0] olo status: 3 Hash[l] / sum o2o 2.000000 status: 3 Hash[2] o3o o4o status: 3 Hash[3] a status: 3 Hash[4] b main status: 1 Hash[5] o6o status: 1 Hash[6] (U)O(U) (P)7(P) status: 1 Hash[7] o8o ol3o status: 1 Hash[8] *g* status: 1 Hash[9] olOo olio ol2o status: 1 Hash[10] 1.000000 status: 1 Hash[ll] 2.000000 status: 1 Hash[12] 3.000000 status: 1 Hash[13] *14* status: 1 Hash[14] ol5o status: 1 Hash[15] 4.000000

The values in between the "(U)"and "(P)" symbols are integers referring to user defined function hashes and parameter list hashes respectively. That is, (U)O(U) refers to hash[0] as a user defined function definition and (P)7(P) refers to hash[7] as the supplied parameter list to the function. One notices the status values left of each hash reference.

Each hash has 1 of 4 values assigned to it: 0 - not fully ground; I - new; 2 - completely ground; 3 - function hash.

One may notice the presence of redundant levels of hash nesting. For example:

46 Hash[7] I 080 ol3o Hash[8] I *9* Hash[13] *14*

A compression technique reduces all redundancy in nesting during initial evaluation. The resultant reduction of the above hash table slice becomes:

Hash[7] I *9* *14*

Also apparent is the existence of function definitions (denoted with a status of 3) integrated seamlessly with computable intermediate code. Functions initialization occurs by copying the function definition's intermediate sequence representation into a new hash and changing the status to I. The sequence node referencing the function is then updated to refer to the new copy of the function definition. The process preserves the original function definitions in the hash table. For example, when initializing reference to the function located at Hash[0] (i.e. "func") located below:

status: 1 Hash[6] | (U)O(U) (P)7{P)

The following transformation results:

status: 1 Hash[6] 0I60 (P)7(P) status: 1 Hash[16] ol7o status: 1 Hash[17] / sum 0I80 2.000000 status: 1 Hash[18] ol9o o20o status: 1 Hash[19] a status: 1 Hash[20] b

A dequeue (the combination of a stack and a queue) was used to keep track of which hash spaces were occupied and ready for evaluation versus those that were not.

The evaluation cycle popped a hash value off of the dequeue, evaluated the hash, and either discarded the hash number or placed the number back on the dequeue depending

47 on the resulting status of the sequence. Sequences were evaluated right to left until completely ground (status 2).

3.5 Proof of Concept: Interpreter and Architecture

The amalgam of the C inteipreter with the communication architecture is designed to yield a proof of concept for the token ring as a viable load-balancing model for SequenceL execution. In order to do this one must address the fundamental issues of dynamic load balancing, initialization and persistent data. An approach to demonstrating solutions to these issues in real time execution must also be devised. The remainder of this chapter discusses solutions to these problems and provides the proof of concept for the coupled token ring and C interpreter.

Initialization is the first stage of execution in distributed computing. This refers to how one goes about breaking a problem apart into sub-problems and distributing those sub-problems to nodes before evaluation actually starts. This can be refened to as static load balancing and is addressed in Chapter 2. For this thesis, root initialization was used.

Meaning all SequenceL tasks remained on a single machine until evaluation began. If one were to initialize a SequenceL program onto different nodes in a cluster it would require some knowledge of how the program will execute. However, SequenceL programs are arbitrary and without the use of an intelligent load prediction algorithm or a parallelism profiler, there is no cunent way to safely and accurately predict the nature of execution so as to be able to statically load balance. The development of such a scheme is outside of the scope of this thesis.

48 Originally thought was placed on distributed methods of initialization. The idea was to allow each node to parse and build an initial intermediate representation of a

SequenceL program. The hash table representation on each node would be identical to the next. Then by some scheme based on rank, each node would remove sections of the hash table so as to partition a complete global representation of the SequenceL program across the cluster. Now each machine would contain a unique section of the hash. When execution began, each node would begin to parallelize its portion of the hash. By using root initialization one node takes responsibility for parsing and initial evaluation.

Next one must address the 5 considerations of dynamic load balancing outlined by

[WaT]. These are load evaluation, profitability determination, work transfer vector, task selection, and task migration. Load evaluation refers to how the load of a cunent node is estimated or measured. Several simple ways mentioned earlier might be to measure CPU usage or memory consumption. Another way may be to estimate the duration or quantity of tasks on a given node or sunounding nodes. For this thesis, the amount of hashes that were not completely ground were used as an estimate of load. Since each hash points to a sequence (or tuple) that can be evaluated in parallel, when taking the hash size of a given node, one has an idea of how many parallelisms exist at a given time. Also consider that as parallelisms increase, evaluation time and memory consumption also increase. Thus taking hash size measurements for non-ground tuples indirectly reveals estimates for duration and memory consumption as well. This was considered to be a solely system based approach for load estimation. SequenceL cunently lacks tools necessary to profile arbitrary programs or utilize insight from the dynamics of a programs execution.

49 The use of the number of sequence nodes (i.e. linked list nodes) was also considered as a value for load estimation. This could be calculated by taking the sum of the lengths of all of the non-ground tuples. The reason behind this is that tuple length can vary greatly from hash to hash. A tuple of length 2 may take much less time to evaluate than one of length 100. By taking the number of tuples alone one may overiook some finer measurements of load estimation. Without a robust method of load prediction however, this method does not hold. Sequences of smaller length can be evaluated into larger sequences by normalization, the gen function, or can simply point to hashes with larger sequences. As a compromise the sequence node count was thrown out in favor of tuple count.

Profitability determination of offloading parallelisms is handled mostiy by the load threshold conditions mentioned earlier. A node will only deem offloading profitable if the load estimation has exceeded the upper threshold condition. For this thesis, thresholds were varied to in attempt to find the optimum load estimations for which distribution and reception of tuples was optimal.

The work transfer vector refers to both where and how many tuples are distributed, if distribution is considered profitable. For this thesis, V2 to '/4 of the load of an overburdened node was transfened to an under-burdened node. This was determined by examination of the token on the overburdened node, and all offloading was initiated from the overburdened node. This is equivalent to a "push" of tuples instead of a "pull".

Task selec^on (i.e. what to distribute) for SequenceL requires a custom approach.

Since all tuples can be evaluated in parallel, the naive approach is to ship tuples off and

50 evaluate without full consideration of the effects. One problem immediately encountered is the distribution of tuples that are already waiting on other tuples to retum from another node. This is particulariy hard to recover from given that incomplete tuples (i.e. those actually containing null references) would then circulate through the cluster. To combat this waiting tuples remained stationary until all sub-sequences were fully ground and returned.

Another problem intrinsic to task selection is identification of good bisectors. A bisector as defined by [BEE] is the division of a problem into two smaller sub-problems.

SequenceL does not choose its bisectors but essentially finds all of them. It is up to the system to choose which bisectors should be used to parallelize versus which ones should be ignored. The cunent system attempts to select and offload randomly. Take for instance the work-depth tree resulting from the sequence [[I*2]-i-[3*4], [5*6]-i-[7*8],

[9*10]-h[Il*I2]]:

-I- + -I-

12 3 4 5 6 7 8 9 10 1112

Consider a desired offload of 3 tuples. The system may randomly look at the *I2, *56,

+*56 *78 sub-trees. The end result would be to parallelize *I2, and the sub-tree -i- *56

*78. Since *56 is encountered twice, once directly and once by examination of its super- tree, it is only included and packed as a member of the super-tree. The super-tree -i- *56

*78 actually contains 3 tuples: *56, *78, and the sum of their result. Once a tree(s) is

51 analyzed that matches or exceeds the amount of desired offload, the tree(s) is packed.

Packing of smaller sub-trees is always overridden if they are encountered when packing a super-tree. In this manner bisectors delimiting larger offloads will be selected until the desired number of tuples is packed or exceeded.

Task migration in SequenceL is a simple task given an intermediate representation of the language. Actually moving a task from one node to another in

SequenceL requires no more than packing tuples into contiguous memory streaming the data to another node and unpacking the data into the exact representation it was before it was packed. One must keep track of the status of each tuple (i.e. 0-3), the node types, the number of tuples, and where tuples start and end in contiguous memory. Once conectly unpacked, however, a node can evaluate the tuples exactly how they would have been on the sending node.

After dynamic load balancing schemes have been formulated one arrives at the last issue facing the SequenceL distributed memory programmer - persistent data. By dividing SequenceL execution between several distributed hash tables on nodes in the cluster, it is imperative to have a way of aggregating back to a coherent product. If tuples were ignorantly distributed each time a load balance occuned there would be no information available over where to send completed tuples. This was not a problem with the client-server architectures used in the older tuple space implementation because the serving node had complete knowledge and control of where tuples were distributed - similar to a global directory. In the peer-to-peer stmctured token ring, another scheme must be devised.

52 The problem is directly akin to distributed hashing methods, distributed database management, and task mapping - see [Dev]. Distributed SequenceL hash tables can be thought of as "buckets," and tuples residing in the hashes can be thought of as being mapped to a hash within the bucket. When sending a tuple to another machine its original mapping must be recorded so upon completion it can be returned and further ground or aggregated to a coherent product.

Two approaches to this problem were considered. The first was to append mapping information in the form of tags to tuples when communicated. Each time a tuple was communicated to another node, the sending node would append its own tag to the tuple. When grounding was completed, retracing would occur by examining the tag information. This approach was discarded because ofthe maintenance and communication implied in producing, packing, unpacking, and complying with tag mapping information.

The second approach involves constmction of a communication table to hold mapping information. The number of rows in the table would be equivalent to the size of the hash table and every row would conespond to the communication status of its equivalent hash. Each row would contain 3 fields: processor, hash, and communication status. Processor and hash refer to the processor and hash from which the tuple came.

Communication status represents whether a tuple is away (has been distributed), local

(cunentiy resides in the hash), or is slated for packing. The following are examples of slices of the hash and the communication table with tuples of varied communication

53 status and mapping, "cp" stands for communication processor, "ch" stands for

communication hash, and "cs" for communication status:

Node 0 with a tuple currently residing in its local hash:

cp: 0 ch: 20 cs: 0 status: 1 Hash[20] | + 1 2

Node 1 with a tuple from processor 2 at hash 15:

cp: 2 ch: 15 cs: 0 status: 1 Hash[31] | + * 4 5 6

Node I after redistributing Hash[31] to another node:

cp: 2 ch: 15 cs: -1 status: 1 Hash[31] |

Figure 3.3 below illustrates the communication table values after the passing of a tuple between 3 processors.

Original commtable & hash on proc 0:

cp: 0 ch: 20 cs: 0 status: 1 Hash[20] j + 1 2

Hash and commtable after token pass for proc 0:

cp: 0 ch: 20 cs: -1 status: 1 Hash[20] I

Hash and commtable after token pass for proc 1:

1 cp: 0 ch: 20 cs: -1 status: 1 Hash[31] I

Hash and commtable after token reception for proc 2:

- ^ cp: 1 ch: 31 cs: 0 status: 1 Hash[15] j + 1 2

Figure 3.3 Communication table after tuple offload

54 By this method, the hash values on processors awaiting retum of tuples are left NULL and can be viewed as placeholders. Other tuples cannot occupy the hash that awaits the retum of another tuple. Consequently this consumes more hashes, but is a fair trade off with generating and communicating tag information.

55 CHAPTER IV

RESULTS

The C SequenceL interpreter and MPI/Socket token nng communication

architecture were tested against a specific problem domain with varying data sizes,

thresholds, and number of processors. This thesis achieved the following results:

• A non-recursive C SequenceL interpreter written and tested on a set of SequenceL

programs.

• A working distributed memory communication architecture and parallelized

SequenceL execution

• A distributed memory representation of SequenceL

• A method of enforcing persistent SequenceL data and programs

• Performance measurements of SequenceL programs executed on the

communication architecture

• Insight into the dynamics of distributed SequenceL execution

4.1 Interpreter Testing:

The SequenceL interpreter written for this thesis did not support recursion. This, in tum limited the choice of easily written test programs by excluding intuitively recursive programs (like quicksort). However this was not essential for testing purposes.

Several non-recursive functions were chosen to test the functionality ofthe interpreter e.g. primes and the trapezoid mle. The interpreter yielded conect lexical analysis, syntax

56 analysis, and a sound evaluation of these and other non-recursive programs. The

implementations of primes and the trapezoid rule are shown below:

Primes program;

//function: prime //description: a brute force method of determining if a number is prime //returns a positive integer if the number is prime else returns 0 prime: s->s prime(a)::= product(a%[2..ceiling(sqrt(a))])

//function: primeall //description: returns all prime numbers in a sequence generated from //the parameters b and c primeall: s*s->[s] primeall(b,c)::= [b..c] when [prime([b..c])!=0]

//function: main main: ?->? main()::= primeall(5,50)

Trapezoid Rule program;

//function: func //description: function to be approximated by the trapezoid rule func: s->s func(x)::=x*3

//function: trapquad //description: quadrature evaluation of a function "func' over the interval from "left' to "right' with "subint' divisions trapquad: s*s*s->s trapquad(left,right,subint) ::=

sum( [ [ [func([[[[right-leftl/subint] * [0..subint-1]] + left])] + [funcC[[[[right-left]/subint] * [1..subint]] + left])] ]/2 ] * [[right-left]/subint])

//function: main main: ?->? main{)::= quadd, 5,1000)

57 SequenceL trace output can be used as a primitive method of visualizing a programs level of parallelism over time. The growth and recession of the length of sequences by the NTD cycle, directly conelates to the amount of parallelism in a program. Rotated trace output from the primes and trapezoid rule programs is shown in

Figure 4.1 and demonstrates the algorithms' inherent parallelism with respect to time.

58 I iii iiiti iiijl jliliyiiil IIP liiW" iiil isiijl {Hr ii::: tHiii;, mm {lliiiiilllliliiil

jiiiii-.. jiilii;; •:i!li!i [||lliy:...,...,„,,.,.,.:;iji •iSiaiir iiuiiiniiiiiiiiUuiiiiiiiiiiiill lUiiiiliiuiiiiiiiiiiiiuiiuuh trapezoid mle primes

Figure 4.1 Dynamics of parallelisms trace output

59 The local maxima or peaks of the trace output represent high levels of

parallelisms. The local minima or valleys are points of low parallelism. When juxtaposed

one can see the differences in the dynamics of parallelisms of the programs.

4.2 Distributing Parallelisms

The SequenceL trapezoid rule program was chosen to distribute parallelisms over

the communication architecture by virtue of its highly parallel nature and its common use

in high performance computing. The trapezoid rule is used for quadrature approximations

of definite integrals. It works on the principle that the area of a trapezoid can approximate

the area under the curve of a function as follows:

J 0

By breaking a definite integral into n subintervals one can better approximate the integral

by taking the sum of all trapezoid approximations of each subinterval i.e.

«-i

(=0 2

where h is the length of a subinterval i.e. ^ ^ m- .)

For this thesis simple polynomial functions were chosen for f(x). The purpose is to exploit parallelisms ofthe trapezoid mle not to perform difficult integration. Figure 4.2 is a trace of the communication between 4 processors when evaluating the trapezoid mle

60 for the integral: I ^ <-^^ with 2000 subintervals. Thresholds for profitability

1 determinations were not present during this execution; meaning as soon as a node detected another node needed work it would offload, regardless of whether it was profitable or not. The work transfer vector consisted of half of a node's tuples. Root initialization occuned on processor 0.

1.) 0 is offloading 8404 tuples to 1 2.) lis returning 601 tuples to 0 3.) 1 is offloading 2418 tuples to 2 4.) lis returning 10 tuples to 0 5.) 2 is returning 388 tuples to 1 6.) 1 is returning 389 tuples to 0 7.) 0 is offloading 14678 tuples to 1 8.) 1 is returning 18 tuples to 0 9.) 1 is offloading 3109 tuples to 2 10.) 1 is returning 7 tuples to 0 11.) 2 is returning 368 tuples to 1 12.) 2 is offloading 2244 tuples to 3 13.) lis returning 369 tuples to 0 14.) 3 is returning 283 tuples to 2 15.) 3 is offloading 1587 tuples to 1 16.) 1 is returning 266 tuples to 3 17.) 2 is returning 284 tuples to 1 18.) 3 is returning 267 tuples to 2 19.) 1 is returning 284 tuples to 0 20.) 1 is returning 55 tuples to 3 21.) 2 is returning 2 67 tuples to 1 22.) 3 is returning 55 tuples to 2 23.) 1 is returning 267 tuples to 0 24.) 2 is returning 55 tuples to 1 25.) 1 is returning 55 tuples to 0 = [156.000043 ]

Figure 4.2 Communication trace output

One immediately notices the pattem of distribution emanating from processor 0

(lines 1, 7). Processor 0 detects processor 1 has no tuples and offloads half of its cunent

workload to processor 1 (line 1). Processor 1 then begins processing it's tuples. On the

61 next token revolution processor 1 returns it's completed tuples to 0 (line 2) and detects processor 2 has no tuples. It then offloads half of its workload to processor 2 (line 3).

Both I and 2 finish computing their tuples and retum their completed tuples to their respective sender (lines 4, 5, 6). The tuples are aggregated at processor 0 and new tuples are then distributed and computed (lines 7-25).

Zooming out, one can see the entire distribution cycle as two steps: lines 1 -6 and lines 7-25. These two steps represent the complete distribution and aggregation of a set of parallelisms and are directly related to the local maxima or peaks ofthe trapezoid mle trace output in Figure 4.1. Notice the quantity of tuples offloaded with respect to each other conesponds directly to the height ofthe peaks in the graph. Lines 1-6 represent the distribution ofthe left peak of parallelisms with 8,404 tuples, whereas the Lines 7-25 represent the longer center peak with 14,678 tuples. Figure 4.3a and Figure 4.3b below fiirther illustrates the two steps, the red lines indicate offload of tuples, whereas the blue anows represent retum of tuples. All anows are numbered to conespond to the output in

Figure 4.2.

offload

return •^^V^2 T^ \\ A 1 ); ' 17 . -6 - ) y •^

Figure 4.3a Illustration of communications occuning in lines 1-6

62 offload m return

Figure 4.3b Illustration of communications occuning in lines 7-25

A small number of processors were used in this example however with larger data sets and more processors the trend is still apparent. Test results show that the distribution method and communication architecture can adaptively respond to the dynamics of a program's parallelism.

One notices the snake-like offload pattem in Figures 4.3a and 4.3b in which nodes offload in a circular fashion. Each node distributes a portion of its tuples to the first node it sees as being free, marked by the red anows. In this case distribution among processors can be visualized in a linear fashion as follows:

0 3-^ 1

and aggregation as:

0^ 1 ^2^3^1

However, in other traces of SequenceL distribution offloading forked to multiple processors. One can view the process of forked offloading and aggregating tuples as a

63 tree. Take for instance the sample trace output and conesponding tiee representation in

figures Figure 4.4 and Figure 4.5:

0 is offloading 14714 tuples to 2 2 is returning 192 tuples to 0 2 is offloading 4079 tuples to 3 2 is returning 704 tuples to 0 2 is offloading 2031 tuples to 4 3 is returning 1318 tuples to 2 3 is offloading 1052 tuples to 1 4 is returning 370 tuples to 2 1 is returning 42 tuples to 3 2 is returning 1688 tuples to 0 3 is returning 43 tuples to 2 4 is returning 167 tuples to 2 2 is returning 210 tuples to 0

Figure 4.4 Communication trace output

Figure 4.5 Communication tree representation

The stmcture is highly reminiscent of a client server model in which 2 is serving tuples to clients 3 and 4. This characteristic occuned intermittently in tests on the token ring and was usually more prevalent when testing larger data sets. If one extrapolates the model to

64 several more processors, and extremely large data sets, it would resemble a hierarchical client server model perhaps similar to Figure 4.6 below.

Figure 4.6 Communication hierarchy

One may now see that the token ring model allows for dynamic expansion of a hierarchical client server distribution model when responding to increased parallelism.

Client nodes that receive tuples from a serving node may in tum produce high amounts of parallelism themselves and will be forced to offload in the hierarchical fashion. Dynamic expansion and recession of the model is directly dependent on the fluctuation in parallelism of the program. Snake-like output as shown earlier can simply be thought of as a two-node client server model. The model also resembles the bisection graph in

Figure 2.1 and is a good representation of how the token ring model uses choice of bisectors to resolve imbalance and granularity issues. The number of forks in the tree would be implementation specific and is dependent on profitability determination, and the work transfer vector.

Root initialization is partially responsible for this tree-like behavior. Given all parallelisms are initially found and distributed from the root node (i.e. processor 0) all

65 results emanate from, and must eventually be aggregated back to, the root node. However

even with an initialization scheme to partition parallelisms onto different nodes before

evaluation, the behavior will still persist. By eliminating root initialization one only

removes the top level of the tree. Nodes evaluating pre-partitioned parallelisms will still

distribute in the tree like fashion.

4.3 Performance Analysis

Performance analysis was conducted by taking serialization time, task selection

time, computation time, communication time, and total execution time versus the number

of nodes involved in the computation and the data size of in the trapezoid program.

Threshold analysis was also investigated to profitability determination of offloading

tuples. The majority ofthe time measurements were conducted on processor 0 (the node

performing root initialization). Data sizes ranged from 1000, 1500, and 2000 subintervals

of the trapezoid mle.

4.3.1 Time Metrics

The figures 4.6a, 4.6b, 4.6c, and 4.6d are pie charts representing the proportion of time processor 0 spent on communication time, computation time, task selection time, serialization time, computation control and wait time with respect to total execution time.

Computation control refers to the amount of time taken checking MPI asynchronous communication buffers, opening and closing sockets, and atomically switching between communication and computation. Wait time refers to the amount of time a processor

66 spends waiting on retum tuples from a processor. Waiting occurs when a node needs tuples to continue computation however the tuples have previously been distributed to another node. Tests were taken on a data size of 1000 subintervals for the trapezoid program. The charts show the average results of 9 executions on 2, 3, 4, and 5 processors.

Each pie slice is represents the ratio in seconds of the measured time to the total execution time. The sum of times of all ofthe slices is the total execution time. Work transfer consisted of VA of the sending machines tuples.

Time proportion for 2 processors

serialization task selection time time 4.581111 3.3855556

computation time 0.607778- computation control / w ait time

communication 4.79774 time 0.011111

Figure 4.7a Time proportion for 2 processors

67 Time proportion for 3 processors

serialization time tasl< selection 6.931111111 time 4.675555556 \r

0.574444444 ^HH^^^^^^^^^^^HP computation computation time

^^^^^^"^ 13.6278 communication time 0 012222222

Figure 4.7b Time proportion for 3 processors

Time proportion for 4 processors

serialization time task selection 7.806666667 time 6.341111111

computation time computation control 0.61 /wait time 25.54444

communication time 0.008888889

Figure 4.7c Time proportion for 4 processors

68 Time proportion for 5 processors

serialization time task selection 5.849 time 7.273 X \ \ computation control/wait time computation ^^^^^^^^^^^n time 43.08 n "17^

communication < time 0.008

Figure 4.7d Time proportion for 5 processors

One immediately notices that serialization, task selection, and computation control/wait time dwarf the amount of time spent on actual computation. This is highly undesirable and would be prefened as the opposite. All cases show processor 0 spending an exorbitant amount of time on computation that does not benefit load balancing or execution. The amount of time increases per increase in number of processors as well note the difference in time values of computation control/wait time from 2 processors to

5. The same results are better shown in a graph format below in Figure 4.8a.

69 # of processors versus execution time, data size=1000

-comnnunication time -connputation time task selection time serialization time -*—comp control/wait time -•—total execution time

2 3 4 # of processors

Figure 4.8a # of processors versus execution time, data size=1000

The graph displays a seemingly exponential curve for total execution time that appears to be directly based on computation control/wait time. Further measurements were taken for a data size of 1500 and 2000 subintervals and shown in the graphis in figures 4.8b and

4.8c.

70 # of processors versus execution time, data size=1500

120

-communication time -computation time task selection time serialization time -*—comp control/wait time -•—total execution time

2 3 4 # of processors

Figure 4.8b # of processors versus execution time, data size=1500

# of processors versus execution time, data size=2000

-•—communication time -•—computation time task selection time -X—serialization time -*—comp control/wait time -•—total execution time

2 3 4 # of processors

Figure 4.8c # of processors versus execution time, data size=2000

71 Tests were successfully conducted on up to all 9 processors in the cluster with varying data sizes. However, with a larger amount of processors, complete computation of programs took several minutes. Thus testing for higher numbers of processors was not presented in these results.

Tests for threshold values were also performed in an effort to determine if adjusting profitability determination would affect execution time. After several tests on 2 to 5 processors, using upper and lower bound thresholds at low levels (ranging between

200 to 2000), it was determined that the thresholds had no real effect on execution time.

Nodes were distributing roughly the same amount of tuples and exhibiting similar distribution pattems each run. For further testing a lower bound threshold remained fixed at a small 100 tuples while an upper bound threshold was explored. The threshold ranged from 5000 to 20000. Testing was done on 5 processors; impact on a lower number of processors did not have much effect. Data sizes of 2000, 2500, and 3000 subintervals were used.

72 Time results from variations in tiie upper bound profitability threshold

350

300

^ 250 "in 0 200 -data size = 3000 •!£, data slze = 2500 1 150 data size = 2000 100 50 0 5000 10000 15000 20000 25000 tlireshold

Figure 4.9 Time results from variations in the upper bound profitability threshold

A slight decrease in execution time (around 25 seconds) can be detected for the data sizes of3000 and 2000.

4.3.2 Explanations

Although computation control and wait time seem to be the underlying factors causing slow execution, they are ultimately based on serialization and task selection of all other nodes participating in the parallelism. Consider the scenario between two nodes in

Figure 4.10 as a series of times steps illustrating the problem. In the first time step a node

0 distributes half of its tuples to node 1. Node 0 then resumes grinding its tiiples while node 1 begins to unpack its tuples (unpacking is considered in the time measurements as a forai of serialization and is time costly). By step 3 node 0 has completed grinding while

73 node I is still unpacking. Remember computation time only takes a fraction of the time that serialization does as shown in Figures 4.7a, 4.7b, 4.7c, and 4.7d. In step 4, node 0 begins awaiting the retum of tuples from node I so as to complete computation or aggregate results to a complete product. However node I has just begun computation.

Node 0 continues to wait through steps 4, 5 and 6 while nodel finishes grinding and begins to pack (serialize) its tuples. In step 7 the tuples are finally returned to processor 0.

The time node 0 waits for retum tuples is large in comparison to the actual computation it achieves in the 7 steps. It becomes even longer if node 1 decides to offload parallelisms. In that case, node 0 must wait for node 1 to perform task selection

(which is time costiy), serialize, offload, and aggregate its jobs before node 1 will even begin to retum its tuples. This builds on top of the serialization time already needed to retum tuples back to node 0. The process could be recursively repeated in a chain of off loads resulting in extremely long periods of wait time for node 0. The result of this cascading accumulation of serialization time is directly reflected in the exponential time trends in the Figures 4.8a, 4.8b, 4.8c.

74 step 1:

unpacking grinding (serialization) Step 2:

done grinding Step 3: unpacking W U/•—X waiting grinding Step 4: w/"—X u waiting done grinding Step 5:

packing waiting (serialization) Step 6:

Step 7:

Figure 4.10 Times steps during distribution and aggregation of tuples

Serialization is costly because ofthe linked list representation of sequences used in the design and implementation of this thesis. The serialization process used is

75 recursive. It traverses a sequence and all of its subsequences packing a representation of each node into contiguous memory along the way. The packed sequence is then freed by another recursive traversal in which memory is deallocated. Allocating and deallocated large amounts of memory recursively each time tuples are communicated is detrimental to execution time.

Task selection is akin to serialization in its recursive characteristics, and it too has been previously shown to be costly. Task selection uses a handful of rules to identify sequences that are worthy of offloading, but must perform heavy recursive sequence traversals to determine the length, depth, and contents of a sequence. Heavy recursive sequence traversals occurring in task selection alone outweigh parallel computational benefit. It too contributes to the exponential growth of wait time as explained earlier.

The combination of the inefficiencies in serialization and task selection make decisions to offload tuples undesirable. In fact, it is never profitable to offload tuples.

Results taken from high upper-bound threshold values in Figure 4.9 show a reduction in execution time simply because the number of processors involved in communication is reduced. A high threshold value limits the rights of nodes to offload tuples so much so that it simply excludes some would-be receiving nodes. Altogether, since the number of processors is reduced, the amount of communication is reduced, which reduces the amount of serialization, which reduces the execution time. In most cases the high thresholds limited communication to only 2 or 3 processors. This would be optimal if those two processors were constantly grinding tuples and did not need to parallelize;

76 however this is not the case. Each processor is spending over 95% of its time processing instructions other than SequenceL code.

4.3.3 Other Considerations

Dynamic load balancing was a central focus for this thesis, however issues in serialization and task selection complicated measurements and fine tuned solutions. This thesis produced functional solutions for serialization, task selection, profitability determination, work transfer, and task migration although perhaps inefficient. The token ring model itself worked as a proof of concept for parallel distribution of SequenceL, however it has not achieved a dynamic means of load balancing execution, given that load balancing was never profitable.

Looking back at the problems with the Jython TSpace implementation, the token ring model shares scalability, serialization, and slow execution issues - all of which are tied to serialization and task selection discussed earlier. However, some key improvements were made in communication, granularity, and portability.

Communication on the Beowulf cluster proved effective even when sending large streams of packed tuples. Results gathered showed communication times as only taking fractions of a second. This caused little computational hindrance or drop in efficiency.

The combination of asynchronous MPI functions and low level socket programming gave the token ring a definite advantage over TSpace communication.

Although the success of threshold values was limited, granularity issues were addressed and solved by allowing nodes to compute tuples by default. If offloading did

77 occur, up to tens of thousands of tuples could be packed and shipped to another node.

This method was much more effective than the fully parallel piecemeal distribution of

tuples in the TSpace model.

One could also argue that the token ring model does outdo the TSpace model in

terms of overall performance. Though both models are slow distributed implementations

of SequenceL, the C interpreter coupled with MPI and socket communication out

computed the Jython Tspaces model by an order of magnitude or more. The

C/MPI/socket implementation could compute programs with data sets of several thousand

in a minute or two, whereas the TSpace/Jython implementation could only do several

hundred in the same amount of time. This is largely due to the choice in languages,

/interpreters, and communication libraries.

4.4 Intermediate Representation and Persistent Data

SequenceL data persisted through the course of execution. The communication

table accurately maintained historical communication information so as to allow nodes to retum tuples to the node that distributed them. This often resulted in a snake-like trace of tuples back to their originating node as shown earlier in Figures 4.3a and 4.3b. The trace back is necessary so that conect aggregation and simplification of tuples can result along their path. Figure 4.2 shows the offloading and aggregation of a combined total of 35,855 tuples to form a coherent final data product. Persistent data has been tested and achieved over larger data sets, all 9 nodes ofthe cluster, and several hundred thousand tuples communicated.

78 The hash table, sequence representation, and sequence node typing scheme proved viable as an intermediate representation of SequenceL in distributed memory. One should note however, that each processor will not share the same memory locations for built-in SequenceL functions and function operators (e.g. -I-, *, /, abs, sin, product). This requires that operators and functions remain in character form or some other universal data form when being distributed. This allows each node to resolve the function or operator representation to a pointer in its local memory.

When executing user-defined functions, a representation of the function must

reside on each node. This requires either the parsing node to send every other node a

copy of the user defined function representation, or allow each node to parse the

SequenceL program itself For this thesis all nodes in the cluster were allowed to parse

the SequenceL program by using MPI's SPMD convention.

79 CHAPTER V

CONCLUSIONS

SequenceL has been introduced as a language that finds all data and control parallelisms in a program. An overview of the language was given and a background on other implementations of the language was covered. Considerations in dynamic load balancing, IBM's Token Ring, and token rings in general were researched and presented to provide a foundation of knowledge for the methodology of this thesis. Two clusters

(one heterogeneous, one homogenous) were constructed from scratch to provide a dedicated distributed memory execution environment for SequenceL. A C SequenceL interpreter and accompanying MPI/socket token ring communication architecture have been implemented and investigated as a solution to dynamically load balancing parallel

SequenceL execution along with addressing specific concems of the SequenceL TSpace implementation. The results proved the implementation as a functional means of distributing SequenceL parallelisms, however load balanced execution was not achieved.

Dynamics of the distribution parallelism have been outlined and specific problems of serialization and task selection have been addressed as factors hampering load balancing.

This chapter draws conclusions from the results of the research, suggests improvements, proposes possible future work, and ends with closing remarks.

5.1 Suggestions and Improvements

The cunent C SequenceL interpreter needs recursion. Testing was limited to non- recursive algorithms and perhaps would have been much easier with the use of recursive

80 parallel algorithms. The modification to the interpreter would cause no large fundamental change in the intermediate representation of the language.

It is highly recommended that linked list representations of sequences not be used in distributed memory implementations of SequenceL. The cost of memory management associated with serialization is too great. The SequenceL Jython/TSpace implementation perhaps suffered for the same reason. The built-in tuple data type that was used in Jython to represent sequences was most likely implemented using a linked list. A wiser approach would be to replace the linked list implementation with an anay based implementation.

The anay-based implementation would require a more clever approach to SequenceL intermediate code representation and would constrain the programmer to buffers of a set length. However it would be a more desirable format for packing and parallelizing.

Task selection should be given a closer look. There is a real need for an algorithm or method to efficientiy control what gets parallelized. Basic heuristics were used in this thesis e.g. never parallelize a list of numerals, always try to parallelize sequences with operators. It was also observed that distributing user defined function operators was almost always advantageous - given once instantiated on another node they can possibly produce many more parallelisms. These rules help but are not a robust solution. Any future distributed memory implementation of SequenceL will face this problem, and would greatly benefit from a well-devised solution.

MPI as a set of communication libraries might be replaced. The libraries work well for communication, but they are not as versatile as was needed at times. In fact, several weeks were spent battiing MPI functionality alone. Socket programming would

81 probably be a suitable replacement for most MPI calls. If sockets are not chosen then some other high performance communication libraries should be. When communicating tuples and tokens, one needs a fast, reliable, and controllable means of communication.

5.2 Future Work

A compiled anay based version of SequenceL should be investigated for distributed memory use. All memory should be pre-allocated or should remain a static size to avoid excessive memory management and allocation problems. The token ring architecture should be explored further with solutions to the problems mentioned above.

The architecture proved viable and should not be abandoned in future research.

5.2.3 Closing Remarks

By working on SequenceL in distributed memory one tastes many of the flavors of computer science. At the language level, one is immersed in theory of computation, compiler theory, and the exciting environment of language development. At the distributed computing level, one must leam protocols, network configurations, communication models, tools, and face issues as they emerge from parallel SequenceL evaluation. SequenceL even touches the system administration level. Operating system knowledge, kemel knowledge, systems programming, process management, hardware/software troubleshooting, cluster assembly, even the basics of cable making are fundamental to maintaining a working system. Altogether working with SequenceL has been a great project and a very rich experience.

82 REFERENCES

[And] Per Andersen, "A Parallel SequenceL Compiler," PhD Thesis, Texas Tech University 2002.

[AGM] I. Ahmad, A. Ghafoor, K. Mehrotra, 'Performance prediction of distributed load balancing on multicomputer systems," Proceedings ofthe 1991 ACM/IEEE conference on Supercomputing, pp.830-839, Albuquerque, NM, 1991.

[BEE] Stefan Bischof, Ralf Ebner, Thomas Eriebach, "Load Balancing for Problems with Good Bisectors, and Applications in Finite Element Simulations." Euro-Par 1998, pp 383-389.

[BHCS] G. Blelloch, J. Hardwick, S. Chatterjee, J. Sipelstein, M. Zagha, "Implementation of a Portable Nested Data-Parallel Language," Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming. Volume 28 Issue 7. July 1993.

[BLEOI] G. Blelloch. NESL: A Nested Data-Parallel Language (Version 3.1). CMU-CS- 95-170, September 1995.

[BLE02] G. Blelloch et al. The SCANDAL Project. http://www-2. cs. emu. edu/~scandal/nesl. html

[CKLP] M. Chakravarty, G. Keller, R. Lechtchinsky, W. Pfannenstiel. Nepal - Nested Data-Parallelism in Haskell. Parallel Processing, 7th Intemational Euro- Par Conference. September 2001, Springer-Veriag, Berlin, pp524-534

[Coo] Daniel E. Cooke, "An Introduction to SEQUENCEL: Alanguage to Experiment with Nonscalar Constructs," Software Practice and Experience, Vol. 26(11), November 1996, pp. 1205-1246.

[CoAn] Daniel E. Cooke, Per Andersen, "Automatic Parallel Control Structures in SequenceL," Software - Practice & Experience, Volume 30 Issue 14, pp. I54I- 1570, 2000.

[CPT] C/Python Tuplespace. The official name of the software is 'LinuxTuples' http://sourceforse.net/proiectsAinuxtuples/

[Dev] R. Devine. "Design and Implementation of DDH: A Distributed Dynamic Hashing Algorithm," In Proceedings ofthe 4th Intemational Conference on Foundations of Data Organization and Algorithms, pp. 101-114, 1993.

83 [DHBS] K. Devine, B. Hendrickson, E. Bowman, M. St. John, C. Vaughn, "Design of Dynamic Load Balancing Tools for Parallel Applications," Proceedings ofthe Mth international conference on Supercomputing, pp. 1 lO-118, Santa Fe, NM, 2000.

[Dij] E.W. Dijkstra, "Self-Stabilizing Systems in Spite of Distributed Control," Communications ofthe A CM 17, pp. 643-644, 1974,.

[GPH] Glasgow Parallel Haskell Homepage. htip://www.macs.hw.ac.uk/~dsg/gph/

[HiM] L. Higham and S. Myers, "Self-stabilizing Token Circulation on Anonymous Message Passing Rings," In OPODIS98 Second Intemational Conference On Principles Of Distributed Systems, pages 115-128, 1998.

[Jain] R. Jain, "Performance Analysis of FDDI Token Ring Networks: Effect of Parameters and Guidelines for Setting TTRT," Proceedings ofthe ACM symposium on Communications architectures & protocols, pp. 264-274, Philadelphia, PA, 1990.

[LRSH] H-W. Loidl, F. Rubio, N.R. Scaife, K. Hammond, S. Horiguchi, U. Klusik, R. Loogen, G.J. Michaelson, R. Pena, S.M. Priebe, A.J. Rebon, and P.W. Trinder, "Comparing Parallel Functional Languages: Programming and Performance," Higher-Order and Symbolic Computation, 2001.

[LoT] H-W. Loidl, P. Trinder. A Gentle Introduction to GPH. http://www.macs.hw.ac.uk/~dsg/gph/docs/Gentie-GPH/gph-gentle-intro.html

[Ros] L. Rosaz, "Self-stabilizing Token Circulation on asynchronous uniform unidirectional rings," Proceedings ofthe nineteenth annual ACM symposium on Principles of distributed computing, pp 249-258, Portland, OR, 2000.

[SoP] Joao Luis Sobral, Alberto Jose Proenga, "Overheads on the dynamic removal of excess parallelism on 00 inegular applications," presented at First Workshop on Parallel Computing for Irregular Applications, France, Jan. 1999.

[SunI] Sriram Sundararajan, "A Sequence Interpreter Using Tuplespaces," M.S. Thesis. Texas Tech University, 2003.

[ToD] K. Tomko, E. Davidson, "Profile Driven Weight Decomposition," Proceedings of the 10th intemational conference on Supercomputing, ppI65-172, Philadelphia, PA, 1996.

[TSP] TSpaces Homepage. http://www.alphaworks.ibm.com/tech/TSpaces.

84 [WaT] Jerell Watts and Stephen Taylor, "A Practical Approach to Dynamic Load Balancing," IEEE Transactions Parallel and Distributed Systems, 9(93) pp235- 248, 1998.

[WIQ] WordlQ definition of Token Ring, http://www, wordiq.com/definition/Token ring.

[WiM] J. Winkler, J. Munn, "Standards and Architecture for Token-Ring Local Area Networks," Proceedings of 1986 ACM Fall joint computer conference, pp. 479- 488, Dallas, TX, 1999.

85 APPENDIX A

GRAMMARS

SequenceL Grammar 7/2003

A ::= integer| real I string L::= A,L|E,L|A E::= []|[L] V::= id 0 ::= + I -1 / I * I abs I sqrt I cos I sin I tan | log | mod | reverse | transpose | rotateright 1 rotateleft | cartesianproduct M::= all,M I T,M I all I T T ::= E I [Tstar] | V | 0(T) | V(map(M)) | T(M) | gen([T,...,T]) R0::= < I > I = I >= I <= I <> I integer | real | var | operator R : := RO(T) | and(Rplus) | or(Rplus) | not(Rplus) B ::= Tplus I Tplus when R else B C ::= [ ] I taking [Vplus] from T F::= V(Vplus) where next = B C U::= FU|EU

SequenceL Grammar used for this thesis

Scalar ::= tme | false | e | pi | Numeral Constldent ::= nil | Scalar | Identifier

Prefix 1::= abs | sqrt | sum | product | transpose | floor | ceiling | length | sin | cos | tan | cos I sec I CSC I cot I log

Prefix2::=~|-

Term ::= TermI , Term | Terml Term I - Term2 .. Temil I Temi2 Term2 : = Term3 else Tenn2 | Term3 Term3 : = Term4 when Term3 | Term4 Term4: = Term5 -i-(- Term4 | Term5 Term5 : = Tenn6 || Term5 | Temi6 Term6 : = Term7 & Term6 | Term7 Term7 : := Terai8 = Term7 I Term8 = Term7 | Tenn8 != Tenn7 I Term8 < Term 7| Term8 > Term7 | Term8 <= Term7 | Term8 >= Term7 I Term 8

86 Term8 ::= Term9 -i- Term8 | Term9 - Term8 | Term9 Term9 ::= Term 10 * Term9 | TennlO / Term9 | Term 10 % Term9 | Term 10 TermI0::=TermlI ^ TermlO [Termll Terml I ::= [Term] | (Term) | Prefixl Terni | Prefix2 Term | Constldent | U(Tenn)

Simpletype ::= s | ? | [Simpletype] Type ::= nil | Simpletype | Simpletype * Type Signature(u) ::= u:Type —• Type, where u e U

Arglist ::= nil | Identifier | Identifier Argtail Argtail ::= , Identifier Argtail | nil

Definition(u) ::= u(arglist) ::= Term, where u s U Function ::= Signature(u) Definition(u), where u s U

Program ::= Function Function 1 Function I ::= Function Function I | nil

87 APPENDIX B

CLUSTER SPECIFICATIONS

Dungeon Cluster Specifications

Sriram vendor_id Genuinelntel model name Intel(R) Pentium(R) 4 CPU I.80GHz cpu MHz 1794.203 cache size 256 KBRAM : 256 MB bogomips 3578.26

Geek vendor_id Genuinelntel model name Intel(R) Pentium(R) 4 CPU 2.40GHz cpu MHz 2391.183 cache size 512KB RAM 512MB bogomips 4771.02

Rizzo vendor_id Genuinelntel model name Celeron (Mendocino) cpu MHz 534.558 cache size 128KB RAM 128MB bogomips 1064.96

Gorilla vendor_id Genuinelntel model name Celeron (Mendocino) cpu MHz 534.554 cache size 128KB RAM 128MB bogomips 1064.96

Pelumpus vendor_id Genuinelntel model name Celeron (Mendocino) cpu MHz 534.553 cache size 128KB RAM 128MB

88 bogomips : 1064.96

Noodle vendor_id Genuinelntel model name Pentium II (Deschutes) cpu MHz 400.912 cache size 512KB RAM 128MB bogomips 799.53

Dualbox processor 0 vendor_id Genuinelntel model name Pentium III (Katmai) cpu MHz 448.882 cache size 512KB RAM 512MB bogomips 894.56 processor I vendor_id Genuinelntel model name Pentium III (Katmai) cpu MHz 448.882 cache size 512KB bogomips 894.56

Dosequis processor 0 vendor_id Genuinelntel model name Pentium II (Deschutes) cpu MHz 400.914 cache size 512KB RAM 512MB bogomips 799.53 processor 1 vendor_id Genuinelntel model name Pentium II (Deschutes) cpu MHz 400.914 cache size 512KB bogomips 801.17

89 Beergarden Cluster Specifications processors 9 processors per node I vendorjd Genuinelntel model name Intel(R) Pentium(R) 4 CPU cpu MHz 1694.877 cache size 256KB RAM 256MB bogomips 3381.65

Texas Tech HPCC computers

Antaeus (cluster) processors 32 processors per node 2 vendor_id AuthenticAMD model name AMD Athlon(tm) processor cpu MHz 1200.077 cache size 256KB RAM 512MB bogomips 2383.03

Mathwulf (cluster) processors 18 processors per node 1 vendor_id AuthenticAMD model name AMD Athlon(tm) processor cpu MHz 1000.030 cache size 256KB RAM 512MB bogomips 1985.67

Weland (cluster) processors 66 processors per node 2 vendor_id AuthenticAMD model name AMD Athlon(tm) MP 2000H cpu MHz 1666.702 cache size 256KB RAM 512MB bogomips 3298.17

90 Pleione (shared memory) Processors 56 300 MHZ IP27 Main memory size 56832 Mbytes Instruction cache size 32 Kbytes Data cache size 32 Kbytes Secondary unified instruction/data cache size: 8 Mbytes

91 PERMISSION TO COPY

In presenting this thesis m partial fulfillment ofthe requirements for a master's degree at Texas Tech University or Texas Tech University Health Sciences Center, I agree that the Library and my major department shall make it freely available for research purposes. Permission to copy this thesis for scholarly purposes may be

granted by the Director of the Library or my major professor. It is understood that any

copying or publication of this thesis for financial gain shall not be allowed without my

further written permission and that any user may be liable for copyright mfrmgement.

Agree (Permission is granted.)

Stiident Signatiire Date

Disagree (Permission is not granted.)

Student Signatiire Date