Investigating the Feasibility of an MPI-like Library Implemented in .Net Using Only Fully Managed Code

Daniel Holmes

MSc in High Performance Computing

The University of Edinburgh

Year of Presentation: 2007 Abstract

The .Net development platform and the C# language, in particular, offer many benefits to programmers including increased productivity, security, reliability and robustness, as well as standards-based application portability and cross-language inter-operation. The Message Passing Interface (MPI) is a standardised high performance computing paradigm with efficient, frequently-used implementations in many popular languages. A partial implementation of McMPI, the first MPI-like library to be targeted at .Net and written in pure C#, is presented. It is sufficiently complete to demonstrate typical application code and to evaluate relative performance. Although the effective bandwidth for large messages (over 100 Kbytes) using 100Mbit/s Ethernet is good, the overheads introduced by .Net remoting and object serialisation are shown to result in high latency and to limit bandwidth to 166Mbit/s when using a 1Gbit/s Ethernet interconnection. A possible resolution that still uses pure C#, i.e. using .Net sockets, is proposed but not implemented.

Contents

Chapter 1. Introduction ...... 1

Chapter 2. Background ...... 2

2.1 Object-Oriented HPC ...... 2 2.2 Comparing .Net and Java ...... 3 2.2.1 Portability ...... 5 2.2.2 Connectivity ...... 5 2.2.3 Software Engineering ...... 6 2.2.4 Security ...... 6 2.2.5 Other Benefits ...... 8 2.2.6 Numerical Issues ...... 8 2.2.7 Performance Issues ...... 8 2.3 Programming in .Net ...... 9 2.4 HPC and .Net Programming...... 11 2.5 Implementation Paradigms for MPI in C# ...... 12 2.5.1 Platform Invoke – MPI.NET ...... 13 2.5.2 Sockets vs. Remoting ...... 14

Chapter 3. Requirements Analysis for a Solution in C# ...... 16

3.1 Headers and Message Data...... 16 3.2 Queues and Progress ...... 17 3.3 Threads and Locks ...... 19 3.4 External Interface ...... 20 3.4.1 Data types ...... 20 3.4.2 Memory References ...... 21 3.4.3 Layout ...... 23 3.4.4 Process Identifiers ...... 23

Chapter 4. Designing and Building a Solution in C# ...... 24

4.1 Object Interaction ...... 24 4.1.1 Non-Blocking Send ...... 26

i

4.1.2 Non-Blocking Receive ...... 27 4.2 Class Structure...... 28 4.2.1 The Internal Storage Structures ...... 29 4.2.2 The RemoteController Class ...... 30 4.2.3 The Process Class ...... 32 4.2.4 The Communicator Class ...... 33 4.2.5 The Request and Status Classes ...... 34 4.3 Configuration, Start up and Shutdown...... 35 4.3.1 The Node Manager Utility...... 35 4.3.2 The Initialise Method ...... 36 4.3.3 The Finalise Method ...... 36 4.4 Summary ...... 36

Chapter 5. Testing...... 37

5.1 Proving Correctness using Ping-Pong Duplex ...... 37 5.2 Investigating Performance using Ping-Pong Simplex ...... 38 5.3 Discussion of Ping Pong Simplex Results ...... 42 5.4 Demonstrating Real-World Code using Image Processor...... 43 5.5 Discussion of Image Processor Results ...... 49

Chapter 6. Further Work ...... 50

6.1 Reducing Latency...... 50 6.2 Testing Support for InfiniBand and Myrinet ...... 51 6.3 Increasing Portability ...... 52 6.4 Scaling to Large Parallel Systems...... 53 6.5 Completing McMPI ...... 53 6.6 Building Remoting from MPI ...... 53

Chapter 7. Conclusion ...... 54

Appendix A. Project Plan ...... A

Appendix B. Data tables ...... C

Appendix C. References & Bibliography ...... N

Appendix D. Glossary ...... R

ii

List of Figures

Figure 1: Building and Executing Java Programs ...... 3

Figure 2: Building and Executing .Net Programs ...... 4

Figure 3: Definition of Type-safe and Verifiably Type-safe in .Net (25) ...... 6

Figure 4: Simplified View of in .Net (26) ...... 7

Figure 5: Remoting infrastructure overview (33) ...... 15

Figure 6: Source code for using the IsSerializable property ...... 21

Figure 7: UML Sequence diagram for non-blocking send and receive ...... 25

Figure 8: UML Static Structure for the public interface of the McMPI classes ...... 28

Figure 9: Source code for the internal storage structures ...... 29

Figure 10: Source code for the PullMessageData method ...... 30

Figure 11: Source code for the PushSendHeader method ...... 31

Figure 12: Source code for the CombineMatched and PullData functions ...... 32

Figure 13: Ping Pong round-trip time for small messages – 100Mbit/s ...... 39

Figure 14: Ping Pong round-trip time for small messages – 1Gbit/s ...... 39

Figure 15: Ping Pong round-trip time for all message sizes – 100Mbit/s ...... 40

Figure 16: Ping Pong round-trip times for all message sizes – 1Gbit/s ...... 40

Figure 17: Ping Pong effective bandwidth – 100Mbit/s ...... 41

Figure 18: Ping Pong effective bandwidth – 1Gbit/s ...... 41

Figure 19: Image Processor times – small image, single precision ...... 45

Figure 20: Image Processor times – small image, double precision ...... 45

Figure 21: Image Processor times – medium image, single precision...... 46

Figure 22: Image Processor times – medium image, double precision ...... 46

iii

Figure 23: Image Processor times – large image, single precision ...... 47

Figure 24: Image Processor times – large image, double precision ...... 47

Figure 25: Communication overheads in the Image Processor code ...... 48

Figure 26: Winsock Direct topology in Windows Compute Cluster Edition (39) ... 51

Figure 27: Typical network topology for a CCS cluster (40) ...... 52

Figure 28: Truly portable classes are a subset of all the .Net runtimes ...... 52

Figure 29: The original project plan ...... B

iv

List of Tables

Table 1: Communication methods implemented in McMPI so far ...... 36

Table 2: Results from Ping Pong Simplex for 100Mbit/s Ethernet connection ...... G

Table 3: Results from Ping Pong Simplex for 1Gbit/s Ethernet connection ...... L

Table 4: Execution times for minimal and eager protocol remoted methods ...... M

Table 5 : Results from the Image Processor code ...... M

Table 6: Configuration of test hardware ...... M

v

Acknowledgements

I would like to thank both my supervisors, Dr Judy Hardy and Dr Stephen Booth, for their help and guidance throughout this project. I would also like to thank the EPCC systems department for providing the testing equipment used during this project.

vi

Chapter 1.

Introduction

High performance computing (HPC) exploits parallelism to make use of the computing power of multiple hardware resources. The physical architectures of such systems can be split into two broad categories, those with shared memory and those with distributed memory. There are two programming paradigms that are generally accepted and widely used in HPC; OpenMP, which uses multiple software threads within one operating system process, and MPI, which uses multiple operating system processes. Whilst OpenMP requires a physical architecture with shared memory, most implementations of MPI can make efficient use of either type of system.

Both OpenMP and MPI have been implemented in very many different ways and using a variety of popular programming languages. For a new programming language to be used for high performance computing it is essential that support for one or both of these standard approaches is extended to that new language. Generally, this involves accessing an existing implementation from the new language, porting an existing implementation to it or creating an entirely new implementation in the new language.

The advent of .Net, with its support for multiple languages that target a standardised virtual machine, presents the opportunity to extend support for HPC to all .Net languages on all .Net capable hardware architectures with a single .Net implementation. However, there is only limited support for HPC programming techniques in .Net. The only access that C# (one of the .Net languages) has to MPI is via MPI.NET (1), which delegates to an existing non-.Net library, e.g. MPICH (2). This reduces many of the benefits of the .Net system, including portability, security, reliability and robustness as well as introducing a potentially avoidable overhead, i.e. the call to non-.Net code. Whilst Corporation (3) has added support for OpenMP directives into their managed code (binary code that targets the .Net virtual machine) compiler for C++, no such support has yet been added to their C# compiler or to other .Net compilers.

The main aim of this feasibility study is to investigate creating a new implementation of MPI using pure C#. The evaluation criteria for the resulting library are its performance relative to the only existing solution, i.e. MPI.NET, and portability to other .Net languages and between the different .Net virtual machines. To achieve this aim only a subset of the full MPI library specification will be implemented, specifically to include non-blocking point-to-point messaging.

The following chapters are structured as follows. Chapter 2 provides some background to the project and to .Net itself, Chapter 3 considers the main requirements for MPI, Chapter 4 sets out the design and implementation of the new .Net library, called McMPI (Managed code MPI), Chapter 5 presents the results of performance testing, Chapter 6 discusses possible future work and Chapter 7 contains concluding remarks. Note the glossary in Appendix D defines some of the more frequently used .Net terms.

1

Chapter 2.

Background

This chapter begins with a brief survey of existing object-oriented libraries for high performance computing, then introduces the .Net language and runtime by comparing it to Java in section 2.2 and, in sections 2.3 and 0, discussing its evolution as well as reasons for and against its use in academic research and scientific programming. Section 2.5 enumerates the options for implementing MPI in .Net including the advantages and disadvantages of each approach.

2.1 Object-Oriented HPC

Combining object-orientation and high performance computing is not a new concept.

Both Java and .Net include good support for threads and therefore for shared memory programming. The OpenMP API (4) has been implemented using Java threads, resulting in JOMP (5), but although it is possible to do the same using .Net threads, this has not yet been attempted.

The MPI standard (6) includes bindings for C++. Both mpiJava (7) and MPJ (8) define a Java API and there are various implementations for Java. Some delegate to an existing MPI library, e.g. JavaMPI (9) and mpiJava (10). Some pure versions follow MPI standard closely, e.g. JMPI (11) and jmpi (12), while others deviate from it in order to provide more object-oriented functionality, e.g. JOPI (13) and CCJ (14).

There are many other object-oriented languages, though some are research projects or intended for rapid prototyping, such as Ruby. MPI Ruby (15) is a useful reference implementation of a fully object-oriented interface for the functionality in the MPI standard. It takes the novel approach of running a modified Ruby interpreter as an MPI program using MPICH. Further work in MPI Ruby (16) adds remote memory access functionality from the MPI-2 standard (17). There is also an MPI implementation for Python (18), an object-oriented language that claims (19) to be source code portable across many operating systems and even to both the Java and .Net virtual machines.

Like Java, .Net includes sockets for network communication and the ability to invoke non-local methods and code written in other languages. However, there are no implementations of MPI in C#, other than MPI.NET.

No HPC language is complete without extensive library support for advanced mathematical operations. Visual Numerics (20) provide their mathematics and data visualisation products in both pure Java and pure C#. NIST publish information on a wide range of Java numerical libraries (21). There is no equivalent focus for .Net numerical library products, such as Extreme Optimisation (22) and NMath (23).

2

2.2 Comparing .Net and Java

Java and .Net are, in many respects, very similar both in terms of programming language syntax and semantics and as software development tools. This section details some of the common aspects but also highlights some of the important differences between them. The overall architecture of the development process is almost identical.

The process for building and executing Java programs is shown in Figure 1.

Figure 1: Building and Executing Java Programs

3

Once the programmer has written Java source code, a static compiler produces Java byte-code, which is an intermediate representation between human-readable source code and machine-readable machine code. This byte-code can be converted to machine code for execution in one of two ways, either by static compilation using a native code generator or via a Java Virtual Machine (JVM). Most JVM implementations use a Just- In-Time compilation process to produce machine code but some still use an interpreter, which is generally much slower.

The process for building and executing .Net programs is shown in Figure 2.

Figure 2: Building and Executing .Net Programs

4

The programmer can write source code in any .Net language, including C#, which conforms to an internationally recognised standard, as well as many others. A static compiler produces Common Intermediate Language code (CIL or IL for short or MSIL historically), which is, like Java byte-code, an intermediate representation between human-readable source code and machine-readable machine code. Again this CIL can be converted to machine code for execution either by static compilation using a native code generator or via a (VES). Similarly to JVM implementations, most VES implementations use Just-In-Time compilation to produce machine code but some still use an interpreter, which is generally much slower.

Strictly speaking, as the JVM includes a garbage collection service, the .Net analogue is a (CLR), usually referred to as the .Net runtime, which includes both the VES and the garbage collection service.

The following sections (2.2.1 – 2.2.7) compare Java and .Net/C#. Note that the headings and most of the information about Java (but not .Net or C#) are based on the course notes for Object-Oriented Programming for High Performance Computing (24).

2.2.1 Portability

Both Java and .Net are platform neutral: Java creates byte-code for the Java Virtual Machine and .Net creates Common Intermediate Language for the Virtual Execution System. Java byte-code and CIL are platform independent as they are targeted at standard abstract machines and will run on any platform with a JVM or a CLR, respectively. The list of compatible platforms currently includes most combinations of popular hardware and operating system. Both Java and C# have a standardised language specification that is also platform independent, for example the size of primitive data types is specified by the language not the platform.

Portability is important because applications typically have a longer lifetime than a particular generation of hardware. Also publishing portable applications can be achieved without distributing the source code; the Java byte-code or the .Net CIL is sufficient. The Grid relies on portability to enable the abstraction of heterogeneous hardware into a single meta-resource.

2.2.2 Connectivity

Both have considerable built-in support for distributed applications, including:

the ability to invoke methods of non-local objects as if they were local, which is called Remote Method Invocation, or RMI, in Java and simply Remoting in .Net stream-based network connections via sockets dynamic load and execution of code at runtime, which allows programs to download and execute code from the intranet or internet at runtime

Connectivity features are important for visualisation and computational steering. Section 2.5.2 discusses .Net remoting and sockets in more detail.

5

2.2.3 Software Engineering

Both Java and C# are object-oriented languages. This is a well-established paradigm that aims to substantially reduce development time by encouraging domain-level abstraction and code re-use with concepts such as encapsulation, inheritance and polymorphism. Both languages have nice features that allow the programmer to concentrate on defining the application's tasks rather than how the hardware will achieve those tasks: no pointers, garbage collection, type checking, array and string bounds checking, exception handling, plus a debugger and extensive class library included in the Standard Development Kit (SDK). Both languages can be used for rapid prototyping and they can produce better quality code because the compiler and programmer aids, such as the Integrated Development Environment (IDE) supplied in the SDK, can find and highlight semantic errors as well as syntax errors before the code is executed.

2.2.4 Security

Both Java and .Net have several important security features. Java does not allow direct access to memory. In .Net, direct memory operations are possible but are deemed to be unsafe code. The compiler will not compile unsafe code unless specifically told to do so and even then it will produce CIL code that is marked as unsafe. The .Net runtime, or CLR, will not load or execute unsafe CIL code unless specific Code Access Security (CAS) privileges have been granted for that code by an administrator user.

Both Java and .Net include a verification process before code is executed. Java byte- code verification is only performed on untrusted code and prevents corrupted byte-code from executing. In .Net, the runtime examines every (equivalent to a .jar file in Java) as it is loaded to make sure that, not only is it well-formed, but also that it is verifiably type-safe, as defined in Figure 3. It is possible, although inadvisable in general, for an administrator user to define a security policy that bypasses verification on a per assembly basis, but only for locally generated code, i.e. downloaded, mobile or dynamically generated code must be verifiably type-safe in order to execute. The .Net Framework SDK from Microsoft includes a utility to verify the type-safety of any assembly without attempting to execute the code it contains.

"Type-safe code is code that accesses types only in well-defined, allowable ways. For example, given a valid object reference, type-safe code can access memory at fixed offsets corresponding to actual field members. However, if the code accesses memory at arbitrary offsets outside the range of memory that belongs to that object's publicly exposed fields, it is not type-safe. Code that is proven during verification to be type-safe is called verifiably type-safe code. Code can be type-safe, yet not be verifiably type-safe, due to the limitations of the verification process or of the compiler."

Figure 3: Definition of Type-safe and Verifiably Type-safe in .Net (25)

6

In Java, untrusted code runs in a sandbox and it has restrictions on what it is allowed to do, for example it may not be allowed to access the local file-system. In .Net, all code is restricted by the combination of the effective security policy and the declarative and imperative code access security demands, as illustrated in Figure 4. Each assembly file contains a manifest that includes a list of security requirements that must be granted for that code to execute. Before the .Net runtime loads the assembly, it checks the manifest against the security policy to make sure the necessary permissions have been granted. If not, then the assembly is not loaded or executed. Additionally, during execution, the code can make imperative security demands for reduced or elevated permissions. For elevation demands, the .Net runtime re-checks the security policy and either grants the permission and continues execution or raises an appropriate security exception. Permission reductions can be specified before calling third-party code to limit the permissions that are granted to it. They are always accepted by the .Net runtime and can be reverted after the third-party code has executed so that the full permission set is only applied to trusted code.

Figure 4: Simplified View of Code Access Security in .Net (26)

In both Java and .Net a digital signature can be attached to the byte-code or CIL. This is called trusted code in Java and strongly named code in .Net. In Java, trusted code can be executed without sandbox restrictions. In .Net, a strongly named assembly cannot be called by code that has not been granted full trust. The code developer can (but should not) turn off this safe-guard by allowing partially trusted callers in the manifest before digitally signing the assembly. In both Java and .Net, once the code is digitally signed it is tamper-evident as it cannot be changed without invalidating the signature.

7

2.2.5 Other Benefits

Both Java and .Net can access libraries written in other languages, e.g. C or FORTRAN. In Java this is called Java Native Interface (JNI) whereas in .Net it is called Platform Invoke (P/Invoke). This allows access to existing libraries, such as MPICH and LAPACK, but impairs the security, portability and robustness of the resulting ‗hybrid‘ applications. There is also a small performance overhead incurred per invocation.

Both languages provide access to portable and simple GUI functionality. Traditionally, .Net only included support for the proprietary Windows Forms GUI but free portable packages such as GTK# have .Net GUI compatibility to many more platforms. In future, the advent of Presentation Foundation in .Net 3.0 promises to extend the portability of GUI functionality using XAML (originally eXtensible Avalon Markup Language, when Avalon was the codename for Vista, but now it is not an acronym), an XML-based description of the GUI that can be interpreted and displayed by any .Net runtime implementation.

Both Java and .Net are gaining popularity in commercial businesses whereas C, C++ and FORTRAN programmers are becoming rarer. In scientific programming Java is starting to gain acceptance but .Net appears to have less traction. Both Java and .Net are freely available on almost all platforms, although Java is more mature on non-Windows operating systems. Most operating system vendors produce a reliable JVM that performs well and the Open Source community, in the form of and Portable.Net, have taken on the challenge of producing .Net runtime implementations for a wide range of non-Windows platforms including Linux, FreeBSD, Mac OS and Solaris.

2.2.6 Numerical Issues

Neither Java nor .Net provides any native mechanism for handling IEEE 754 floating- point exceptions, other than divide-by-zero and overflow, and only round-to-nearest is supported. There is no native support for complex numbers in either language, although C# allows the programmer to create value-types that are treated like built-in primitives and to overload operators so at least it is possible to get a presentable syntax, i.e. similar to FORTRAN. However, performance relies heavily on in-lining and other optimisations made by the JIT compiler.

2.2.7 Performance Issues

Performance in both languages depends on how the code is written, how well the static compiler can optimise the byte-code or CIL and how efficient the JVM or CLR is at executing the code.

Multidimensional arrays in Java are always jagged arrays, i.e. arrays of arrays. This involves extra memory for row-pointers and multiple de-references for each element access. It also potentially distributes the memory storing the rows in a non-contiguous manner, which means that the memory access pattern is likely to be non-linear. In C#, arrays can be single dimensional, jagged or rectangular as in C. The .Net runtime

8

includes specific, optimised CIL instructions for single-dimension and jagged array access but not for rectangular array access. This allows jagged arrays to out-perform rectangular but only for small arrays with a row-by-row element access pattern. The best performance for both Java and .Net is achieved using a single-dimension array and calculating the offset from the index values, which is untidy.

2.3 Programming in .Net

All computer languages are a compromise between efficiency of execution and ease of development, i.e. between how well the computer can interpret and execute what is required and how easily the programmer can create, read and modify those requirements. In general, the closer to machine code the language is, the better the performance that can be achieved but the more human effort that must be invested. In contrast, as more high-level concepts are introduced into a language, it becomes easier and faster to write large, complex programs but the runtime performance of those programs is usually noticeably lower.

At one extreme, assembly language instructions form the lowest-level human-readable computer languages. They are specific to a single processor type and, whilst this allows the programmer to take maximum advantage of every feature of the processor, it does so only at the expense of large amounts of development time. It also requires that multiple versions of the same program be written for portability to each target machine- architecture.

Languages such as C and FORTRAN, and compilers for them, were adopted to achieve very good program performance and greater source code portability without expending a large amount of programmer effort. Many low-level details of the hardware are abstracted into a single coherent platform. For example, variables in C are stored in ‗memory‘, which may, in fact, be virtual memory on disk, main physical memory, L2 or L1 cache or a register in the CPU. To operate on the value of the variable it may need to be moved into a register but this is no longer the responsibility of the programmer. The onus is moved to the compiler to make sure that the correct load instructions are scheduled at the correct time. As the runtime machine-code is generated by the compiler, good performance can only be achieved with a cooperative effort between the compiler and the programmer. However, this approach allows a single source code to be compiled for multiple hardware architectures, which reduces human effort at the expense of a little extra computer time.

The continuing evolution of C has produced languages such as C++, which adds syntax constructs necessary to support the object-oriented paradigm, and Java, which additionally takes the abstraction of the hardware to the extreme by targeting a Java Virtual Machine. The productivity benefits of these high-level concepts have performance consequences. For example, objects in C++ and Java may require a lookup table for the methods in addition to the data for the properties. This makes efficient usage of memory much harder than using a C struct. With the introduction of the JVM the onus for performance optimisation is shared between the programmer (who writes the source code, once), the static compiler (that generates Java byte-code

9

from the source code, once) and the just-in-time compiler or JIT (that generates machine-code from the Java byte-code at runtime, each time the executable is run).

The Microsoft Corporation created the .Net Framework and accompanying languages, initially C# and VB.Net, both as a natural evolution of previous offerings and as a rival to Java. The .Net approach is very similar to Java (as discussed in more detail in section 2.2), with source code being compiled to CIL and executed by the VES within the .Net runtime. Although used extensively in businesses world-wide due to the substantial productivity gains over languages such as C and C++, it has not been considered for academia or research for a number of reasons. There is a substantial quantity of legacy code, written in C and FORTRAN, which performs well and is still widely used. The existing skills base, and future trend, in universities and other research institutions favours Java as the de facto standard object-oriented language. Possibly the most important reason, however, is that .Net was a proprietary product, which required the Windows operating system, and so carried licensing implications that made it potentially very expensive in HPC contexts because they involve very many processors and lots of concurrent users. However, Microsoft has submitted the specifications for both the .Net runtime infrastructure (27) and the C# language (28) to ECMA (29) for standardisation. This move has allowed the development of other .Net runtime products, such as Mono and Portable.Net, as well as alternative compilers for the C# language. These are open-source projects that do not require royalties or license fees for use or deployment. Also, whereas the .Net Framework from Microsoft only runs on Windows operating systems, the Mono and Portable.Net runtimes both run on a variety of platforms, including Windows. With these major impediments removed, cross- platform .Net development has become possible (30) and it is worth investigating the feasibility of using C# as a high-performance language within computational science disciplines and in scientific research.

In scientific computing, and other true high-performance computing arenas, any performance degradation due to high-level languages is generally seen as unacceptable. When a program may need months to complete on the fastest hardware in the world then even quite a small increase in the performance will outweigh the necessary extra development time and reduce the total time-to-completion. On the other hand, commercial businesses are tending to move towards high-productivity languages to reduce development time, which in turn reduces project costs. The consequential loss of performance is either considered to be unimportant or is offset by purchasing more powerful computers, which are becoming cheaper as demand for them grows. With an ever growing need for compute power, and the advent of multi-core and multi- processor machines, businesses are increasingly looking at high-performance hardware to keep ahead of their competitors. The desire to continue to leverage the high- productivity high-level languages and avoid the cost of retraining their work force, will lead to situations where C# is used as a high-performance language, even if it is not the ideal tool for the job. Research into the suitability of C# and the maturity of Mono and Portable.Net is therefore timely.

10

2.4 HPC and .Net Programming

There is a comprehensive threading model within .Net that can take advantage of a shared memory system. A System.Threading.Thread object can be created for each processor. Each thread is started by calling the Start method and passing any function pointer that matches the ThreadStart delegate, which allows multiple functions to execute simultaneously. By default all variables are shared, that is visible and accessible by all threads, however the normal variable scope rules still apply. This means that the threads can communicate using shared variables, e.g. class level variables. Access to these shared variables can be protected using the various locking mechanisms implemented in the System.Threading namespace. There are no built-in instructions for work-sharing as there are within OpenMP, so sharing the workload between the threads must be done manually by the programmer.

One approach to high-performance computing using .Net, therefore, would be to introduce work-sharing directives, similar to OpenMP directives, into the C# language. The model set out by JOMP would work equally well in C#. The form of the directives could be special comments (as in FORTRAN), which are converted into source code in a pre-compilation step. A new OpenMP.Net compiler could be created by writing a program that performed the pre-compilation tasks and then called the usual C# compiler to compile the modified source code. The programmer would write a single serial version of the program that includes some special ‗OpenMP directive‘ comments, and then invoke the OpenMP.Net compiler. Multi-threading and work-sharing code would be added, using the directives as a guide. If the intermediate source code, i.e. after interpretation of the directives but before actual compilation, is to be output by the new compiler, then the complex segments of added code, e.g. to calculate the loop bounds for the next chunk of iterations under the dynamic or guided loop schedules, would be placed into a shared library and referenced as function calls.

Extending C# by adding support for OpenMP would make coding a high-performance program for a shared memory system much easier and more controlled. However, it does not address distributed memory systems. For this reason, this project will investigate implementing MPI rather than OpenMP.

When considering writing a program for a distributed memory system, the program must involve multiple processes, one per processor within the system. Note that, in a cluster of SMP nodes, mixed-mode programming (using one process per node and one thread per processor) is possible but far more complicated to create and maintain with debatable performance benefits (31)(32). Even in a task farm, the distributed processes must communicate to synchronise, get the next task and to return the results of the previous task. When the processes are more closely-coupled, communication is more important and more frequent, e.g. halo-swapping in a molecular dynamics code requires each process to exchange small messages with its nearest neighbours in every time-step of the simulation.

In .Net, there are two obvious choices for inter-process communication; sockets and remoting. These two technologies are described briefly in the following paragraphs and then compared and contrasted, from a .Net point of view, in section 2.5.2.

11

Sockets are a relatively low-level abstraction of network hardware that enable communication but also expose a lot of the detail of the connections to the main program. To achieve consistency of implementation, and of performance, a library containing a higher-level object structure would be needed that embeds the sockets classes in a more programmer-friendly format and provides a simple interface suitable for HPC coding, i.e. targeted at peer to peer message exchange. Also a shared-variables implementation should be built to deal with the non-distributed case.

Remoting is already most of the way towards this goal as it provides a higher-level, programmer-friendly and simple interface that provides inter-process communication using network functionality when needed, i.e. when the processes are physically distributed, but also short-circuits this overhead, when the processes are not distributed, by automatically and transparently switching to a shared-variables implementation. However, remoting is designed for multi-threaded client-server (implying many threaded, especially in the server process), which is generally good for business applications, rather than for single-threaded (or at most few-threaded) peer-to-peer, which is essential for scalability and load balancing in massively parallel hardware setups. The client-server idea relies on the server being able to cope with many clients, whereas the clients connect to only one server at a time. This means the server becomes a bottle-neck and ultimately limits scalability. For load-balance, which requires that all processes have the same amount of work, each process must be both client and server. In an N-process situation, having N servers each with N client connections is overkill because all-to-all connectivity is not needed, even for all-to-all communication. Of the many known and useful topologies, the hypercube has the highest number of connections, at log(N) per node. So adding scalability and load-balancing in a peer-to- peer system adds significant complexity to standard remoting and again a library containing a higher-level object structure is needed.

In either case, the restrictions specified by MPI are good for high-performance programming. For example, MPI includes a requirement that there is a limit to the resources needed to send and receive each message. Also, in MPI each process chooses when to send and receive messages, rather than being interrupted by an event as in event-driven programming. The implications of these restrictions are discussed in detail in Chapter 3.

2.5 Implementation Paradigms for MPI in C#

The C# language is very versatile and the class libraries supplied as part of the major .Net development products contain all the functionality that is needed to implement MPI in a number of ways. The most obvious is to make direct use of a well-established, proven MPI implementation built in a different language, such as MPICH. This has already been done and is briefly discussed in section 2.5.1. This approach is likely to be very attractive because the best underlying MPI library for the platform can be chosen and the overhead in the platform invoke calls is acceptably small. However, there are inter-process communication mechanisms within the .Net specification and therefore in C#. The two choices, sockets and remoting, are discussed and compared in 2.5.2. Web services are excellent for interoperation but the overheads, e.g. in producing XML representations of methods and parameters and the use of the HTTP communication

12

protocol, are far too great to consider using them as a basis for a high-performance communication layer. Windows Communication Foundation (WCF), part of .Net 3.0, is not yet part of the .Net specification and is included in the Microsoft .Net Framework but not Mono nor Portable.Net. It is built on top of remoting and so will have higher overheads in the same manner as web services.

2.5.1 Platform Invoke – MPI.NET

The MPI.NET library contains C# bindings for MPICH version 1.2.5 for the Microsoft Windows operating system. The bindings make use of the P/Invoke feature of the .Net system that allows .Net code to call functions from non-.Net libraries. A P/Invoke declaration (similar in concept to a forward declaration in languages such as C or an interface block in FORTRAN) is included in the .Net code. It specifies the location of the binary file and the name of the function (the entry point) as well as the types of the parameters for the call. This external function can then be called using the same syntax as a native .Net function. The MPICH bindings for C# in MPI.NET therefore exactly follow the layout and requirements of the C library. In particular, message parameters must be memory addresses and the supported data types are limited to simple value types such as integers, doubles or arrays of these types.

The garbage collector in .Net may move the memory that stores the value of a variable to improve overall usage efficiency. A safe memory address for a variable can be obtained by instructing the garbage collector to pin the memory for the variable and to return its handle for that block of memory. The variable must be unpinned after the MPICH library has finished accessing it.

Constructing customised data types for structures is possible by calling the appropriate creation functions in MPICH via MPI.NET but this will not work for all types in C#. A struct type in C may include padding added by the compiler so the memory locations of its elements are specified using byte offsets from the beginning of the structure. This information is much harder to obtain in C# because all direct memory operations are deemed to be unsafe and forbidden. Unsafe code can be enabled but doing so impairs the ability of the garbage collector to efficiently manage memory because several of its basic assumptions are invalidated. Static analysis of safe .Net code can yield enough information to be sure when memory resources can be re-located or reclaimed. This is not true of unsafe code. Also any program that enables unsafe code requires that full trust security privileges be granted to it at runtime; whereas safe code can demand the specific privileges it needs and refuse all others.

A much easier way to send and receive a complex object (or object graph) is to serialise it into a byte array and send and receive that instead. The serialisation step introduces extra overheads, i.e. processing time, memory for the temporary storage and to describe the layout of the fields in the object graph. It also avoids the overheads of packing the struct data elements into a contiguous memory block. Serialisation is substantially easier for the programmer than attempting to get the detailed information about the memory layout needed for a struct data type.

13

The combination of object serialisation and platform invoke bindings to an existing, optimised, high performance implementation of MPI will provide a good benchmark for testing the feasibility and relative performance of a new pure .Net implementation.

2.5.2 Sockets vs. Remoting

In pure .Net, there are two obvious choices for inter-process communication; sockets and remoting. In theory, either of these two methods could be used effectively, however in common practice, remoting is far more frequently chosen because it provides a much higher-level of abstraction, making it easy to use and very flexible whilst remaining extensively customisable. This is also the major drawback of remoting because it introduces overheads that negatively affect performance.

There are two main disadvantages to using sockets: complexity of code (in both creation and maintenance) and inflexibility (requiring further coding).

Remoting syntax is identical to a simple method call but code using sockets involves many interacting objects. The System.Net.Sockets namespace contains classes that implement the TCP network protocol stack. A TCPListener object attaches to a particular IPEndPoint (a combination of an IP address and a port number) and waits for incoming connection requests. A TCPClient object represents an active network connection between two different end points. It may be returned by the TCPListener when an incoming connection request is accepted or created manually for an outgoing connection. These TCP classes derive from the lower-level Socket class, which provides raw access to the network hardware. Connections must be managed manually, that is a TCPListener must be started on one side, then a TCPClient must be created on the other side and a connection attempt made, the TCPListener must then accept the request and produce a matching TCPClient. Once data has been sent and received, as required, then both the TCPClient objects must be closed and destroyed to conserve system resources. Accessing the TCPClient objects can be abstracted one step further by retrieving a NetworkStream object. The NetworkStream object represents the network connection as a two-way stream of bytes but can be chained with other stream objects, such as System.IO.StreamReader and System.IO.StreamWriter, which present one-way streams of characters.

The other major disadvantage of using sockets for inter-process communication is that it is inflexible because it forces the use of high-latency network communication even when it is not needed, e.g. in a shared-memory machine or inside a node in a cluster. This can be overcome by implementing a second communication method, using shared variables, and switching between the two depending on the physical locations of the end points. This makes the connection layer code more complex but follows directly in the footsteps of various implementations of MPI such as MPICH, which achieves very good performance on a variety of hardware platforms and memory architectures.

Remoting is primarily a mechanism for distributing code in an n-tier program design. It is an abstract model of inter-process communication with a modular design that facilitates extensions and modifications to the default behaviours. Fundamentally, remoting needs two channels, a server channel that connects the server code to the transport medium and a client channel that connects the client code to the transport

14

medium. A channel is made up of a chain of channel sinks, which must include at least one formatter sink and a single transport sink, as in Figure 5, which shows the default TCP channel configuration for a cross-machine scenario.

In a single-machine, single-process scenario, the client object reference would simply be a pointer directly to the server object instance, exactly as in normal object creation without remoting. If the server object is hosted by another process, on the same machine or on a different machine, then remoting will return a reference to a transparent proxy to the client on object creation. To the client, this proxy is almost indistinguishable from the actual server object. When the actual object is needed, i.e. when the client calls a method or accesses a property, the proxy captures the details of the request (called marshalling) and passes the information to the client channel sink chain. The formatter sink packages the information into a form that can be transported to another process and the transport sink sends the packaged information to the other process. If the process is local then no network communication is necessary and the transport occurs using a shared variables implementation. For non-local processes the transport sink will communicate using a particular protocol. Once received in the server process by the transport sink in the server channel, the information for the method call is unpackaged (called unmarshalling) by the formatter sink. The Stack Builder uses the method call information to select or create an appropriate server object instance to service the call and collects the parameters returned by (or the exception raised by) that call. The whole process is then reversed to get the results back to the client code.

Figure 5: Remoting infrastructure overview (33)

The major benefit of remoting, from a programmer‘s perspective, is ease of use; the minimum requirements are that the server object inherits System.MarshalByRefObject (directly or indirectly) and the channel setup data is specified in a configuration file on both the client and server machines. The disadvantage is the overhead introduced by the software stack. This feasibility study will make use of the rapid development possible with remoting and determine if acceptable performance can be achieved.

15

Chapter 3.

Requirements Analysis for a Solution in C#

The requirements for a strict MPI library are well specified and fully documented in the MPI Standard (6)(17). However, the standard was targeted at pre-object-oriented languages. Most implementations of MPI that use object-oriented languages modify the interface and are labelled MPI-like as a result.

It is a design goal in the object-oriented paradigm to re-use existing code wherever possible. Solution architects are actively encouraged to make use of the resources in vast libraries of re-usable classes. The Microsoft .Net Framework includes not only the CLR but also very many useful classes and programming aids. The SSCLI (34), the Mono project (35) and the Portable.Net (36) product all supply their own implementation for most of the non-platform-specific classes from the .Net Framework, as well as their equivalent of the CLR.

It is therefore essential to undertake an analysis of the key requirements of MPI in a .Net context. The following sections examine some of the opportunities for, and challenges of, making use of the functionality available in C#.

3.1 Headers and Message Data

One of the central design principles of the MPI standard is a restriction on the resources used for each message, in particular in the receiving process. The restriction states that the resources for each message should never exceed a known limit, independent of the size of the message, until the receiving process has matched the incoming message. When memory management is done manually, this restriction is essential so that the program can allocate enough memory for the message data before it arrives. In .Net, when using languages such as C#, the memory needed by the incoming object graph will be allocated automatically during the de-serialisation inherent in the remoting call, so this restriction is not as important. However, in application codes that reference and modify large objects, or very many small objects, it is important to be able to predict the memory usage of included libraries. For this reason a request-acknowledge-transfer (or rendezvous) protocol is adopted by all MPI and MPI-like libraries, at least for large messages. An eager protocol (where the actual data is sent during the first protocol message) would be faster because it avoids at least one round trip between source and target. To switch between these two protocols, the sending process must be able to determine the size of the message data.

In remoting, the data will be serialised from an object (or object graph) to a stream of bytes, transmitted from the source to the target, then de-serialised back into an object (or object graph). Therefore, the message size is the size of the serialisation stream. There is no easy way to determine the size of a serialisation stream for a general object

16

graph. It could be argued that the stream is just temporary storage and the size that really matters is that of the object itself. However, even determining that is also problematic for many types of object.

Some objects do have a well-defined size and the serialisation overhead can be accurately predicted, for example all built-in value types, such as System.Int32, System.Double and arrays of these types. However, for user-defined classes, determining the object size is best achieved by serialising the object into a System.IO.MemoryStream and examining the Length property, which involves making a complete copy, requiring CPU time and extra memory. This extra resource usage amounts to a doubling of the serialisation overhead because the syntax of .Net remoting requires that the MemoryStream, or its underlying byte array, be serialised again during the remote method. In theory, the size could also be calculated by traversing the object graph and adding up the contributions of all of the fields, but accessing type information for private fields requires dangerous code access security permissions that are very rarely granted in practice.

Without knowledge of the message size, no determination can be made of which messages are large (and require the rendezvous protocol) and which are small (and could be sent using an eager protocol). Using an eager protocol for all message sizes breaks the resource restriction, so it cannot be used without limiting the types of object that can be sent and received. If this MPI-like library is to support sending and receiving general objects of all sizes, then it must use a rendezvous protocol.

The size of messages is determined in traditional MPI libraries by the data type and count parameters passed to the send method. In McMPI, the data type is not necessary, because the type of the object can be determined from the object itself. Also the count parameter is no longer relevant because only one object is ever sent. However, these two parameters could be replaced with a single new parameter that allows the programmer to specify the (approximate) size of the object. This size does not have to be accurate but simply indicates to the McMPI library whether the object is small enough for an eager protocol.

3.2 Queues and Progress

Another of the central design principles of the MPI standard is that messages must be ordered. That is, when two messages are sent by the same source to the same target and both could potentially be matched with a particular receive at the target, the first one sent must be the one that is actually matched. The converse is also true, i.e. if two receives could potentially match a single send, the first receive that was issued must be the one that actually matches the send.

Within the sending process, there are two stages: sending the header object and responding to the call for the message data. In terms of the protocol, sending the header is the request, the call for the data is the acknowledgement and sending the data is the transfer. The sending of the header to the target process can be decoupled from the external interface method call using a send queue but that requires an independent progress engine to consume the items in the send queue, sending a header for each one

17

and moving the send task on to the next stage. Polling is quite often a non-optimal and inelegant way of solving a problem. It can be avoided here by sending the header during the non-blocking send method. This does make the non-blocking method non- local, as it communicates with the target process, but the overhead is not dependent on message size and is likely to be very small. The direct option also eliminates the delay before sending the header that is implied by the en-queuing, waiting for the next pass of the progress engine and de-queuing steps.

The timing of the second sending stage (responding to the call for the message data) depends on the target process and so it must be decoupled by storing the message data in a ‗pending‘ data structure until needed. The target process will send the header it was given by the source to identify which message is being requested. This suggests that the pending structure should be a dictionary object, such as the strongly-typed System.Collections.Generics.Dictionary (similar to a Map in Java). This type is backed by two arrays, one for the keys and another for the values, so inserting an object can become an operation, if the arrays need resizing. The initial size of the arrays in the dictionary can be specified on creation, so an initial size such as 1000 can mitigate this problem. However, if there will be a very large number of messages outstanding at any one time, then another type of dictionary, such as System.Collections.HashTable, may be more suitable even though it is not strongly-typed.

Receiving is more complicated than sending because it involves matching incoming headers with receive method calls. A header that cannot be immediately matched on arrival must be stored until a matching request is issued. Similarly, attempting to receive a message for which no matching send header has yet arrived means that the method call must be queued. To allow correct matching, both of these storage structures must maintain the order of their contents and it must be possible to enumerate them in that order. This requirement means that a System.Collections.Generics.Queue will not suffice because it is strictly a first-in-first-out queue and the IEnumerable interface does not guarantee to honour the ordering of the objects. A System.Collections.Generics.LinkedList (which represents a double-ended queue, backed by a doubly-linked list) is the most appropriate structure as all the required operations are implemented efficiently, i.e. AddLast, First, Next and Remove are .

Once a match is made, the message data must be retrieved from the source process, which could be decoupled by placing the matched pair into a queue but that would, as with the send queue, require an independent progress engine to consume the items it. The amount of resources used in sending the header does not depend on the message size but the amount used for pulling the message data obviously does. This time the decoupling is crucial because the matching is performed either by a non-blocking receive method or a remoted push header method. For correct adherence to the intent of the MPI non-blocking receive method and efficiency of the remote method call from the source, neither of these should be delayed for a long time, even if the message is large and would require a long time for the data to transfer. Fortunately, this can be achieved by using worker threads, as discussed in section 3.3.

18

3.3 Threads and Locks

A third requirement of the MPI standard is that all messages are progressed whenever possible, primarily to prevent unnecessary deadlocks. In a single-threaded environment each method in the external interface of an MPI implementation must invoke a progress engine method. This would attempt to perform outstanding actions for some, or all, incomplete messages irrespective of the purpose of the external method from which it is called. Also all wait methods must repeatedly call the progress engine method and test for the appropriate condition. This polling consumes a large amount of CPU time, especially when no progress can be made, e.g. when one process is attempting to receive a message but a matching send has not yet been issued by the appropriate remote process.

The CLI standard specifies that System.Threading is a core namespace and must be contained in all configurations including the kernel or compact versions. Multi- threading is natively supported on all C# compatible platforms, i.e. within all operating systems, on which a CLI implementation can currently be installed. The remoting infrastructure in .Net creates an internal pool of threads to service remote requests. Taken together, these three observations indicate that attempting to produce a single- threaded progress engine for this MPI-like library would be a challenging but ultimately futile exercise.

The System.Threading.ThreadPool class supplies a ready-made set of worker threads with creation, management and control all transparently built-in and fully automatic. The interface to this class includes a method to enqueue a delegate and a single object parameter as a task. If a thread is available within the pool, then it is allocated to the task and it executes immediately. Otherwise the task is queued until a thread becomes available. The tasks are always supplied to threads in first-in-first-out order.

Instead of creating a ‗receive pending‘ queue to contain matched messages awaiting the data transfer stage, it is possible to enqueue each of these as a task into the thread pool. This achieves the desired decoupling from the matching step but also does not require a progress engine to poll and process a pending queue. Several incoming transfers can be in progress at the same time. It becomes the responsibility of the operating system and the remoting infrastructure to intelligently time-slice between these threads to achieve maximum throughput.

The final piece of the puzzle is the wait methods. Polling the status of the request(s) consumes too much CPU time. This remains true, although to a slightly lesser degree, even if a call to System.Threading.Thread.Sleep is included in the polling loop. The Sleep method either yields the remainder of the current time-slice or prevents the thread being considered for scheduling for a specified time period, depending on the parameter value. Either is better than pre-emptive thread-switching alone but neither is ideal as unnecessary work is still done repeatedly.

A more optimal approach to synchronising threads is to use locks. In this case, a System.Threading.ManualResetEvent lock can be created for each request (send or receive). This type of lock allows a thread to signal other waiting threads that some

19

interesting point has been reached using the Set method. It exposes a System.Threading.WaitHandle, which contains a WaitOne method and can also be used by the WaitAll and WaitAny static methods. Once a ManualResetEvent is set all waiting threads are released and future waits complete immediately. Threads blocked by a call to one of these wait methods are not scheduled for execution until the appropriate events have happened. This completely eliminates the busy-wait intrinsic to polling but places even more responsibility for efficient high performance operation onto the .Net platform, i.e. the integration and maturity of the virtual execution system (VES), the operating system and the hardware.

3.4 External Interface

The MPI standard lays out the external application programming interface for C and C++ very explicitly. However, the C# language was developed after the MPI standard and it provides some opportunities to enhance that interface. The general principle used here is to stay as true to the MPI standard as possible (e.g. keep the naming convention) whilst incorporating the good work done in MPI.NET and extending it in line with other well-known object-oriented interfaces and standard design patterns.

The most obvious differences with the function calls themselves are with data types and memory references. Also the layout of the API uses methods and properties on interconnected objects rather than a collection of functions in a library.

3.4.1 Data types

One of the advantages of object-oriented programming is polymorphism through inheritance; the ability to treat objects of different sub-classes using the interface of a common super-class. In a message-passing context this points towards being able to send and receive any derivative of System.Object, i.e. any data type as all other classes derive from this class. This turns out to be too general. Some objects do not make sense outside of their original context, for example a System.IO.File object represents a file in the local file system, which may not be accessible from a remote system. Some objects cannot be duplicated because they do not supply a way of duplicating themselves and do not release their full state to any external duplication mechanism.

The mechanism used by the remoting infrastructure is serialisation and it is a requirement of all objects that are to be transmitted by remoting that they can be serialised. This requirement is not overly restrictive: all simple types (such as System.Int32 and System.Double) are inherently serialisable, as are arrays (of any dimension) of these types. Any user-defined type can be made serialisable by adding the System.Runtime.Serialization.Serializable attribute to its definition and, optionally, implementing the System.Runtime.Serialization.ISerializable interface. The attribute on its own tells the compiler that when serialisation of this class is requested that the values of all fields in the class (including private fields) should be added to the serialisation stream. The attribute plus the explicit implementation of the interface, gives the programmer full control over the information that is used to copy the object.

20

The requirement that the object to be sent must be serialisable can be checked by obtaining its type and examining the IsSerializable property of that type, as demonstrated in Figure 6.

if (message != null && message.GetType().IsSerializable) // send header else throw new ArgumentOutOfRangeException();

Figure 6: Source code for using the IsSerializable property

This checking code allows complete freedom in the choice of input parameter type; the method will throw an exception if the parameter is null or not serialisable, which will protect it from the exception that remoting would throw, if it attempted to send a null or non-serialisable object.

Although the message parameter could be exposed as a System.Object, it would be preferable, as described in the paper on MPI.NET (1), to use generics. A generic function works in the same way as a function that accepts (or returns) a System.Object but it will also maintain type-safety. In particular, the type of the return parameter in a blocking receive call is an important example. Without using generics, the return type must be System.Object, in order that any message of any type may be returned to the user code that is receiving the message. However, the user code will store the reference to this object with a properly typed variable, e.g. a one dimensional array of double- precision floating point numbers, System.Double[]. The user code must cast the return type to the expected receive type. A generic function returns the correct type directly with no cast in the user code.

The tag parameter is not just a simple integer because the MPI standard describes the concept of a wildcard that can represent any tag. Traditionally in C#, a null value is used to represent such a special non-data value. The syntax Nullable is used to declare a type that can hold any value of the underlying ValueType, e.g. System.Int32, or null.

3.4.2 Memory References

Implementations of MPI in languages such as C and FORTRAN take memory addresses of variables as parameters to the send and receive functions. The other parameters specify the size of the block of memory that contains (or will contain) the data of the variable.

In C#, memory management is done automatically. This means it is not normal practice to do any sort of direct memory operation manually, including simply obtaining a memory address for a variable. The garbage collection part of the .Net runtime is responsible for releasing resources used by objects that are no longer referenced by active code. It also attempts to increase the usage efficiency of virtual memory and heap memory by moving data for variables that are still referenced by active code. Any memory operations must take this behind-the-scenes activity into account. A safe

21

memory address can be obtained by telling the garbage collector to ‗pin‘ the variable in memory before getting the handle to the variable, which contains the memory address. This is essential for the C# bindings to MPICH in MPI.NET but is not necessary for a pure .Net library. The information contained in an object (or an object graph) can be extracted by serialising that object on the sender. The resulting byte-stream can be transmitted to the receiver and de-serialised back into a copy of the original object.

On the sender process, the memory address, size and type parameters can be replaced with a single object reference. This is not true on the receiver though. The de- serialisation is a special object constructor, i.e. it will convert the byte-stream into a new object and return the reference to that new object. This presents a particular problem for the non-blocking receive method because it will return to user code before the new object is created. A naïve first attempt to port the C interface to C# will result in an object reference being passed from user code to the library function. The default mode is ‗by value‘ which means that the content of the object pointed to can be changed but the reference itself cannot. So the variable passed in from user code cannot point to the new object created by de-serialisation. Changing the mode to ‗by reference‘ allows the actual value of the variable (i.e. which object it points to) to be changed, but only within the scope of that method call. In a blocking receive method the new object is created before the method returns but in a non-blocking receive method this is, by definition, not the case.

A solution to this problem, that avoids using unsafe code such as pointers, would be to wrap the object reference as a field in another object. The user code would create an instance of the wrapper object and set the field to a null reference. This wrapper object can then be passed into the non-blocking receive method and the library can take a copy of the reference to it. When the new object is created by de-serialising the data from the sender, the field of the wrapper object can be set to point it. As both the user code and the library code reference the same wrapper object, both see the change to the field. The wrapper object acts like a ‗safe pointer‘.

A more object-oriented solution to this problem is to remove the incoming message parameter from the receive method and add a message field to the Status object instead. However, in addition to deviating even further from the MPI specification, this approach also loses the symmetry with the blocking receive method and forces the Wait methods to differentiate between a send request and a receive request. The intention of this project is to follow the MPI specification, rather than the standard .Net design patterns, so this possibility is deprecated in favour of the safe pointer.

Both of these problems are introduced by the .Net remoting syntax and, as mentioned in section 6.1, they would be easily resolved if .Net sockets were used instead.

22

3.4.3 Layout

All current object-oriented implementations of MPI deviate from the MPI standard in the organisation of the functionality because the basic design pattern is different. However, most are similar to each other and follow a logical arrangement of the functionality into a few objects with well-defined responsibilities. These objects expose methods and properties appropriate to the duties they must perform. One effect of this is to reduce the number of parameters needed for some operations. For example, in coding a send operation using a non-object-oriented library function the program must specify the communicator in addition to the other parameters. An object-oriented library would have a send method exposed by a communicator object instead. Thus the program does not need to specify the communicator as a parameter but it must make sure that it calls the send method on the correct communicator object.

3.4.4 Process Identifiers

In the MPI specification, all process identifiers are integers. In McMPI, all Process parameters are opaque handle objects. There should be no need for publicly accessible integer identifiers in a full implementation because sub-classed Communicator variants, such as TorusCommunicator and HypercubeCommunicator, can supply the topological relationships between processes that the integers are usually used to calculate. However, as this is only a feasibility study, methods to expose the integer identifiers are still needed.

23

Chapter 4.

Designing and Building a Solution in C#

This section describes the implementation of the McMPI library in detail, including UML (Universal Modelling Language) diagrams, discussions of the key design decisions and explanations of selected source code fragments.

4.1 Object Interaction

The discussion of the MPI specification from a .Net perspective in Chapter 3, introduces several conceptual objects and some of the interactions between them. The UML sequence diagram in Figure 7 sets out a more formal and complete picture of the two central operations, sending and receiving a message. The diagram shows cooperating objects from two MPI processes, with the solid bar between the two RemoteController objects representing the network that connects them. All the objects and methods are within the MPI-like library; the arrows entering and leaving to left and right are method calls from user code. Two separate message transfers are shown, one for a late receiver and one for a late sender.

The two processes execute independently and will reach the send and receive methods at different times. It is almost certain that the Header from the sender will not arrive at the receiver at exactly the same time as the RecvRequest is created. The top half of Figure 7 represents the case where the send is issued first, known as a late receiver. The bottom half represents the other case, known as a late sender. All interactions are identical for both of these cases, apart from the matching step on the receiver process.

The Communicator class is the centre of the public interface for the MPI-like library, containing all the send and receive methods including barrier and all other collectives. The RemoteController class is the interface that each process presents to the other processes. The classes in between are the queues and other structures that store the requests at various stages. The Header class provides an identity for each send request that can be transmitted to another machine and matched with a RecvRequest from that process. The Request class is an opaque handle for a non-blocking method. It exposes a variety of wait methods and is sub-classed to produce both the SendRequest and RecvRequest classes. The Status class describes the result of a non-blocking method, including the values actually matched for wildcards.

24

Figure 7: UML Sequence diagram for non-blocking send and receive

25

The rendezvous communication protocol chosen for this MPI-like library involves sending a Header object that contains enough information for the matching stage but no message data. The Header object is programmer-defined but it is a value type with a fixed size and serialisation overhead so that sending it to the receiver process does not violate the MPI resource limit restriction. When matching takes place, the target process sends the header object back to the source process as a request for the message data, which is then sent to the target. Only two remotable methods are required, one to push the header from the source to the target and another for the target to request the message data. The first of these performs the request step of the rendezvous protocol. The second performs both the acknowledge step and transfer step; the input Header object parameter is the acknowledgement and the return type of the function transfers the message data back to the target. These methods are contained in the RemoteController class.

The eager communication protocol attaches the message data to the Header object that is sent to the target process. The send does not store the outgoing message in the SendPending queue and does not wait for the data request. When matching takes place, the target process does not need to request the data from the source process. Both the send and receive methods complete much earlier than in the rendezvous protocol.

4.1.1 Non-Blocking Send

The ISend method creates a SendRequest to encapsulate the details of the method call, including the object that the user wishes to send and the identity of the target process. It then generates a Header object from the SendRequest and passes it to the other process by calling a remoted method, i.e. the PushSendHeader method exposed by the RemoteController object, which resides in the other process and is connected by the remoting infrastructure. The SendRequest is placed into the SendPending dictionary with the Header as the key. Finally, the SendRequest is returned to the user code as a Request reference.

The Wait method simply waits for the release of the lock that is built into every Request object when it is created. After the lock is released it creates a Status object from private information inside the Request object and returns it to the user code. The Wait method treats each Request object in the same manner, whether it is a SendRequest object or a RecvRequest object.

The intermediate step on the sender is initiated by the receiver process. Once the matching step has been completed, the receiver pulls the data from the sender by calling the PullMessageData method in the sender‘s RemoteController via remoting. It passes the Header object it received in the earlier PushSendHeader method call so that PullMessageData can use it as a key in the SendPending dictionary. The key-value pair is removed from the dictionary and the SendRequest is then returned to the receiver process via the remoting infrastructure, which uses a call back event to release the completion lock once the data is no longer needed.

26

4.1.2 Non-Blocking Receive

The IRecv method is more complicated than the ISend method because its actions depend on whether or not a matching Header object has already arrived. In all cases the first task is to create a RecvRequest object that encapsulates the details of the method call, including any wildcard parameters.

In the late receiver scenario, the Header is already in the RecvBuffer queue. The IRecv method must search that queue for a Header that matches the RecvRequest it has just created. If it finds one then it removes it from the RecvBuffer queue, combines it with the RecvRequest and gives the result to a RecvPending thread.

In the late sender scenario, no matching Header will be found in the RecvBuffer queue and the RecvRequest must be stored in the RecvQueue queue. In both scenarios the IRecv method returns the RecvRequest to the user code as a Request reference.

The PushSendHeader method is similar to the IRecv method. It has a Header object from the sender process and must search the RecvQueue queue to determine if there is already a matching RecvRequest object. If one is found (late sender scenario) then the Header and RecvRequest objects are combined and given to a RecvPending thread. If no matching RecvRequest object exists yet (late receiver scenario) then the Header object is stored in the RecvBuffer queue.

A RecvPending thread gets the data for a matched RecvRequest by calling the PullMessageData remoted method, sending the Header object to the sender process, which returns the full SendRequest, including the message data. The message data is an identical copy, in the receiver process, of the object passed into the ISend method on the sender process. The variable passed into the IRecv is set to point to this newly received message object. Finally, the RecvPending thread releases the lock embedded in the RecvRequest to signal that receiving is complete.

27

4.2 Class Structure

Whereas section 4.1 concentrated on the object interactions necessary for the two key tasks, this section focuses on the overall structure of the MPI-like library. The UML static structure diagram in Figure 8 illustrates all the classes and their relationships. The important classes are described in detail in the following sub-sections.

Figure 8: UML Static Structure for the public interface of the McMPI classes

28

4.2.1 The Internal Storage Structures

There are three structures needed for temporary internal storage. The SendPending dictionary stores outgoing messages until the target process requests the data from the source process. The other two, RecvBuffer and RecvQueue, are both for the receiver. They use a linked list to store the unmatched incoming Header objects and the unmatched IRecv method calls, respectively. The C# source code for these structures is shown in Figure 9.

// temporary storage for messages awaiting target‘s pull request static private Dictionary SendPending = new Dictionary();

// temporary storage for unmatched received message headers static private LinkedList RecvBuffer = new LinkedList();

// temporary storage for unmatched IRecv function calls static private LinkedList RecvQueue = new LinkedList();

Figure 9: Source code for the internal storage structures

The SendRequest, Header and RecvRequest objects are described in section 4.2.5 and form part of the McMPI library. The other classes mentioned in Figure 9 (Dictionary and LinkedList) are contained in the System.Collections.Generic namespace, which is part of the .Net class library. Generic types are denoted with angle brackets that contain type parameters. This syntax allows the compiler and programmer aids to check and maintain type safety without the need for duplicating common functionality or explicit type-casting.

The outgoing messages are stored in a Dictionary because the pull message data method needs a fast mechanism to retrieve a particular value using a key. The Dictionary is implemented as a hash table and so the access time is very close to O(1), depending on the quality of the hashing algorithm used for the key class. The Header class is a simple structure and the values of its fields define its identity, so the default, built-in hashing algorithm is very efficient.

The two receive structures use LinkedList because they must maintain the order of the elements and respect this order during searches. The LinkedList is implemented as a doubly linked list, which produces a double-ended queue. Both insertion and removal are operations and each element contains a reference to the next element, which facilitates an element-by-element search, in the correct order.

29

4.2.2 The RemoteController Class

The minimum coding requirement for remoting (i.e. excluding configuration settings) is for the server program to contain at least one remotable object. The processes in an MPI library share a relationship that is much closer to peer-to-peer than client-server. So the requirement for an MPI-like library based on remoting is that every process must contain at least one remotable object. The RemoteController class is the template for all those remotable objects.

To be remotable an object must be marshalled by reference, i.e. its class must inherit, directly or indirectly, from the System.MarshalByRefObject class. Also each method in the class that is be accessed remotely must be declared with the public access modifier. These restrictions tell the remoting infrastructure to transmit a reference for the object to remote callers rather than transmit the entire object. The reference is used to automatically create a proxy for the object at runtime in the caller‘s process. Accessing methods and properties on the proxy result in network communication to and from the actual remote object. The marshalling, networking and unmarshalling are syntactically transparent to the programmer, i.e. a call to a remote method is coded in exactly the same way as a local method.

In accordance with the object interaction pattern described in section 4.1, the RemoteController class needs two remotable methods, PushSendHeader and PullMessageData. The C# source code for these methods is given in Figure 11 and Figure 10, respectively.

public SendRequest PullMessageData(SendRequest.Header header) { // return the full SendRequest that matches the given header // NB: this will serialise request and so call OnSerialised

SendRequest request = SendPending[header]; SendPending.Remove(header); return request; }

Figure 10: Source code for the PullMessageData method

The PullMessageData method is the simpler of the two. It uses the Header object as a key in the SendPending dictionary to retrieve the SendRequest value then it removes that key-value pair from the dictionary and returns the SendRequest to the calling process. The complication is hinted at in the comments. The remoting system will serialise the SendRequest in order to transport it to the calling process. The SendRequest class has a method designated as a call-back for the OnSerialised event; see section 4.2.5 for an explanation of this event.

30

The PushSendHeader method searches the RecvQueue, element-by-element, in the order that items were added to it, to determine if there is a RecvRequest (representing an IRecv method call) that can be matched with the incoming Header object. Enumerating the RecvQueue is not thread-safe, so a lock must be taken to avoid a race condition with the code that adds elements to this structure (see the discussion of the IRecv method in section 4.2.4). If no matching RecvRequest is found then the Header is stored in the RecvBuffer. Again, a lock is needed to avoid a race condition with the IRecv method.

public void PushSendHeader(SendRequest.Header header) { lock (RecvQueue) { LinkedListNode matchingIRecv = RecvQueue.First;

while (matchingIRecv != null) if (matchingIRecv.Value.Matches(header)) break; else matchingIRecv = matchingIRecv.Next;

if (matchingIRecv == null) // unmatched due to late receiver RecvBuffer.AddLast(header); else // matched with an existing IRecv RecvQueue.Remove(matchingIRecv); }

if (matchingIRecv != null) // matched with an existing IRecv CombineMatched(header, matchingIRecv.Value); }

Figure 11: Source code for the PushSendHeader method

If a matching RecvRequest is found then the Header and RecvRequest are combined by a helper function, called CombineMatched, which is shown in Figure 12. This function submits the task of pulling the data from the sender process to a separate background thread by adding it to the work item queue provided by the built-in ThreadPool. When a thread in the ThreadPool becomes available it will execute the PullData function, also in Figure 12, with the RecvRequest object as a parameter.

31

private static void CombineMatched( SendRequest.Header header, RecvRequest request) { request.header = header; request.stage = StageConstants.Matched; ThreadPool.QueueUserWorkItem( new WaitCallback(PullData), request); }

private static void PullData(Object state) { RecvRequest request = (RecvRequest)state; SendRequest.Header header = request.header; Communicator comm = Communicator.GetByID(header.communicator); Process sourceProcess = comm.GetProcessByRank(header.source); RemoteController rController = sourceProcess.remoteController; SendRequest send = rController.PullMessageData(header); request.message.MessageData = send.MessageData; request.stage = StageConstants.Completed; request.completion.Set(); }

Figure 12: Source code for the CombineMatched and PullData functions

The PullData function entails an I/O operation that could receive a very large quantity of data. This is the reason for executing it on a separate background thread; the rest of the program in the receiving process can continue while the I/O operation takes place. There is an overhead involved with using threads, which will adversely affect the performance for small messages. This has been solved by introducing an eager protocol for small messages. The last line of the PullData function flags the ManualResetEvent lock referenced by the completion field, which releases any threads that were waiting for this receive to complete. Refer to the Wait method in section 4.2.5, for a more detailed discussion.

4.2.3 The Process Class

The Process class is an opaque wrapper for the RemoteController class. It represents any McMPI process, either local or remote. Whereas the integer identifier for a process may change from one communicator to another, each process is always represented by the same Process object. The integer identity within a particular Communicator for a known Process object can be found using the ProcessRank method on that Communicator object. The reverse is also possible using the GetProcessByRank method on the appropriate Communicator object.

32

4.2.4 The Communicator Class

The Communicator class is the centre of the public interface for the whole library. Its content can be divided into three categories: information, point-to-point and collectives.

The two key pieces of information provided by all communicators are size and rank. The Size is simply a property that returns the number of processes that belong to the communicator. The Rank property returns an integer identifier for the local process within the communicator. If, as discussed in section 3.4.4, the integer identifiers can be deprecated in a fully object-oriented implementation, then this property should be replaced with one that returns a reference to the Process object that represents the local process. Such a property has been implemented in the Communicator class but is currently hidden from the public interface.

The MPI specification includes a wide variety of point-to-point message functions. The functions can be blocking or non-blocking, the operation mode can be synchronous, buffered, ready or standard and the protocol is usually eager or rendezvous depending on message size.

Any of the blocking functions can be thought of as the corresponding non-blocking function combined with a call to a wait function for the single request returned by that non-blocking function. The minimal scope for this feasibility study therefore includes both non-blocking and blocking functions.

The ‗standard send‘ operation mode actually means the MPI library is free to choose either buffered or non-buffered (called synchronous) mode, for example depending on the size of the message and the amount of buffer space attached. The ‗ready send‘ mode is, in practice, only used when the physical architecture supports it by supplying hardware synchronisation or active network hardware that processes incoming messages without interrupting the CPU. The ‗buffered send‘ takes a copy of the message into some buffer space, then sends that copy using a synchronous send. The minimal scope for this study only includes a non-buffered synchronous send mode.

An eager protocol includes all the data for the send in the initial protocol message, which breaks the resource limit in the MPI specification for arbitrary sized messages. A rendezvous protocol must be employed, at least for larger messages, and is legal for all message sizes. The minimal requirement therefore only includes a rendezvous protocol, despite the performance degradation this will inevitably entail for small messages. An eager protocol would be able to avoid one of the two network round trips and the step of enqueuing a task into the thread pool and may reduce the small message latency by half with a corresponding increase in perceived bandwidth. As this potential optimisation effect can be approximately predicted, it is sufficient to only include the rendezvous protocol, initially, and add the eager protocol in a later development cycle.

The ISend method creates a SendRequest from the supplied parameters, adds it to the SendPending dictionary, generates a Header object, which it sends to the target process and then returns the SendRequest as a Request reference to the calling code. It locks the SendPending dictionary during the add operation to avoid interference from the remove operation in the PullMessageData method.

33

The code for the IRecv method is identical to the code presented for the PushSendHeader method in Figure 11, except that the roles of the two LinkedList objects are exchanged. That is, the IRecv method searches the RecvBuffer list and then either adds to the RecvQueue list or calls the CombineMatched function.

All the collective operations can be built out of point-to-point messages. Some savings can usually be achieved by short-circuiting parameter checks for internal method calls. One of the intentions of this feasibility study is to ascertain whether completing and optimising an MPI-like library, following the pure .Net route, is worth the development effort it will involve. This aim can be achieved without coding any collective operations at all, but a synchronisation barrier is useful for timing and the image processing demonstration code requires an AllReduce operation to test for convergence to the solution. Both of these have been coded using a hypercube algorithm that exchanges point-to-point messages with a nearest neighbour in each successive dimension. This algorithm assumes the physical interconnect between processes has a high degree of direct connectivity but if that requirement is met then it has the potential to scale well, i.e. . Both the Barrier and AllReduce operations typically send very small messages and would benefit from an eager protocol but again that is out of scope for this feasibility study.

4.2.5 The Request and Status Classes

The Request class is an abstract super-class for the SendRequest and RecvRequest classes. It contains the fields, methods and properties that are common to all types of Request object. The SendRequest class encapsulates the information supplied in a call to the non-blocking, point-to-point send method of the Communicator class. The RecvRequest class holds the information supplied in a call to the non-blocking, point-to- point receive method of the Communicator class but also captures the Header object information from the matching send.

The Request super-class implements the various ‗wait‘ functions specified in MPI by delegating to the lock wait operations from the built-in System.Threading.WaitHandle class, i.e. WaitOne, WaitAny and WaitAll. The WaitSome function is not built-in but it is legal in MPI for it to use WaitAny internally. The MPI ‗test‘ functions are also included in the Request super-class and simply check a private information flag held inside each Request being tested. The Request class must be marked with the Serializable attribute because one of its sub-classes, SendRequest, is serialised by remoting when sending the message data from the source process to the target process. However, all of the fields in the Request class are marked with the NonSerialized attribute to prevent them being included in the data stream, which reduces the serialisation overhead and will fractionally decrease the transfer time.

The SendRequest sub-class is complicated by the serialisation that allows remoting. It incorporates the definition of the Header structure, which is also serialisable. However, the Header structure is very simple, containing only a small number of well-defined fields that specify integer identifiers for the communicator, source process, target process, and tag as well as which collective operation it is part of, if any, and a unique identifier for the request itself. This structure only requires the Serializable attribute to enable fully automatic handling by the remoting infrastructure. On the other hand, the

34

sending process must be able to determine when the remoting infrastructure has finished with the SendRequest, i.e. when it has been serialised. This is problematic because the method that causes the serialisation does so when it returns the SendRequest to the receiving process that is pulling the message data, so there is no opportunity in the source process to do necessary tasks such as set the ‗complete‘ flag and release the completion lock. This can be solved by getting the object that does the serialisation to raise an event in the source process when it has finished serialising. If a method is included in the SendRequest class and marked with the OnSerialized attribute, then it will be treated as a call back and invoked once all the data in the object has been read and copied. Note that this timing will ensure that the lock remains in place while the data is needed but it may result in the completion lock being released before the data has arrived at the target process. This is the perfect timing for an MPI send function because the semantics of the send dictate that it is considered complete as soon as the message data can be modified without affecting the result.

The RecvRequest sub-class includes the matching function that determines if the RecvRequest object can be satisfied by a particular incoming Header object. It takes account of wildcards for both the tag and source process fields but is otherwise quite straightforward. This sub-class is not marked with the Serializable attribute and so cannot be serialised. This means that, if the user program attempts to serialise the Request references returned by this library, then those that refer to RecvRequest objects will fail and those that refer to SendRequest objects will appear to succeed but will actually set the complete flag and release the completion lock, possibly before the target has asked for the data. Attempting to serialise any Request reference returned by this library is therefore declared to be illegal and the result of such an attempt is unspecified.

The Status class provides a summary of the status of a request in the same way as the status structure in the MPI specification. It is useful after a wait method completes, especially one that waits for multiple requests. Each Request object provides its own Status object as a property, rather than the wait methods taking a list of Request objects as an input parameter and creating a separate list of Status objects as its output.

4.3 Configuration, Start up and Shutdown

As this project is a feasibility study, only a basic mechanism for configuration, start up and shutdown has been implemented, as briefly described in the following sub-sections.

4.3.1 The Node Manager Utility

The Node Manager Utility runs on all machines involved in the cluster. Each instance reads a configuration file, which holds details of all the machines that will host McMPI processes, then gets instructions from command line, sends those instructions to the Node Manager on each remote machine and starts local processes.

35

4.3.2 The Initialise Method

The initialisation of McMPI starts by obtaining the values of three environment variables set by the Node Manager utility and reading the configuration file. Together these provide enough information for the process to set up remoting connections to all the other processes. Each connection is wrapped in a Process object and a default communicator object, called World, is created and populated with all the Process object references. The final step is to wait one second to give all processes a chance to start and then perform a synchronisation barrier. This is not an ideal solution because if one or more processes are delayed in starting then the barrier will cause a fatal exception in other processes when the connection to the late process cannot be made.

4.3.3 The Finalise Method

The finalisation of McMPI simply performs a collective synchronisation barrier so that all processes must get to the finalisation stage before any of them exit. There are no single-sided communications in this version, so this is sufficient if the user code is legal according to the MPI standard and well-behaved.

4.4 Summary

In addition to the initialisation and finalisation methods and the informational properties, the communication methods implemented in this initial version of McMPI are summarised in Table 1. Both an eager and a rendezvous protocol are included.

Method Name Description Initialise Connects all processes and creates Communicator.World Finalise Tears down the McMPI library Rank Returns the rank of the current process in the communicator Size Returns the number of processes in the communicator IRecv Non-blocking point-to-point receive ISend Non-blocking point-to-point standard mode send ISsend Non-blocking point-to-point synchronous mode send Wait Wait for one request to complete WaitAll Wait for all requests in a supplied list to complete WaitAny Wait for any request in a supplied list to complete WaitSome Wait for at least one request in a supplied list to complete Recv Blocking point-to-point receive Send Blocking point-to-point standard mode send Sendrecv Blocking point-to-point send and receive combined Ssend Blocking point-to-point synchronous mode send AllReduce Collective reduction where all processes get the result Barrier Collective synchronisation primitive Bcast Collective broadcast from one process to all others

Table 1: Communication methods implemented in McMPI so far

36

Chapter 5.

Testing

In order to test the MPI-like library described in the previous chapters, two sample applications were built using both the new library and MPI.NET: the ever-useful HPC utility ‗Ping Pong‘ and the ubiquitous HPC first application ‗Image Processor‘.

5.1 Proving Correctness using Ping-Pong Duplex

The Ping Pong Duplex utility is designed to test the functionality of the McMPI communication layer. It sends simple messages from a source to a target with replies being sent back from target to source. In particular, this version of the utility can operate in three modes.

Ordered – the tags for the sends and receives are incremented with each method call on both processes. The messages should be received in the same order as they were sent because a receive that specifies a particular tag value cannot match a send with a different tag value. All the non-blocking sends in the batch are issued before any receives are issued. Then each receive is issued and waited for in turn. As each ping receive completes (i.e. when the wait method returns) the message data is parsed and a response message is generated and sent back to the source, using a non-blocking send. Once all pings have been received and responded to, receives for the pong messages are issued, again with incrementing tags to force the ordering. Reversed – similar to Ordered but the tags for ping receives are decremented to force the messages to be received in reverse order. This represents the pathological worst case for the matching algorithm as all outstanding incoming pings (except the last one to arrive) must be considered and rejected when matching each receive. Note that, whilst the ping receives are issued with tags in the ‗wrong‘ order, the pong receives are not, i.e. they are incremented with each call exactly as in Ordered mode. As the pong messages are sent in the same order as the pings are received, this means that the pong messages are also matched in reversed order. Interleaved – each ping-pong message pair is dealt with separately and the whole cycle is repeated as many times as required by the batch count. Only one send per process is outstanding at any one time. The utility actually performs a ping-ping followed by a pong-pong, i.e. both processes send their ping message and then wait to receive the other‘s ping message. They both then parse the received ping and send an appropriate pong message. Once the wait to receive the pong message has completed then the cycle is repeated up to the batch count.

The Ordered and Reversed modes are designed primarily to check for correct operation of the library functions. The Interleaved mode times a series of messages but is largely superseded by the Ping Pong Simplex utility described in section 5.2.

37

5.2 Investigating Performance using Ping-Pong Simplex

The Ping Pong Simplex utility is designed to test the basic performance of the McMPI library in comparison to the MPICH library accessed via the C# bindings of MPI.NET.

For the McMPI tests, a parameter was set to force the use of a particular protocol, so that times for both eager and rendezvous could be obtained for all message sizes.

For the MPI.NET tests, serialisation of the message object and memory pinning and unpinning of the serialisation buffer are performed in user code for each message. This approach for testing MPI.NET represents the same level of generality available in McMPI, i.e. a different, possibly complex, object between heterogeneous system architectures in each message. In reality, most scientific codes would only send a small number of very simple objects between homogeneous system architectures during the code hot-spot, e.g. a halo region consisting of a 1-dimensional array of double-precision values from one big-endian machine to another. This limitation allows the serialisation to be avoided altogether, as demonstrated in the Image Processor code described in section 5.4. It also means the memory pinning and unpinning could be performed once only, and outside the hot-spot, although this is not done in the Image Processor code.

The application sends messages of a particular size, i.e. a System.String object containing a particular number of 1 byte characters, around a ring of all processes and measures the average round-trip time for a batch of 10 round-trips. This test is repeated 100 times for each message size.

The minimum, maximum, mean and standard deviation of the average batch times, as well as the effective bandwidth, at each message size were calculated and are in Appendix B as Table 2 for the 100Mbit/s Ethernet connection and Table 3 for the 1Gbit/s Ethernet connection. The results are plotted in Figure 13 and Figure 14, which show the round-trip times for small messages only, in Figure 15 and Figure 16, which include the round-trip times for all tested message sizes, and in Figure 17 and Figure 18, which show the effective bandwidth for all the tested message sizes. The data points are the mean values. There are no error bars because the standard deviation is an order of magnitude smaller than the mean values for almost all of the data points.

Note that the MPI.NET values are labelled MPICH in the plots for this section and for section 5.4 for two reasons. The first is that it avoids confusion between a pure .Net library without ―.Net‖ in its name and a non-.Net library accessed via a .Net interface. Secondly, the C# bindings, used in both the Ping Pong Simplex code and the Image Processor code, are closer to MPICH than the full MPI.NET library implementation, which wraps the C# bindings with an object layer. The C# bindings are used because they are readily available (37) whereas the MPI.NET object layer is not.

All timed tests were done using the hardware described in Appendix B by Table 6.

38

Figure 13: Ping Pong round-trip time for small messages – 100Mbit/s

Figure 14: Ping Pong round-trip time for small messages – 1Gbit/s

39

Figure 15: Ping Pong round-trip time for all message sizes – 100Mbit/s

Figure 16: Ping Pong round-trip times for all message sizes – 1Gbit/s

40

Figure 17: Ping Pong effective bandwidth – 100Mbit/s

Figure 18: Ping Pong effective bandwidth – 1Gbit/s

41

5.3 Discussion of Ping Pong Simplex Results

The Ping Pong Simplex results show that the latency of McMPI point-to-point messaging is poor when compared to MPI.NET and restricts the benefit that can be achieved by using a fast network connection. For the 100Mbit/s Ethernet network, the average round-trip time for a 100 byte message using the McMPI rendezvous protocol is 5.31±0.22ms but the MPI.NET time is over an order of magnitude better at 0.492±0.077ms, although the McMPI eager protocol fares a little better at 2.01±0.090ms, i.e. only four times slower. For the 1Gbit/s Ethernet network, the difference is even greater with the MPI.NET time of 0.328±0.111ms easily beating both the McMPI eager time of 1.83±0.18ms and the rendezvous time of 4.75±0.15ms. MPI.NET shows approximately a 33% improvement in latency when using the faster network but latency in McMPI only improves by approximately 9% for the eager protocol and 11% for the rendezvous protocol.

The main reasons for the large difference in latency are the software stack of the .Net remoting infrastructure and the fact that MPICH actively polls when waiting for a request whereas McMPI uses synchronisation locks to notify waiting threads. The time taken by a minimal remoting method, i.e. one that takes no parameters, contains no code and does not return a value, was measured on both networks and compared to the remoting method used in the McMPI eager protocol. The results are shown in Appendix B as Table 4. On the 100Mbit/s network, the minimal method takes 610µs whilst the eager protocol method takes 1202µs (both are an average of 1000 calls in a tight loop with the timing code outside that loop). On the 1Gbit/s network, both are slightly faster at 510µs for the minimal method and 1018µs for the eager protocol method. Note that the time for the rest of the eager Send method was measured to be about 38µs with creation of the SendRequest object accounting for the majority of that time, at 29µs. This reveals that improvements to the eager protocol, e.g. only creating a Header object instead of the full SendRequest object, cannot exceed a two-fold decrease in the small message times without replacing the remoting method with another inter-process communication method, such as using sockets directly. Active polling may allow the MPICH library to respond more quickly because it avoids the context switch to an idle thread, which is implied in the McMPI design. However, this affect was not investigated because the remoting paradigm does not support active polling for incoming messages.

For the 100Mbit/s network, all three protocols tested asymptotically approach a high percentage of the available bandwidth. The MPI.NET library achieves significantly higher effective bandwidth for all message sizes up to 100,000 bytes. For larger messages the McMPI boasts marginally better figures. The difference is likely to be due to an extra memory copy, although this has not been confirmed. The process of pinning an object into memory is not well-documented and may involve copying the data for the object from memory managed by the .Net garbage collector to unmanaged memory. As this pin operation is only performed for the external calls to MPICH this would explain the small disparity in the upper bandwidth limit. As already stated earlier in this chapter, in some codes, it would be possible to move the pin and unpin operations outside the main processing loop and thereby reduce, or even eliminate, any disadvantage it entails.

42

For the 1Gbit/s network, the MPI.NET library noticeably out-performs both protocols from the McMPI library, although it only peaks at 259Mbit/s for messages of size 2,000,000 bytes, i.e. about one quarter of the available bandwidth. Both of the McMPI protocols asymptotically approach 166Mbit/s, which is one sixth of the available bandwidth. Whereas the McMPI bandwidth increases fairly smoothly, the MPI.NET bandwidth shows much greater sensitivity to message size and the last few data points appear to indicate an interesting downward trend. Previously published results for MPI.NET (1) do not show either the variability or the downward trend. The cause of these features is not known.

The McMPI eager protocol is consistently faster than the McMPI rendezvous protocol but its advantage decreases for larger message sizes as the data transfer time begins to dominate the difference in latencies. The MPICH library chooses to switch from its eager protocol to its rendezvous protocol at 100,000 bytes, by default. The effective bandwidth plots suggest that this switch should be at 1,000,000 bytes for McMPI because below this size the two protocols begin to rapidly diverge.

5.4 Demonstrating Real-World Code using Image Processor

The Image Processor code essentially reverses a simple edge-detection algorithm. The series of steps necessary to retrieve the original image from the edge-detection data is an iterative process with a strong resemblance to the solution of a partial differential equation using the Jacobi algorithm. The time taken to complete the number of iterations needed to converge to a solution within a given tolerance can be reduced by sharing the workload amongst multiple processes. This is achieved by a simple domain decomposition technique, i.e. the image data is divided into chunks that are distributed to the processes and each performs the required steps on the data it ‗owns‘. The five- point stencil requires data from the edge of the neighbouring chunks (halo regions) to be exchanged at each time step.

The application works in one of three modes, chosen by specifying the MPI library to use on the command line: NONE, McMPI or MPICH. The serial version, i.e. when NONE is specified, uses Gustafson scaling rather than Amdahl scaling, i.e. it assumes that the parallel versions will be executed using two processors that share the workload equally and only processes half of the image. Making the workload per processor per iteration is the same in all versions ensures that any differences in the measured time per iteration are due to communication overheads only. This removes effects that are due to serial scaling issues, such as when the whole image does not fit into cache but half the image does, which would otherwise skew the results.

As the MPI I/O section of the MPI standard is out of scope for McMPI and the key part of the Image Processor code is the main loop, it was deemed expedient to use a very naïve I/O strategy. The master process (the one with a rank of zero) simply loads the image from disk, performs the edge-detection algorithm and then broadcasts the entire edge-detected data array to all other processes. Each process then decomposes the very large array and copies its chunk into a smaller array, ready for processing. Once the processing has been completed, each process sends its chunk back to the master process by copying back into a very large array and then participating in a reduction operation.

43

This is very inefficient and it restricts the size of image that can be processed because both one large and one small array must fit in the memory available to each process. However, it is sufficient for this demonstration application code.

The testing was done with only two processes, so the decomposition technique is very simple. The aim of any decomposition is to balance the computational load and minimise the communication overhead. This application therefore uses 1-D decomposition along the longest image dimension, so that the chunks are of equal size and the halo region is only as long as the short image dimension. The code also transposes the image data, if necessary, to ensure that the halo regions are contiguous in memory, which increases the efficiency of packing and unpacking those data elements.

The main processing loop consists of an exchange of halo regions with nearest neighbour processes, an iteration of the de-edge algorithm over the entire local chunk and a check for convergence to a solution and other stopping criteria, i.e. the maximum number of iterations allowed. The C# language facilitates combining code from several source code files by allowing each piece to be labelled as a partial class. This allows the parallelisation code to be easily separated from the main application code, which increases the readability and maintainability of both parts of the application.

The halo exchange is achieved by copying the edge of the local chunk into a 1-D array, sending it to the neighbour process, receiving a similar 1-D array from that process and copying the contents into the local chunk. For the McMPI version, the Sendrecv method combines the two operations in a manner that ensures no deadlock occurs. In the MPI.NET version, the Sendrecv method causes a MemoryAccessViolation exception, even with valid parameters, so the operations are performed separately using Send then Receive on processes with an even rank and Receive then Send on processes with an odd rank.

The check for convergence involves computing a delta value that quantifies the amount of change between the previous iteration and the one just calculated. This is achieved by computing the partial delta locally in each process and then combining these partial delta values using an AllReduce operation that returns the result to all processes. The global delta is then compared to a target tolerance value and the main loop exits if the desired tolerance has been achieved.

Three sizes of image (small: 75x50 pixels, medium: 600x352 pixels and large: 2000x1000 pixels) and two levels of precision (single: 4 bytes and double: 8 bytes) were investigated to explore the overheads of both communication and computation dominated codes. Also the tests were run using the 100Mbit/s and 1Gbit/s Ethernet connections to investigate the advantage, if any, of a faster network.

The results for all three images and for both single and double precision are tabulated in Appendix B as Table 5 and displayed in Figure 19 and Figure 20 for the small image, Figure 21 and Figure 22 for the medium image and Figure 23 and Figure 24 for the large image. Also Figure 25 illustrates the communication overhead for each of the parallel tests by subtracting the appropriate serial computation time.

All timed tests were done using the hardware described in Appendix B by Table 6.

44

Figure 19: Image Processor times – small image, single precision

Figure 20: Image Processor times – small image, double precision

45

Figure 21: Image Processor times – medium image, single precision

Figure 22: Image Processor times – medium image, double precision

46

Figure 23: Image Processor times – large image, single precision

Figure 24: Image Processor times – large image, double precision

47

Figure 25: Communication overheads in the Image Processor code

48

5.5 Discussion of Image Processor Results

The first observation that can be made from Image Processor timings is that the overheads for the McMPI library are much larger than those for the MPI.NET library. The smallest measured communication overhead for McMPI is 2.85ms for the small image using double precision on the 1Gbit/s network. This is larger than the largest measured overhead for the MPI.NET library, which is 2.34ms for the large image using double precision on the 100Mbit/s network.

A sizeable communication cost becomes relatively less important for a computation dominated code, as can be seen by comparing the results for the large image using single precision, in Figure 23, and the small image using single precision, in Figure 19. This suggests that McMPI is not suited to closely-coupled, communication intense algorithms but may be more acceptable for loosely-coupled codes with large blocks of computation between message transfers.

The McMPI overheads for the large image using double precision are 13.69ms and 10.11ms for the 100Mbit/s and 1GBit/s networks, respectively. These figures are almost exactly twice those for the same image using single precision, i.e. 6.34ms and 5.08ms for the 100Mbit/s and 1GBit/s networks, respectively. This fact indicates that McMPI has already reached its maximum effective bandwidth for this code at this image size. In comparison, the MPI.NET overheads increase by just 13%, from 2.07ms to 2.34ms, for the 100Mbit/s network and by just 5.2%, from 1.68ms to 1.76ms, for the 1GBit/s network. This is initially unexpected because the Ping Pong Simplex results point to approximately a 17% increase in transfer time from a 4,000 byte message (halo region of 1,000 single precision numbers) to an 8,000 byte message (halo region of 1,000 double precision numbers) for McMPI.

However, there is a significant difference between the two codes: serialisation. In the Ping Pong Simplex code, the message data, a System.String object, must be serialised before being sent by MPI.NET and de-serialised after being received. In the Image Processor code serialisation does not happen because MPI.NET supports sending the 1-D array of floating-point numbers containing the halo data without conversion. This is an optimisation compromise that is only applicable to homogeneous processes that store data in memory using exactly the same format. Serialisation is performed by .Net remoting irrespective of data type, so the McMPI code suffers an additional overhead, which quickly turns out to be a limiting factor in the maximum achievable bandwidth. The difference between the 17% prediction, based on Ping Pong Simplex, and the measured 100% increase, is due to the serialisation efficiencies for the different data types used in the two codes.

As noted in Chapter 6, further work to replace .Net remoting with .Net sockets is highly desirable. It would also allow this additional serialisation overhead to be removed for specific simple data types, such as those supported by MPICH and MPI.NET, whilst keeping the serialisation for the general case of complex objects or processes with heterogeneous memory storage formats.

49

Chapter 6.

Further Work

6.1 Reducing Latency

Due to the very high latency overheads, in its current form the McMPI library is only useful for loosely-coupled, course-grained parallel applications or those that only transfer large messages containing complex objects. If any further work is to be worthwhile then the latency must be reduced. This could be achieved by using .Net socket objects directly, i.e. not via .Net remoting, in the same manner as the C source code for MPICH.

The proof of concept code, created in preparing for this project1, showed a 6-fold decrease in latency compared to remoting when using sockets, which indicates that significant performance gains should be possible.

Exposing the socket object inside the MPI library also has several other useful benefits.

It allows the serialisation overhead to be eliminated for simple data types when exchanged between similar machine architectures but retained for other cases. This is the same optimisation made in Pure Java MPI (38) where the DataOutStream is modified to replace serialisation with a single-byte tag indicating the byte-order. It allows the size of the message object to be determined without increasing the overhead. For simple data types the size can be determined easily, e.g. using the built-in sizeof operator for value types and the Length property for arrays. For complex objects, the approach is to serialise and examine the Length property of the resulting memory stream. In contrast to .Net remoting, the memory stream can be passed to the NetworkStream interface of a .Net socket without being re-serialised. It allows active polling for incoming messages using background threads, which may decrease the response time of the receiving process and further reduce latency.

Unfortunately, replacing .Net remoting with .Net sockets would also require the creation of a shared-memory implementation for all the communication methods. The C source code for MPICH could be used as a reference implementation to inform the design and speed up the development.

1 That is, during the Project Preparation module of this MSc. A copy of the source code with the accompanying report that describes the results is available on request to: [email protected].

50

6.2 Testing Support for InfiniBand and Myrinet

One of the advantages of MPICH is the support for InfiniBand and Myrinet interconnects through the use of an abstract device interface. The recently released Microsoft Windows Compute Cluster Server (CCS) operating system introduces support for these interconnect technologies using a new operating system component model, as shown in Figure 26.

Figure 26: Winsock Direct topology in Windows Compute Cluster Edition (39)

Both Microsoft MPI (MS MPI) and MPICH, on which it is based, make direct calls to the winsock.dll and the .Net socket object is a managed code version of that same winsock.dll. This is true on all Windows machines but on CCS the winsock.dll delegates to winswitch.dll (called Winsock Direct), which switches traffic to RDMA drivers, where available, depending on subnet address.

Comparative testing of MS MPI, MPICH and McMPI (in its current form or, preferably, modified to use .Net socket objects directly) on CCS with a dedicated MPI network, as in Figure 27, would verify whether these libraries can take full advantage of high performance interconnect technologies.

51

Figure 27: Typical network topology for a CCS cluster (40)

6.3 Increasing Portability

One important and expected benefit of using .Net, namely portability, could not be adequately verified during this project. The cross-language portability is assured by the common nature of the Intermediate Language binary code. However, in order to execute on non-Windows operating systems the McMPI library must be binary- compatible with other .Net runtime environments. This is not a matter of porting the code, i.e. of producing a separate version of the source code that is compatible with Mono or Portable.Net, but rather of restricting the McMPI library to only use classes from within the subset of the built-in classes that are provided by all .Net runtimes. All the current .Net runtimes include some classes that the others do not; truly portable classes are those provided by all implementations of .Net, as shown in Figure 28.

MS Framework

Mono P.Net

SSCLI (Rotor)

Figure 28: Truly portable classes are a subset of all the .Net runtimes

52

6.4 Scaling to Large Parallel Systems

As the development and testing of McMPI has so far been restricted to a pair of single- core, single-processor machines, there has been no opportunity to test scaling beyond two processors either in a distributed memory system (e.g. a cluster) or an SMP machine. This could be combined with testing non-Ethernet network technologies on a Microsoft Windows Compute Cluster or with testing a modified version that is compatible with Mono or P.Net and, therefore, non-Windows operating systems.

6.5 Completing McMPI

If the results of attempts to solve the latency issues with the current version of McMPI were favourable, then the other parts of the MPI standard, such as buffered and persistent sends, MPI/IO, single-sided communications must be implemented before it can be considered complete.

6.6 Building Remoting from MPI

The largest contributing factor to the high latency of McMPI is the time taken by remoting infrastructure. Even a minimal remote method that takes no parameters, contains no code and returns no value, takes 0.5ms to execute, which is twice the time for a round-trip using 100 byte messages in MPICH. It would be interesting to look at re-engineering remoting using MPICH techniques. Remoting abstracts remote method calls as messages, transmitted by a transport sink class. It may be possible to create a new transport sink class that reduces the latency, hopefully not at the expense of the high bandwidth exhibited by the built-in implementation. This could benefit all applications that use remoting directly and those that use technologies based on it, such as WCF (Windows Communication Foundation).

53

Chapter 7.

Conclusion

The main focus of this project was to create an MPI-like library, written in C# and using .Net remoting, and to assess its suitability for high performance programming. This has been successfully achieved and leads to the conclusion that it is possible to create an MPI-like library in this manner. The re-use of many built-in .Net features, such as queues, lists, locks, threads and delegate functions, significantly aided the development and achieved very good performance. However, overall performance of McMPI, in its current form, is restricted by the remoting infrastructure and by the serialisation process that it requires.

The over-arching goal of assessing the feasibility of a pure .Net MPI library has been partially achieved. This project has proven that .Net remoting introduces too great an overhead to realise the low-latency communications demanded by HPC applications. Further work is needed to investigate the other major implementation possibility within .Net, namely using sockets directly.

This project has shown that, despite the latency overheads, it is possible for McMPI to compete favourably with MPI.NET in certain specific circumstances although in most real-world scenarios this is unlikely to be the case. The high bandwidth achieved with the 100Mbit/s Ethernet link, and the obvious impediment of the remoting infrastructure, together indicate that there is great potential for McMPI, if that impediment is removed. The speed and ease of development during this project, which has allowed a significant portion of the MPI standard to be implemented in a short period of time, demonstrates the significant productivity advantage of modern object-oriented languages with integrated, extensive, general purpose, re-usable class libraries.

54

Appendix A.

Project Plan

The plan used for estimating the project duration and informing decisions regarding scope is reproduced in Figure 29. Although the plan shows a strict ‗waterfall‘ methodology (where each task is completed before moving on to the next), the work was, as intended, actually done iteratively (where tasks may overlap or be restarted or revisited). The final design is heavily influenced by the experience gained in implementing previous design iterations.

An example of an earlier task being revisited later in the project is the locking of the RecvQueue to prevent deadlocks caused by a race condition between the incoming Header and the matching IRecv method call. The initial design did not contain any protective locks. Deadlocks were noticed in the code built according to this initial design and investigated. An intermediate version introduced two locks (one on RecvQueue and one on RecvBuffer) in each of the pieces of code involved in the race condition, which prevented one thread adding a new object to a list as another was searching that list. This reduced the frequency of the deadlocks but did not eliminate them altogether. Further investigation revealed that one thread could search its list and then, before adding its object to the other list, the second thread could search that list. Neither thread would find the matching object so both would add their object to their list and a deadlock is inevitable. This was solved by locking only one of the lists, RecvQueue, during both the search and the add operation in each thread, which both protects the lists from simultaneous access and also ensures that the deadlock condition cannot happen.

Some changes were made to the original scope of the project, the most significant of which is testing on non-Windows operating systems. It was hoped that testing of portability to Mono on Windows and then Mono on Linux would be possible within the limited project time-frame. However, the initial work to run the test programs using Mono on Windows XP revealed a number of issues. The current version of McMPI makes extensive use of classes provided in .Net Framework 2.0 and it appears that some of these are not included or supported in Mono yet. There are .Net 1.1 alternatives to all the 2.0 classes used, so a portable version is achievable, however it was decided that this is an open-ended, hard to estimate and potentially lengthy task that could not easily be accommodated in the project plan. The time previously allocated to testing using Mono was used to create the eager protocol and thereby significantly reduce the latency especially for small messages.

Although the content of some of the individual stages was altered during the project, all the major milestones were met and the deliverables were finished on time. The original project plan was an invaluable tool for managing the project scope and monitoring progress throughout the project lifecycle.

A

Figure 29: The original project plan

B

Appendix B.

Data tables

Message Minimum Maximum Mean Standard Effective MPI Library Size Time Time Time Deviation Bandwidth and Protocol (bytes) (ms) (ms) (ms) (ms) (Mbit/s) McMPI 100 1.776E+00 3.307E+00 2.006E+00 9.041E-02 7.605E-01 Eager McMPI 150 1.859E+00 2.494E+00 2.038E+00 7.625E-02 1.123E+00 Eager McMPI 200 1.811E+00 2.521E+00 2.031E+00 7.792E-02 1.502E+00 Eager McMPI 300 1.856E+00 2.481E+00 2.053E+00 8.207E-02 2.229E+00 Eager McMPI 400 1.870E+00 2.522E+00 2.068E+00 8.013E-02 2.952E+00 Eager McMPI 500 1.873E+00 2.674E+00 2.082E+00 8.722E-02 3.664E+00 Eager McMPI 750 1.944E+00 3.769E+00 2.221E+00 1.138E-01 5.153E+00 Eager McMPI 1000 1.952E+00 2.806E+00 2.267E+00 8.652E-02 6.730E+00 Eager McMPI 1500 2.074E+00 2.708E+00 2.352E+00 8.910E-02 9.732E+00 Eager McMPI 2000 2.170E+00 2.966E+00 2.422E+00 1.193E-01 1.260E+01 Eager McMPI 3000 2.449E+00 2.945E+00 2.630E+00 9.890E-02 1.741E+01 Eager McMPI 4000 2.651E+00 3.307E+00 2.784E+00 9.496E-02 2.192E+01 Eager McMPI 5000 2.719E+00 3.502E+00 2.930E+00 1.063E-01 2.604E+01 Eager McMPI 7500 3.084E+00 3.598E+00 3.261E+00 7.443E-02 3.509E+01 Eager McMPI 10000 3.478E+00 4.159E+00 3.724E+00 7.836E-02 4.098E+01 Eager McMPI 15000 4.375E+00 4.805E+00 4.620E+00 1.397E-01 4.955E+01 Eager McMPI 20000 5.202E+00 6.032E+00 5.564E+00 2.174E-01 5.485E+01 Eager McMPI 30000 7.296E+00 7.587E+00 7.424E+00 8.750E-02 6.166E+01 Eager McMPI 40000 9.169E+00 9.326E+00 9.248E+00 4.651E-02 6.600E+01 Eager

C

Message Minimum Maximum Mean Standard Effective MPI Library Size Time Time Time Deviation Bandwidth and Protocol (bytes) (ms) (ms) (ms) (ms) (Mbit/s) McMPI 50000 1.100E+01 1.170E+01 1.126E+01 1.738E-01 6.774E+01 Eager McMPI 75000 1.583E+01 1.620E+01 1.600E+01 1.111E-01 7.154E+01 Eager McMPI 100000 2.065E+01 2.111E+01 2.080E+01 1.200E-01 7.337E+01 Eager McMPI 150000 3.008E+01 3.057E+01 3.030E+01 1.554E-01 7.555E+01 Eager McMPI 200000 3.959E+01 4.023E+01 3.990E+01 2.191E-01 7.649E+01 Eager McMPI 300000 5.827E+01 5.898E+01 5.854E+01 2.300E-01 7.820E+01 Eager McMPI 400000 7.654E+01 7.726E+01 7.691E+01 1.963E-01 7.935E+01 Eager McMPI 500000 9.497E+01 9.538E+01 9.517E+01 1.528E-01 8.016E+01 Eager McMPI 750000 1.409E+02 1.420E+02 1.415E+02 2.980E-01 8.090E+01 Eager McMPI 1000000 1.867E+02 1.892E+02 1.877E+02 6.180E-01 8.130E+01 Eager McMPI 1500000 2.807E+02 2.816E+02 2.812E+02 2.589E-01 8.140E+01 Eager McMPI 2000000 3.727E+02 3.736E+02 3.732E+02 3.264E-01 8.178E+01 Eager McMPI 3000000 5.576E+02 5.631E+02 5.609E+02 1.506E+00 8.162E+01 Eager McMPI 4000000 7.437E+02 7.484E+02 7.466E+02 1.275E+00 8.175E+01 Eager McMPI 5000000 9.334E+02 9.369E+02 9.349E+02 1.139E+00 8.160E+01 Eager McMPI 7500000 1.394E+03 1.398E+03 1.397E+03 1.477E+00 8.191E+01 Eager McMPI 10000000 1.860E+03 1.873E+03 1.864E+03 3.053E+00 8.185E+01 Eager McMPI 100 5.010E+00 8.401E+00 5.307E+00 2.206E-01 2.875E-01 Rendezvous McMPI 150 5.004E+00 7.842E+00 5.332E+00 1.848E-01 4.292E-01 Rendezvous McMPI 200 5.004E+00 6.712E+00 5.347E+00 1.576E-01 5.708E-01 Rendezvous McMPI 300 5.053E+00 6.472E+00 5.377E+00 1.484E-01 8.514E-01 Rendezvous McMPI 400 5.126E+00 6.982E+00 5.418E+00 1.532E-01 1.127E+00 Rendezvous McMPI 500 5.162E+00 7.309E+00 5.473E+00 1.897E-01 1.394E+00 Rendezvous

D

Message Minimum Maximum Mean Standard Effective MPI Library Size Time Time Time Deviation Bandwidth and Protocol (bytes) (ms) (ms) (ms) (ms) (Mbit/s) McMPI 750 5.241E+00 6.812E+00 5.601E+00 1.435E-01 2.043E+00 Rendezvous McMPI 1000 5.410E+00 6.990E+00 5.653E+00 1.661E-01 2.699E+00 Rendezvous McMPI 1500 5.552E+00 7.230E+00 5.774E+00 2.410E-01 3.964E+00 Rendezvous McMPI 2000 5.568E+00 6.238E+00 5.784E+00 1.178E-01 5.276E+00 Rendezvous McMPI 3000 5.810E+00 6.371E+00 6.032E+00 1.093E-01 7.589E+00 Rendezvous McMPI 4000 5.935E+00 6.511E+00 6.164E+00 1.204E-01 9.902E+00 Rendezvous McMPI 5000 5.984E+00 6.655E+00 6.275E+00 9.773E-02 1.216E+01 Rendezvous McMPI 7500 6.309E+00 7.959E+00 6.698E+00 3.335E-01 1.709E+01 Rendezvous McMPI 10000 6.726E+00 7.420E+00 7.026E+00 1.176E-01 2.172E+01 Rendezvous McMPI 15000 7.663E+00 8.328E+00 7.932E+00 1.320E-01 2.885E+01 Rendezvous McMPI 20000 8.604E+00 9.183E+00 8.886E+00 1.146E-01 3.434E+01 Rendezvous McMPI 30000 1.046E+01 1.107E+01 1.075E+01 1.035E-01 4.257E+01 Rendezvous McMPI 40000 1.240E+01 1.299E+01 1.266E+01 1.309E-01 4.819E+01 Rendezvous McMPI 50000 1.426E+01 1.528E+01 1.475E+01 1.792E-01 5.174E+01 Rendezvous McMPI 75000 1.910E+01 1.965E+01 1.937E+01 1.114E-01 5.908E+01 Rendezvous McMPI 100000 2.388E+01 2.450E+01 2.414E+01 1.227E-01 6.321E+01 Rendezvous McMPI 150000 3.246E+01 3.400E+01 3.363E+01 4.190E-01 6.807E+01 Rendezvous McMPI 200000 4.258E+01 4.313E+01 4.297E+01 1.538E-01 7.103E+01 Rendezvous McMPI 300000 6.131E+01 6.226E+01 6.194E+01 2.599E-01 7.390E+01 Rendezvous McMPI 400000 7.955E+01 8.048E+01 8.011E+01 2.422E-01 7.619E+01 Rendezvous McMPI 500000 9.763E+01 9.861E+01 9.839E+01 2.846E-01 7.755E+01 Rendezvous McMPI 750000 1.429E+02 1.452E+02 1.446E+02 6.267E-01 7.912E+01 Rendezvous McMPI 1000000 1.891E+02 1.914E+02 1.907E+02 5.804E-01 8.001E+01 Rendezvous

E

Message Minimum Maximum Mean Standard Effective MPI Library Size Time Time Time Deviation Bandwidth and Protocol (bytes) (ms) (ms) (ms) (ms) (Mbit/s) McMPI 1500000 2.718E+02 2.839E+02 2.823E+02 3.504E+00 8.107E+01 Rendezvous McMPI 2000000 3.733E+02 3.783E+02 3.770E+02 1.290E+00 8.095E+01 Rendezvous McMPI 3000000 5.517E+02 5.601E+02 5.589E+02 2.410E+00 8.191E+01 Rendezvous McMPI 4000000 7.381E+02 7.486E+02 7.458E+02 2.875E+00 8.184E+01 Rendezvous McMPI 5000000 9.301E+02 9.383E+02 9.369E+02 2.329E+00 8.144E+01 Rendezvous McMPI 7500000 1.381E+03 1.404E+03 1.397E+03 5.871E+00 8.191E+01 Rendezvous McMPI 10000000 1.848E+03 1.869E+03 1.864E+03 5.622E+00 8.184E+01 Rendezvous MPICH 100 2.426E-01 1.405E+00 4.919E-01 7.711E-02 3.102E+00 Automatic MPICH 150 3.885E-01 1.848E+00 5.113E-01 9.639E-02 4.476E+00 Automatic MPICH 200 4.816E-01 1.410E+00 5.355E-01 9.613E-02 5.698E+00 Automatic MPICH 300 4.622E-01 1.409E+00 5.755E-01 1.182E-01 7.954E+00 Automatic MPICH 400 4.831E-01 1.384E+00 5.679E-01 1.161E-01 1.075E+01 Automatic MPICH 500 4.776E-01 1.459E+00 6.173E-01 1.285E-01 1.236E+01 Automatic MPICH 750 6.070E-01 1.435E+00 7.469E-01 7.373E-02 1.532E+01 Automatic MPICH 1000 7.262E-01 1.655E+00 8.200E-01 1.260E-01 1.861E+01 Automatic MPICH 1500 9.691E-01 1.631E+00 9.947E-01 7.732E-02 2.301E+01 Automatic MPICH 2000 9.708E-01 1.679E+00 1.001E+00 8.097E-02 3.049E+01 Automatic MPICH 3000 1.216E+00 2.166E+00 1.426E+00 1.442E-01 3.211E+01 Automatic MPICH 4000 1.457E+00 2.189E+00 1.502E+00 8.596E-02 4.063E+01 Automatic MPICH 5000 1.723E+00 2.383E+00 1.757E+00 1.043E-01 4.342E+01 Automatic MPICH 7500 2.017E+00 3.162E+00 2.131E+00 1.794E-01 5.371E+01 Automatic MPICH 10000 2.430E+00 3.258E+00 2.615E+00 1.374E-01 5.835E+01 Automatic MPICH 15000 3.429E+00 4.279E+00 3.697E+00 1.344E-01 6.191E+01 Automatic

F

Message Minimum Maximum Mean Standard Effective MPI Library Size Time Time Time Deviation Bandwidth and Protocol (bytes) (ms) (ms) (ms) (ms) (Mbit/s) MPICH 20000 4.498E+00 5.158E+00 4.678E+00 1.287E-01 6.524E+01 Automatic MPICH 30000 6.316E+00 7.033E+00 6.491E+00 1.492E-01 7.053E+01 Automatic MPICH 40000 8.345E+00 9.169E+00 8.530E+00 1.719E-01 7.155E+01 Automatic MPICH 50000 1.016E+01 1.109E+01 1.044E+01 1.794E-01 7.307E+01 Automatic MPICH 75000 1.500E+01 1.596E+01 1.528E+01 2.123E-01 7.489E+01 Automatic MPICH 100000 1.984E+01 2.065E+01 2.005E+01 2.062E-01 7.611E+01 Automatic MPICH 150000 3.056E+01 3.147E+01 3.103E+01 2.666E-01 7.376E+01 Automatic MPICH 200000 4.003E+01 4.090E+01 4.041E+01 2.716E-01 7.553E+01 Automatic MPICH 300000 6.002E+01 6.148E+01 6.067E+01 4.666E-01 7.545E+01 Automatic MPICH 400000 7.916E+01 8.013E+01 7.963E+01 2.704E-01 7.665E+01 Automatic MPICH 500000 9.777E+01 9.896E+01 9.830E+01 3.465E-01 7.762E+01 Automatic MPICH 750000 1.469E+02 1.482E+02 1.474E+02 3.665E-01 7.763E+01 Automatic MPICH 1000000 1.931E+02 1.953E+02 1.942E+02 7.409E-01 7.856E+01 Automatic MPICH 1500000 2.908E+02 2.922E+02 2.915E+02 4.447E-01 7.853E+01 Automatic MPICH 2000000 3.858E+02 3.871E+02 3.863E+02 4.072E-01 7.899E+01 Automatic MPICH 3000000 5.728E+02 5.785E+02 5.744E+02 1.520E+00 7.969E+01 Automatic MPICH 4000000 7.683E+02 7.705E+02 7.691E+02 6.976E-01 7.936E+01 Automatic MPICH 5000000 9.710E+02 9.799E+02 9.755E+02 3.130E+00 7.821E+01 Automatic MPICH 7500000 1.442E+03 1.447E+03 1.444E+03 1.044E+00 7.924E+01 Automatic MPICH 10000000 1.943E+03 1.949E+03 1.946E+03 1.706E+00 7.842E+01 Automatic

Table 2: Results from Ping Pong Simplex for 100Mbit/s Ethernet connection

G

Message Minimum Maximum Mean Standard Effective MPI Library Size Time Time Time Deviation Bandwidth and Protocol (bytes) (ms) (ms) (ms) (ms) (Mbit/s) McMPI 100 1.651E+00 6.834E+00 1.826E+00 1.791E-01 8.357E-01 Eager McMPI 150 1.677E+00 2.015E+00 1.814E+00 5.120E-02 1.262E+00 Eager McMPI 200 1.653E+00 2.113E+00 1.819E+00 5.335E-02 1.677E+00 Eager McMPI 300 1.698E+00 2.065E+00 1.827E+00 5.856E-02 2.505E+00 Eager McMPI 400 1.629E+00 2.069E+00 1.829E+00 7.148E-02 3.338E+00 Eager McMPI 500 1.651E+00 3.259E+00 1.846E+00 8.787E-02 4.133E+00 Eager McMPI 750 1.701E+00 2.140E+00 1.873E+00 7.974E-02 6.109E+00 Eager McMPI 1000 1.700E+00 2.164E+00 1.912E+00 9.482E-02 7.980E+00 Eager McMPI 1500 1.699E+00 2.070E+00 1.851E+00 5.911E-02 1.236E+01 Eager McMPI 2000 1.896E+00 2.213E+00 2.031E+00 5.066E-02 1.503E+01 Eager McMPI 3000 1.921E+00 2.164E+00 2.042E+00 4.954E-02 2.241E+01 Eager McMPI 4000 1.921E+00 2.317E+00 2.064E+00 6.029E-02 2.957E+01 Eager McMPI 5000 2.067E+00 2.384E+00 2.163E+00 5.753E-02 3.527E+01 Eager McMPI 7500 2.129E+00 2.457E+00 2.277E+00 5.959E-02 5.027E+01 Eager McMPI 10000 2.184E+00 2.481E+00 2.333E+00 6.651E-02 6.539E+01 Eager McMPI 15000 2.594E+00 3.137E+00 2.869E+00 1.227E-01 7.977E+01 Eager McMPI 20000 3.040E+00 3.435E+00 3.252E+00 7.198E-02 9.383E+01 Eager McMPI 30000 3.734E+00 4.413E+00 4.054E+00 1.443E-01 1.129E+02 Eager McMPI 40000 4.473E+00 5.374E+00 5.043E+00 1.918E-01 1.210E+02 Eager McMPI 50000 5.577E+00 6.711E+00 6.325E+00 2.615E-01 1.206E+02 Eager McMPI 75000 8.025E+00 9.194E+00 8.626E+00 3.688E-01 1.327E+02 Eager McMPI 100000 9.942E+00 1.180E+01 1.111E+01 4.853E-01 1.373E+02 Eager McMPI 150000 1.508E+01 1.610E+01 1.545E+01 3.259E-01 1.481E+02 Eager

H

Message Minimum Maximum Mean Standard Effective MPI Library Size Time Time Time Deviation Bandwidth and Protocol (bytes) (ms) (ms) (ms) (ms) (Mbit/s) McMPI 200000 2.079E+01 2.187E+01 2.146E+01 2.977E-01 1.422E+02 Eager McMPI 300000 2.807E+01 3.113E+01 2.920E+01 1.216E+00 1.567E+02 Eager McMPI 400000 3.701E+01 4.147E+01 3.935E+01 1.537E+00 1.551E+02 Eager McMPI 500000 4.469E+01 4.982E+01 4.656E+01 1.666E+00 1.638E+02 Eager McMPI 750000 6.518E+01 7.448E+01 7.013E+01 3.590E+00 1.632E+02 Eager McMPI 1000000 8.738E+01 9.813E+01 9.371E+01 4.124E+00 1.628E+02 Eager McMPI 1500000 1.310E+02 1.474E+02 1.384E+02 6.645E+00 1.654E+02 Eager McMPI 2000000 1.740E+02 1.940E+02 1.841E+02 7.764E+00 1.658E+02 Eager McMPI 3000000 2.610E+02 2.927E+02 2.748E+02 1.119E+01 1.666E+02 Eager McMPI 4000000 3.501E+02 3.818E+02 3.652E+02 1.142E+01 1.671E+02 Eager McMPI 5000000 4.450E+02 4.729E+02 4.608E+02 9.000E+00 1.656E+02 Eager McMPI 7500000 6.861E+02 7.061E+02 6.936E+02 5.716E+00 1.650E+02 Eager McMPI 10000000 9.071E+02 9.346E+02 9.195E+02 8.829E+00 1.659E+02 Eager McMPI 100 4.442E+00 5.956E+00 4.745E+00 1.484E-01 3.216E-01 Rendezvous McMPI 150 4.395E+00 5.556E+00 4.758E+00 1.503E-01 4.810E-01 Rendezvous McMPI 200 4.488E+00 6.153E+00 4.768E+00 1.590E-01 6.400E-01 Rendezvous McMPI 300 4.508E+00 5.396E+00 4.782E+00 1.434E-01 9.573E-01 Rendezvous McMPI 400 4.526E+00 6.107E+00 4.790E+00 1.560E-01 1.274E+00 Rendezvous McMPI 500 4.540E+00 5.907E+00 4.801E+00 1.464E-01 1.589E+00 Rendezvous McMPI 750 4.548E+00 5.994E+00 4.828E+00 1.481E-01 2.370E+00 Rendezvous McMPI 1000 4.639E+00 6.089E+00 4.865E+00 1.366E-01 3.136E+00 Rendezvous McMPI 1500 4.668E+00 5.374E+00 4.904E+00 1.266E-01 4.667E+00 Rendezvous McMPI 2000 4.787E+00 5.349E+00 4.892E+00 1.161E-01 6.238E+00 Rendezvous

I

Message Minimum Maximum Mean Standard Effective MPI Library Size Time Time Time Deviation Bandwidth and Protocol (bytes) (ms) (ms) (ms) (ms) (Mbit/s) McMPI 3000 4.789E+00 5.365E+00 4.943E+00 1.260E-01 9.261E+00 Rendezvous McMPI 4000 4.792E+00 5.481E+00 4.988E+00 1.299E-01 1.224E+01 Rendezvous McMPI 5000 4.826E+00 5.625E+00 5.068E+00 1.302E-01 1.505E+01 Rendezvous McMPI 7500 5.012E+00 5.717E+00 5.235E+00 1.433E-01 2.186E+01 Rendezvous McMPI 10000 5.033E+00 5.755E+00 5.301E+00 1.253E-01 2.879E+01 Rendezvous McMPI 15000 5.422E+00 6.420E+00 5.762E+00 1.697E-01 3.972E+01 Rendezvous McMPI 20000 5.952E+00 6.787E+00 6.242E+00 1.534E-01 4.889E+01 Rendezvous McMPI 30000 6.771E+00 7.686E+00 7.149E+00 1.662E-01 6.403E+01 Rendezvous McMPI 40000 7.667E+00 8.563E+00 8.070E+00 1.970E-01 7.563E+01 Rendezvous McMPI 50000 8.719E+00 9.924E+00 9.386E+00 2.470E-01 8.129E+01 Rendezvous McMPI 75000 1.102E+01 1.246E+01 1.173E+01 3.080E-01 9.754E+01 Rendezvous McMPI 100000 1.333E+01 1.495E+01 1.410E+01 4.173E-01 1.082E+02 Rendezvous McMPI 150000 1.922E+01 2.043E+01 1.975E+01 3.161E-01 1.159E+02 Rendezvous McMPI 200000 2.279E+01 2.445E+01 2.359E+01 6.734E-01 1.294E+02 Rendezvous McMPI 300000 3.110E+01 3.511E+01 3.298E+01 1.375E+00 1.388E+02 Rendezvous McMPI 400000 4.018E+01 4.433E+01 4.221E+01 1.681E+00 1.446E+02 Rendezvous McMPI 500000 4.872E+01 5.410E+01 5.200E+01 1.833E+00 1.467E+02 Rendezvous McMPI 750000 6.865E+01 7.790E+01 7.325E+01 3.660E+00 1.562E+02 Rendezvous McMPI 1000000 9.021E+01 1.011E+02 9.456E+01 4.166E+00 1.614E+02 Rendezvous McMPI 1500000 1.321E+02 1.482E+02 1.404E+02 6.294E+00 1.630E+02 Rendezvous McMPI 2000000 1.747E+02 1.964E+02 1.854E+02 8.438E+00 1.646E+02 Rendezvous McMPI 3000000 2.620E+02 2.913E+02 2.780E+02 1.037E+01 1.647E+02 Rendezvous McMPI 4000000 3.503E+02 3.845E+02 3.686E+02 1.087E+01 1.656E+02 Rendezvous

J

Message Minimum Maximum Mean Standard Effective MPI Library Size Time Time Time Deviation Bandwidth and Protocol (bytes) (ms) (ms) (ms) (ms) (Mbit/s) McMPI 5000000 4.480E+02 4.774E+02 4.623E+02 1.099E+01 1.650E+02 Rendezvous McMPI 7500000 6.812E+02 6.993E+02 6.909E+02 5.034E+00 1.656E+02 Rendezvous McMPI 10000000 8.976E+02 9.356E+02 9.211E+02 1.336E+01 1.657E+02 Rendezvous MPICH 100 2.367E-01 9.493E-01 3.281E-01 1.110E-01 4.650E+00 Automatic MPICH 150 2.393E-01 1.411E+00 4.524E-01 1.076E-01 5.060E+00 Automatic MPICH 200 2.410E-01 9.239E-01 3.002E-01 8.588E-02 1.017E+01 Automatic MPICH 300 2.424E-01 9.980E-01 4.798E-01 7.198E-02 9.540E+00 Automatic MPICH 400 2.423E-01 9.475E-01 3.502E-01 1.118E-01 1.743E+01 Automatic MPICH 500 2.417E-01 9.732E-01 4.258E-01 1.163E-01 1.792E+01 Automatic MPICH 750 2.420E-01 9.727E-01 4.677E-01 8.512E-02 2.447E+01 Automatic MPICH 1000 2.690E-01 9.978E-01 4.925E-01 6.273E-02 3.098E+01 Automatic MPICH 1500 4.857E-01 9.948E-01 5.261E-01 5.457E-02 4.351E+01 Automatic MPICH 2000 6.311E-01 1.145E+00 7.283E-01 5.286E-02 4.190E+01 Automatic MPICH 3000 7.224E-01 1.240E+00 7.625E-01 6.917E-02 6.004E+01 Automatic MPICH 4000 7.257E-01 1.216E+00 7.626E-01 5.319E-02 8.003E+01 Automatic MPICH 5000 7.532E-01 1.263E+00 8.049E-01 6.677E-02 9.479E+01 Automatic MPICH 7500 9.491E-01 1.412E+00 9.971E-01 6.615E-02 1.148E+02 Automatic MPICH 10000 9.963E-01 1.551E+00 1.110E+00 7.838E-02 1.375E+02 Automatic MPICH 15000 1.211E+00 1.728E+00 1.289E+00 8.099E-02 1.775E+02 Automatic MPICH 20000 1.455E+00 2.160E+00 1.627E+00 1.050E-01 1.876E+02 Automatic MPICH 30000 1.934E+00 2.527E+00 2.039E+00 9.862E-02 2.245E+02 Automatic MPICH 40000 2.551E+00 3.380E+00 2.755E+00 1.430E-01 2.216E+02 Automatic MPICH 50000 3.014E+00 4.465E+00 3.325E+00 2.196E-01 2.295E+02 Automatic

K

Message Minimum Maximum Mean Standard Effective MPI Library Size Time Time Time Deviation Bandwidth and Protocol (bytes) (ms) (ms) (ms) (ms) (Mbit/s) MPICH 75000 4.452E+00 5.447E+00 4.886E+00 2.247E-01 2.342E+02 Automatic MPICH 100000 5.812E+00 6.951E+00 6.326E+00 2.235E-01 2.412E+02 Automatic MPICH 150000 1.046E+01 1.130E+01 1.070E+01 2.578E-01 2.138E+02 Automatic MPICH 200000 1.318E+01 1.391E+01 1.340E+01 2.308E-01 2.278E+02 Automatic MPICH 300000 1.944E+01 2.055E+01 1.993E+01 3.344E-01 2.296E+02 Automatic MPICH 400000 2.324E+01 2.603E+01 2.472E+01 1.027E+00 2.469E+02 Automatic MPICH 500000 3.078E+01 3.246E+01 3.124E+01 4.673E-01 2.442E+02 Automatic MPICH 750000 4.318E+01 4.872E+01 4.645E+01 1.624E+00 2.464E+02 Automatic MPICH 1000000 5.639E+01 6.308E+01 5.992E+01 1.792E+00 2.547E+02 Automatic MPICH 1500000 8.440E+01 9.308E+01 8.968E+01 2.680E+00 2.552E+02 Automatic MPICH 2000000 1.111E+02 1.207E+02 1.177E+02 2.396E+00 2.593E+02 Automatic MPICH 3000000 1.706E+02 1.828E+02 1.772E+02 4.342E+00 2.583E+02 Automatic MPICH 4000000 2.352E+02 2.425E+02 2.395E+02 2.763E+00 2.549E+02 Automatic MPICH 5000000 3.183E+02 3.259E+02 3.226E+02 2.280E+00 2.365E+02 Automatic MPICH 7500000 5.189E+02 5.220E+02 5.207E+02 8.610E-01 2.198E+02 Automatic MPICH 10000000 8.107E+02 8.140E+02 8.121E+02 1.248E+00 1.879E+02 Automatic

Table 3: Results from Ping Pong Simplex for 1Gbit/s Ethernet connection

L

Remoted Method 100Mbit/s Time (µs) 1GBit/s Time (µs)

Minimal method 610 510

Eager protocol method 1,202 1,018

Table 4: Execution times for minimal and eager protocol remoted methods

100Mbit/s 1Gbit/s 100Mbit/s 1Gbit/s Image Data Type Serial MPICH MPICH McMPI McMPI Size Precision Time (ms) Time (ms) Time (ms) Time (ms) Time (ms)

Small Single 1.284E-01 1.027E+00 1.071E+00 3.397E+00 3.113E+00

Small Double 1.203E-01 1.005E+00 8.645E-01 3.287E+00 2.967E+00

Medium Single 5.568E+00 6.665E+00 6.322E+00 9.342E+00 8.912E+00

Medium Double 5.467E+00 6.710E+00 6.696E+00 9.577E+00 8.940E+00

Large Single 5.087E+01 5.295E+01 5.255E+01 5.721E+01 5.595E+01

Large Double 4.932E+01 5.167E+01 5.109E+01 6.302E+01 5.943E+01

Table 5 : Results from the Image Processor code

Machine A Machine B

Processor Pentium 4 – 2.4GHz Pentium 4 – 2.6GHz

Memory 2GB SDRAM 2GB SDRAM

Network Gigabit Ethernet NIC Gigabit Ethernet NIC Netgear FS108 100Mbit/s Switch Interconnect or Netgear GS108 1Gbit/s Switch Operating System Windows XP – SP2 Windows XP – SP2

MPICH MPICH 1.2.5 NT MPICH 1.2.5 NT

.Net MS Framework 2.0.757 MS Framework 2.0.757 Clean OS install plus all critical updates, as at 1st August 2007 Notes Windows Firewall disabled during timed tests

Table 6: Configuration of test hardware

M

Appendix C.

References & Bibliography

1. Using MPI with C# and the Common Language Infrastructure. Willcock, Jeremiah, Lumsdaine, Andrew and Robison, Arch. 7-8, Bloomington, IN : Indiana University Open Systems Laboratory, 2005, Concurrency and Computation: Practice and Experience, Vol. 17, pp. 895-917.

2. A high-performance, portable implementation of the MPI message passing interface standard. Skjellum, W. Gropp and E. Lusk and N. Doss and A. 6, 1996, Parallel Computing, Vol. 22, pp. 789-828.

3. Gatlin, Kang Su and Isensee, Pete. Reap the Benefits of Multithreading without All the Work. MSDN Magazine. [Online] Microsoft Corporation and CMP Media, LLC, October 2005. [Cited: 8 March 2007.] http://msdn.microsoft.com/msdnmag/issues/05/10/OpenMP/.

4. OpenMP Architecture Review Board. Official OpenMP Specifications version 2.5. OpenMP.org. [Online] March 2005. [Cited: 13 August 2007.] http://www.openmp.org/drupal/mp-documents/spec25.pdf.

5. Kambites, M E, Obdrzalek, J and Bull, J M. An OpenMP-like Interface for Parallel Programming in Java. Concurrency and Computation: Practice and Experience. 2001, Vol. 13, 8-9, pp. 793-814.

6. MPI: A message passing interface. Message Passing Interface Forum. New York : ACM Press, 1993. Proceedings of SuperComputing 1993. pp. 878-883.

7. Carpenter, B, Fox, G, Ko, S-H, Lim, S. mpiJava 1.2: API specification. Center for Research on Parallel Computation, Rice University. 1999. Technical Report. CRPC- TR99804.

8. Carpenter B, Getov V, Judd G, Skjellum A, Fox G. MPJ: MPI-like message passing for Java. Concurrency: Practice and Experience. 2000, Vol. 12, 11.

9. S, Mintchev. Writing programs in JavaMPI. School of Computer Science, University of Westminster. 1997. Technical Report. MAN-CSPE-02.

10. Baker, Mark, et al. mpiJava: A Java interface to MPI. First UK Workshop on Java for High Performance Network Computing. September 1998. mpiJava Home Page: http://www.npac.syr.edu/projects/pcrc/HPJava/mpiJava.html.

N

11. JMPI: Implementing the message passing standard in Java. Morin S, Koren I, Krishna CM. Piscataway, NJ : IEEE Press, 2002. International Parallel and Distributed Processing Symposium: IPDPS 2002 Workshops. Vol. 118b.

12. Ubiquitous message passing interface implementation in Java: jmpi. K, Dincer. Piscataway, NJ : IEEE Press, 1999. Proceedings of the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. Vol. 203.

13. JOPI: A Java object-passing interface. Mohamed N, Al-Jaroodi J, Jiang H, Swanson D. New York : ACM Press, 2002. Proceedings of the ACM 2002 Java Grande-ISCOPE Conference. pp. 37-45.

14. CCJ: Object-based message passing and collective communication in Java. Nelisse A, Maassen J, Kielmann T, Bal HE. New York : ACM Press, 2001. Proceedings of the Joint ACM 2001 Java Grande-ISCOPE Conference.

15. MPI Ruby: Scripting in a Parallel Environment. Ong, Emil. Los Alamitos, CA, USA : IEEE Computer Society, 2002, Computing in Science and Engineering, Vol. 04, pp. 78-82. ISSN 1521-9615.

16. MPI Ruby with Remote Memory Access. Aycock, Christopher C. Los Alamitos, CA, USA : IEEE Computer Society, 2004, hpdc, Vol. 00, pp. 280-281. ISSN 1082- 8907.

17. Message Passing Interface Forum. MPI-2: Extensions to the Message-Passing Interface. [Online] 1997. [Cited: 24 January 2007.] http://www.mpi-forum.org/.

18. Miller P, Casado M. MPI Python. [Online] [Cited: 24 January 2007.] http://sourceforge.net/projects/pympi/.

19. Python Software Foundation. About Python. Python.org. [Online] [Cited: 13 August 2007.] http://www.python.org/about/.

20. Visual Numerics. About Visual Numerics. Visual Numerics. [Online] [Cited: 17 August 2007.] http://www.vni.com/company/corp.php.

21. National Institute of Standards and Technology. JavaNumerics. Math, Statistics and Computational Science. [Online] [Cited: 17 August 2007.] http://math.nist.gov/javanumerics/.

22. Extreme Optimization. Introducing the Extreme Optimization Numerical Libraries for .NET. Extreme Optimization. [Online] [Cited: 17 August 2007.] http://www.extremeoptimization.com/.

23. CenterSpace Software. CenterSpace Software: Products. CenterSpace Software Website. [Online] [Cited: 17 August 2007.] http://www.centerspace.net/products.php.

O

24. Bull, Mark and Smith, Lorna. Java for High Performance Computing. Object- Oriented Programming for High Performance Computing: Course Slides. Edinburgh : Edinburgh Parallel Computing Centre, 2007.

25. Microsoft Corporation. .Net Framework Developer's Guide: Writing Verifiably Type-Safe Code. MSDN: .Net Framework Developer Center. [Online] [Cited: 17 August 2007.] http://msdn2.microsoft.com/en-us/library/01k04eaf.aspx.

26. —. Figure 8.1: Code Access Security - a simplified view. Improving Web Application Security: Threats and Countermeasures. Paper back. s.l. : Microsoft Press, 2003, 8, p. 186.

27. ECMA. Standard ECMA-335: Common Language Infrastructure (CLI). ECMA International Standards. [Online] 4th Edition (June 2006), 2006. [Cited: 8 March 2007.] http://www.ecma-international.org/publications/standards/Ecma-335.htm.

28. —. Standard ECMA-334: C# Language Specification. ECMA International Standards. [Online] 4th Edition (June 2006), 2006. [Cited: 8 March 2007.] http://www.ecma-international.org/publications/standards/Ecma-334.htm.

29. ECMA International. Home page: ECMA International. ECMA International. [Online] [Cited: 17 August 2007.] http://www.ecma-international.org/.

30. Easton, M J and King, Jason. Cross-Platform .NET Development: Using Mono, Portable.NET, and Microsoft .NET. Berkeley, CA : Apress, 2004. ISBN 1-59059-330- 8.

31. Smith, L and Bull, M. Development of mixed mode MPI / OpenMP applications. Scientific Programming. 2001, Vol. 9, 2-3, pp. 83-98.

32. Hybrid Parallel Programming on HPC Platforms. Rabenseifner, Rolf. Aachen, Germany : s.n., Sept 22-26, 2003. In proceedings of the Fifth European Workshop on OpenMP, EWOMP '03. pp. 185-194.

33. Microsoft Corporation. Sinks and Sink Chains. MSDN .NET Framework Developer Center. [Online] [Cited: 17 August 2007.] http://msdn2.microsoft.com/en- us/library/tdzwhfy3(VS.71).aspx.

34. —. SSCLI: Shared Source Common Language Infrastructure. Microsoft Research. [Online] [Cited: 17 August 2007.] http://research.microsoft.com/sscli/.

35. Mono Team. Main Page - Mono. mono-project.com. [Online] Novell. [Cited: 17 August 2007.] http://www.mono-project.com/Main_Page.

36. DotGNU Project - GNU Freedom for the Net. GNU's Not Unix. [Online] Free Software Foundation. [Cited: 17 August 2007.] http://www.gnu.org/software/dotgnu/pnet.html.

P

37. Open Systems Lab. MPI.net research at the Open Systems Lab. Open Systems Lab: Pervasive Technology Labs at Indiana University. [Online] The Trustees of Indiana University. [Cited: 17 August 2007.] http://www.osl.iu.edu/research/mpi.net/.

38. PJMPI: pure Java implementation of MPI. WeiQin, Tong, Hua, Ye and WenSheng, Yao. Beijing, China : School of Computer Engineering and Science, Shanghai University, 2000. The Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region, 2000. Proceedings. Vol. 1, pp. 533-535. ISBN: 0-7695-0589-2.

39. Microsoft Corporation. Using Microsoft Message Passing Interface. Microsoft TechNet. [Online] 6 June 2006. [Cited: 17 August 2007.] http://technet2.microsoft.com/windowsserver/en/library/4cb68e33-024b-4677-af36- 28a1ebe9368f1033.mspx?mfr=true.

40. —. Windows Compute Cluster Server 2003 Product Overview. Microsoft TechNet. [Online] [Cited: 17 August 2007.] http://www.microsoft.com/technet/ccs/overview.mspx.

41. Stutz, David. The Microsoft Shared Source CLI Implementation. MSDN Library. [Online] March 2002. [Cited: 8 March 2007.] http://msdn2.microsoft.com/en- us/library/ms973879.aspx.

42. Whittington, Jason. Rotor: Shared Source CLI Provides Source Code for a FreeBSD Implementation of .NET. MSDN Magazine. [Online] July 2002. [Cited: 8 March 2007.] http://msdn.microsoft.com/msdnmag/issues/02/07/sharedsourcecli/.

43. Object oriented MPI: A class library for the message passing interface. Squyres, JM, McCandless, BC and Lumsdaine, A. s.l. : Los Alamos National Laboratory, 1996, Parallel Object-Oriented Methods and Applications (POOMA).

44. Design issues for efficient implementation of MPI in Java. Judd, G, et al. New York : ACM Press, 1998. Proceedings of the ACM 1999 Java Grande Conference. pp. 58-65.

45. ParC#: Parallel Computing with C# in .Net. Ferreira, J. F. Sobral, J. L. Heidelberg, Berlin : Springer, 2005, Lecture Notes In Computer Science, pp. 239-248. ISSN 0302-9743.

Q

Appendix D.

Glossary

Attributes; a way to attach meta-data to classes, properties, fields or methods written in managed code so that it is preserved into the binary code by the compiler.

C# – pronounced ―C sharp‖; a language derived from the C/C++ family (originally developed by Microsoft Corporation but now recognised by ECMA standardisation) that can be compiled to CIL.

CIL – Common Intermediate Language; conceptually similar to java byte code, .Net source code is compiled to this lower level intermediate form ready for a Virtual Execution System such as the Microsoft .Net Framework, Mono or Portable.Net to execute, using a Just-In-Time compilation process to produce machine code for the particular machine.

CLI – Common Language Infrastructure; the combination of the (CTS), Common Language Specification (CLS), the Common Intermediate Language (CIL) and a Virtual Execution System (VES).

CLR – Common Language Runtime; the combination of the Virtual Execution System and the Garbage Collection service but also used to reference the Microsoft .Net Framework, which historically was the first and only CLR available.

CLS – Common Language Specification; defines the rules that any language targeting the CLI should obey in order to interoperate with other CLS-compliant languages.

CTS – Common Type System; defines the basic types and the operations on those types that are shared by all CTS-compliant languages.

ECMA – formerly the European Computer Manufacturers Association, now ECMA International; facilitates the timely creation of a wide range of global Information Communication Technology (ICT) and Consumer Electronics (CE).

ISA – Instruction Set Architecture; the particular set of machine code instructions available on a particular hardware machine.

JIT – Just-In-Time compilation; in contrast to ordinary compilation, JIT delays all or part of the compilation of code until runtime. This provides benefits such as portability and allows the JIT compiler to take advantage of its knowledge of the target ISA and the actual usage patterns for the code such as the most frequently taken route for a branch instruction.

JVM – Java Virtual Machine; a sandbox environment in which all Java code executes.

R

Managed Code; a generic term for source code in any language that is compiled to CIL.

MSIL – Microsoft Intermediate Language; an outdated synonym for CIL from before the CLI and C# specifications were standardised by ECMA.

P/Invoke – Platform Invoke; a mechanism whereby C# code can call code not written in managed code such as C, C++ or FORTRAN.

Remoted Object; an object created locally that handles remote method calls using automatically generated proxy objects.

Remoting; conceptually similar to Remote Method Invocation (RMI) in Java, remoting is a method of inter-process communication based on the idea that a method call can be viewed as a message between two locations in code. These messages are transported by proxy classes that de-couple the source and target. In particular, the proxy classes may transmit the message across a network, allowing code on one machine to call code in a process on another machine. The syntax of the method call is identical to that for a local (non-remoted) method call; only the configuration files and physical deployment location must change.

VES – Virtual Execution System; a part of a CLR which is conceptually similar to a JVM. The CIL code is targeted at the VES rather than any specific ISA, which achieves binary portability across a wide variety of hardware.

S