EVALUATION AND TUNING OF GIGABIT ETHERNET PERFORMANCE ON CLUSTERS

A thesis submitted to Kent State University in partial fulfillment of the requirements for the Degree of Master of Science

by

Harit Desai

August, 2007

Thesis Written By

Harit Desai

B.E., Nagpur University, India, 2000

M.S., Kent State University, OH, 2007

Approved by

Dr. Paul A. Farrell, Advisor

Dr. Robert A. Walker, Chair, Dept. of Computer Science

Dr Jerry Feezel, Dean, College of Arts and Sciences

ii

TABLE OF CONTENTS

ACKNOWLEDGEMENTS …..………………………………………………………….vi

CHAPTER 1 INTRODUCTION ....…………………………….…………………….. 1

1.1 Clusters for Scientific Computing ……………………………………….…….... 2

1.2 Thesis Organization .………………………………………………………...... 8

CHAPTER 2 OVERVIEW OF GIGABIT ETHERNET TECHNOLOGY ...... 9

2.1 Operating Modes ………………………………………………………………... 9

2.2 Enhanced CSMA/CD…………………………………………………………… 12

2.3 Issues affecting Gigabit Ethernet performance…………………………………. 15

CHAPTER 3 VI ARCHITECTURE OVERVIEW ………………………………… 19

3.1 VI Architecture…………………………………………………………………..20

3.1.1. Virtual Interfaces……………………………………………………………….. 21

3.1.2. VI Provider …..…………………………………………………………...……. 23

3.1.3 VI Consumer……………………………………………………………………. 23

3.1.4. Completion Queues………………………………………………..……………. 24

3.2. Data Transfer Models………………………………………………..………….. 25

3.2.1 Send/Receive……………………………………………………………..………26

3.3. Managing VI Components……………………………………………….………27

iii

3.3.1 Accessing a VI NIC……………………………………………………………...27

3.3.2 Registering and De-registering Memory …..………………...…………………28

3.3.3 Creating and Destroying VIs …………………………………………………. 28

3.3.4 Creating and Destroying Completion Queue …...………………………….….39

3.4. VI Connection and Disconnection………………....…………………………..31

3.4.1. VI Connection…………………………………………………………………31

3.4.2. VI Disconnection……………………………………………………………...34

3.4.3. VI Address Format…………………………………………………………… 35

3.5. VI States…………………………………...…………………………………. 36

CHAPTER 4 NETPIPE……………………………………………………………. 37

4.1. Introduction……………………………………………………………………37

4.2. NetPIPE Design……………………………………………………………….38

4.3. NetPIPE Results……………………………………………………………….40

4.4. VIA driver for NetPIPE……………………………………………………….42

CHAPTER 5 PERFORMANCE COMPARISION……………………………….45

5.1. Testing Environment and Network Parameters……………………………….45

5.2 TCP Comparisons………………...... 46

5.2.1 Varying MTU size ……………………………………………………………48

iv

5.2.2. Varying Socket buffer size ……………………………………………………50

5.2.3. Varying TX queue length ……………………….…..………………………...51

5.2.4. Varying processor speed …………………………………………………...…54

5.2.5. Difference gigabit network interface ……………………….………………....57

5.2.6. Performance of Xeon processors ……………………….…………………….59

5.2.7. Performance of Opteron processors …………………….……………………..60

5.3 VIA Comparisons ………………………………………………………..…...63

5.4 TCP and VIA comparison…………………………………………..……...…67

5.5 MVIA Latency comparisons ………………………………………………….72

CHAPTER 6 MPI COMPARISONS …………………………………….………74

6.1 Introduction…………………………………………………………………...74

6.2 Testing environment ………………………………………………………….75

6.3 LAM and MPICH performance comparisons………………………………...76

CHAPTER 7 CONCLUSION ……………………………...……………………..81

References ……………………………………………………….………………...….84

v

ACKNOWLEDGEMENT

This thesis would not have been possible without the help and encouragement of many people. First and foremost I wish to thank my advisor, Dr. Paul Farrell. Without his encouragement, patience and constant guidance, I could not have completed my thesis.

Besides my advisor, I would also like to thank Roy Heath for his help and support in providing me a platform to do various tests. I also want to thank Dr. Ruttan and Dr.

Nesterenko for serving on my thesis committee.

Last but not the least, I thank my family: My Mom and my wife for unconditional support and encouragement to pursue my interests. My friend – Darshan, who has always advised me when I needed him and always given me inspiration. I also want to thank my friends: Parag, Deepak, Jalpesh, Kiran, Mahesh and Siddharath for their support.

vi

Chapter 1

Introduction

Abstract

Cluster computing imposes heavy demands on the communication network. Gigabit

Ethernet technology can provide the required bandwidth to meet these demands.

However, it has also shifted the communication bottleneck from network media to protocol processing. In this thesis, we present an overview of Gigabit Ethernet technology and study the end-to-end Gigabit Ethernet communication bandwidth and latency. Performance graphs are collected using NetPIPE which clearly show the performance characteristics of TCP/IP and VIA over Gigabit Ethernet.

Here we discuss the communication performance attainable with a PC cluster connected by a Gigabit Ethernet network. Gigabit Ethernet is the third generation of Ethernet technology and offers raw bandwidth of 1 Gbps. The focus of this work is to discuss the Gigabit Ethernet technology, to evaluate and analyze the end-to-end communication latency and achievable bandwidth, and to monitor the effects of software and hardware components on the overall network performance.

1

2

1.1. Clusters for Scientific Computing

Cluster computing offers great potential for increasing the amount of computing power and communication resources available to large scale applications. The combined computational power of a cluster of powerful PCs connected to a high speed network may exceed that achievable by the previous generation of stand-alone high performance supercomputers.

Running large scale parallel applications on a cluster imposes heavy demands on

the communication network. Therefore, in early distributed computing, one of the design

goals was to limit the amount of communication between hosts. However, due to the

features of some applications, a certain degree of communication between hosts may be

required. As a result, the performance bottleneck of the network severely limited the

potential of cluster computing. Recent high speed networks such as Asynchronous

Transfer Mode (ATM), Fibre Channel (FC), Gigabit Ethernet and 10 Gigabit Ethernet [8]

change the situation. These high speed networks offer raw bandwidth ranges from 100

megabits per second (Mbps) to 10 gigabit per second (Gbps) satisfying the

communication needs of many parallel applications.

Due to the increase in network hardware speed and the availability of low cost

high performance workstations, cluster computing has become increasingly popular.

Many research institutes, universities, and industrial sites around the world have started

to purchase or build low cost clusters, such as Beowulf-class clusters, for their

parallel processing needs at a fraction of the price of mainframes or supercomputers.

Beowulf (PC) clusters represent a cost-effective platform for many large scale scientific

3

computations. They are scalable performance clusters based on commodity hardware, such as PCs and general purpose or third-party network equipment, on a private system area network. By general purpose network equipment, we mean network interface cards

(NICs) and switches which have been developed for use in general local area networks

(LANs) as opposed to those which are designed by third party vendors specifically for use in clusters or parallel machines, such as Myrinet [22], Giganet, or Quadrics.

Latest trends of the most powerful computational machines in the world are tracked on the TOP 500 site ( www.top500.org ) [28]. This was started in 1993 to provide a reliable basis for tracking and detecting trends in high-performance computing. The site also includes summary information on the architectures, operating systems and interconnects of these computers. Some of the summary figures are included below.

These clearly illustrate the transition from the early days when proprietary custom built supercomputers were dominant to the current situation where clusters are predominant.

4

Figure 1.1: Processor Family Evolution over time

Figure 1.1 clearly indicates that Intel (EM64T, IA-64, i680, IA-32) and AMD processors now predominate in the TOP 500. The only other processors still appearing are the

Power PC, Cray and Hewlett-Packard PA_RISC and of these only the Power PC is a significant percentage of the whole.

5

Figure 1.2: Architecture Evolution over time

Figure 1.2 illustrates a similar consolidation in architecture, with clusters now representing approximately two-thirds of the machines in the TOP500. Figure 1.3 also illustrates a similar trend in interconnects, with Gigabit Ethernet, Myrinet and Infiniband

[16] being the dominant interconnects in recent years. Of all interconnects, approximately

50% of high performance computational machines use Gigabit Ethernet technology.

Figure 1.4 shows an even clearer dominance for Linux as it is the used in over two thirds of clusters in the TOP500. Berkeley Systems Distribution (BSD) and other Unix variants comprise most of the remainder.

6

Figure 1.3: Interconnect Evolution over time

7

Figure 1.4: Operating System Evolution over time

However, in many cases the maximum achievable bandwidth at the application level is still far away from the theoretical peak bandwidth of the interconnection networks. This major roadblock to achieving high speed cluster communication is caused by the overhead resulting from the time required for the interaction between software and hardware components. To provide a faster path between applications and the network, most researchers have advocated removing the operating system kernel and its centralized networking stack from the critical path and creating a user-level network interface . With these interfaces, designers can tailor the communication layers each process uses to the demands of that process. Consequently, applications can send and receive network packets without operating system intervention, which greatly decreases

8

communication latency and increases network throughput. Intel, Microsoft, and Compaq

introduced the Virtual Interface Architecture (VIA), as a standard for cluster or system-

area networks. VIA defines mechanisms that bypass layers of protocol stacks and avoid

intermediate copies of data during sending and receiving messages. Elimination of this

overhead is intended not only to enable significant communication performance increases

but also to result in a significant decrease in processor utilization by the communication

subsystem.

1.2. Thesis Organization

Chapter 2 provides details about Gigabit Ethernet and issues affecting performance on gigabit Ethernet clusters. Virtual Interface Architecture (VIA) is explained in detail in Chapter 3. Chapter 4 explains about NetPIPE, the software used to run network tests and collect data. In Chapter 5, we show the performance comparison between TCP and VIA with varying MTU, TX queue length and Socket buffer size.

Tests also show the performance comparisons of TCP and VIA on different processor speeds. End-to-end communication latency and throughput of LAM and MPICH is presented in Chapter 6. Finally, we present conclusions and a summary of similar work performed elsewhere in Chapter 7.

Chapter 2

Overview of Gigabit Ethernet Technology

Gigabit Ethernet [2] is a third generation of Ethernet technology, also know as IEEE

Standard 802.3z. Like Ethernet, Gigabit Ethernet is a media access control (MAC) and physical layer (PHY) technology. It offers a raw bandwidth of 1 gigabit per second

(1Gbps). In order to achieve 1 Gbps, the original Gigabit Ethernet over fiber uses a modified version of the ANSI X3T11 Fibre Channel standard physical layer (FC-0). To remain backward compatible with existing Ethernet technologies, Gigabit Ethernet uses the same IEEE 802.3 Ethernet frame format, and a compatible full or half duplex carrier sense multiple access/ collision detection (CSMA/CD) scheme scaled to gigabit speeds.

2.1 Operating Modes

The Gigabit Ethernet standard provides for either half-duplex or full-duplex mode. In full-duplex mode, frames travel in both directions simultaneously over two channels on the same connection for an aggregate bandwidth of twice that of half-duplex mode. Full duplex networks are very efficient since data can be sent and received simultaneously.

However, full-duplex transmission can be used for point-to-point connections only. Since full-duplex connections cannot be shared, collisions are eliminated. This setup eliminates

9

10

most of the need for the CSMA/CD access control mechanism because there is no need to

determine whether the connection is already being used.

When Gigabit Ethernet operates in full duplex mode, it uses buffers to store incoming and outgoing data frames until the MAC layer has time to pass them higher up the legacy protocol stacks. During heavy traffic transmissions, the buffers may fill up with data faster than the MAC can process them. When this occurs, the MAC layer prevents the upper layers from sending until the buffer has room to store more frames; otherwise, frames would be lost due to insufficient buffer space.

In the event that the receive buffers approach their maximum capacity, a high water mark interrupts the MAC control of the receiving node and sends a signal to the sending node instructing it to halt packet transmission for a specified period of time until the buffer can catch up. The sending node stops packet transmission until the time interval is past or until it receives a new packet from the receiving node with a time interval of zero. It then resumes packet transmission. The high water mark ensures that enough buffer capacity remains to give the MAC time to inform the other devices to shut down the flow of data before the buffer capacity overflows. Similarly, there is a low water mark to notify the MAC control when there is enough open capacity in the buffer to restart the flow of incoming data.

Full-duplex transmission can be deployed between ports on two switches, a workstation and a switch port, or between two workstations. Full-duplex connections cannot be used for shared-port connections, such as a repeater or hub port that connects multiple workstations. Gigabit Ethernet is most effective when running in the full-duplex,

11

point-to-point mode where full bandwidth is dedicated between the two end-nodes. This is the normal mode used in switch based clusters, where each node is connectd to a separate port on a Gigabit Ethernet switch. Full-duplex operation is also ideal for backbones and high-speed server or router links.

For half-duplex operation, Gigabit Ethernet will use the enhanced CSMA/CD access method. With CSMA/CD, the same channel can only transmit or receive at one time. A collision results when a frame sent from one end of the network collides with another frame. Timing becomes critical if and when a collision occurs. If a collision occurs during the transmission of a frame, the MAC will stop transmitting and retransmit the frame when the transmission medium is clear. If the collision occurs after a packet has been sent, then the packet is lost since the MAC has already discarded the frame and started to prepare the next frame for transmission. In all cases, the rest of the network must wait for the collision to dissipate before any other devices can transmit.

In half-duplex mode, Gigabit Ethernet's performance is degraded. This is because

Gigabit Ethernet uses the CSMA/CD protocol which is sensitive to frame length. The standard slot time for Ethernet frames is not long enough to traverse a 200-meter cable when passing 64-byte frames at gigabit speed. In order to accommodate the timing problems experienced with CSMA/CD when scaling half-duplex Ethernet to gigabit speed, the slot time has been extended to guarantee at least a 512-byte slot time using a technique called carrier extension . The frame size is not changed, only the interframe timing is extended.

12

Half-duplex operation is intended for shared multistation LANs, where two or

more end nodes share a single port. Most switches enable users to select half-duplex or

full-duplex operation on a port-by-port basis, allowing users to migrate from shared links

to point-to-point, full duplex links when they are ready. It is not recommended for cluster

installations, since the predominant programming model for these is Single Process

Multiple Data (SPMD), where all nodes proceed in loose lockstep. This means that most

nodes transmit at approximately the same time, leading to a high probability of collisions.

Gigabit Ethernet now operates over a variety of cabling types. Initially, the

Gigabit Ethernet specification supported multi-mode and single-mode optical fiber, and

short haul copper cabling. Fiber is ideal for connectivity between switches and between a

switch and high speed server because it can be extended to greater length than copper

before signal attenuation becomes unacceptable and it is more reliable than copper. In

June 1999, the Gigabit Ethernet standard was extended to incorporate category 5

unshielded twisted-pair (UTP) copper media [4]. The fianna cluster in Computer Science

at Kent state was one of the earliest to be implemented using UTP network cards and

switches.

2.2. Enhanced CSMA/CD

The MAC layer of Gigabit Ethernet uses the same CSMA/CD protocol as defined in

IEEE 802.3. As a result, the maximum network diameter used to connect nodes is limited by the CSMA/CD protocol.

13

IEEE 802.3 (10BaseT) defined the original CSMA/CD mechanism. This scheme ensures that all nodes are granted access to a physical media on a first come, first serve basis. The maximum network diameter in 10BaseT is limited to 2000 m. This distance limitation is due to the relationship between the time (also known as slot time) required to transmit a minimum frame of 64 bytes and the ability to detect a collision (a limit known as propagation delay). When a collision occurs, the MAC layer detects it and sends a halt signal to cause the transmitting nodes to stop transmitting and enter a backoff phase prior to retrying transmission.

When the IEEE defined 802.3u (100BaseT) in 1994, it maintained the Ethernet framing format and raised the speed limit to 100Mbit/s. As the bit rate increases, the time needed to transmit a frame is reduced by a factor of 10. This implies that the network diameter is decreased from 2000 m for 10BaseT to 200 m for 100BaseT.

Since IEEE 802.3z represents another tenfold increase in bit rate as compared to

100BaseT, the network diameter is further reduced by another factor of 10. But, the resulting network diameter of 20 m is clearly too short for most general purpose network configurations and is thus impractical. In addition, this distance is even less if delays in active components such as repeaters are considered. Moreover, with the existing silicon technology at the time, it was not feasible for vendors of repeater chips operating with a

25MHz clock to scale up to operate with a 250 MHz clock. As a consequence, the IEEE

802.3z working committee redefined the MAC layer for Gigabit Ethernet and introduced a mechanism that preserved the 200 m collision domain of 100BaseT. This is necessary because two nodes, which are 200 m apart, will not be able to detect a collision when

14

both simultaneously transmit a 64 byte-frame at gigabit speed. This inability to detect

collisions will eventually lead to network instability.

The mechanism to preserve the 200m network diameter is known as carrier

Extension . Carrier extension, developed by Sun Microsystems Inc. ( Mountain View,

California), is a way of maintaining the IEEE 802.3 standard minimum and maximum

frame size while enabling a meaningful network diameter. The resultant mechanism

leaves the CSMA/CD algorithm unchanged. Carrier extension increases the slot time to

512 bytes rather than the 64 bytes defined in IEEE 802.3. If the frame is shorter than 512

bytes, then it is transmitted in a 512 byte window and the transmitted frame is padded

with carrier extension symbols.

Upon receipt of a frame carrying carrier extension symbols, the entire extended frame is considered for collision and dropped if necessary. However, the Frame Check

Sequence (FCS) is calculated only on the original (without extension symbols) frame.

The extension symbols are removed before the FCS is checked by the receiver. So, the logical link control layer is not even aware of the carrier extension.

Carrier Extension wastes bandwidth. For example, a small packet of 64 bytes will have 448 padding bytes of carrier extension symbols added. This results in additional overhead of 700% or degradation in throughput to approximately 12.5% of the theoretical maximum. In addition, an increased collision rate which may increase the number of lost frames. In fact, for a large number of small packets, the Gigabit Ethernet throughput is only marginally better than 100BaseT.

15

To gain back some of the performance lost due to carrier extension, NBase

Communication proposed a solution known as packet bursting. It is essentially a

modification to the carrier extension procedure. The idea is to transmit a burst of frames

every time the first frame has successfully passed the collision window of 512 bytes.

Carrier extension is only applied to the first frame in a burst. This essentially averages the

wasted time in the carrier extension symbols over the few frames that are transmitted.

Packet bursting substantially increases the throughput and does not change the dynamics

of the CSMA/CD algorithm. It only slightly modified the existing MAC definition.

2.3. Issues affecting Gigabit Ethernet performance

Communication performance is affected by a number of factors, including CPU speed,

I/O speed, bus architecture, network adaptors, device drivers, and protocol stack processing. While most of these factors do not contribute significantly to the performance of slower networks, they begin to become significant factors in high speed networks.

Gigabit Ethernet provides the bandwidth required to meet the demands of current and future applications. However, it has also shifted the communication bottleneck from network media to hardware and software components. It was thus critical to improve or tune these components in order to achieve high speed transmission. Since the introduction of Gigabit Ethernet, vendors have made significant improvements to Gigabit

Ethernet network interface cards to reduce latency and improve performance, including such features as adding TCP checksum calculation.

16

Since TCP was originally engineered to provide a robust general transport protocol, it is not by default optimized for streams of data coming in and out of the system at high transmission rates (such as 1Gbps).

Some of the major issues which affect gigabit Ethernet performance on Clusters include

• Different versions of Linux kernel.

• Maximum Transmission unit

• Transmit queue Length

• Processor speed

• Different Device drivers and NICs

• Socket buffer size

MTU , short for Maximum Transmission Unit, is the largest physical packet size that can be transmitted across a network. Any messages larger than the MTU are divided into smaller packets before being transmitted in Ethernet frames. MTU determines the size of packets being transmitted and it is a well established fact that MTU can be a limiting factor in determining throughput. To preserve compatibility with 10 Mbps and

100 Mbps Ethernet, the Gigabit Ethernet standard still limits the MTU to 1500 bytes.

Standards bodies are reluctant to change this since, among other issues, they wish to avoid the complications in specifying how larger frames transitioning from networks with

MTU greater than 1500 to ones with MTU of 1500 should be handled. This would be a fairly widespread transition if Gigabit Ethernet supported MTUs greater than 1500, since the slower Ethernet standards do not. One of the common uses for Gigabit Ethernet was expected to be in aggregating switches which take multiple 100 Mbps input streams from

17

workstations and output to other Gigabit Ethernet switches on a gigabit link. An efficient implementation of Gigabit Ethernet with MTU greater than 1500 bytes would probably require switches to resegment Ethernet frames greater than 1500 bytes and recompute the checksums. This would add to the cost of switches. It is to be expected that high speed networks such as Gigabit Ethernet would benefit from an MTU larger than 1500. In addition to improving the throughput, one would expect that a larger MTU would also reduce the load on the CPU by reducing the number of frames, which would need to be processed for large message sizes. As a result of these factors, some companies, notably

Alteon, have enhanced the Gigabit Ethernet functionality by adding a facility to support

MTUs and hence frame sizes greater than 1500 bytes. Alteon coined the name Jumbo

Frames for this functionality, and their network interface cards (NICs) and switches support Jumbo Frames of up to 9000 bytes.

Processor speed also becomes an important factor in achieving higher throughput and lower latency on Gigabit Ethernet. Faster processors can attain higher throughput for large transfer block sizes. This is largely due to the fact that faster processors can process the protocol stacks and calculate TCP checksums faster than the slower processors.

Increase in Transmit queue Length (Txqueuelength) parameter also improves performance especially for high-speed connections that perform large, homogeneous data transfers. Increasing the Transmit queue Length however consumes memory, which is then not available for user programs. Different brands of Network Interface cards (NIC) and different versions of device drivers also affect throughput.

18

The socket buffer size determines the size of the TCP sliding window and thus the number of packets which can be sent without an acknowledgment (ACK) being received from the receiver. The increase in socket buffer size means that additional memory is used for buffering in the socket software implementation of TCP.

Chapter 3

VI Architecture Overview

The VI Architecture [7] is a user-level memory-mapped communication architecture that is designed to achieve low latency and high bandwidth across a cluster of computers. It attempts to reduce the amount of software overhead imposed by traditional communication models, by avoiding the kernel involvement in each communication operation.

In traditional models, the operating system (OS) virtualizes the network hardware into a set of logical communication endpoints available to network consumers. The operating system multiplexes access to the hardware between communication endpoints and therefore all communication operations require a call or trap into the operating system kernel, which can be quite expensive.

The VI Architecture eliminates the system-processing overhead of the traditional model by providing each consumer process with a directly accessible interface to the network hardware - a Virtual Interface (VI). Each VI represents a communication endpoint and a pair of VIs can be connected to form communication channels for bidirectional point-to-point data transfer. A process may own multiple VIs exported by one or more network adapters. A network adapter performs the endpoint virtualization directly and handles the tasks of multiplexing, de-multiplexing, and data transfer

19

20

scheduling, normally performed by an OS kernel and device driver. An adapter may

completely ensure the reliability of communication between connected VIs.

Each VI has a pair of work queues; one for send and one for receive. VI

Consumers send and receive messages by posting requests, in the form of descriptors, to

these queues. These requests are asynchronously processed directly by network interface

controller (VI Provider) and marked with a status value when completed. VI Consumers

can then remove these descriptors from the queue and reuse them if necessary.

Completion queues allow the VI Consumer to combine the descriptor completion events

of multiple VIs into a single queue.

3.1 VI Architecture

The VI Architecture is comprised of four basic components:

• Virtual Interfaces

• VI Providers

• VI Consumers

• Completion queues

The VI Provider is composed of a physical network adapter and Kernel Agent. The VI

Consumer is generally composed of an application program and an operating system communication facility. The organization of these components is illustrated in Figure

3(a).

21

Figure 3.1 VI Architecture

3.1.1. Virtual Interfaces

A Virtual Interface is the mechanism that allows a VI Consumer to directly access a VI Provider to perform data transfer operations. A VI consists of a pair of Work

Queues: a send queue and a receive queue. VI Consumers post requests, in the form of

Descriptors, on the Work Queues to send or receive data. A Descriptor is a memory structure that contains all of the information that the VI Provider needs to process the request, such as pointers to data buffers. A VI Provider processes the posted Descriptors

22

and marks them with a status value when completed. A VI Consumer removes completed

Descriptors from the Work Queues and uses them for subsequent requests. Each Work

Queue has an associated Doorbell that is used to notify the VI network adapter that a new

Descriptor has been posted to a Work Queue. The Doorbell is directly implemented by the adapter and requires no OS intervention to operate.

A Completion Queue allows a VI Consumer to coalesce notification of Descriptor completions from the Work Queues of multiple VIs in a single location.

VI Consumer

Send Q Recv Q Receiver Doorbell Descriptor Descriptor

Descriptor Descriptor

Send Doorbell Send Descriptor Descriptor

Status Status

Network Interface Controller

Packets to/from network

Figure 3.2 Virtual Interface

23

3.1.2. VI Provider

The VI Provider is the set of hardware and software components responsible for

instantiating a Virtual Interface. The VI Provider consists of a network interface

controller (NIC) and a Kernel Agent. The VI NIC implements the Virtual Interfaces and

Completion Queues and directly performs data transfer functions. The Kernel Agent is a

privileged part of the operating system, usually a driver supplied by the VI NIC vendor,

which performs the setup and resource management functions needed to maintain a

Virtual Interface between VI Consumers and VI NICs. These functions include the

creation and destruction of VIs, VI connection setup/teardown, interrupt management and

processing, management of system memory used by the VI NIC, and error handling. VI

Consumers access the Kernel Agent using standard operating system mechanisms such as

system calls. Kernel Agents interact with VI NICs through standard operating system

device management mechanisms.

3.2.3 VI Consumer

The VI Consumer represents the user of a Virtual Interface. While an application program is the ultimate consumer of communication services and applications access these services through standard operating system programming interfaces such as Sockets or MPI.

The OS is generally implemented as a library that is loaded into the application

process. The OS facility makes system calls to the Kernel Agent to create a VI on the

24

local system and connect it to a VI on a remote system. Once a connection is established,

the OS facilities post the application’s send and receive requests directly to the local VI.

The OS communication facility often loads a library that abstracts the details of

the underlying communication provider, in this case the VI and Kernel Agent. This

component is shown as the VI User Agent in Figure 3.1. It is supplied by the VI

Hardware vendor, and conforms to an interface defined by the OS communication

facility.

3.1.4. Completion Queues

Notification of completed requests can be directed to a Completion Queue on a per-VI

Work Queue basis. This association is established when a VI is created. Once a VI Work

Queue is associated with a Completion Queue, all completion synchronization must take place on that Completion Queue.

As with VI Work Queues, notification status can be placed into the Completion

Queue by the VI NIC without an interrupt, and a VI Consumer can synchronize on a completion without a kernel transition. Thus the usual overhead of a trap to the kernel is avoided.

25

Descriptors Submitted

Completion Queue

Work Queues Descriptor Completion Notification from VI NIC

Poll/Wait on completion queue, then Retrieve Entries

Dequeued Descriptors

Figure 3.3 VI Architecture Completion Queue Model

3.2. Data Transfer Models

There are two types of data transfer facilities provided by the Virtual Interface

Architecture Send/Receive and Remote Direct Memory Access (RDMA). Since

Send/Receive is used in the VIA implementation used here, we shall omit a description of the RDMA method.

26

3.2.1 Send/Receive

The Send/Receive model of the VI Architecture follows a well known and well understood model of transferring data between two endpoints. On the sending side, the sending process specifies the memory regions that contain the data to be sent. On the receiving side, the receiving process specifies the memory regions where the data will be placed. Given a single connection, there is a one to one correspondence between send

Descriptors on the transmitting side and receive Descriptors on the receiving side.

The VI Consumer at the receiving end pre-posts a Descriptor to the receive queue of a VI. The VI Consumer at the sending end can then post the message to the corresponding VI’s send queue. The Send/Receive model of data transfer requires that the VI Consumers be notified of Descriptor completion at both ends of the transfer, for synchronization purposes.

VI Consumers are responsible for managing flow control on a connection. The VI

Consumer on the receiving side must post a Receive Descriptor of sufficient size before the sender’s data arrives. If the Receive Descriptor at the head of the queue is not large enough to handle the incoming message, or the Receive Queue is empty, an error will occur.

The VI Architecture differs from some existing models in that all Send/Receive operations complete asynchronously.

27

3.3. Managing VI Components

This section discusses how the components of a Virtual Interface are created, destroyed, and managed.

3.3.1. Accessing a VI NIC

A VI Consumer gains access to the Kernel Agent of a VI Provider using standard operating system mechanisms. Normally, this involves opening a handle to the Kernel

Agent that represents the target VI NIC. The VI Consumer uses this handle to perform general management operations such as registering Memory Regions, creating

Completion Queues and creating VIs. This mechanism would also be used to retrieve information about the VI NIC, such as the reliability levels it supports and its maximum transfer size limits.

VI hardware resources cannot be shared across multiple VI NICs, even if they are managed by the same Kernel Agent. Hardware resources may include Completion

Queues, mapped memory and other resources that are associated with an instance of the hardware.

A Kernel Agent must use standard operating system mechanisms to detect when a

VI Consumer process exits so that it can cleanup any resources used by the process. The

Kernel Agent must keep track of all resources associated with a VI Consumer’s use of a

VI NIC.

28

3.3.2 Registering and De-registering Memory

The VI Architecture requires that memory used for data transfers, both buffers and

Descriptors, be registered with the VI Provider. The memory registration process defines one or more virtually contiguous physical pages as a Memory Region. A VI Consumer registers a Memory Region with the Kernel Agent, which returns a Memory Handle that, along with its virtual address, uniquely identifies the registered region. The VI Consumer must qualify any virtual address used in an operation on a VI with the corresponding

Memory Handle. A VI Consumer must de-register a Memory Region when the region is no longer in use.

When a Memory Region is registered, every page within the region is locked down in physical memory. This guarantees to the VI NIC that the memory region is physically resident (not paged out) and the virtual memory to physical memory translation remains fixed when NIC is processing requests. The VI Kernel Agent manages the VI NIC’s Page Table. The Page Table contains the mapping and protection information for registered Memory Regions.

3.3.3 Creating and Destroying VIs

A VI is created by a VI Provider at the request of a VI Consumer. A VI consists of a pair of Work Queues and a pair of Doorbells, one for each Work Queue. Work Queues are structures that are allocated from a VI Consumer process’ virtual memory. The VI

Provider maps and locks this memory and informs the VI NIC of its location. A Doorbell is a hardware resource located on the VI NIC an mapped by the Kernel Agent into the

29

virtual address space of a VI Consumer Process using standard operating system

facilities. The VI Provider supplies the VI Consumer with the information needed to

directly access these structures when a VI is created. If these resources cannot be

allocated and mapped, an error will result and the VI will not be created.

There is no connection established upon creation of a VI. No data will flow until the VI is connected to another VI. See section 4 for more information on connecting VIs.

A VI Consumer should instruct a VI Provider to destroy a VI that is no longer in

use. A VI cannot be destroyed if any packets remain on its Work Queues. It may only be

destroyed if it is in the Idle state. See Section 3.5 for a discussion of VI states. The Work

Queue pair and Doorbell are de-allocated when the associated VI is destroyed. In order to

avoid consuming large parts of a VI Consumer’s virtual address space, it is recommended

that the VI Provider map multiple Doorbells into a single page if a VI Consumer opens

multiple VIs. Doorbells that belong to different processes must be mapped in different

pages.

3.3.4 Creating and Destroying Completion Queues

A Completion Queue can be used to direct notification of Descriptor completions from multiple Work Queues to a single location. The Work Queues associated with a

Completion Queue may span multiple VIs on the same VI NIC. A Completion Queue is created by a VI Provider at the request of a VI Consumer and must be created before any of its associated VI Work Queues are created.. Each VI Work Queue is optionally

30

associated with a Completion Queue when the VI is created. Work queues on the same

VI may be associated with a different Completion Queue, if desired.

The maximum number of Descriptors that can be outstanding at any given time in

a Completion Queue is defined by the VI Consumer when the Completion Queue is

created. The VI Consumer is responsible for ensuring that this number is large enough to

prevent overflow of the queue. The VI NIC must be able to support Completion Queues

with at least 1024 entries.

In order to create a Completion Queue, the VI Provider allocates memory for the queue in the VI Consumer’s virtual address space. It then maps and locks this memory and informs the VI NIC of its location. If enough memory cannot be allocated, or it cannot be mapped and locked, an error will result and the Completion Queue will not be created.

Completion Queues may be resized dynamically through the VI Provider. It is important to understand that while this operation is taking place, all IO to the Completion

Queue may cease, depending on the VI Provider’s implementation of this function.

Incoming requests should still be satisfied, and no incoming data should be rejected unless there is an insufficient number of Descriptors.

A VI Consumer should instruct a VI Provider to destroy a Completion Queue that is no longer in use. A Completion Queue cannot be destroyed until all VIs associated with it have been destroyed. VI Providers are responsible for destroying any Completion

Queues still associated with a process when the process is destroyed by the operating system.

31

A Disconnect request will transition the VI into the Idle state. Descriptors pending

or posted to either the Receive Queue or the Send Queue when the VI is in this state will

result in the Descriptor being completed in error.

Inbound traffic sent to this VI is refused. There is no outbound traffic, since requests

posted to the Send Queue are completed in error. Any outbound traffic left on a queue

when the VI transitions into this state are aborted and the corresponding Descriptors are

completed in error.

3.4. VI Connection and Disconnection

The VI Architecture provides connection-oriented data transfer service. When a VI is initially created, it is not associated with any other VI. A VI must be connected with another VI through a deliberate process in order to transfer data. When data transfer is completed, the associated VIs must be disconnected.

3.4.1. VI Connection

A VI Consumer issues a request to its VI Provider in order to connect its VI to a remote

VI. VI Providers must implement robust and reliable connection protocols. In particular,

VI Providers must prevent interference with current connections and the creation of stale or duplicate connections by delayed or duplicate packets from extinct connections.

The endpoint association model is a client-server model. The server side waits for incoming connection requests and then either accepts them or rejects them based on the

32

attributes associated with the remote VI. A state diagram depicting this process is shown

in Figure 3.4.

Figure 3.4 VI Connection Process

The server’s VI Consumer issues a ConnectWait request to its VI Provider. This request contains the discrimination values that are acceptable to the VI Consumer. A VI

Consumer should be able to accept a connection from any remote endpoint or a specific remote endpoint, based on the discriminator supplied. The request also contains a data

33

structure, used to receive information about the remote VI that is requesting a connection,

which may indicate a timeout value.

Sometime after the server VI Consumer begins waiting for a connection, the client VI Consumer issues a ConnectRequest request to its VI Provider. This request specifies the local VI that is to be connected, an address structure that indicates the remote VI to which to connect, and a timeout value. It also specifies a data structure used to receive information about the corresponding server VI, if the connect operation completes successfully.

The client’s ConnectRequest request results in one of two actions. If the specified remote VI does not exist, is not reachable, is in the wrong state, or its discriminator doesn’t match, then the VI Provider will return an error to the VI Consumer’s request. If the specified remote VI is available then the server VI Consumer’s ConnectWait request completes, and information about the client VI is returned to the server VI Consumer. A unique identifier for the incoming connection request is also returned.

The server VI Consumer then decides whether to accept this incoming request or to reject it. If the server intends to accept the connection, it must prepare a VI for the connection. The server VI Consumer may either choose a VI from a pool that it has previously created or it may create a new VI with attributes it considers appropriate for this connection request. The reliability level of the new VI must match that of the remote

VI. The VI Providers must also agree on the MTU to be used on the connection.

If the server VI Consumer intends to accept the connection, it issues a

ConnectAccept request to the VI Provider, specifying the incoming connection ID as well

34

as the local VI to be used. If the local VI’s reliability attributes match those required by

the remote VI, the connection is established and the VI State transitions according to the

State Diagram in Figure 3.5. If the local VI’s attributes do not meet the requirements,

then the ConnectAccept will complete in error; however, the incoming connection

request remains valid. The server VI Consumer must either issue a valid ConnectAccept

request, or reject the connection.

If the server intends to reject the connection, it issues a ConnectReject request to the VI Provider, specifying the incoming connection ID. If the connection request was rejected, the client VI Consumer’s ConnectRequest request returns with a status indicating that fact. If the connection request was accepted, the client VI Consumer’s

ConnectRequest request returns successfully.

The VI connection model does not attempt to apply either authorization or authentication to a VI connection. It is recommended that connected VI Consumers perform an authentication process, especially before providing RDMA access to registered memory.

The VI connection model must also allow a connection to be established between two endpoints on the same node.

3.4.2. VI Disconnection

A VI Consumer issues a Disconnect request to a VI Provider in order to disconnect a connected VI. This unilaterally aborts the connection and will result in the completion of all outstanding Descriptors on that VI endpoint. The Descriptors are completed with the

35

appropriate error bit set. Implementers must ensure that stale connections cannot be

reused.

A VI Provider may issue an asynchronous notification to the VI Consumer of a VI that

has been disconnected by the remote end, but this feature is not a requirement. A VI

Provider is required to detect that a VI is no longer connected and notify the VI

consumer. Minimally, the consumer must be notified upon the first data transfer

operation that follows the disconnect. When a VI Consumer issues a Disconnect request

for a VI, the VI will transition to a new state as illustrated in Figure 3.5.

3.4.3. VI Address Format

Each VI Provider must define an address format that uniquely identifies all possible VIs on a SAN. A VI Consumer must be aware of the address format used by a VI Provider.

The address format must allow VI discrimination across systems as well as on the same node and must also permit distinguishing between connection requests from only a specific VI or from any VI. The VI Address Format does not require support for multicast or broadcast capability.

36

Figure 3.5 VI State Diagram

3.5. VI States

A VI may be in one of four states throughout its life. The four states are Idle, Pending

Connect, Connected, and Error. Transitions between states are driven by requests issued by the VI Consumer and network events. Requests that are not valid while a VI is in a given state, such as submitting a connect request while in the Pending Connect state, must be returned with an error by the VI Provider. The states and transitions are illustrated in Figure 3.5.

Chapter 4

NetPIPE

4.1. Introduction

In recent years, much research has been directed towards evaluating the performance of high speed networks. The design of NetPIPE has been motivated by the need to assess the performance of communication bound applications. NetPIPE helps to answer questions

that surround network communications inherent to these applications. The two most

popular tools, ttcp [30] and netperf [ 29], are based on the TCP/IP communications protocol. While netperf has the ability to map network performance, comparing network protocols with these tools is difficult if not impossible. Finding the effective maximum bandwidth using ttcp is an exercise in delving into protocol internals. Knowledge of the appropriate buffer size, alignment address, and protocol settings is required to achieve data transfer at the effective maximum bandwidth

Net work Protocol Independent Performance Evaluator (NetPIPE) was developed

by Snell, Mikler, and Gustafson [1] at Ames Laboratory. It encapsulates features of ttcp

and netperf and provides a visual representation of network performance under a variety

of conditions. By taking the end-to-end application view of a network, NetPIPE attempts

to show the overhead associated with different protocol layers.

37

38

4.2. NetPIPE Design

The design of NetPIPE consists of two parts.

i) A protocol independent driver

ii) Protocol specific communication APIs.

Figure 4.1 NET work P rotocol I ndependent P erformance E valuator

The communication APIs contain the necessary functions to establish a connection,

send and receive data, and close a connection. This part is different for each protocol.

However, the interface between the driver and protocol module remains the same.

39

Therefore, the driver does not have to be altered in order to change communication protocols. Currently, NetPIPE supports TCP, PVM, and MPI communication protocols.

The device independent driver implements a ping-pong like program which keeps increasing the transfer block size from a single byte to large blocks until transmission time exceeds 1 second. This means that NetPIPE includes a variable time benchmark and will scale to all network speeds. Unlike fixed size benchmark tests, NetPIPE will not become outdated and inaccurate with the increasing speeds of upcoming technology advances.

Let us represent the block size as c, then for each block size c, three measurements are taken c − p bytes, c bytes and c + p bytes, where p is a perturbation factor with a default value of 3. This allows examination of block sizes that are possibly slightly smaller or larger than an internal network buffer. For each measurement, NetPIPE uses the following algorithm:

/* First set T to a very large time. */ T = MAXTIME For i = 1 to NTRIALS

t0 = Time() For j = 1 to nrepeat if I am transmitter Send data block of size c Recv data block of size c else Recv data block of size c Send data block of size c endif endFor t1 = Time() /* Insure we keep the shortest trial time. */ T = MIN(T, t1p;t0)

endFor T = T/(2 * nrepeat)

40

The variable nrepeat is calculated based on the time of the last data transfer. The intent is

to repeat the experiment enough times such that the total time for the experiment is far

greater than timer resolution. The default target time is 0.5 seconds. For most modern

computers, this provides a sufficiently precise data transfer time. Given that the last

transfer time was tlast seconds for a block size bsz1 , the value of nrepeat for block size bsz2 is approximated as: nrepeat = TARGET / (( bsz2 /bsz1 ) * tlast )

4.3. NetPIPE Results

NetPIPE produces a file that contains the transfer time, throughput, block size, and transfer time variance for each data point and is easily plotted by any graphing package.

For instance, Figure 4.2 presents the throughput versus the transfer block size for a typical Ethernet link.

41

Figure 4.2: Ethernet Throughput

This graph is referred to as the throughput graph . From this graph, it is easy to see the maximum throughput for any network. However, it is difficult to analyze the latency, an equally important statistic. A graph that is easier to read and analyze is the network signature graph . One such graph is shown in Figure 4.3. It depicts the transfer speed versus the elapsed time; hence it represents a network “acceleration” graph. It is very similar to the way computer performance is presented by the HINT performance metric.

42

Figure 4.3: Network Signature Graph

Although unconventional, this graph represents perhaps a better approach to visualizing

network performance. All the necessary data are clearly visible and easy to extrapolate.

The network latency coincides with the time of the first data point on the graph. The

maximum attainable throughput is clearly shown as the maximum point on the graph.

4.4. VIA driver for NetPIPE

For our performance evaluation, we modified a research implementation originally written by Dr. Hong Ong of a VIA communication protocol module for

NetPIPE [6]. The set of NetPIPE communication APIs needed for the protocol specific

43

module includes those for establishing a connection, closing a connection, sending and receiving data, and performing synchronization. The implementation is based on the

VIPL library. To keep the implementation simple, NetPIPE-VIA creates a pair of VI endpoints per connection.

A fixed number of send and receive packet descriptors are pre-allocated and each descriptor has a fixed size of registered (pinned) memory which is equal to the maximum data buffer size supported by the VI Provider. The descriptors are chained together to form a ring. To send a message, NetPIPE-VIA gets a descriptor from the send ring and posts the descriptor to the send queue. After the completion of a send operation, the descriptor is inserted back into the ring. VIA requires packet descriptors to be posted on the receive queue before any message arrives, otherwise the message will be lost.

Therefore, NetPIPE-VIA pre-posts all the receive descriptors before the reception of messages occurs. Whenever a packet arrives, it gets a descriptor out of the receive queue, the packet is processed and the descriptor is posted back to the receive queue again. For each measurement, the protocol independent driver determines the size of the data block either linearly or exponentially depending on a user specified command line option.

Hence, the memory buffer for a data block of size c is dynamically allocated at run time.

In order to achieve zero-copying and avoid the extra overhead of pinning and unpinning

the memory buffer for each data block, NetPIPE-VIA pre-allocates and pre-registers a

pool of memory buffers. All memory requirements of the independent protocol driver are

satisfied from this memory pool. This also keeps the memory management in NetPIPE-

VIA relatively simple.

44

When transmitting a large data block, the message will be fragmented in order to fit into a descriptor's data segment. This implies that multiple descriptors are needed to either transmit or receive large messages. Consequently, flow control is required to prevent the sender from overflowing the receiver's pre-posted receive descriptors.

NetPIPE-VIA implements a simple flow control scheme.

On the sender side, it continues to transmit until either the entire data block c is

sent or the number of sends reaches the maximum number of pre-posted descriptors of

the receiver. For the latter case, the sender waits for a “ continue ” message from the

receiver before sending more packets. On the receiver side, it continues to receive

packets until either the entire data block c is received or it reaches the maximum number

of pre-posted descriptors. If the receiver runs out of pre-posted descriptors, it stops

receiving and waits for all receive requests to complete. Then, it sends a “continue”

message to inform the sender to continue to send more packets.

Chapter 5

Performance Comparison

This chapter describes the testing environment for the communication benchmarking that

we conducted. First, we will present the hardware and software environment. Next, we

present how some TCP and Ethernet parameters are tuned to accommodate gigabit speeds

and discuss some performance comparisons related to TCP and VIA.

5.1. Testing Environment and Network Parameters

The initial experiments conducted to evaluate the communication latency and throughput were performed on three different sets of machines.

• f31 and f32 , nodes of the fianna Kent State Computer Science Beowulf cluster,

consisting of dual processor Pentium III PCs running at 450MHz with a 100 MHz

bus, and 256MB of SD-RAM. The cluster operating system was the Red Hat 8.0

Linux distribution with kernel version 2.4.18.

• Frodo and Legolas , consist of single Intel Pentium IV processors running at

1500MHz in a Supermicro’s P4STA Pentium 4 Motherboard with Intel 850

controller chip. The memory consists of dual 256MB 800MHz RDRAM modules

for 512MB of memory. The Intel 850 supports a 400MHz system data bus and 5

33MHz/32bit PCI slots.

45 46

• C1 and C2 , are Dell Optiplex GX260 mini-towers, with single Intel Pentium IV

processors running at 2.4GHz. The motherboard has four 33MHz/32bit PCI slots

and 512MB of DDR SD-RAM.

The experiments conducted to evaluate the communication latency and throughput were performed on the three different sets of machines with the same software. Each machine had the operating system kernel of Linux 2.4.18 and M-VIA version 1.2 [13] installed on it. Each node had one 100 Mbps Ethernet card and one

SysKonnect SK-NET Gigabit Ethernet adapter (SK-9843 SX). Each 100Mbps Ethernet card was used for external communication, which helps to isolate the test environment from other network traffic. This is important for the accuracy of the tests. Gigabit

Ethernet cards for each set of machines were connected back to back. All the tests were done using the same version of NetPIPE.

5.2 TCP Comparisons

Here we compare the throughput results of gigabit Ethernet using different sizes of MTU, different socket buffer sizes and different TxQueueLength.

As described in Chapter 2, MTU stands for the Maximum Transmission Unit, and is the largest physical frame size that can be transmitted across a network. The Ethernet standard IEEE 802.3 sets the maximum available MTU to 1500 bytes. However, some vendors have permitted MTUs larger than 1500. These are non-standard and are sometimes called Jumbo frames. The first vendor to provide these as a non-standard

47

feature was Alteon in their ACEnic Gigabit NIC and switches. These permitted setting the MTU to 9000 bytes. Note that to use MTU greater than 1500 both the NIC and switch must handle the larger MTU. The SysKonnect cards used in these tests also support

Jumbo frames of up to 9000 bytes.

The socket buffer size determines the size of the TCP sliding window and thus the number of packets which can be sent without an acknowledgment (ACK) being received from the receiver. Increasing the socket buffer size means that additional memory is used for buffering in the socket software implementation of TCP.

Figure 5.1: Socket size = 64KB

48

Except where indicated otherwise, all the tests were performed using f31 and f32, the 450MHz Pentium IIIs.

5.2.1. Varying MTU size

Figures 5.1 and 5.2 show the effects of change in MTU size, with the default fixed

Txqueue length of 100 bytes, and Socket buffer sizes of 64k and 256K respectively.

There are a number of interesting observations from these figures. For socket buffer size of 64KB, increasing the MTU from 1500 to 3000 bytes results in significant increase in maximum throughput attainable. Increasing the MTU size from 3000 to 4500 or 6000 bytes increases the throughput but not as significantly. However, continuing to increase the MTU size greater than 6000 bytes results in decrease in throughput. Thus, maximum achievable throughput at socket buffer size of 64KB will be at MTU sizes of 3000 bytes to 6000 bytes, and the corresponding optimum achievable throughput is approximately

467 Mbps.

49

Figure 5.2: Socket size = 256kb

By contrast, for 256KB socket buffer size, the maximum achievable throughput increases monotonically to 677Mbps at MTU of 9000 bytes. For any fixed MTU, using socket buffer size of 256KB increases the maximum achievable throughput.

50

5.2.2. Varying Socket Buffer size

Figure 5.3: MTU=1500

Figure 5.3 and 5.4 show the throughput achievable at MTU size equal to 1500

bytes and 9000 bytes respectively for varying socket buffer size. We observe that peak

TCP performance is better with Socket buffer sizes of 128KB and 256KB as compared to

the default 64KB. Also note that the effect of increase in Socket buffer size is more

noticeable for larger MTU size. One factor to note is that the performance drops off for

larger messages from the peak throughput when the socket buffer size is 128KB or

256KB.

51

Figure 5.4: MTU=9000

5.2.3 Varying TX Queue Length

Figures 5.5 and 5.6 show the throughput achievable at varying TX queue length, with MTU sizes 1500 and 9000 bytes respectively. We see that TX queue length does not have a significant effect on throughput.

52

Figure 5.5 Change in Txqueue Length with MTU = 1500 bytes

53

Figure 5.6 Change in TX queue Length with MTU = 9000 bytes

54

5.2.4. Varying Processor Speed

Figure 5.7: Processor speed = 450MHz

In Figures 5.7, 5.8 and 5.9, we see that faster processors can attain higher throughput for

large transfer block sizes (greater than 1MB).The maximum throughput achievable for

,F31 and F32, Pentium III 450MHz is 564Mbps. Whereas the maximum attainable

throughput is approximately 645Mbps for Frodo and Legolas, the Pentium IV 1500MHz

processors, and 787Mbps for C1 and C2, the Pentium IV 2.4GHz processors. This is

largely due to the fact that faster processors can process the protocol stacks and calculate

TCP checksums faster than the slower processors. However, note that an increase in

55

processor speed of 53% resulted in only a 22% increase in throughput. This suggests that a substantial part of the bottleneck lies elsewhere.

Figure 5.8: Processor speed = 1500MHz

56

Figure 5.9 : Processor speed = 2400MHz

57

5.2.5 Different Gigabit Network Interfaces

Figure 5.10 throughput comparisons on Dell Optiplex GX260 with Built-in NIC card

We tested the throughput on a pair of Dell Optiplex GX260 connected back-to-back.

These have an Intel Pentium IV processor running at 2.53GHz with 256MB of DDR

SDRAM and a 32-bit, 33-MHz PCI card slot. Figure 5.9 and 5.10 show the effect of different network interfaces on the performance of the Gigabit Ethernet network. Figure

5.9 shows throughput for MTU size 1500, 3000 and 9000 bytes with built-in network interface and Figure 5.10 shows the throughput for MTU size 1500, 4500 and 9000 bytes with the Syskonnect NIC card. Maximum throughput achievable with the built-in NIC

58

card is 494 Mbps at MTU size of 3000 bytes whereas throughput is 795 Mbps with the

Syskonnect NIC at MTU size of 9000 bytes.

Figure 5.11 throughput comparisons with SysKonnect NIC card

Similar results were seen in Hong and Farrell [17], [25] for Packet Engine GNIC-II

(Hamachi v0.07), Alteon ACEnic (Acenic v0.45), and SysKonnect SK-NET (Sk98lin v3.01).

A comprehensive evaluation of the varying performances of different network interface cards was undertaken by Anthony Betz at the University of Northern Iowa [24]. The performance varies dramatically between these NICs similar to those observed in our experiments.

59

5.2.6. Performance of Xeon Processors

Figure 5.12 Throughput for newer Xeons

Figure 5.11 shows the throughput between Xeons in a RocketCalc Titan Cluster. The

Xeons were 2.4GHz processors with 2GB of memory, running the Linux 2.4.21 kernel.

The figure shows the performance for MTU of 1500 bytes with different buffer sizes. For

Xeons, the throughput increases considerably with increase in buffer size from 417 Mbps

with default buffer size to approximately 894 Mbps with 1 megabyte of buffer size.

60

5.2.7. Performance of Opteron Processors

Figure 5.13 Throughput for newer Opterons

We evaluated the performance between the nodes of two different Opteron clusters.

Figure 5.12 shows the throughput between two Opterons 244 running at 1.8 GHz. Each is a dual processor Tyan Thunder K8W with onboard Broadcom Gigabit Ethernet

NetXtreme BCM5703 LAN. They ran Suse Linux Professional 9.0 with Linux kernel

2.6.2. The MTU was 1500bytes. The Broadcom BCM5703 is a fully integrated 64-bit

10/100/1000BASE-T Gigabit Ethernet Media Access Control and Physical Layer

Transceiver with CPU task offloads. Figure 7.4 shows the effects of different buffer sizes.

61

The maximum throughput is around 897 Mbps at 1500 bytes with no effect of change in buffer size.

We also tested performance of a dual processor AMD Opteron 250 having processor speed of 2.4 GHz with 2GB RAM per processor. The operating system is Fedora Core 5

Linux with Kernel version 2.6.18. The motherboard is a Supermicro H8DCI EATX MBD

2 with built-in Gigabit Ethernet network interface based on the nVidia nForce Pro

2200(CKIO4) & 2050(IO4) Dual Single-Port Gigabit Ethernet Controllers. The layout of the nVidia nForce Pro 2200/2050 Chipset is shown in Figure 5.13 below. As can be seen the chipset diagram these are connected directly to the Hyper-Transport bus, and also have TCP/IP offload capability built into the Gigabit Ethernet controller.

Figure 5.14: Block Diagram of nVidia 2050/2200 chipset

62

Figure 5.15 : TCP throughput with newer Opterons

From Figure 5.15, we see that the maximum throughput achieved is approximately

822MHz with MTU size of 1500 bytes. The throughput did not vary significantly with socket buffer size. One point of note here is that the throughput for the earlier 1.8GHz

Opteron was 897Mbps, whereas for the newer 2.4GHz Opteron it was only 822MHz.

This indicates that the motherboard and Gigabit interface were more significant factors in the performance.

Attained throughput for Opterons with default configuration is much higher than any of the maximum throughput achievable on the earlier processors tested with different configurations of TCP and VIA. However we have to note that a number of factors have changed as outlined below

63

• The Gigabit Ethernet interface used on the Opterons is built-in instead of the

Syskonnect SK9821 card in a 32bit/33MHz PCI slot.

• The type of processor is AMD Opterons, which is a 64 bit processor, and the

processor speed is 1.8 and 2.4 GHz.

• The Linux kernel versions are 2.6.2. and 2.6.18 for the Opteron tests and all our

earlier tests were done using Linux kernel version 2.4.18. It is possible that the

device drivers are more efficient in the later kernels.

Note that the newer processors and newer network interface cards (NICs) provide very high throughput without making any configuration changes. More interestingly, the throughput for Opterons with default configuration is much higher than any of the maximum throughput achievable with different configurations of TCP and VIA. We do believe that network configuration parameters which were dominant in increasing the performance of Gigabit Ethernet networks may be helpful in configuring future networks, like 10 Gbps networks, to a certain extent.

5.3 VIA Comparisons

Here we compare the point-to-point latency and throughput performance of

software implementation of VIA (MVIA). We have also compared the TCP performance

with MVIA. Latency is measured by taking half the average round trip time for a 1 byte

transfer. The throughput rate is calculated from half the round-trip time for a data block

of size c bytes.

64

Figure 5.16: Processor speed = 450 MHz

In M-VIA, SysKonnect NICs hardware MTU is 1500 bytes. This is because the

Gigabit Ethernet standard still limits the MTU to 1500 bytes. Above we have observed

that a larger MTU improves TCP throughput. To confirm that hardware MTU will also

improve VIA performance, we tested the SK-9821 NIC using an MTU of greater than

1500 bytes. Since we did not have a Gigabit Ethernet switch which supported Jumbo

frames we connected the two PCs back to back using the SK-9821 NICs.

In above Figure 5.16, we can observe M-VIA performance for MTU size 1500 and

3000 bytes with default Socket buffer size and varying Txqueuelenght of 100 and 1000 bytes. Similar to the case with TCP, VIA communication performance also improves

65

with increase in MTU size. Also, we can observe that increase in Txqueuelenght has no effect on performance.

Figure 5.17: Processor speed = 1500 MHz

66

Figure 5.18: Processor speed = 2400 MHz

Above Figure 5.17 and 5.18 shows the M-VIA performance of faster processors with processor speed of 1500MHz and 2400MHz with varying MTU size. For Pentium IV

1500MHz, M-VIA performance attains maximum throughput of roughly 407 Mbps at

MTU = 1500 bytes and 597 at MTU = 9000 bytes. And for processor speed of 2400MHz,

M-VIA performance is 497 Mbps for 1500 Bytes and 739 for 9000 bytes. The surprising result here is that M-VIA throughput is 544Mbps for the 450MHz processor, but drops to

407Mbps for the 1500MHz processor, and rises again to 497Mbps for the 2200MHz processor.

67

Figure 5.19: MVIA throughput at different processor speeds

5.4 TCP and VIA comparison

As we have seen above increasing hardware MTU improves TCP as well as M-VIA performance. Here we compare the throughtput and latency of TCP and M-VIA with different processor speeds. Note that all tests are done with Linux kernel 2.4.18 as

Operating system using Syskonnect NIC cards and sk98lin driver, and with default socket buffer size and Txqueuelenght=100bytes

68

Figure 5.20: TCP and VIA at Processor speed = 450MHz

For Pentium III 450Mhz processor, maximum throughput achievable is 291 Mbps for TCP and 544Mbps for MVIA. We can see that M-VIA throughput is 86% higher than the corresponding TCP throughput.

For faster processors like the Pentium IV 1500MHz and 2400MHz, we can see, that TCP performance is equal to or better than VIA performance.

69

Figure 5.21: TCP and VIA at Processor speed = 1500MHz

70

Figure 5.22: TCP and VIA at Processor speed = 1500MHz

We have also noticed that on slower processor speeds the difference between M-VIA and

TCP throughput is more, that is, M-VIA performance is much better. As the processor speed increases the TCP performance gets better. At processor speeds of 2.3 GHz the

TCP throughput is equal or greater than M-VIA throughput. This effect is partly due to the fact that faster processors can process TCP/IP stack faster and calculate TCP checksum faster, and so the effect of this overhead is not as significant as with slower processors. No cause has been identified for the decrease in M-VIA performance.

71

Figure 5.23: TCP and VIA at Processor speed = 2400MHz

72

Figure 5.24: TCP and VIA at Processor speed = 2400MHz

5.5 MVIA latency comparisons

The table below summarizes the latency for TCP and VIA on different systems with different processor speeds. For the Pentium III 450MHz, latency is 61u secs for TCP

and 26usecs for M-VIA. Notice that the M-VIA latency is at least 50% less than the TCP

latency. This highlights the fact that VIA can deliver the low latency needed for

communication intensive applications even on slow processors.

73

Processor speeds TCP VIA

450Mhz 61 26 1500Mhz 26 24 2400MHz 21 21

Latency for 1 byte

However for faster processors, with processor speeds of 1500MHz and 2400 MHz we can observe that TCP latency is only slightly greater than or equal to M-VIA latency.

Again this is mainly because the overhead of network stack processing and checksum computation is less significant than on slower processors.

Chapter 6

MPI Comparisons

Here we evaluate and compare the performance of two implementations of the MPI standard, LAM and MPICH, on a Linux cluster connected by a Gigabit Ethernet network.

Performance statistics are collected using the NetPIPE MPI module.

6.1 Introduction

On cluster systems, parallel processing is usually accomplished through parallel programming libraries such as MPI, PVM and BSP. These environments provide well defined portable mechanisms for which concurrent applications can be developed easily.

In particular, MPI has been widely accepted for computational science applications. The use of MPI has broadened over time as well. Two of the most extensively used MPI implementations are MPICH [9], [10], [27] from Mississippi State University and

Argonne National Laboratory and LAM [26] originally from Ohio Supercomputing

Center. LAM is now being maintained by the Indiana University. The modular design taken by MPICH, by Gropp and Lusk[10], and LAM has allowed research organizations and commercial vendors to port the software to a great variety of multiprocessor and multicomputer platforms and distributed environments.

74 75

Naturally, there has been great interest in the performance of LAM and MPICH for enabling high performance computing in clusters. Large scale distributed applications using MPI as communication transport on a cluster of computers imposes heavy demands on communication networks. Gigabit Ethernet technology, among others high-speed networks, can in principle provide the required bandwidth to meet these demands.

Moreover, as Gigabit over copper devices become more available and the use of it increased, the price decreased to commodity level. However, this has also shifted the communication bottleneck from network media to protocol processing. Since LAM and

MPICH use TCP/UDP socket interfaces to communicate messages between nodes, there have been great efforts in reducing the overhead incurred in processing the TCP/IP stacks. Since then, many systems such as U-Net and Active message have been proposed to provide low latency and high bandwidth message-passing between clusters of workstations and I/O devices that are connected by a network. Virtual Interface

Architecture (VIA) has been developed to standardize these ideas. Since the introduction of VIA, there have been several software and hardware implementations of VIA.

Berkeley VIA, Giganet VIA, M-VIA, and FirmVIA are among these implementations.

This has also led to the recent development of VIA-based MPI communications libraries like MVICH.

6.2 Testing Environment

The initial testing environment for collecting the performance results consists of two Intel

Pentium IV PCs running at 1500MHz and 1GB of RAM. The PCs are connected back to

76

back via a Gigabit Ethernet network card, SysKonnect SK-NET Gigabit Ethernet adapter

(SK-9843 SX), installed in the 32bit/33MHz PCI slot. Each node also has one 100Mbps

Ethernet card used for external communication to ensure that the cluster is isolated from rest of the traffic for accuracy of tests. The cluster ran the RedHat 7.1 Linux distribution with kernel version number 2.4.18. In addition, LAM v6.5.7 and MPICH v1.2.4 were also installed. All the tests were done using NetPIPE.

6.3 LAM and MPICH performance comparisons

In this section, we present and compare the performance of LAM and MPICH on a Gigabit Ethernet network. Before moving on to discuss the performance results of LAM and MPICH, it is useful to first briefly describe the data exchange protocol used in these two MPI implementations. The choices taken in implementing the protocol can influence the performance as we will see later in the performance graphs.

Generally, LAM and MPICH use a short/long message protocol for

communication. However, the implementation is quite different. In LAM, a short

message consisting of a header and the message data is sent to the destination node in one

message. A long message is segmented into packets with the first packet consisting of a

header and possibly some message data sent to the destination node. Then, the sending

node waits for an acknowledgment from the receiver node before sending the rest of the

data. The receiving node sends the acknowledgment when a matching receive is posted.

MPICH implements three protocols for data exchange. For short messages, it uses

the Eager protocol to send message data to the destination node immediately with the

77

possibility of buffering data at the receiving node when the receiving node is not expecting the data. For long messages, two protocols are implemented - the Rendezvous

protocol and the Get protocol. In the rendezvous protocol, data is sent to the destination

only when the receiving node requests the data. In the get protocol, data is read directly

by the receiver. This choice requires a method to directly transfer data from one process's

memory to another such as exists on parallel machines.

Figure 6.1: LAM, compile at 64kb, soc=64, MTU=all

All the LAM tests are conducted using the LAM client to client (C2C) protocol

which bypasses the LAM daemon to reduce latency. In LAM and MPICH, the maximum

length of a short message can be configured at compile time by setting an appropriate

78

constant. The LAM short/long messages switchover point occurs at 64kb by default. We also tried a configuration with LAM short/long messages switchover of 128KB. For

MPICH, we used the default settings.

Figure 6.2: MPICH, soc=64k, MTU=all

For LAM, compiled at 64k long/short message switchover using MTU size of

1500 bytes, the maximum attainable throughput is about 426 Mbps with latency of 32

usecs. And for LAM, compiled at 128k long/short message switchover using 1500 bytes,

maximum throughput is about 372 Mbps with latency of 34 usecs . For MPICH using

MTU size of 1500, the maximum attainable throughput is about 391 Mbps with latency

79

of 38 usecs. Below we have shown a throughput in Mbps at different MTUs for LAM

complied at 64k and MPICH.

MTU in bytes Throughput in Mbps LAM MPICH (compiled at 64k) 1500 426.58 391.69 3000 551.90 492.48 4500 595.89 500.28 6000 628.91 453.88 9000 640.07 350.25

From the table above you can see that changing MTU to a larger size improves

LAM performance considerably. For LAM, the maximum attainable throughput is increased by approximately 50% with MTU of 9000 bytes as compared to 1500 bytes.

This is expected as TCP/IP performs better on a Gigabit Ethernet network for large MTU and larger socket buffer size. However, changing MTU to increasingly larger sizes initially increases but eventually decreases MPICH performance. For MPICH, the maximum attainable throughput drops by approximately 10% with MTU of 9000 bytes.

This is because during initialization, MPICH initializes SOCK SNDBUF and SOCK

RCVBUF size equal to 4096 bytes thus limiting the effects of increasing MTU. As we might expect the throughput peaks when the MTU permits the frame to include the full

4096 bytes of the TCP packet. In our tests this occurs when MTU is 4500 bytes. Hence, a larger MTU does not help to improve MPICH performance. On the other hand, LAM sets send and receive socket buffers, SOCK SNDBUF and SOCK RCVBUF, to a size equal to the switch-over point plus the size of the C2C envelope data structure. This

80

explains why, when we made the MTU greater than 1500 bytes, LAM performance improved.

Throughput(Mbps) Latency( usecs ) LAM (compiled at 64k) 426 32 LAM (compiled at 128k) 372 34

MPICH 391 38

TCP 428 26

One surprising aspect is that despite the extra overhead of MPI protocol processing and headers, LAM attains performance of 426Mbps close to that of TCP at 428Mbps.

Chapter 7

Conclusion

Communication performance is affected by a number of factors, including CPU speed,

I/O speed, bus architecture, network adaptors, device drivers, and protocol stack processing. While most of these factors do not contribute significantly to the performance in the case of slower networks, they begin to become factors in high speed networks.

Gigabit Ethernet provides the bandwidth required to meet the demands of current and future applications. However, it has also shifted the communication bottleneck from network media to hardware and software components. It is critical to improve these components in order to achieve high speed transmission. A detailed study of the communication latency and throughput of Gigabit Ethernet was presented with emphasis on the TCP/IP protocol stack processing. In order to tune the TCP/IP protocol for high speed networks, we analyzed the effects of a number of TCP parameters and indicated the optimal values for our test environment. The effect of processor speed on Gigabit

Ethernet throughput was also analyzed and it was shown that, although improved performance was obtained with faster processors, the improvement in performance was not proportional to the increase in processor speed. This leads to the conclusion that processor speed is not the major bottleneck in obtaining improved throughput. We also

81 82

showed the effects of MTU size on Gigabit Ethernet performance. The ability to increase the MTU size beyond 1500 bytes, that is to transmit Jumbo Frames, could significantly enhance the throughput attainable. Moreover, we showed that maximum throughput attainable was not always at largest MTU size of 9000 bytes. In order to make the largest

MTU selectable the optimal choice, one also has to ensure that the TCP socket buffer size is also large enough.

We also compared the throughput and latency for both TCP and M-VIA at different processor speeds. We have seen that increase in hardware MTU improves TCP as well as

M-VIA performance. We have also noticed that on slower processor speeds, M-VIA performance is much better than TCP performance. As the processor speed increases the

TCP performance gets better. At higher processor speeds the TCP throughput is equal or greater than M-VIA throughput. This effect may be due to the fact that faster processors can process TCP/IP protocol stack and TCP checksum faster as compared to slower processors. We have also seen that in slower processors the latency is much better for M-

VIA, whereas for faster processors, latency is more or less equal for TCP and M-VIA. As per our investigation, M-VIA was a very useful alternative on less powerful processors but on newer and more powerful processors, like Opterons and Xeons, the TCP performs much better.

We also compared the performance of LAM and MPICH at different MTU and SOC buffer sizes. LAM throughput increases with increase in MTU size, while MPICH performance decreases when MTU size is increased more than 4096 bytes. This is

83

because during initialization, MPICH initializes send and receive socket buffers to 4096 bytes. On the other hand, LAM sets send and receive socket buffers to size equal to switch-over point plus the size of the C2C envelope data structure. We have also seen that communication latency of LAM is better than MPICH.

For older processors many factors impacted the performance attainable, and significant testing and tuning were required to optimize the performance. In contrast, for newer processors such as the Opteron, acceptable performance is attained with the default configurations.

References

[1] Q.O. Snell, A.R. Mikler, and J.L Gustafson, “Net-PIPE: Network Protocol Independent Performance Evaluator”. http://www.scl.ameslab.gov/netpipe/paper/full.html

[2] Stephen Saunders, “Data Communication Gigabit Ethernet Handbook”, McGraw Hill, ISBN 0-07-057971-7, 1998.

[3] S. Elbert, et al, C. Csanady, “Gigabit Ethernet and Low-Cost Supercomputing”. http://www.scl.ameslab.gov/Publications/Gigabit/tr5126.html

[4] Gigabit Ethernet Alliance, “Gigabit Ethernet Over Copper”. http://www.gigabit-ethernet.org/

[5] Joe Skorupa and George Prodan, “Battle of the Backbones: ATM vs. Gigabit Ethernet”, Data Communications, April 1997, http://www.data.com/tutorials/backbones.html

[6] Mark Baker, Paul A. Farrell, Hong Ong, Stephen L. Scott, VIA Communication Performance on a Gigabit Ethernet Cluster, Proceedings of EuroPar2001.

[7] Compaq Computer Corp. , Intel Corporation, Microsoft Corporation, Virtual Interface Architecture Specification version 1.0. http://www.viarch.org

[8] 10 Gigabit Ethernet Alliance, 10 Gigabit Ethernet Technology Overview White Paper. (2001) http://www.10gea.org/

[9] W. Gropp and E. Lusk and N. Doss and A. Skjellum, A high-performance, portable implementation of the MPI message passing interface standard, Parall. Comp. 22 (1996).

[10] William D. Gropp and Ewing Lusk, User’s Guide for mpich, a Portable Implementation of MPI, Argonne National Laboratory (1996), ANL-96/6.

[11] HINT - Hierarchical INTegration benchmark web site, http://www.scl.ameslab.gov/Projects/HINT/

84 85

[12] Intel Corporation, Virtual Interface (VI) Architecture: Defining the Path to Low Cost High Performance Scalable Clusters. (1997)

[13] M-VIA: A High Performance Modular VIA for Linux. http://www.nersc.gov/research/FTG/via/

[14] MPI for Virtual Interface Architecture. http://www.nersc.gov/research/FTG/mvich/

[15] Hong Ong, Paul A. Farrell, Performance Comparison of LAM/MPI, MPICH, and MVICH on a Linux Cluster connected by a Gigabit Ethernet Network, Proceedings of Linux 2000, 4th Annual Linux Showcase and Conference, Extreme Linux Track, Atlanta, 2000.

[16] InfiniBand Trade Association, http://www.infinibandta.org

[17] Paul Farrell and Hong Ong, Communication Performance over a Gigabit Ethernet Network, IEEE Proceedings of 19th IPCCC. 2000

[18] M. Banikazemi, V. Moorthy, L. Hereger, D. K. panda, and B. Abali. Efficient Virtual Interface Architecture Support for IBM SP Switch-Connected NT Clusters. International Parallel and Distributed Processing Symposium. (2000)

[19] Richard P. Martin, Amin M. Vahdat, David E. Culler, Thomas E. Anderson: Effects of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture. ISCA 24. (1997)

[20] P.A. Farrell, H. Ong, A. Ruttan, “Modeling liquid crystal structures using MPI on a workstation cluster”, Proceedings of MWPP99.

[21] E. Speight, H. Abdel-Shafi , J. K. Bennett: Realizing the Performance Potential of the Virtual Interface Architecture. Proc. of Supercomputing'99. (1999)

[22] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, W. Su: Myrinet - A Gigabit per second Local Area Network. IEEE Micro. (1995)

[23] M. Mathis, J.Mahdavi, S. Floyd, A. Romanow, “TCP Selective Acknowledgment Options”, RFC 2018 October 1996.

[24] Paul Gray, Anthony Betz, Performance Evaluation of Copper-based Gigabit Ethernet Interfaces, 679-690, Electronic Edition, IEEE Computer Society DL.

86

[25] Paul A. Farrell and Hong Ong, Factors involved in the Performance of Computations on Beowulf clusters, Electronic Transactions on Numerical Analysis.Volume 15, pp. 211- 224, 2003.

[26] LAM, http://www. lam -mpi.org/

[27] MPICH, http://www.mcs.anl.gov/mpi/ mpich /

[28] TOP 500, http://www.top500.org

[29] Netperf, http://www.freebsd.org/projects/netperf/index.html

[30] ttcp, http://www.cisco.com/warp/public/471/ ttcp .html