and network support for high-performance computing

Item Type text; Dissertation-Reproduction (electronic)

Authors Guedes Neto, Dorgival Olavo

Publisher The University of Arizona.

Rights Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.

Download date 30/09/2021 20:24:35

Link to Item http://hdl.handle.net/10150/298757 INFORMATION TO USERS

This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer.

The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction.

In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.

Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book.

Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6" x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order.

Bell & Howell Information and Learning 300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA 800-521-0600

OPERATING SYSTEM AND NETWORK SUPPORT FOR HIGH-PERFORMANCE COMPUTING

by

Dorgival Olavo Guedes Neto

Copyright © Dorgival Olavo Guedes Neto 1999

A Dissertation Submitted to the Faculty of the

DEPARTMENT OF

In Partial Fulfillment of the Requirements For the Degree of

DOCTOR OF PHILOSOPHY

In the Graduate College

THE UNIVERSITY OF ARIZONA

19 9 9 UMI Number: 9946820

Copyright 1999 by Guedes Neto, Dorgival Olavo

All rights reserved.

UMI Microfonn 9946820 Copyright 1999, by UMI Company. All rights reserved.

This microfonn edition is protected against unauthorized copying under Title 17, United States Code.

UMI 300 North Zeeb Road Ann Arbor, MI 48103 2

THE UNIVERSITY OF ARIZONA ® GRADUATE COLLEGE

As members of the Final Examination Conmittee, we certify that we have

read the dissertation prepared by Dorgival Olavo Guedes Neto

entitled OPERATING SYSTEM AND NETWORK SUPPORT FOR HIGH-PERFORMANCE

COMPUTING

and reconmend that it be accepted as fulfilling the dissertation

requirement for the Degree of Doctor of Philosophy

s-U-ij'21. eterson DatTl ^

^ohn H. Hartman Date

Richard D. SchlichtDate

Date

Date

Final approval and acceptance of this dissertation is contingent upon the candidate's submission of the final copy of the dissertation to the Graduate College.

I hereby certify that I have read this dissertation prepared under my direction and reconmend that it be accepted as fulfilling the dissertation requirement.

Dissertation director Larry L. Peterson Date 3

STATEMENT BY AUTHOR

This dissertation has been submitted in partial fulfillment of requirements for an advanced degree at The University of Arizona and is deposited in the University Libreiry to be made available to borrowers under rules of the Library. Brief quotations from this dissertation are allowable without special permission, pro­ vided that accurate acknowledgment of source is made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the copyright holder.

SIGNED: 4

ACKNOWLEDGEMENTS

The road to a Ph.D. is always an adventurous one, and I have been greatly honored to have Ljirry Peterson as my jidvisor along the way. Many were the times when I came to his office feeling like in a dead end, and left with new confidence and new paths to explore. He was always there to hear ajiy ideas I had, to help turn bad ones into good, and to make good ones even better, with insights often beyond my own. Working with him has been a reward in itself. I also would like to thank the other members of my committee, John Hartman, Rick Schlichting, and Greg Andrews (who served in it for a while), for their support and encouragement. I could not have made it without the help of the lab staff of the Computer Science department. In particular, I would like to thank John Cropper, for his patience anwering tons of questions and going over old backups to help me fix many of my mistakes, £tnd Phil Kaslo, for being always ready to help me with the Paragon, managing to keep it running long enough for me to get all the numbers I needed. Also, the administrative stjiff has always been there to help, specially Margaret Newman, Cara Wallace and Wendy Swartz. Many fellow students also helped, in different ways. Among them, the following deserve special mention: David Mosberger provided the Scout thread package he wrote for the Alpha, which I ported to the Paragon; Lawrence Brakmo helped me with x-sim and offered veduable comments on rate-bcised traffic shaping; and Robert Muth helped a lot during the preparation of the presentation, providing invaluable comments on earlier versions of the slides. Sometimes it is hard to identify the moment when we started a long journey such as a doctorate. In my case, I remember it exactly. For that I thank my M.S. advisor at the UFMG, Osvaldo S. F. de Carvalho. Neither of us had any idea about it then, but when he asked me to lead a discussion about one of the first i-kernel papers ([HP91]), that was the start. Above all else, my wife, Daniella, deserves special thanks for her understanding, love and support in all moments. She unselfishly put her career on hold to keep me company diu-ing all this time, and for this I will be forever grateful. I could never have completed his work without her by my side, and I share this achievement with her. Finally, I would like to thank my parents for their education and support. During my doctorate I was particdly supported by the Brazilian Science and Technol­ ogy Council (Conselho Brasileiro de Desenvolvimento Cientffico e Tecnologico, CNPq), scholarship no. 200861/93-0. This work vas also supported in part by Darpa (California Institute of Technology subcontract PC159228). 5

To Daniella, for her love and support, adways. And to my grandfather, the first Dorgival Guedes (in memorian). ... "Benga, V6!" 6

TABLE OF CONTENTS

LIST OF FIGURES 10

LIST OF TABLES 12

ABSTRACT 13

CHAPTER 1: INTRODUCTION 14 1.1 The Parallel I/O Bottleneck 16 1.2 Moving Data Out of the MPP 17 1.2.1 The Need to Share Information 17 1.2.2 Meta-Computing 17 1.2.3 New Network Technologies 18 1.2.4 Network Attached Peripherals 19 1.2.5 Workstation Clusters 19 1.2.6 Standeird Parallel I/O Interfaces 19 1.2.7 The Web 20 1.3 Problems With MPP Network Subsystems 20 1.4 Thesis Statement £ind Contributions 21 1.5 Overview of This Dissertation 22

CHAPTER 2: PARALLEL I/O ISSUES 23 2.1 The ParjiUel I/O Problem 23 2.2 I/O Nodes in Multiprocessor Architectures 24 2.3 Parallel File Systems 27 2.3.1 Operating System Organization 27 2.3.2 File Layout 30 2.3.3 Access Interfaces 30 2.3.4 Strided Accesses and Collective I/O 31 TABLE OF CONTENTS - Continued 7

2.3.5 Current Parallel File Systems 33 2.3.6 Communication Patterns 34 2.4 Impact on Network I/O 35 2.4.1 Network I/O Nodes 36 2.4.2 Network OS Subsystems 37 2.4.3 Communication Patterns Over the External Network 39 2.4.4 Protocol Requirements 41 2.5 Concluding Remarks 42

CHAPTER 3: DISTRIBUTED PROTOCOL PROCESSING 44 3.1 A Case Study: The Intel Paragon 44 3.1.1 Architecture 45 3.1.2 Operating System Structure 46 3.1.3 Accessible Interfaces 47 3.1.4 The network subsystem 47 3.1.5 The Target Machines 50 3.2 Performance of the Current System 51 3.2.1 OSF/1 Protocol Server 51 3.2.2 Inter-node Communication 53 3.2.3 HIPPI Interface 54 3.3 Distributed Protocol Stacks 56 3.3.1 User-Level Protocols 56 3.3.2 ULP in the Paragon 57 3.3.3 Raw HIPPI Implementation 59 3.3.4 Raw HIPPI Performance Results 60 3.3.5 NX Implementation 61 3.3.6 NX Performance Results 62 3.4 Comparison of the Different Solutions 63 3.5 Concluding Remarks 64

CHAPTER 4: PROTOCOL ISSUES AFFECTING PARALLEL I/O 66 4.1 Window-Based Flow Control 66 4.2 Congestion 67 TABLE OF CONTENTS - Contiaued 8

4.3 Connection Start-up 68 4.3.1 Setup Haudshake 68 4.3.2 TCP Slow Start 69 4.3.3 Proposals to Improve Slow Start 70 4.4 Congestion Avoidance Phase 71 4.4.1 Original TCP Congestion Control 71 4.4.2 Proposals to improve TCP Congestion Control 72 4.5 Problems of TCP for High Performance Computing 74 4.5.1 Explicit Delays 74 4.5.2 Timer Granularity 76 4.5.3 Packet TVains 77 4.6 Cooperative Sessions 81 4.7 Concluding Remarks 84

CHAPTER 5: COOPERATIVE RATE-BASED TRAFFIC SHAPING 85 5.1 Implementation 86 5.1.1 Rate Estimation 87 5.1.2 Combination of Multiple Connections 89 5.1.3 Connection Scheduling 91 5.1.4 Fine-GraJned Timers 91 5.1.5 Operation of the Rate Controller 92 5.2 Simulation Models 93 5.3 Performance Results 96 5.3.1 Single Switch Case: Read 97 5.3.2 Single Switch Case: Write 101 5.3.3 Two-Switch Case: Read 103 5.3.4 Multiple Clients 103 5.4 Rate Control and Distributed Protocol Stacks 105 5.5 Concluding Remarks 109

CHAPTER 6: CONCLUSION 110 6.1 Limitations of Current Systems 110 6.2 Main Contributions Ill TABLE OF CONTENTS - Continued 9

6.3 Suggestions for Future Work 112

APPENDIX A: DETAILED SIMULATION RESULTS 114 A.l No Congestion 115 A.2 Central Switch and Related Central Link Cases 115 A.3 Reading from Ethernet Servers Through Ethernet Bottlenecks 116 A.4 Reading from ATM Servers Through Ethernet Bottlenecks 118 A.5 ATM Clients Writing Through Fast Ethernet Central Link 119 A.6 ATM Clients Writing to Fast Ethernet Servers 120 A.7 ATM Networks 122

REFERENCES 124 10

LIST OF FIGURES

1.1 Organization of a Massively Parallel Processor (MPP) 14

2.1 Distribution of I/O nodes in MPPs 26 2.2 Examples of different MPP OS organizations 28 2.3 File striping using multiple I/O nodes 30 2.4 PFS communication patterns inside an MPP mesh 34 2.5 PFS communication patterns over an external network 40

3.1 Organization of a Paragon Node 45 3.2 Standard Paragon network subsystem 48 3.3 Paragon TCP/IP performcince 52 3.4 Pjiragon TCP/IP performance for multiple connections 52 3.5 NX IPC performance 54 3.6 Raw HIPPI performance 55 3.7 Proposed change to the network subsystem 58 3.8 User-level protocol stack using NORMA 59 3.9 Performance of NORMA-based protocol stack 60 3.10 User-level protocol stack using NX 61 3.11 User-level protocol stack results 63 3.12 Comparison of the various stack implementations 64

4.1 TCP coimection setup 69 4.2 TCP slow start 70 4.3 Proper packet spacing causes no losses 77 4.4 Spurious drops caused by packet trains 78 4.5 Packet treiins in a TCP coimection 80 4.6 Poor link utilization due to one stalled connection 82 LIST OF FIGURES - Continued 11

5.1 Rate controller added to the protocol stack 86 5.2 Rate controller internal structure 87 5.3 Rate controller operation 93 5.4 LA.N with a single switch 94 5.5 LAN with two switches 96 5.6 Application performance; lOObaseT client reading from OC-12 servers . . 98 5.7 IVansfer trace for plain TCP: single switch, read 99 5.8 Transfer trace with rate control: single switch, read 100 5.9 State of the rate controller: single switch, read 101 5.10 Application performance: OC-12 client writing to lOObaseT servers .... 102 5.11 Application performance: lOObciseT hosts behind OC-12 link 104 5.12 Effect of added clients on rate: two clients writing, lOObaseT network . . 105 5.13 Effect of added clients: drops for two clients writing, lOObaseT network . 106 5.14 Steps for a distributed protocol stack implementation 107 5.15 Steps for vrate implementation 107 5.16 Distributed protocol stacks and rate control combined together 108

A.l LAN with a single switch 114 A.2 LAN with two switches 114 A.3 No congestion: two lOObaseT clients writing to OC-12 servers in a central- switch network 115 A.4 lOObciseT client reading from OC-12 servers 116 A.5 Client reading from lOObaseT servers 117 A.6 Two-way traffic with lOObaseT clients 117 A.7 Clients reading from OC-12 servers 118 A.8 Two-way trziffic with OC-12 servers 119 A.9 Central lOObaseT configuration with lOObaseT servers 120 A.IO Two-way traffic: OC-12 clients, lOObaseT central link, lOObaseT servers . 120 A.ll Two OC-12 clients writing to lOObaseT servers 121 A.12 Two-way traffic between OC-12 clients and lOObaseT servers 122 A.13 Performance in ail-ATM networks 122 A.14 Two-way traffic in all-ATM networks 123 12

LIST OF TABLES

1.1 Some application I/O requirements 16

2.1 Difierent multiprocessor architectures 24 2.2 Performcmce of Collective Buffering 32 2.3 Various PFS implementations 33 2.4 Bandwidth for different mediums 36 2.5 Network subsystem in some MPPs 38

3.1 Bandwidth limits in the Intel Paragon 56 13

ABSTRACT

High-performance computing applications were once limited to isolated supercomput­ ers. In the past few years, however, there has been an increasing need to share data between different machines. This, combined with new network technologies which pro­ vide higher bandwidths, have led high-performamce computing systems to adapt so that they can move data over the local network. There are some problems in doing this. Cur­ rent high-performance systems often use centralized protocol servers, thereby creating bottlenecks to network connections. In addition, the lack of a more appropriate protocol leads to the use of TCP by applications using parallel connections. TCP is not perfectly tuned to such applications. This dissertation presents a detailed an2ilysis of the problems caused by centralized protocol servers and the use of TCP in high-performance computing environments. It shows why the network servers currently available in some do not provide good performance. It also presents simulation results that illustrate how TCP connection performance can degrade rapidly when multiple cooperative connections are used. The main contributions in this work are the development of distributed protocol stacks and cooperative rate-based trafiSc shaping. Distributed stacks use an user-level protocol implementation to replicate the TCP/IP protocol stack in all the nodes of a multicomputer, removing the protocol server firom the data path and avoiding the associ­ ated bottleneck. Cooperative rate shaping uses bandwidth estimates to pace data packets, avoiding most of the problems that cause performance degradation in paxcdlel cooperative connections. It also provides a way for cooperating connections to share their bandwidth estimates, improving performance by making good use of their combined knowledge. 14

CHAPTER 1

INTRODUCTION

The past few years have seen technology achieve new limits by the use of Massively Parallel Processing architectures (MPP). In such systems, high-performance processing nodes (usually off-the-shelf microprocessors) are linked by a high-performance proprietary communication interconnect. The interconnect provides the nodes with high bandwidth, low latency paths to each other. Performance increases are achieved by expEinding the interconnect 2ind adding new nodes. MPP nodes may be divided into compute nodes, dedicated exclusively to running the appUcation code, and I/O nodes, responsible for controlling 1/0 devices, such as disks and interfaces to external networks. Such an organization is illustrated in Figure 1.1.

Interconnect I/O Node Disk Compute Node

I/O Node Disk

Compute Node I/O Node IMik I/O Node

External Network.

Figure 1.1: Organization of a Massively Parallel Processor (MPP)

Using this technology, vendors have broken the "TeraFLOPS barrier": at least one computer system is capable of maintaining more than one trillion Floating Point Opera­ tions per Second. At least five others have maximum performance above that threshold, and the 500 fastest computer systems perform well above 10 GigaFLOPS [top98]. All this available capacity has allowed researchers to consider problems that would have been impossible to solve a few years ago. These include problems in global weather 15

models, particle interaction in cosmological models, complex data visualization, planetary data imaging, eind computational biology, among others. In almost all cases, however, the need for high-performance machines is not the only issue: there is also a very large amount of data movement involved. This hcis brought into evidence the input/output (I/O) bottleneck problem associated with MPP systems. At first, the I/O problem was seen as limited to a single machine (a supercomputer). There was no need to move data out of the system, and almost all research efforts were focused on improving the performance of the disk sub-system. The problem was to improve I/O performance when disks were not capable of matching the data transfer speeds of the rest of the machine. Research in this area led to the development of high- performance parallel file systems, which use multiple disk nodes in parallel to handle each data transfer, assigning fractions of the information to each disk unit [C^QS, CPD+OO, JWB96]. Although most supercomputers were also connected to external networks, there was no motivation to make data readily accessible to other machines by extending a parallel file system over the external network [FCBH95]. Such networks were intended to be just a connection path to submit jobs and manage the machine, and the technologies used had capacities much lower than that of the internal interconnect. But network technology has experienced rapid improvement in recent years, and now technologies like Asynchronous Trsmsfer Mode (ATM) [Vet95], High Performance Parallel Interface (HIPPI) [TR93], and others, allow local networks to perform at levels much higher than before. In addition, the need for data sharing has increased, and researchers now find themselves having to move data across different networks to make use of different machines. The network subsystems in ciurrent MPPs were not designed for such use. Although the external network can now in some cases offer performance similar to that of the internal interconnect, in most cases MPPs can only use a fraction of that bandwidth. This is due to multiple factors, including poor operating system organization and use of inadequate transport protocols. New solutions in these areas are still necessary. This dissertation focuses on improving the performance of the network subsystem for massively pareillel processors. It starts by identifying the driving forces that have exposed the problem. It then discusses the situation in terms of current operating system (OS) infraistructure and tremsport protocols. Finally, it suggests changes in these two areas to 16

improve network subsystem performance.

1.1 The Parallel I/O Bottleneck

Consider a planetary data rendering system used to visu£dize the of Venus. The Magellan probe has sent already more than 3 Tbytes of radar reflection data to map the relief under the planet's permsinent cloud cover. To render a 30 frame per second animated image of the surface, about 200 Mbytes must be handled for each frame, requiring I/O rates in excess of 5 Gbytes per second [dRC94]. Computational power is not a problem in this case. Many other applications have been recognized as extremely I/O intensive, and sig­ nificant effort has been into improving performajice on such cases. Table 1.1 lists the requirements of some of these applications [dRC94, AWG94|:

Application I/O rates Astrophysics particle dynamics 20-200 MB/s Radio telescope imaging >1 GB/s Computational quantum materials 40-100 MB/s 3D atomic structinre of viruses >1 Gbps Seismic data processing >100 MB/s 3D tmrbulence simulations « 1 GB/s Table 1.1: Some apphcation I/O requirements

At first, the I/O problem in supercomputers was only local. The question was how to achieve high transfer rates when the disks used for storage inside the machine were clearly not capable of sustaining transfer rates as high as needed. The solution is to extend the MPP model to differentiate compute nodes and I/O nodes. Compute nodes are responsible for the actual execution of application code, while I/O nodes handle disk accesses^. The task of distributing the data and maintaining the sequential file abstraction over a set of disks in different nodes is performed by the parallel file system (PFS) [JWB96]. With such an organization, high-speed data transfers can be achieved by accessing data from different disks in parallel. If different nodes access different parts of a file, there are chances they may he on different nodes. Multiple disk transfers can then occur in

^Interfaces to external networks, if they exist, are also assigned to I/O nodes. 17 parallel, thereby improving performance.

1.2 Moving Data Out of the MPP

In the last few years researchers have started to consider moving the data used by theu: high-performajice applications out of the supercomputers. This is happening for several reasons. New solutions are being designed in order to have an application's data available over the network, instead of just inside the MPP. Driving this technology change are new needs previously overlooked, changes in network and I/O technology, and new computing paradigms.

1.2.1 The Need to Share Information

The first parallel file systems were all designed with specific architectures in mind. That made it difficult to transfer a data file from machine A to machine B, if their architectures were different. In many cases, files had to be individually processed to fit in a new PFS. When researchers from different facilities worked together in one project, it was necessciry to maintain copies of their data sets in each facility. Some techniques have been proposed to reduce the problem of transferring large data sets between cooperating research facilities. Such solutions usually start by retrieving a complete data set from its remote location. Only when all information is received the application is launched, and results are stored locally. If necesseiry, updates to the original data set are performed afterwzirds [FKKM97]. As the processing power increases, these sets grow in size to a point where in some cases replication is not always possible. This is a problem, specially considering that in many applications just small portions of the data sets are used.

1.2.2 Meta-Computing

There are problems that are better solved in a vector supercomputer, like compute in­ tensive matrix calculations, while others are better suited for am MPP, like general bag- of-t£ksks problems, and yet others are ideally solved with data parallel machines. Many important problems have different stages that fit better in different machines. One good example of such a system is a global climate model system [GFLH98]. It must include simulation models for atmospheric variations, full-depth ocean currents, land changes, 18 sea-ice models and even chemistry models, not to mention the visualization aJgorithms. Each separate element may require a different space and time granularity, may be mod­ eled with algorithms fitting different architectures, and in some cases may be so large that it will not fit any individual machine. This kind of requirement hcis led to the development of meta-computers, large systems resulting from the combination of machines with different architectures in order to solve complex problems[GFLH98]. For the most part, there axe no standard programming intedaces or unified access and control facilities. Instead, each application has to define its own communication mechanism, usually developed in an ad-hoc way [VDK92].

1.2.3 New Network Technologies

The needs mentioned in the previous sections would not be enough to cause chcmges if network performance was still a limiting factor, but developments in different areas have contributed to reduce this problem.

• In the traditional network realm, new technologies like Gigabit Ethernet and ATM have created high-performance networks capable of Gigabit speeds. Such networks are already available for use even in personal computers.

• Some of the protocols defined for I/O buses are becoming flexible and powerful enough to be used as network protocols themselves [SLS94]. For example, HIPPI was developed originally as a bus protocol for high-performance I/O systems (disks), but with the development of switches with support for routing and the definition of how to encapsulate TCP/IP traffic in HIPPI packets, it is now available for network applications [Ren97].

• For a long time, work has been done on supercomputer interconnect technologies. Despite their specialized use and limited dimensions when compared to a LAN, these interconnects provide routing and other features required from a general net­ work architecture. There are cases of supercomputer interconnect technologies be­ ing adapted to create network fabrics for individual computers; for example, the ATOMIC network [FDCF94] and Myrinet [BCF+95].

All these developments bring network performance to a level that allows networks to be used to access external data storage, outside the specialized supercomputers. 19

1.2.4 Network Attached Peripherals

The convergence of network and I/O technologies, along with the reduction in disk costs, have led to new research concerning the use of peripherals connected to a network instead of being restricted to a single computer. Such devices are based on a dedicated processing node serving I/O requests through the network [SI095]. The goal is to take advantage of low disk prices and high parallelism to deliver high performance and high reliability at low costs. Although the interfaces offered to client nodes vary widely, systems based on network- attached devices are intrinsically dependant on network I/O, since the network is a key element in the design [G'''98, Har99].

1.2.5 Workstation Clusters

Improvements in network and microprocessor technology allow high-performance systems to be designed by combining individual workstations with off-the-shelf network interfaces. Projects like Beowulf [BSS"^95] and NOW (C''"97] achieve supercomputing performance for some applications with widely available hardware. In such systems, each node can be a complete computer, with its own disk and network interface. In a first inspection, there is no I/O bottleneck to worry about—each node handles its own. But this is almost never possible. Although e£ich node may use its own disk for temporary storage, input data and final results must be laid out in a consistent way, and must be easily accessed. Many times during the execution of an application data produced in one node is needed by another one. When this happens, transfers Jire imavoidable [Ste96]. In general, each node may be capable of storing data, which may be eventually requested by some other node. In a way, every I/O operation may potentially become a network access.

1.2.6 Standard Parallel I/O Interfaces

All the initial PFS implementations had proprietEiry interfaces, and application programs were highly dependent on the data placement techniques used in each case. Even if data could be moved from one architecture to another, programs had to be completely re­ designed to take the new data organization into account. Only recently have standard interfaces been proposed, allowing different machines to share a common access method 20 despite their diflFerent internal organizations [CPD''"96, NK96]. On a higher level, portable parallel programming environments like MPI and PVM have cdso been expanded to in­ clude a set of parallel I/O primitives [CFF"'"96, MS96].

1.2.7 The Web

The World Wide Web has gained such widespread use that some Web servers have to be implemented using extremely powerful machines, usually MPPs, but also sometimes workstations clusters [KBM94, AYHI97]. For such servers, any request implies a network connection with the client, and sometimes may also include access to a high-performance PFS. If such a machine is to perform as a server, it must be able to handle a very large number of concurrent data transfers over the network continuously, which requires new solutions for reducing the network subsystem overhead.

1.3 Problems With MPP Network Subsystems

The development of high-performance applications that access data over the network has not been without problems. Despite the advances in network and hardware technology, network subsystems in massively parallel processors have not been able to deliver perfor­ mance at the expected levels. There are two causes for this problem: (1), the operating system overhead due to the network subsystem, and (2), the inadequficy of the network protocols used for the eictual tremsfers. For many years all efforts in the area of massively parallel computers have focused on improving performance of compute-bound applications. Such applications perform few I/O operations compared to their CPU utilization, and most of their inter-processor com­ munications are to exchange intermediate results among the compute nodes. Research in this area has therefore focused on improving processor performance and reducing the latency of inter-process communication primitives. Since the bottleneck started to move to the I/O subsystem, research has been mostly concerned with improving communica­ tion rates to the internal disks, not the external network. The later has been regarded as the interface to the supercomputer &ont-end machine, used just for machine administra­ tion and job submission. Now that high-performance interfaces au'e becoming available for supercomputers and data is being moved over the extemail network, their network subsystems eire begining to limit overall performance. 21

While in the supercomputer interconnect highly specialized protocols can be used to take the best advantage of the hcirdware features, when applications start to cross the external network for data they have to rely on standard network protocols. Considering the need for reliable data transfers and for flow and congestion control, the protocol of choice has been TCP.^ Although it has been updated to include features needed for operation in high-performance networks [JBB92], there are still problems in its use for PFS. First, being originally designed for wide area networks (WANs), some of its features limit its performance in local area networks (LANs). Second, it is designed to handle individual connections. In an MPP, applications will usueiUy start multiple concurrent connections, with behavior and semantics affected by their overall combination. Using a standcU'd protocol does not allow the system to express and make use of this combination. This dissertation addresses both of these problems and proposes solutions that im­ prove the overall network I/O performance in massively parallel processors by chajiging the organization of the network subsystem and by adding functionality to TCP to improve its performance when used in multiple cooperating sessions.

1.4 Thesis Statement and Contributions

This dissertation starts with the hypothesis that the current organization of the network subsystem in MPPs and the use of TCP for concurrent connections su-e causes of the poor network performance in those systems. For better results both elements must be reviewed so that better solutions can be found. The goal is to improve the performance of the network subsystem in parallel proces­ sors. To achieve this goal, we analyze in detail the problems outlined in the previous sections, with results from an actual system and network simulations. Based on these observations, the contributions of this work are new techniques developed to improve the network utilization in those systems:

• Moving most of the network subsystem data path from the operating system to each compute node improves performance by increasing the parallelism in processing multiple coimections. Instead of using the OS centralized network server, each node handles its own connections. ^Distributed Sle systems for networks of workstations often use other protocols (e.g., NFS uses UDP), but that is not the case with parallel file s3rsteins so far. 22

• Adding new rate control features to the TCP/IP protocol stack improves the conges­ tion control algorithms for the local network case. It also allows the protocol stack to take advantage of the cooperating behavior of a set of concurrent connections.

The solutions proposed can be combined to create efficient PFS implementation.s oper­ ating over high-performance networks with little operating system overhead and improved protocol behavior.

1.5 Overview of This Dissertation

Chapters 2 and 3 address the network subsystem in MPPs. Chapter 2 explains the usual OS organization in current machines, focusing mainly on the network subsystem. The diflFerent possible organizations are described, and their problems discussed. Next, Chap­ ter 3 examines the multiple performance overheads found in an actual MPP architecture, the Intel Paragon. It then describes the performance improvements achieved by moving the network subsystem into the application space in that architecture. Chapters 4 and 5 then address the protocols used in parallel file systems. Chapter 4 presents a discussion of the TCP features relevant for its use as the PFS's transport protocol. The original implementation of the protocol is described, as well as a set of techniques designed to improve its performance. It then discusses which features are a problem in this case, and explains why the cooperative behavior of sessions is important ajid why it cannot be used by TCP. Chapter 5 presents performance results of adding cooperative rate-based trciffic shaping to the TCP/IP stack. The new technique is explained with an emphasis on its ability to take advantage of the cooperative nature of the connections and its positive effect on congestion avoidance. Closing this dissertation, Chapter 6 presents the conclusions, discusses possible future directions, and offers some final thoughts. 23

CHAPTER 2 PARALLEL I/O ISSUES

To understand the problems related to the network subsystem in massively parallel processing systems, we must understand the problem of parallel I/O and how it is handled. This chapter starts by explaining the importcince of parallel I/O in current MPPs and proceeds to discuss how the hardware and the operating system have been adjusted to better handle I/O-bound applications, including the requirements over file systems designed to operate in MPPs. Finally, it discusses how current systems deal with network connections.

2.1 The Parallel I/O Problem

To illustrate the problems that data input and output pose on multiprocessors, consider the problem of composite image generation in remote sensing data processing [WM97]. A satellite builds images of a region, one line at a time, using different sets of sensors to detect different wavelengths (e.g. near infrared, red and green). The lines from each sensor are concatenated and stored. The resulting image files have the three readings for each line, in order, instead of grouping the three colors for each pbcel, as usual graphical representations. Images of any region are acquired multiple times (say, once a day). On some days the cloud coverage may be too thick to yield a good image of the terrjun, while on others the same region may be clearly visible. This problem makes these raw data files unsuitable for higher level purposes. They must be combined into composite images, which use the best pixels possible for each area along an image time series. The first step for composite construction is to derive a set of image indices from the three wavelength values for each point. Such indices may be compared later to determine automatically the quality of each pixel based on physical properties of the ground atmosphere. The exact expressions and their meanings are not relevant in this discussion; it suffices to say that eftch index for a pixel is a function of just the three 24 wavelength values of that pixel; there are no neighborhood effects. There is very little processing iavolved, but a lot of data must be input, and sometimes ten times that amount must be output as different indices. Assume we have a multiprocessor with twelve compute nodes to process this infor­ mation, and images are stored one per file, with interleaved lines for each sensor. Each node is supposed to read a line from the file, compute the various indices and save the resulting new lines in other files (ignore the synchronization problem to determine which line each node must process next). The simple solution is to attach a single I/O processor to the system and to keep aJl the data in it. The problem is that such a solution does not scale. The I/O processor is quickly swamped by the requests from till nodes. The I/O problem in massively parallel processors can only be solved efiiciently if parallelism is used at the disk-access level. The solution is to add disks, not to one point in the MPP, but to a set of nodes—the so called I/O nodes.

2.2 I/O Nodes in Multiprocessor Architectures

Parallelism is a key factor to guaranteeing good performance in current high-end ma­ chines. Almost all shau'e the architectural principle of using computing nodes with some local memory linked by high-bandwidth, low-latency interconnections. They differ in the type of interconnection network (interconnect), and the inter-process communi­ cation (IPC) model used by computing nodes to interact. This is usually some kind of distributed shared memory (DSM) or direct message passing. Table 2.1 lists a few of the different combinations used.

Machine Interconnect IPC Model Cray T3E 3D toroidal hardware assisted DSM Intel Paragon 2D mesh message passing IBM SP Multistage switch message passing Connection Machine CM-5 Fat-tree message passing Beowulf, NOW Switch (LAN) message peissing, software DSM Tera MTA-1 Switch Shared memory Table 2.1; Different multiprocessor architectures 25

The Tera machine is the exception in this list, being developed to hide latency delays in the access of a large shared memory by using a variety of sophisticated designs like deep pipelines and per-instruction task switching. Since it does not fit the general model for massively parallel machines adopted here, it is not considered in this work. In aJl the distributed memory architectures, message processing, routing, and delivery

Compule Node Compute Node

Compute Node Compute Node (a) Centralized

I/O Node Compute Node

I/O Node —

Compute Node I/O Node (b) Partially distributed

Compute^l/O Computet-I/O Node Node

Compute+I/O Compute^l/O Node Node (c) Fully Distributed

Figiire 2.1: Distribution of I/O nodes in MPPs

The most cominon solution, illustratted in Figiure 2.1.b, is to add I/O capabilities to a subset of the compute nodes and to distribute them along the topology in a way that takes advantage of the machine's communication patterns. In the CM-5, for example, I/O nodes are grouped together in complete sub-trees, so that I/O traffic is kept as separated from inter-compute-node messages as possible. There is a lot of debate about whether I/O nodes must be used solely for disk access or if they can perform some computation as well. The argument for the latter is that disk bandwidths are low enough that there is some capacity available in those nodes that can be used to perform some other computations. Others argue that the increased complexity due to the combination of different tasks in one node ends up hurting performance. Except in a few cases where the I/O node is implemented with special hardware, the decision is left to the user, who in most cases prefer to use only the compute nodes, due to the 27

greater simplicity of implementation. Only when the problem has some special feature that can clearly benefit from using the I/O nodes do users choose to include them in the computation.

2.3 Parallel File Systems

Once I/O nodes are added to the architecture, the operating system must be adapted to handle input and output to the special nodes. This dictates what kind of services are available in each node. Defining the structure of the operating system is just the first part of the problem, however. There are two other questions that must be answered: how

Eire files laid out over the multiple disks, and what kind of interface is available to the programmer? The emswers to these questions define the parallel file system (PFS).

2.3.1 Operating System Organization

Despite the variety of node architectures, interconnect technologies, I/O node organi­ zation, and programming interfaces, MPP operating systems fit into just a few main categories. This section considers the kernel structure in each node and the services available in them, as shown in Figure 2.2.

Fully differentiated: There are different kernels for different parts of the machine. Since services in each node depend on the local kernel, services are also differenti­ ated. I/O nodes have larger kernels with support for specific devices, while compute nodes have kernels limited to basic process management tasks. Some nodes may be specialized as front-end interface processors or offer specific services.

Fully replicated: All nodes have the same kernel architecture, and there are no limita­ tions on the operations available in any of them, except for those dictated by the hardware available (I/O nodes have local disk access, compute nodes do not). Any node in the system can start a network connection to the outside or control user access, for example.

Replicated kernel, differentiated services: Although all nodes have the same kernel architecture, configiuration parameters limit the service classes available to processes in each node. This option tries to simplify configuration and administration, using 28

a single kernel organization for all nodes, with administration files describing which services are available in each node..

Application Remote I/O Space Interfaces l/b*KenTei Limited Device Kernel Dnver

Remote I/O System Services Interfaces 1/0 Kernel External Full Kernel Device! i Driver Network

Fully differentiated

Application ; Space Space Full Full Device Kernel Kernel Driver Dtaka

'• Application Application Space Space , Full Full Device i External I Kernel Kernel Driver Networtc

Fully replicated

Figure 2.2; Examples of different MPP OS organizations

There are obviously advantages and disadvantages for each model. Differentiation can potentially reduce the OS overhead in each node, limiting the amount of resources (memory, CPU cycles, etc.) allocated for kernel processing to its minimum. The down­ side is that it requires specialized nodes for certain tasks, and those nodes may become bottlenecks. One of the best examples of an MPP with a fiilly differentiated operating system is the Intel Paragon [Int91]. It offers an -like prograunming interface, but compute nodes run just a basic kernel, which offers little more than process and memory management support [RBF'^89]. No complex system service defined by an Unix system call can be per­ 29

formed locally. It must be routed to a service node, a differentiated node that runs a Unix server on top of . The differentiated approach is even more noticeable if the Paragon is configured to run the speciEil SUNMOS kernel [SS94]. It achieves higher performance by simplifying the application even further, biding most Unix calls completely. The fully replicated approach, on the other hand, has configuration simplicity as one of its main qualities, not requiring any node differentiation. This makes all nodes potenticdly capable of performing any OS task. It does not abolish the differentiation between compute and I/O nodes; it just means that the kernel is basically the same, although some parts of it may be of no use for compute nodes. This is exactly the msiin disadvantage of this approach: in many cases a node's kernel may carry a high overhead due to features that might never be necessary. The IBM SP series is a good example of an MPP with replicated kernels. Each node has a complete AIX kernel, and behaves as an individual machine, controlling user access, providing kernel services, etc. A special interface on top of AIX allows programmers to treat the system as a multiprocessor. Each machine has its own version of system time, local file system (if available), £ind so on. Although the fully replicated model is easier to configure, it has the problem that it may be hard to mzinage, since all machines can potentially perform any function by them­ selves. The replicated kernel, differentiated services approach, exemplified by Beowulf, for example, tries to solve this problem. In Beowulf, all nodes execute the same OS kernel, Linux. Despite the fact that each kernel contains a complete network subsystem, they have no path to the external world. Their subsystems are used solely to support internode communication. If multiple nodes require data firom the outside, they must request it from the front-end node, which becomes a bottleneck in such cases [BSS''"95]. In systems with differentiated services, whenever a certain service needed by the application is not explicitly defined for all nodes, the user may use application gateways to solve the problem. An application process running in the Beowulf front-end node serving as a relay between compute nodes and the external network is an example of such gateway. 30

2.3.2 File Layout

The simplest solution to the file layout problem, Illustrated in Figure 2.3, is usually called striping. Each file is divided in chunks of some size (stripes), which are distributed cyclically over the available I/O nodes. Chunk size may be fixed for the file system or decided based on file characteristics, but the main idea is that the file system exports what is conventionally called the canonical file representation: a one-dimensional byte stream formed by concatenating one chunk firom each I/O node. Any compute node can address any portion of the file by defining an ofiset and length of access. The main advantage of such format is that applications based on conventional file interfaces can be used without modifications. The downside is that it cannot use any information about how the application £iccesses the data.

File

Figure 2.3: File striping using multiple I/O nodes

In order to make some of that information available to the file system, some imple­ mentations use structured file views, where each node has access to a different portion of the file, and that portion carries information about the way data is to be accessed. This extra information, along with the restriction on which nodes can access which files, make it easier to achieve higher performance, although applications have to be rewritten to make use of the interface.

2.3.3 Access Interfaces

The first parallel file system implementations were vendor-based, and included operating system level changes to extend the original Unix file system calls to work over a proprietary 31

PFS designed for a given machine. They offered a familiar interface, which allowed existing applications to use them, but were restricted to their original systems, and lacked flexibility to handle complex data access modes. Although the general Unix interface was maintmned, extensions had to be added to handle new problems, like how conflicting write command issued by diflerent nodes are resolved and consistency issues in access sequences are handled. Most of the newer solutions are library-based implementations focused on portability. Besides the original Unix file abstraction, they provide data representation primitives that allow programmers to express complex distributions and access patterns. These feature- rich interfaces make more information about the application available to the file system, so that some optimizations can be performed that would be impossible in a canonical Unix file system.

2.3.4 Strided Accesses and Collective I/O

In general, MPP applications tend to access data in small units laid over the file with some stride. For example, imagine a node reading a row of a matrix stored in column major order in a file: it will have to read one element, then skip over a whole column to get the next element in the same position of the next one. This means that it will access an element of data with a certJiin stride over the file, which would lead to a large number of isolated messages and disk accesses. (This is sometimes referred to as the naive algorithm for parallel I/O.) The first optimization allowed by special PFS interfaces is to express such periodic accesses in a single system call defining a strided access. That way, there are many fewer messages generated and traffic inside the MPP is reduced noticeably. Even if all small accesses are compiled into a single request message, if each node is left to read/write &om disk whenever it deems necessary, there is still no order in the way disks are accessed. I/O nodes are still forced to perform multiple random accesses to serve each request, resulting in seek delays and cache conflicts. It was to solve this problem that a new optimization was created, usually referred to as collective I/O [NF95]. In most cases in PFS-based applications, I/O operations tend to occur in waves, with a set of compute nodes accessing the disks right before or after synchronization barriers. When this is the case, at the beginning of a wave the application in each compute node 32 sends a collective I/O message to each I/O node specifying all data that node requires. When an I/O node receives a message indicating that a collective I/O operation is in progress, it buffers all requests until it has one message from each compute node. At that point it can sort all requests using some criteria designed to reduce service times and serve them accordingly. Since the disk nodes decide the order in which requests are served, this technique is sometimes called disk-directed I/O [NF95]. Another form of collective I/O, called collective buffering, performs a shuffling of requests among compute nodes first, so that all requests to a given I/O node are con­ centrated in a selected I/O node. After that, communication happens only for pairs of compute and I/O nodes. When the compute node receives all data it distributes it again to those taking part in the request. This only makes sense when interconnect latencies are much lower than disk access times, since it requires intense communication between compute nodes before and after the disk accesses. These optimizations are critical to guaranteeing scalable I/O performance. They have the important effects of reducing the niunber of connections over the mesh and the number messages exchanged, so that connections can transfer larger amounts of data in each message. Disk-directed I/O, for instajice, has been shown to be 16 times faster than the naive algorithm in some cases [NF95]. A variation of double buffering called extended two phase method has been documented to perform more than 25 times faster than the naive algorithm when accessing strided data [TC96]. Table 2.2 lists some execution times (in seconds) when reading elements from large arrays with different strides.

Stride pattern Naive Two-Phase Gain Stride = no. of processors 210.8 9.33 ss 23x Diagonal: prog. 1 53.1 2.84 « 19x Diagonal: prog. 2 87.2 4.39 Rs 20x Along columns 96.2 3.85 « 25x Along rows 130.7 2.34 ss 56x (Times in seconds) Table 2.2: Performance of Collective Buffering 33

2.3.5 Current Parallel File Systems

Table 2.3 list some of the various PFS implementations available today, with some of their most important features.

Product Platform Base File View Collective IBM PIOFS IBM SP OS Canonical, No (Vesta FS) Structured Intel PFS Paragon OS Canonical, Yes Structured sfs CM-5 OS Canonical No PASSION Paragon Librciry Structured Yes Panda iPSC/860 Library Structured No PIOUS PVM Library Ccmonical, No Structured MPI-IO MPI Library Canonical, Yes Structured Table 2.3: Various PFS implementations

Most systems mentioned here offer at least a Unix-like I/O interface, but in most cases they are expanded with options that allow the definition of special access modes. Structured file view means the application may specify elements of the file's internal structure, making the expected access pattern explicit to the file system. The main structuring technique is explicit support for strided accesses, but it may also specify other features, like dividing a matrix in sub-matrices, instead of laying it in row (or column) major order.

IBM PIOPS: This is the production system based on the research prototype Vesta file system [CF96]. It offers a lot of flexibility in terms of controlling the file layout and access modes, but it does not have support for collective I/O in any form [C"'"95].

Intel PFS: Although data placement is restricted to standard striping, extensions to the Unix interface allow many different access modes, providing great flexibility, including collective I/O [Bor96].

CM-5 sfs: The oldest system in this comparison, it has little more than a striped file 34

layout with Unix-like programming interface, except that some extensions allow independent access by each node [LIN'^93].

PVM PIOUS: Being implemented as a library in the PVM system, it is highly portable and offers most of the features expected from a good PFS, except for the lack of collective I/O [MS96].

MPI-IO; Like PIOUS, it is a library-based implementation in the highly portable MPI environment [GLS95], making it available in a great variety of architectures. It offers a wide remge of services, with Qexibility to express a wide range of disk layouts and access modes [CFF'^96].

Panda [SCJ"''95] and Passion [CBM'''95] follow a different path from the others. Al­ though they eire library implementations, they are highly dependent on the underlying architecture, not being as portable as PIOUS or MPI-IO. They also limit file layout to partitioned solutions, where each node has access to a different portion of the file.

2.3.6 Communication Patterns

The final topic in parallel file systems relevant to this discussion is to identify the main commimication patterns occurring during a data access phase in an MPP. In this case, we assume just internal I/O nodes. Figure 2.4 illustrate the four main modes.

Compute I/O Node Compute I/O Node Node Node Compute Compute I/O Node I/O Node Node Node

Compute I/O Node Compute I/O Node Node Node Collectiv*

Compute I/O Node Compute I/O Node Node Node

Compute I/O Node Compute I/O Node Node Node -e >• Compute I/O Node Compute I/O Node Node -© Node Scatter/GatlMr Broadcast/Reduce Figure 2.4: PFS communication patterns inside an MPP mesh 35

When ail nodes are allowed to perform I/O on their own, with no support for collective I/O, we have independent accesses, which means that each compute node contacts all I/O nodes containing the data it needs. If there are M compute nodes and N I/O nodes, there may be up to M x iV connections working in parallel. The complex traffic patterns may lead to multiple congestion points, auid managing Jill transfers may result in a high overhead. If the system supports collective I/O with request shuffling among the compute nodes, we end up with at most N connections from compute nodes to I/O nodes (assuming M > jV), and what is more, each node has to handle at most one connection eaxih time. When all nodes require the same information from disk, like a problem description record from an input file, that data must be read by all compute nodes before they can proceed. This is called a broadcast, although it is usually steirted by the receiving nodes, as opposed to the usual definition of a broadcast operation. The opposite operation, although not very common, can also occur when all compute nodes are programmed to store a certain result, but which one gets stored is of no importance. This is usually referred to as a reduction, and it creates a set of connections with different sources, but a single end-point. The use of I/O gateways, as discussed in Section 2.3.1, leads to the scatter/gather scenario. This can also happen if the operating system tends to concentrate all data paths in special servers. One important point to notice is that if we focus on the view of a single compute node in the independent communication pattern, we have a scatter/gather situation, just as when we focus on the I/O node we notice a broadccist/reduce pattern. For our purposes, it does not matter whether such operations address the same position in a file, just that they share the same combination of end-points among compute nodes and I/O nodes.

2.4 Impact on Network I/O

Some of the characteristics of the problem change when the external network becomes a part of the parallel file system. Network interfaces differ from disks in some aspects, and these differences pose different requirements on network nodes. The network subsystem has different implementations in different systems, and it becomes an important part of the system, not found in a local PFS. The existence of centralization points, whether due 36 to the limited mmiber of high-speed network interfaces or the use of a network switch, cause the commimication patterns to change noticeably.

2.4.1 Network I/O Nodes

Although nodes connecting to external networks can be seen as very similar to disk nodes, there is one difference in the way the two mediums that must be considered: transfer speeds, as illustrated in Table 2.4.

Technology Disk/Net Bandwidth Ethernet Network 10 Mbps SCSI-1 Disk 40 Mbps SCSI-2 Disk 80 Mbps Fast Ethernet Network 100 Mbps ATM OC-3 Network 155 Mbps SCSI-3 Disk 160 Mbps ATM OC-12 Network 622 Mbps HIPPI Network/Disk 800 Mbps Gigabit Ethernet Network 1000 Mbps Table 2.4: Bandwidth for different mediums

It is easy to see that newer network technologies have reached performance levels above those of local disks, specicilly considering that single disk performance is usually much lower than SCSI limits. Network protocols are also more complex than I/O bus protocols, meaning that an I/O node cannot heindle as many network interfaces as disks. Taking advantage of the higher bandwidths offered by network interfaces, massively parallel systems can use many fewer network interfaces than disk nodes to move the same data. This fact, £ind the higher overhead required for network protocol processing, usually force network interface nodes to be completely dedicated to that task. In some cases, general purpose architectures like those used in the compute nodes are not even capable of handling data at network speeds, and dedicated hardware may be used, as in the case for the Intel Paragon ATM interface [FGP95]. On the other hand, sometimes the capacity of individual links in the MP? mesh may aot be enough to handle all the throughput of new network technologies. To solve this problem, it may be necessary to coalesce several links into a single network I/O node, 37 replacing several nodes. This solution is adopted in the Connection Machine CM-5, for example, where each HIPPI interface element is not a common leaf node in the FAT-tree structure, but instead it replaces a complete sub-tree to use all the capacity of the group of links connecting it to the rest of the tree.

2.4.2 Network OS Subsystems

Considering that external networking has received less attention than most other oper­ ating system functions in multiprocessors so far, it is not surprising that current systems have rather simplified solutions to that problem. Although multiprocessor operating sys­ tems in general have a wide variation in terms of implementation solutions, usually their network subsystems fall into one of two broad categories:

Independent network processing: This is the usual solution in MPPs built using individual machines, like workstation clusters. Each machine, having a complete stfmdalone operating system, has its own protocol stack. This means that they are already capable of handling network traffic, as long as they have their own network interface. The problem is that connections from each node behave as though they are coming from an independent hosts, and the cooperative nature that usually guides them in an MPP is lost. The IBM SP series and the NOW cluster are in this class.

Centralized network server: In order to guarantee that all connections appear as be­ longing to the same system, all network traffic is routed through a common node inside the MPP. Usually, this implies that a single interface is used to the outside network. The protocol processing may be centralized in a protocol server, a pro­ cess responsible for processing all network traffic, as in the Intel Paragon, or the application may be forced to implement its own gateway, as in the Beowulf.

Table 2.5 summarizes the solutions adopted in each main MPP system available today.

Intel Paragon: Usually there are a few nodes with HIPPI or ATM interfaces. Protocol stack processing is the task of the Unix server, a user-level process executing outside of the Mach kernel. It may be co-located in the node with the network interface, but usually it is located in a selected compute node, the service node. 38

System OS structure Network subsystem Intel Paragon Fully differentiated Unix server IBM SP Fully replicated Independent nodes Connection Machine CM-5 Differentiated services Network server Cray T3E Fully differentiated Unix server and front-end Beowulf Differentiated services Front-end only NOW Fully replicated Independent nodes Table 2.5: Network subsystem in some MPPs

IBM SP series: Each node has two network connections, one limited to the other nodes over the multi-stage switch, and the other to the external network, with lower capacity. Network 1/O may be performed directly by each node over that external interface as an individual machine.

Connection Machine CM-5: High-performance network interfaces replace small sub­ trees in the FAT-tree interconnect, to guarsintee that the bandwidth from the inter­ connect to the interface can handle the capacity of the external network. Protocol processing is performed by a central network server, as in the Paragon.

Beowulf: Although ecich compute node has a network interface, their routing tables are restricted to accessing each other and the front-end node. Network I/O to other mcidilnes out of the cluster is performed only by the front-end node.

NOW: Every node is capable of performing network I/O to machines outside the system over the same network interfaces used to connect to the rest of the cluster. If the cluster has a single central switch, that element is the only point where connections are combined in the system.

Even if an operating system offers only fully replicated network subsystems, some shared knowledge is always desirable. This allows connections from an MPP that cross an external network to be identified as sharing a common end-point, which may be important in making better decisions about traffic management, congestion control, and so on. If there is some way to maJce all nodes label their coimections in a consistent way, the network is able to identify them as cooperating with each other, and traffic decisions may use that information. 39

Centralization is one way to provide this shared knowledge. It happens often in current MPPs, even those using fully independent network subsystems. This is due to the fact that these multiprocessors are based on workstation clusters: the replication is a direct result firom the use of nodes that have full operating systems, but the network technologies available to put such clusters together are all based on central switches. All connections to the outside must be routed through such a switch, and it becomes a (mostly passive) central element. In addition, it is often the case that applications define their own centralizers to handle network traffic. Although in some cases (like the Beowulf) such centralizatioa is imposed by the hardware, in many others it is used to simplify the implementation and to giiarantee that the external network identifies all connections from the MPP cis originating from a single point. In other cases, given that some network technologies may be powerful enough to carry the external traffic of all MPP nodes through a single interface, centralization happens at the interface level. This is the case, for example, with the Intel Paragon: all nodes may end up having their connections to the external network routed through a single HIPPI or ATM interface. This may or may not imply a centralized protocol server, although all systems in production today tend to adopt that solution. The problem with centralized servers is that they put too much pressure on the nodes responsible for protocol processing, creating serious bottlenecks. A solution for this problem is discussed in the next chapter.

2.4.3 Communication Patterns Over the External Network

The centralization identified in the implementation of the network subsystem, either in the form of a protocol server or an application gateway, leads to noticeable changes in the communication pattern of PPS connections, as shown in Figiure 2.5. With the cen­ tralization point, the traffic in the external network always share some common element, and for the I/O nodes, it always appears to originate from a single point. The same is true for the behavior inside the MPP, if we observe just the interior of the gray boxes in Figure 2.5. To a first approximation, the interior nodes communicate with the centralizing element over a high-speed interconnect that is in turn addressing the external network. This is an important point, because it allow us to consider the MPP with ail its internal nodes as a high-capacity host when we study connection behavior 40

Compute lA^Node Compute I/O Node Node Node

Compute Compute I/O Node Node Node

Compute I/O Node Compute I/O Node Node Node -© Independent Collectiva

} Compute I/O Node Compute I/O Node Node h© Node -e i rnmpiitn I/O Node Compute I/O Node Node 1^ Node -Q Compute I/O Node Compute I/O Node i Node Node Broadcast/Reduce Figure 2.5: PFS communication patterns over an external netvfork

iu certain cases. This brings together the centralized protocol server and the application gateway solution as nuances of the same case.

2.4.3.1 The Workstation Cluster Case

The realization of the MPP/node equivalence due to the centralization element gives us a way to bring the case of fully independent network subsystems, such as those in workstation clusters, into the same framework. In that case, the notion of the enclosing MP? with its internal interconnect is gone, and all compute nodes are directly connected to the same switch. In a sense, the global system fits into our model if we consider the switch to be the centralization point, as proposed before, but this is not of much practical use by itself. The switch is a pre­ programmed, inflexible element, and we cannot have much hope to use it to improve the traffic behavior of the system by feeding it some information about applications and system organization. We can, however, focus on the problem of a single node in the cluster communicating with the I/O nodes. In that sense, the communication pattern is similar to the original scatter/gather case, or the same as the one £u:hieved by collapsing the MPP model with its internal nodes as a single host in the network. Such abstractions will be useful when we discuss what features are expected from the protocols in the system, eks well was when we simulate MPP protocol behavior in Chapter 5. 41

2.4.4 Protocol Requirements

When parallel file systems were limited to a single MPP, little thought was given to the protocols used in the implementation. In most cases, they were custom protocols implemented by the interconnection fabric itself, usually with strong guarantees about packet delivery, ordering, error protection, and do on. When data starts to move out of the MPP Jind over the external network, however, we must turn to protocols that can provide reasonable behavior over more general configurations. First, data loss and corruption become much more widespread than in internal inter­ connects, so the protocols used must provide error detection and even error correction, if possible. Second, they must be able to handle multiple concurrent connections efiiciently, and if possible, make their cooperative behavior explicit to the network elements. Third, since the exterior network may be used for other unknown traffic, the protocols must be able to handle congestion problems gracefully. Other requirements include the ability to operate across networks of distinct technologies, and to not incur too much processing overhead. Unfortunately, there is no single protocol that satisfies all these properties. Most transport protocols provide error detection by applying a checksum eilgorithm to the transmitted data. Error correction is usually achieved by retransmitting paickets received with errors. A different approach has been proposed that uses a redundant encoding that allows corrupted data to be reconstructed at the receiver. This technique, called forward error correction., avoids delays related to data retransmissions by using a different data encoding [Bie93]. Although it has been used in specific applications, the lack of a general protocol implementation and the added complexity make it hard to use in most cases. The ability to handle multiple connections efficiently is a basic featiure of most trans­ port protocols. It is achieved by combining data streams by multiplexation [Ten89], and there are a few different levels in the protocol stack at which this can happen. A protocol may define independent connections for each data stream it handles, or it may multiplex streams with the same end-point hosts into a single connection. The use of a single connection for a group of data streams helps to carry the informa­ tion about their related nature over the network. Two protocols that have been proposed to provide such fimctionality for high-performance machines are F-channels [Ahu93] and 42 the MiiltiStream protocol [PS93]. There is one problem with their use in a network PFS, however: they can only be used to combine data streams that share both end-points. They could be used to multiplex all data streams between nodes in two separate mul- ticomputers, but not to carry data streams from multiple nodes in an MPP to a set of different independent storage servers in a pool of storage servers, for example. Other protocols take the simpler approach of defining one connection for each data stream, using the cormection identification to demultiplex the data for delivery. It is the task of the application (or a higher level protocol) to open and handle the multiple connections. The network sees each data stream as an independent connection, and cannot identify any relationships between connections sharing some common behavior at the application level. This is the case with TCP, for example. Although TCP lacks the ability to take advantage of the cooperating behavior of PFS connections, it has become the standard protocol for such systems since it provides all the other required features with relative efficiency: reliable connections with in-order delivery, error detection and correction (by retrsuismission), and flow eind congestion control (although at the individual connection level) [Com95]. A last protocol that deserves mention is the Parallel TVansport Protocol, developed at Sandia Labs [Ber95]. It is specifically designed to be used to handle parallel I/O across networks. Its purpose is not to oversee the actual data transfer, but to provide an efficient and flexible way to describe the communication patterns needed at any time and to set the actual connections to transfer the data using TCP.

2.5 Concluding Remarks

This chapter introduces the Parallel I/O problem and explains its importance for appli­ cations in massively parallel processing systems today. It explains how I/O nodes can be added to an MPP to solve the problem by exploiting pcu-allelism at the disk level in dijSerent nodes of the meichine. In order to make such peirallel structure available to applications, operating systems must be designed to handle the different tasks of compute and I/O nodes, and an interface must be defined to permit applications to specify how data is to be accessed. Such an interface combined with the algorithms used to lay out files over the multiple disks define a Parallel File System and the communication patterns inside the MPP during an I/O 43

phase. This chapter shows that when the PFS is moved outside the MPP, the external net­ work affects the design and behavior of the I/O system. Network nodes are able to handle higher data transfers that disk nodes, reducing the number of I/O nodes required. Some systems have centralized protocol processing servers designed to guarantee that all connections are seen as belonging to a single machine. Others completely replicate the protocol stack in each node, but require additional application gateways if their connec­ tions are to be identified as belonging to a single entity. Specific details about the improvements proposed for MPP network subsystems are discussed in Chapter 3. The actual protocols used with their main characteristics and problems are discussed in Chapter 4, and suggested improvements are presented in Chap­ ter 5. 44

CHAPTER 3

DISTRIBUTED PROTOCOL PROCESSING

As discussed in the previous chapter, it is desirable that an MPP offer an imi&ed iden­ tity when contacting other hosts in a network, no matter which node starts a connection^. In order to do that it is necessary for the machine to have some level of centralization for the protocol processing functions, so that all connections use the same IP address and share TCP/IP state (e.g., port numbers) properly. A problem occurs, however, when all protocol processing is centralized in a single node; such centralization can easily become a performance bottleneck. This chapter presents one case study that exemplifies how performance can be im­ proved if the protocol processing tJisk is distributed among all compute nodes, removing the protocol server from the data path without destroying the unified identity of the con­ nections. It begins by describing the Intel Paragon, the architecture used in this work, and the organization of its operating system (OS), including the network subsystem. Next, the performance of the Paragon's original TCP/IP implementation is measured and the various factors that limit performance are identified. Finally, a proposed solution is described in detail, and an anedysis of the improvements achieved is given.

3.1 A Case Study: The Intel Paragon

The Intel Pziragon [Int91] is a good representative of ciurent MPP architectures. It is still in use in most supercomputer centers and the most powerful machine in use today, the Intel ASCI Red, adopts most of its design principles [Int97]. The Paragon has one of the best developed network servers in an MPP so far, offering transparent access to all network interfaces in the machlae from any node, and presenting a unified identity to outside machines. Nevertheless, such features are not eichieved without a high price: the centralized network server used to achieve these goals is a performance bottleneck,

'if necessary, node identification can be implemented as a higher level function. 45

3.1.1 Architecture

Each node in the Paragon, as illustrated in Figure 3.1, uses two or three Intel i860 proces­ sors in a shared memory configuration. In addition to these general purpose processors, each node has a specially designed communications circuit, the Mesh Router Chip(MRC). The MRC is responsible for the control of the interconnect links attached to that node, including routing tasks. One of the 860 processors operates as the communication pro­ cessor, controlling all message trafSc from that node to others through the MRC chip. The remaining processor(s) are available to user applications.

Paragon nod* To neighboring General Local nodes Purpose Memory Processor(s) Bus

Communication External Processor(s) 1X3 MBC\ 5x5 Crossbar Switch To neighboring nodes

Figure 3.1: Organization of a Paragon Node

The nodes are organized in a two-dimensional toroidal mesh. The MRC is actually a five-way crossbsir switch connected to the node's communication processor and to the node's four neighbors. Each link c£in transfer data at 175 MB/s, although the communi­ cation processor may not be able to handle that capacity in some cases. The network interface considered in this study is the Pcuragon HIPPI interface, with 100 MB/s bandwidth. It is implemented as a daughter board connected to a general- purpose node. Although most of the time it is configured as a dedicated I/O node, nothing in the architecture forbids using the same node for application processing. These nodes are commonly referred to as HIPPI nodes. 46

3.1.2 Operating System Structure

The Paragon's OS is based on the OSF/1 AD operating system [RBG'*"93]. It uses a Mach 3.0 micro-kernel [RBF"*'89, BKLL93] in each node to control basic functions like execution scheduling and memory management, with specialized user-level servers to provide specific services. For example, disk I/O nodes have their kernels configured to perform all disk-related functions, including both the actued disk accesses and the management of in-memory disk caches, parallel file system consistency checks, and so on. All external accesses (user logins, standard Unix daemon services, etc.) are performed by a service node. This node is responsible for executing the Unix server process, which performs user access control, job scheduling, and other services defined in the Unix . Parallel programs are started in the service node and distributed to the compute nodes by the operating system, eis specified by the application. Unix system calls are implemented by a Unix library in each compute node, which implements some of the simple tasks, like getpidO. Most of the time, however, the kernel must be contacted to complete the teisk. This is done using Mach inter-process communication (IPC) to send a message to the kernel specifying the requested system call. Each kernel serves the requests that depend solely on local state, like simple process and memory management, immediately. It must, however, redirect those requests afiiecting the shared state of the Unix machine to the Unix server ruiming on the service node. This redirection is done by remapping local Mach messages into NORMA IPC^ [Bar91], which provides node identification and authentication, remote service addressing, and other tasks required for communication between independent nodes. Not all inter-node communication happens over NORMA: user-level conmiunication between compute nodes in a same application is possible without the heavy overheswi of kernel authentication and marshaling by meeins of an application library usually referred to as the NX messaging interface.

''An in-kemel implementation of Mach inter-process communication facilities for distributed memory (no remote memory access) architectures. 47

3.1.3 Accessible Interfaces

The standard BSD socket interface is available to applications requiring access to the ex­ ternal network. It is available to compute nodes through the standard Unix library, which in OSF/1 AD is implemented using Mach messages to the kernel. When the requested service is located in a remote node, NORMA IPC is used, as mentioned previously. It is important to notice that NORMA is not an interface defined for user-level appli­ cations. All application requests are made to the local kernel, using local Mach messages. It is that kernel's responsibility to decide which services are not local, to identify which nodes must be contacted to fulfill a request, and to route messages over the internal mesh accordingly. Another interface that is available in the case of HIPPI is what is called Raw HIPPI. This interface is accessible through the libhippi library: processes can build their own HIPPI packets without using higher layer protocols [Int95]. It is used, for example, by special applications to access external disks connected to a HIPPI interface. In this case, an application may register a packet filter with the HIPPI node identifying the kind of packets it is willing to accept. The filter may operate on a HIPPI session identifier or on a combination of fields in the message body. If a process executing in a HIPPI node makes a libhippi call, it is handled locally by the Mach kernel. On the other hemd, if the caller process is not local to the HIPPI node, the call is routed through the remote device driver interface to the correct node using NORMA. The NX facility is silso available in all nodes, so it may be used for any inter-node communication, as long as the two end-point processes belong to the same application. Being extremely light-weight, the performance available through NX is essentially the same that is available between the communication processor and the MRC chip in each node. This means that it is limited only by the OS interface between the application and the communication processor.

3.1.4 The network subsystem

All protocol processing in the Paragon is performed by the Unix server process running in a service node [LR94]. The first reason for this is that all protocol code in Mach w£ts 48 derived from the in-kernel BSD implementations [WS95], and it was easier to keep the centralized organization as it was. Another reason is that there are some tasks in protocol processing that require global knowledge of the state of all connections in one host, for example to allocate a TCP port number [RH91]. The service node usually has £m Ethernet interface that allows users to access the system remotely. However, high-performance interfaces may be attached to other nodes, independent of the location of the protocol stack (Unix server). In such cases, the protocol server uses the Mach remote device driver interface [FGB91] (which in turns uses NORMA IPC) to access the network card in an I/O node. The I/O node holding the network interface does not have any protocol processing capabilities itself. The Mach kernel in that node has the appropriate drivers to control the device, but at first it does not know what to do with any arriving packets. It is the task of the protocol server to register itself with the kernel in that node to receive packets. This is achieved by programming the Mach packet filter to route all packets to the Unix server process [YBMM94].

Compute Nodes Service Node HiPPI Node

p Application J Unix Server Appl. Space ^fbHlPPl] (Unused),, Mach

NX NX NX Mach MRj

Mesh

Figure 3.2; Standard Paragon network subsystem

The organization of the network subsystem and the path taken by messages sent by an application in a compute node is shown in Figure 3.2. The numbers identify the main steps in the processing of each packet:

1. The application running in a compute node starts a write operation with some data.

2. The write is implemented as a Ccill to a library implementing Unix services. One or 49

more Mach system calls to the local kernel are issued (Mach messages). 3. The local kernel decides that the request cannot be serviced locally £ind marshals the message through NORMA to the service node. 4. The Mach kernel transfers the data to the communication processor, which in turn delivers it to the routing chip. 5. The request traverses the interconnect and is delivered to the node running the Unix server. 6. The message is processed by the NORMA IPC module to determine its validity and to extract the parameters. 7. The request is identified as addressed to the Unix server and so it is delivered; the server pushes it down the stack. 8. Assuming that the TCP code decides that a new segment must be sent at this time, a new message is pushed down the stack, until it reaches the raw HIPPI interface in libhippi. This adds the lower layer headers and starts a new system call to deliver the data to the device. 9. The kernel in the service node uses the remote device driver interface to build a request to a device driver in another node. 10. Again, £ui inter-node message is marshaled through NORMA and delivered to the MRC routing chip. 11. The message is delivered to the HIPPI node, where the local kernel is scheduled to process the request. 12. The remote device driver access request is delivered to the kernel, which in tium checks the validity of the request. 13. The actual device driver is accessed. 14. The message is finally injected into the external HIPPI network.

The path for incoming messages is basically the reverse of this sequence. The most important details for that case not discussed previously are:

13. The device driver passes the incoming packet through the Mach packet filter to decide how to handle it. If the packet matches the pattern registered by the Unix server to identify a valid TCP/IP packet, it is shipped to that process. 50

7. Processing proceeds up the TCP/IP stsick as in any machine. The steps from there to the application are only executed if a connection is found and the incoming packet made some data available for delivery.

There are a few points at which the path may be temporarily interrupted due to blocking operations. The most important one is certjiinly the TCP/IP stack itself. Packet transmission will only proceed down the stack if the flow and congestion control algorithm allows it. Otherwise, data will be stored there for later processing. A transmission request may also require that a destination address by identified and that routes be chosen, so data may sit in the stack for a while while address resolution is performed by ARP, for example. Finally, on the receiving side, data may be stored or discarded by the Unix server if an end-point for a connection is not ready to receive at a certain point in time or if DO connection is established. It is important to notice that the NX interface is not used in any point of this process. The interface between the Mach kernel (NORMA IPC module) and the communication processor, however, has essentially the same fimctionality (and overhead) as that of NX. All steps between the service node and the interface node are shared by all connec­ tions, independent of their end point, and almost eill protocol processing is performed in the service node. This concentration creates a bottleneck when the number of connec­ tions increase: all data streams will has to go twice through the NORMA module in the service node, and all network trafiSc has to be processed in the Unix server.

3.1.5 The Target Machines

Most of the work reported in this chapter was performed on a Paragon configured with one service node for handling user accesses, one I/O node with a RAID unit, 16 compute nodes and two I/O nodes with HIPPI interfaces. For experimental purposes, these two HIP PI nodes were coimected to one another. For most tests, the machine was partitioned, with eight compute nodes and one HIPPI node on each of two partitions. This allowed the machine to behave as two independent HIPPI hosts. Nodes were positioned in the mesh so that all conununication over the intercoimect during the execution of applications was limited to the containing partition to guarantee their independence. The machine as 51 a whole had just one protocol server, as discussed in Section 2.3.1. When tests required two machines to communicate using the original protocol implementation, that protocol server would be traversed by traffic from both partitions, doubling its load artificially and causing other problems, like increased cache conflicts. To avoid these problems, whenever the original Unix server protocol stack was the target of measurements, two production Paragon systems in the Caltech Center for Advanced Computer Research (CACR) were used.

3.2 Performance of the Current System

The first step to anjilyze the proposed changes is to determine the performance of the current implementation. Once this is done, and the bottleneck problems are confirmed, all the possible limitations faced must be identified, specially the maximum bandwidth available for the Unix server protocol code, HIPPI connections and NX interprocess communications.

3.2.1 OSF/1 Protocol Server

In order to identify the maximum performance offered by the standard implementation, we measured the performance of data transfers between two Intel Paragons connected using TCP/IP through a HIPPI switch located at the Caltech Advanced Computing Research facility (CACR). The transfers were implemented as one-way connections using the standard sockets interface to send a large amoimt of data. After the connection was started, 1 Megabyte of data was transferred to bring the connection to a steady state, and then the time to transfer 16 MB of data was measured. Different message sizes where used at the socket interface. Measurements were repeated at least ten times for each message size, during times when the machines and their HIPPI interfaces were not otherwise loaded (as observed through programs like netstat and uptime). The results are shown in Figure 3.3. The standard for EP over HIPPI links limits IP packets to 32 KB [Ren97], so there is no sense in increasing message size above that size, except for a smeiU reduction in the overhead in the socket layer. Throughput levels off at 1.97 MB/s for larger messages. This is obviously a very poor result, less than 2 percent of the link's capacity. Fixing the message size at 64 KB to achieve the best possible performance. Figure 3.4 52

I TCP Throughput

i z

OS

t» 2M StS tK 2 4K 8K 3JK MK Meisafe ibv (bytes)

Figure 3.3: Paragon TCP/IP performcince

shows the measiu'ed throughput when a variable number of nodes open concurrent con­ nections through the HIPPI interface. In this case we used a program that defined a partition with one to eight nodes in each machine, cind then one connection was estab­ lished concurrently for each node pair. Each connection started by transferring 1 MB of data, then all the sender nodes in one machine synchronized (by means of a bcurier) be­ fore sending 16 MB of data. After each node completed its transfer it hit a barrier again, and time was measured when all processes reached the barrier. All data was touched at each end point: it was written before being sent and read after being received, to emulate the behavior of an actual application.

Aggregate throughput Throughput per node 2.5 2.0

1.0

0.0 1 2 3 4 5 6 7 8 iwdes

Figure 3.4: Paragon TCP/IP performance for multiple connections

The aggregated throughput shows the transfer rates at the HIPPI interface, measured 53 as the number of nodes times the amount of data in each transfer over the total time. The throughput per node was just the size of one data transfer divided by the measured time. It is clear that the system saturates with just two connections, at a value just a little above the throughput for a single connection. Extra connections just reduce the bandwidth per node. These results nevertheless are not completely unexpected. From the analysis in Sec­ tion 3.1.4 it is clear that there arc many points that add overhead to the data path, especially the Unix server. Although is has been shown to be possible to implement flow control in the TCP/IP stack with great efficiency [CJRS89], data touching operations like checksumming can be time consuming [KP93]. Buffer handling, parameter marshal­ ing bet-ween multiple interfaces, and protocol layering are other factors that cannot be ignored [KC94]. If we can transfer as much of this problem as possible to the connection end points in the compute nodes, performeuice is likely to improve due to the increased parallelism. Before we address this issue, however, other communication costs should be clearly identified. These values will allow us to set reasonable goals based on the overall system performance.

3.2.2 Inter-node Communication

Although the standard network data path does not use NX communication, a test of maximum NX performance may help identify the maximum perfonn£ince that can be expected from an ideal protocol implementation. Although the links of the interconnect are capable of speeds on the order of 175 MB/s, individual nodes may not be able to trsinsfer data at those speeds continuously. To determine the maximum bandwidth available to a single node, a test program was designed to send data between to nodes using NX primitives. The size of the messages was varied from 1 byte (just a control transfer operation) to 1 Mbyte in a number of steps. Bach nm sent approximately 32 Mbytes and the process was repeated at least ten times for each message size. Figwe 3.5 shows the average results. The maximum rate achieved in this test, 67 MB/s, is well below the Hnlc capacity, so we find the first point where performance is lost. The problem is the operating system overhead when messages are exchanged between the application running in the general- 54

NX Throughput TO <0 SO

40

30 20 10 0 M m 4K 1M Meiute sue (hytesi

Figure 3.5: NX IPC performance

purpose processor(s) and the communication processor, and between the later and the MRC. It seems that the interface defined in the Paragon OS for that task has a lot of bookkeeping to do, and has to copy data multiple times to transfer information from the MRC to the application address space. All this limits the maximum throughput. The limit for the Unix server may be even lower, considering that it may be limited to using messages no larger than 32 Kbyte due to the IP over HIPPI standard previously mentioned. It is easy to show that Paragon OS (or its Mach micro-kernel) is the culprit here by comparing these numbers with measiurements of the same transfer between nodes using the SUNMOS kernel [Dun94]. Under SUNMOS the transfer reaches up to 150 MB/s, much closer to the link capacity. Obviously, the Mach overhead limits the bandwidth between the application processor and the message processor. It is worth mentioning that although the OS limits the maximum bandwidth of a single connection, it does not limit link utilization by connections with different end-point nodes which happen to be routed through a common link. In that case, all the capacity of the link is available (175 MB/s), and multiple trauosfers may take place concurrently, sharing the link.

3.2.3 HIPPI Interface

It is clear now that a compute node caimot use all bandwidth of a HIPPI channel if the message has to travel the mesh to reach the network interface, which is usually the 55 case. But can it get even close to 67 MB/s? An experiment was designed to measure the maximum performance possible through the Paragon HIPPI interface. Data was sent (again, with varying message sizes and multiple measurements) between two nodes at opposite sides of a HIPPI channel using the raw HIPPI primitives offered by the libhippi library. Two cases were investigated. First, end-points were located on the HIPPI nodes themselves, so that NORMA IPC and the remote driver interface were not used at all. Second, end-points were placed on compute nodes. The results are shown in Figiue 3.6.

Libhippi Throughput 25 • l/ONodM 20 Compul»NodM

ts

10

s

0

Message sue (bytes)

Figiure 3.6: Raw HIPPI performance

In the first case, all overhead is due solely to the system call interface, the kernel and device driver processing. This should give us an estimate of the maucimum capacity of the interface when accessed by an application process in the HIPPI node, for example, if an application gateway were to be used. In the second case, the test included the extra overheads due to NORMA and the cost of accessing a remote device driver. This is the maximum performance one would expect in most cases for libhippi, since applications are usually nm in compute nodes, not in the 1/O node. Again we find that the maximum performance we can expect &om the system is much lower than HIPPI's peak numbers. The maximum performjince from a compute node is around 10.8 MB/s, while the interface node itself cannot handle more than 24 MB/s. But the numbers we can expect when using TCP/IP are in fact much lower. Remember 56

that the standard limits IP messages to under 64 Kbyte, so that is the maximum message size one can expect to transfer in such cases. FVom the graph, this gives us a limit of approximately 9 MB/s when using libhippi from a compute node (which includes NORMA) and 13 MB/s when it is used in the I/O node itself The fact that in general there are multiple connections in parallel may lead to slightly higher numbers, since there is a lot of concurrency available.

Component 1 MB msgs 64 KB msgs TCP stack 2 MB/s 2 MB/s libhippi from compute nodes 11 MB/s 9 MB/s libhippi from HIPPI nodes 24 MB/s 13 MB/s NX IPC 67 MB/s 63 MB/s Table 3.1: Bandwidth limits in the Intel Paragon

Table 3.1 summarizes the results so far. Even though the maximum performsmce pos­ sible through the HIPPI nodes is much lower than that of the standard, the performance of the operating system TCP/IP stack is still too low.

3.3 Distributed Protocol Stacks

Since the protocol stack implementation in the Unix server is too much of a bottleneck, the idea in this work is to apply the concept of user-level protocols (ULP) to a multiprocessor. First I explain the principles behind ULP and how they can be applied to an MPP; after that, two slightly different solutions are discussed.

3.3.1 User-Level Protocols

User-level protocols are implementations of standard protocol code in user space instead of inside the operating system kernel [MB93, EWL''"94, EM95]. The goal is to provide ap­ plications with a streamlined data path for the protocol stack code, avoiding the operating system related overheads as much as possible. There are three important aspects that must be handled in order to be able to use ULP efficiently: the hemdling of shared state, implementation of efficient buffer mechanisms, and the proper handling of incoming packets. 57

Efficient buffer mechanisms are needed to guarantee that data is handled efficiently by each protocol layer zind that overhezid is avoided when it must be transferred across the kernel boundary (e.g., when accessing the device driver). A lot of work has been done in this jirea, including fbufs [DP93], I/O-Lite [PDZ97], and new buffer sharing techniques [BS98]. The problem of handling incoming packets is how to decide, at the device driver level, to which application to deliver each incoming packet. This is considered a solved problem, thanks to packet filters [MJ93]. These are essentially pattern matching engines that can be programmed to recognized a certain combination of values in some helds of an incoming packet and to associate those patterns with a given process. The Paragon OS uses the Mach packet filter [YBMM94]; this filter is also available to applications using the HIPPI interfaces through the libhippi library. The most complex task is the control of shared state. Although each process may handle its own data connections, each time such a connection is established or torn down information relevant to other processes and other connections must be updated and made available. For example, when a generic connection is created, it must allocate a new TCP port number, and that number must be unique in the machine. When a connection ends, the port number just released must be kept under control for a timeout period, to avoid it being reused before any stray packets destined to the old connection are guaranteed to have been removed from the network. Although some ULP implementations mostly ignore such problems [TNML93], a complete implementation must address them. The solution is to keep a protocol stack in the kernel to hold shared state eind to handle axriving packets that have not been yet associated with any specific connection. All connections are set by the protocol stack on behalf of applications. When a cormection has been properly initiaJized, its state is transferred to the application, which then handles the data path. In the same way, when a connection terminates, information is returned to the kernel so that it can supervise any cleanup operations that may be needed.

3.3.2 ULP in the Paragon

The intended approach is illustrated in Figure 3.7. The protocol stack implementation in the Unix server is replaced with a user-level protocol implementation on each node requiring network coimections. An i-kemel [HP91] protocol stack was used. 58

x-Kemel OSF/1

TCP/IP

Device

Figure 3.7: Proposed change to the network subsystem

The use of the x-kernel provides a simple solution the the problem of providing the application with an eflBcient protocol stack. All elements of a TCP/IP stack are already defined in the arkemel in a way that guarantees portability to different environments. Its internal structure is designed to guarantee efficient execution, including good buffering handling primitives. Porting the ®-kernel to the Paragon in user space required re-implementing its run­ time support elements for the new architectiure. The most important element w£is the implementation of a>kernel threads and events. These were implemented at first using a small core of assembly language routines for thread creation, suspension and schedul­ ing [Mos96]. This was later replaced by an implementation of Posix threads (P-threads) for the Intel Paragon. Although the assembly core had slightly better performance, it did not work well with other elements of the system, specially when Mach messages were used. All results in the following sections are betsed on the P-threads implementation. Once the run-time system was ported, all protocols were readily available. The two remeiining problems were how to integrate the a!-kemel stacks with the Unix server, so that they maintained the shared TCP/IP state properly, £ind how to interface with the actucil HIPPI devices. To have the shared TCP/IP state updates implemented between the avkemel and the Unix server would require altering the Unix server so that it would send a description of the connection to the application, where the protocol stack would be traversed to create the i-kemel sessions in its protocol graph. The Unix server would then program the 59

packet filter in the HIPPI node to route packets belonging to that connection directly to the compute node that created it. The fact that some of the Paragon machines available during this work (at Caltech CACR) were not open to changes to their operating systems made it impossible to work on this. Since TCP/IP shared state is only necessary diuring connection setup and tear down—it has no effect on data transfer performance—and that the ULP implementations could still be made inter-operable with the standard Unix server stack, management of shared state was not implemented in this work. The interface of the ULP stacks with the actual devices required the development of

appropriate 2!-kernel device drivers. There were two possible solutions to this problem, which yielded different results. Both sure discussed in the sections that follow.

3.3.3 Raw HIPPI Implementation

The simplest way to interface the ULP stack with the HIPPI nodes was to use the raw HIPPI mode of the libhippi library to access the Mach remote device driver. The resulting organization for this case is shown in Figure 3.8.

Application Unix Server Appl. space libHiPPI; (Notus^)

NX :MR< Mach H«>PI Mesh Network Figure 3.8: User-level protocol stack using NORMA

As desired, this solution completely bypasses the Unix server protocol stack. All TCP/IP processing is performed at the compute node, in the application's address space, until a complete HIPPI message is built for a TCP packet. Then a call to a libhippi function accesses the Mach remote device driver interface, building the NORMA message. This message is shipped over the interconnect directly to the kernel at the HIPPI node, where it gets processed by the NORMA module and directed to the device driver. 60

The incoming path is essentially the same. At connection setup, the ULP stack uses libhippi calls to program the packet filter at the HIPPI node so that packets for that connection (identified by IP addresses and TCP port nimibers of sender and receiver) are associated with that process. Upon detecting such a packet, NOHMA ships it to the remote device driver interface.

3.3.4 Raw HIPPI Performance Results

The main advantage of the libhippi device driver is its simplicity, given that most of the complexity of the remote access (packet filter programming, routing to the proper device) is hidden in the library routines. There is a lot of overhead in the NORMA commimication layer, as discussed before. Even then, the performance of such ULP implementation siurpasses that of the Unix server network subsystem, as can be seen in Figiure 3.9 when compared to Figure 3.4. For example, for 3 nodes, the Unix server implementation yielded 2.2 MB/s aggregate performance (0.72 MB/s per node), and the NORMA-based stack achieves 7.4 MB/s aggregate (2.47 MB/s per node).

Aggregate throughput Throughput per node 10.OT 3.0 8.0 2.5 2.0 6.0

4.0 1.0 2.0

0.0 0.0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

node* nodes

Figure 3.9: Performance of NORMA-based protocol stack

Performance for a single connection is approxiniately 3 MB/s. It is still low if com­ pared to the available bandwidth, but it is almost 50 % higher than that of the original implementation, which was 1.9 MB/s. As more connections are added the combined per­ formance continues to improve, until it saturates at approximately 9.2 MB/s. That is still less than 10 % of the channel capacity, however. Nevertheless, if we consider the libhippi tests (Figure 3.6), we can see that we are using all the bandwidth that it provides. By 61 moving the protocol stack from the centralized Unix server to each node's individual address space, all protocol processing (except for the packet filter) can be performed in parcillel by the compute nodes, and the system was able to use aJl the performance available, being limited by the NORMA overheads only.

3.3.5 NX Implementation

In order to avoid the NOR^IA overhead, a new solution was designed to use only NX communications between compute nodes and the HIPPI node. Since the Paragon NX library is defined among application processes, this requires an application gateway to be run in the I/O node, as depicted in Figure 3.10. The user-level process in the HIPPI node is responsible for all device accesses. For each connection opened by the application in a compute node, the gateway receives an NX message and registers itself with the packet filter as receiving packets for that connection.

Application Unix Servei NX^iPPr \rirb]^i] Gateway \ Mach

^(MI3 1 rtjppi 1 External Mesh Network Figure 3.10: User-level protocol stack using NX

Although this implementation may serve as a proof of concept, it is not as efficient or as well organized as one would expect. There are many incompatibilities among the various software components necessary in this case, which dictate a certain organization:

• The Paragon User's Manual [IntQl] states that NX conmiuiiication primitives should not be used simultaneously with Mach messages. The NX/HIPPI gateway has to use NX to communicate with other hosts, and it has to use Mach messages to access the device, since all system calls are implemented over such messages. 62

• The libhippi library hais no non-blocking read operation: a process must block until data is available. This requires a separate P-thread to keep polling the device driver at all times. • The manual also states that NX calls are not thread safe and must be issued only from the process's main thread. A serious performance penalty occurs when this is not the case. To avoid this situation, all NX messages inciu: in a P-thread switch to transfer control to the main thread. • There are problems combining P-threads and global synchronization primitives (needed for accurate timing of test programs) that may cause applications to halt.

Due to all these limitations, the implementation is severely restricted. Although libhippi allows the use of different P-threads to block at different filter outputs (avoiding the need for the gateway to re-inspect the packet to find which connection it belongs to), the packet must be relayed to the main P-thread to be sent using NX. The same happens on the other way, when the thread running on behalf of a certain connection must switch the packet to the main P-thread on that process for shipping over the interconnect.

3.3.6 NX Performance Results

Despite all the limitations in the implementation of the x-kernel device driver using NX, the results £ire still better than those for the Unix server, as it can be seen when comparing Figiure 3.11 to Figure 3.4. With 3 nodes, aggregate performance is 6.7 MB/s (2.23 MB/s per node); the Unix server reached just 2.2 MB/s aggregate (0.72 MB/s per node). The NORMA-based stack had better performance in this case: 7.4 MB/s aggregate. Although the performance for a single connection is below that achieved by using the NORMA-based a?-kernel device driver, the aggregate performance in this case scales well with the number of connections, at least up to six nodes. Instabilities in the interactions between NX, P-threads, libhippi and sjrnchronization functions limits reliable results to six compute nodes, however. After six nodes the application hangs at random times. The good result here is that combined throughput grows almost linearly until it reaches 13.2 MB/s, approximately the limit observed for data transfers for a single connec­ tion running in the HIPPI node itself (Figure 3.6). Again, using an ULP implementation allows aggregate performance to utilize all the available bandwidth in this case. The 63

Aggregate throughput Throughput per node 15.0 T

12.0

9.0...

6.0 ao 0.0 1 2 3 4 5 6 nodes

Figure 3.11: User-level protocol stack results

instabilities with seven or more nodes may even be related to the fact that data is being moved at rates close to the HIPPI node's maximum capacity.

3.4 Comparison of the Different Solutions

To better analyze the results, performance of the original network subsystem and of the two variations of the ULP stack are plotted together in Figure 3.12 for up to six nodes. It is clear that both s-kernel solutions have better performance than the original network subsystem. The limits that dominate both implementations £ire not intrinsic to themselves, but related to the operating system structure, as discussed previously. It might be possible to improve the overall performzince by improving the organization of the micro-kernel structure, avoiding some of the overheads currently involved [Lie95].

It seems that the lower performance of an individual connection under the NX-based stack, when compared to the NORMA-based version, may be due to the higher complexity of the implementation, dictated by the limitations in module interactions. It might be possible to improve performance by re-implementing some of the NX and libhippi functions with more reliable versions, as well as by using a more reliable threads package (the current Paragon OS P-threads implementation is considered "beta" and does not match the latest versions of the standard). The use of user level protocol stacks in a massively parallel processor is a clear solution to the problem of providing a unified network identity to the external network. This 64

TTtfougtiput p*r notft

Figure 3.12: Comparison of the various stack implementations

technique achieves that while maintaining eidequate performance by allowing connections to be processed in parcillel. Machines which lack a unified identity, like the IBM SP series, could benefit from this approach by combining their individual protocol stacks with an virtual device driver and a special gateway at the actual node holding the high performance external network interface, in a way very similar to the implementation of the NX-based ULP stack. One system that has experimented with such a solution is the Solaris MP research system [KBM''"96].

3.5 Concluding Remarks

This chapter explains how protocol stacks can be implemented in massively psirallel pro­ cessors efficiently, in a way that provides an unified identity. The approach extends the concept of user level protocols to distributed systems, transferring most of the protocol processing tasks to the connection end points in the compute nodes. An application of this concept to the Intel Paragon shows how performance increases when the Unix server, originally responsible for all protocol processing in that system, is removed from the data path. Two solutions using different inter-process communication techniques improve individual and combined session performance. Once the network subsystem has been reorganized to better sei-ve the needs of high- performance computing systems, the protocols used in such systems must be analyzed. Chapter 4 discusses the current protocols and their problems, and Chapter 5 proposes 65 some techniques to improve their performance. 66

CHAPTER 4 PROTOCOL ISSUES AFFECTING PARALLEL I/O

When parallel file systems (PFS) are implemented inside supercomputers, applica­ tions use the inter-process communication primitives of the specific machine to reach and control the I/O nodes. Issues like routing, congestion control, and error detection and correction are all handled by the interconnection fabric. When the PFS is placed on the external network, this is not possible anymore; the application must rely on standard network protocols to communicate with the storage servers. Considering that data trans­ fers must be reliable in any file system, and that in most cases message order is also important, this means a connection oriented, reliable service must be used. As discussed in Section 2.4.4, the only widely available protocol that provides such service is TCP, and it has been used in many applications cind PFS for high-performance computing. This chapter discusses TCP's basic principles, the latest changes proposed to improve its performance for high-performance networks, and the deficiencies that it still has when PFS are considered.

4.1 Window-Based Flow Control

In order to offer guaranteed, ordered delivery, TCP uses a sUding window to control the flow of packets from sender to receiver. This window represents the amount of data that may be in flight at any time between the connection end points. The sender can never transmit more than one window of data without receiving acknowledgments from the receiver, and the receiver in turn must be able to receive a window's worth of data at any time. For each packet that is acknowledged by the receiver, the sender can move its window up by the same amount and transmit another packet. This has the effect of allowing one new data packet into the network for each acknowledgment (ACK) received. This technique is called self-clocking (or self-pacing), since TCP uses the ACKs for previous 67

packets to decide when to send each new packet. In cases where the sender always has data to transmit, window-based flow control leads to a steady state in which there is a full window of data in transit during every round-trip time. If an ACK is not received after a certain interval, the sender must assume the packet was lost, and retransmit it. The important element here is a reasonable estimate of the round-trip time (RTT) for the connection, so that the sender can know when to expect the ACK for a packet. TCP keeps a running average of the measured RTT, inflated to account for the observed variation in measurements. When an ACK takes longer than twice that estimated RTT value, the original packet is considered to be lost and retransmission occurs. One important aspect of self-clocking is that the connection may stedl due to packet losses. When the receiver detects a hole in the sequence of received pacicets due to a loss, it cannot acknowledge any other packets received after the hole—although it will keep sending duplicate ACKs for the last packet received before the missing one. The sender window limit cannot be moved past the lost packet, so the sender can send at most a window worth of data after it before it is forced to stop due to the lack of new ACKs. If the loss is not detected and fixed before the last packet in that window reaches the receiver, the connection will stall until a retransmit timeout occurs.

4.2 Congestion

Data in transit over the network take up buffer space in the routers (switches) along the path. As the number of connections in the network fluctuate, so does the amount of data in transit. There are times when buffer requirements exceed the buffer space available in the network, and some packets have to be dropped. At this point the network becomes congested, and if connections do not reduce their bandwidth requirements, this may lead to serious network collapses [Jac88]. To avoid that, connections must be able to detect the onset of congestion, and to react to it, reducing their bandwidth usage. To detect congestion, TCP relies on loss information. Packet losses are assumed to be always due to drops in congested switches, so a lost packet indicates the path is congested. TCP's window size determines the amount of data in transit, and therefore the bandwidth used by a given connection. Based on that observation, the protocol's 68 standard congestion control mechanism relies on the control of a connection's window size to limit the bandwidth used. There are two different parts to congestion control in TCP. First, dinring connection start-up, it must decide how much network bandwidth to assign to a new connection. Second, after the connection is set, it must react to changes in the network due to new connections being established or connections being terminated. These issues are discussed in the sessions to follow.

4.3 Connection Start-up

For all connection-oriented protocols, connection start-up is a crucial operation. During this phase, sender ajid receiver protocol stacks cire initialized, hosts are identified and authenticated, and access permissions are verified. These £ire time-consuming tasks, and therefore should not have to be repeated on every access. The solution is for clients to establish connections to the servers once and keep them open, reusing them for each new request.

4.3.1 Setup Handshake

The first operation during connection start-up is to initialize the protocol stacks. For TCP, this means a three-way handshake must tsike place, during which both end points recognize each other, exch2inge and agree upon optional configuration parameters, and initialize the sequence numbers to start each connection, as illustrated in Figure 4.1. In TCP, three messages are necessary to guarantee that both hosts agree on which special options should be active for the data transfer, and to provide each host with the random value used as initial sequence number by the other end-point. The use of randomly chosen initial sequence numbers is necessary to ensiu'e that any packets from a previous connection that might still exist in the network are not accepted as belonging to the new data stream. The problem with this technique is that a complete round-trip is necessary before any data packets can be transmitted, causing noticeable delays if the amount of data to be sent is small. Some new protocols have been proposed with simpler setup procedures, allowing data to be sent at once, but they lead to increased complexity in the handling of terminated connections [DDK'*"90]. For cases where the connection lasts much longer 69

S«<|sx, offered options Checit options Get X • • •

Check x+1, options ^ Seqsy, accepted options, Acitsx-fl Get y..

Acky-fl i

Connection Established Figure 4.1: TCP connection setup

than a round-trip, cis in the case of the connections used in a parallel file system, TCP is a reasonable solution.

4.3.2 TCP Slow Start

Hosts starting a new TCP connection have no way to know the available bandwidth. The two end-points may be located on the same high-capew:ity network, or they may be on opposite sides of a low-bandwidth link. If the sender starts transmitting as fast as it can, sending a large window's worth of data, congestion is likely to occur. The multiple losses in this case can lead to congestion collapses if multiple connections happen to be competing for links in the amd keep retransmitting lost packets. This actually happened in the early days of the Internet, when TCP behaved exactly that way. The solution to this problem is to force each TCP connection to increase its window (and therefore the used bandwidth) progressively, using the slow start technique. With slow start, every TCP connection begins with a window equal to a single packet, allowing only one data packet to be transmitted. Each time the sender transmits a window's worth of data successfully (as attested by the receipt of ACKs), it doubles the window size. That is, the window is doubled after each round trip interval during slow start . In practice, this doubling is achieved by increasing the window size by one packet for each ACK received, opening the window incrementally during the RTT interval instead of in one large step, as illustrated in Figure 4.2. During the first round-trip time, just one packet is trzinsmitted. When its ACK is received, two packets are sent in the second 70

round-trip interval. Their ACKs in turn cause four packets to be sent during the third round-trip, and so on.

o Data packet 3 O'a> <0 o

1 2 3 4 RTTs Figiu'e 4.2: TCP slow start

When the connection detects that a packet has been lost, TCP concludes that the bandwidth used during the latest round-trip was above the capacity of the links. At this point, the connection reduces its window size by half (it assumes that the window during the previous interval was safe to use) and starts the congestion avoidance phase, discussed in section 4.4.1. The reduced window size and the last measured round-trip time provide an estimate of the available bcmdwidth for that connection.

4.3.3 Proposals to Improve Slow Start

Although slow-start has been shown to work well in many cases, it has some problems when used in high-speed networks. First, if the link capacity is large, it will take maJiy round trips before the window grows enough to use the full capacity. If the data transfer is short, the link will not be well utilized. Second, if the delay-bsmdwidth product of the path is high, when the connection window is doubled during the last RTT in slow start, there may be a lot of packets transmitted beyond the link capacity, causing multiple losses. In many cases, half of the packets in the last window size may be lost, requiring multiple retransmissions [Hoe96]. To solve the problem of low utilization for short transfers and to reduce the time taken to complete the slow start phase, new proposals suggest changing the initial window size from one to up to four packets [AFP98, AH098]. Although increasing the initial window seems to improve the performzmce of short-lived connections, it does not address the 71

problem of multiple losses at the end of slow start. Other research has addressed the problem of multiple losses by trying to provide TCP with better algorithms to estimate the available rate. If a comiection can get a reasonable estimate for the available bandwidth, it caji stop slow start as soon as that bandwidth is reached. This would avoid the losses that happen when slow start goes beyond the available bandwidth [Hoe96]. If such estimate is found after a few round-trip times with good confidence, the connection may even bypass slow start altogether and proceed to transmit at the estimated rate as soon as it is identified, as it is done in TCP Vegcis* [BP95].

4.4 Congestion Avoidance Phase

Once the connection reaches steady state, the congestion avoidance phase begins. As link utilization varies continuously, it is not enough for a connection to choose a window size (as the one obtained by the end of slow start) and use it all the time. Other connections may end, and the remaining ones should be able to expand their windows to make use of the bandwidth just released. Similarly, new connections may be created, and the previously existing ones should reduce their windows to make room for the new ones. For the best utilization of the available bsmdwidth, it is crucial that TCP adjust connections based on the amount of congestion in the network.

4.4.1 Original TCP Congestion Control

Standard TCP congestion avoidance uses an additive increase/multiplicative decrease technique to change window size over time [Jac88]. As long as a connection does not experience any losses, it periodically increases its window by a fixed amount for each ACK it receives. This increase is usually chosen so that the window is opened by one packet during each round-trip time interval. When a loss is detected, the window is halved, and that value is assumed to be the new safe window size (i.e., the window is multiplied by 0.5 after each loss). Research shows that additive increase/multiplicative decrease schemes are stable in the presence of congestion [MSM097]. One of the first TCP implementations to have congestion avoidance, TCP T^oe, relies exclusively on slow start and the simple window change algorithm described above. In TCP Tkhoe, every time a loss is detected the window is temporarily reduced to one 72 packet and the connection initiates slow start again, restarting transmission with the lost packet. The halved window, computed when the loss was detected, is used as a threshold for slow start; as soon as the window reaches the threshold, additive increase resumes [SZC90]. Loss detection can happen by the activation of a retransmit timer, or by fast re­ transmit. Fast retransmit recognizes that the receiving end of a TCP connection sends duplicate ACKs for the last byte received when an out-of-order packet arrives. This means that if just one packet in a data stream is lost, the arrival of packets in the window following the one that was lost will generate repeated ACKs for the packet immediately preceding the missing one. Based on this observation, the sender interprets some number of duplicate ACKs (three, in typical implementations) as indicating a dropped packet. TCP Reno, the implementation in wide-spread use in the Internet today, adopts a new technique to improve fast retransmit: fast recovery. This technique tries to keep the pipe full after the loss, avoiding new slow start phases [Ste97]. The goal of Fast recovery is to avoid the draining the pipe—and the disruption of self-clocking—when a loss occurs. Instead of resorting to slow start after fast retransmit, fast recovery works by immediately retransmitting the packet found to be missing. Window flow control is resumed right after that. This does not mean that the connection ignores the indication of congestion, however. Recall that right zifter a loss is detected the window is halved. Let W be the original window size. This means that right after retransmission there are approximately W bytes in transit. But now the window is just W/2, and any new ACKs arriving will only decrease the amount of data in transit (w W) by one packet. Only when W/2 bytes have been acknowledged does the amount of data in transit fall below the window size. From this point on, incoming ACKs are used to keep the self-clock working, and transmission resimiies before packet flow halts completely.

4.4.2 Proposals to improve TCP Congestion Control

Although TCP Reno has proved to be e£Bcient in most cases, a lot of changes have been proposed in the recent past to address some deficiencies which became evident with the advent of high speed networks. These changes address links with large delay-bandwidth products, better solutions to packet retransmission, smoothing of packet rates over time, and completely new congestion control techniques. 73

The most important change to TCP is not a change to the congestion control algorithm itself, but to TCP headers, allowing it to hiindle networks with larger delay-bandwidth products. The origineil TCP hesuler fields used by the end-points to represent sequence numbers (needed to convey information about packet ordering) and window sizes (ex­ changed between end-points to achieve congestion control) were not large enough to han­ dle high-capacity links. The problem was solved by adding two options to TCP: window scaling, and timestamps [.JBB92]. The overall effect is that the number space available to represent sequence numbers and window sizes have been increased by several orders of magnitude, with no ill effects on the overall protocol behavior. The changes have been added to TCP Reno, and are standard in the 4.4BSD distribution, sometimes referred to as Big Windows TCP [WS95]. Fast retransmit and fcist recovery work well when only single losses occur, but there are problems when multiple, related losses happen close together. In such cases two problems may occur: the window may be halved multiple times, reducing the used bandwidth drastically, even though the losses might be all related and a single reduction might have sufficed [Flo95b]; and the duplicate ACKs may not be enough for a second or third lost packet to be detected and retransmitted. Both problems cause the connection to stall. One change proposed to fast recovery to avoid such stalls is to make a better use of the information from duplicate ACKs: when fast retransmit occiurs the connection enters a fast retransmit phase, which lasts approximately one RTT. During that phase, if other duplicate ACKs arrive after the retransmitted packet is acknowledged, they trigger other retransmissions, but without further halving of the congestion window [Hoe96, FF96]. A later proposal suggests the use of special packet markers (which might be implemented using TCP timestamp options) to better identify and limit a fast retransmit phase [LK98]. Another solution for the multiple losses problem is to replace duplicate ACKs with a new TCP option to use selective acknowledgments (SACK) [MMFR96]. This allows a receiver to explicitly indicate which packets were received out of order and exactly which packets are missing. SACK is a promising solution to this problem, but its use has not yet been added to implementations in the Internet since there are still some aspects of its use that have not been completely defined [BHZ98]. Some authors point that SACK is not a new solution to congestion control in itself, but just a way to provide more specific loss information back to the sender. This 74 information can, in turn, be used to implement better congestion control. For example, Forward Acknowledgment (FACK) TCP uses the detailed information from SACK to decide exactly when a fast retreinsmit phase ends, instead of always always using one RTT for it [MM96]. All the solutions discussed so feir improve TCP congestion control without changing its basic cidditive increase/multiplicative decrease aspect. All of them depend on creating congestion as a way of discovering the available network capacity. TCP Vegas takes a different approach. It attempts to use better round trip estimates to detect changes in the routers before losses occur, adapting continuously to maintain a load considered acceptable on router queues [BOP94]. There has been some work also on improving congestion control by changing router behavior, or by combining information from routers into TCP's basic congestion control. Random early detection is a new drop policy proposed for Internet routers to replace the basic FIFO queueing used in most cases. In this solution, drops are not applied just at the end of the queue, but they are randomly distributed over its contents, instead. The goai is to distribute losses with increased fairness and to avoid multiple losses to a single connection in a short interval [LM97]. Another proposed solution combines TCP congestion control with information from the routers about when drops become likely to occur, replacing the implicit information obtained from lost packets [FIo95a]. Since these solutions require changes to and cooperation from routers, they are not of much use in high-performance computing systems. In these systems, traffic is mostly limited to the loced area network, and the switches used offer simple FIFO queueing only.

4.5 Problems of TCP for High Performance Computing

Despite all the recent work on TCP congestion control, there are still problems for its use in applications for high-performance computing. Some have been partially addressed by the new techniques discussed in Sections 4.3.3 and 4.4.2, but others are particular to the new applications emd have received little attention.

4.5.1 Explicit Delays

Although not directly related to congestion control, there are some cases in which TCP connections are forced to delay traximission of a packet even though there is space in 75 the transmission window and it would not cause congestion. Although that is not a congestion problem, it leads to poor network utilization if such delays occur frequently. Delayed acknowledgments (delayed ACKs) are by far the most important cause of unexpected delays in TCP. When using delayed ACKs, instead of acknowledging a packet as soon cis it arrives, the receiver postpones that action for a while. If the application in that host happens to have some information to send as a reply to the original sender, the ACK information can be piggybacked on the reply, avoiding the transmission of a control packet. To avoid delaying ACKs indefinitely, a timer is used; if no application data is sent back after 100 milliseconds, an ACK is sent anyway. This technique works fine in general, but it poses problems in applications in local area networks with high-capacity links and low delays. A departmental network using Fast Ethernet links, for example, has round-trip times on the order of hundreds of microsec­ onds. When an ACK gets delayed, that round trip becomes more than 100 milliseconds: one thousand times longer! This can drastically limit the bandwidth available to an application. Other events may lead to an ACK being sent before the timeout even when there is no data to be sent. To avoid a problem known as silly window syndrome^ TCP must send an ACK every time the receiver window gets extended by more than the size of the maximum packet length allowed for that connection (also referred to as the maximum transfer unit, MTU). If a connection has a large window and data is transmitted continuously, every packet will likely be as long as possible (an MTU), and every two packets will extend the receiving window by more than one MTU. In this case, an ACK is sent back for every other packet^. This alleviates the problem of delayed ACK, but only if the connection always has packets to send. If application data transfers occur in blocks of multiple packets, and the next to last packet in a block causes an ACK to be sent, the ACK for the last packet will be delayed. It will take at least an extra 100 millisecond interval before the application can proceed with a guar£intee that the data transfer was complete. In cases where application use short (one peicket) request messages before transfers start, delays may happen after every request. In one reported case, disabling delayed ACKs changed

'This has an effect on slow start. Since ACKs are sent only for every other packet, and the sending window is increased by one pacicet for every ACK, the window is opened more slowly. Instead of doubling after each round-trip interval, it incieases by 50 percent, opening by a factor of 1.3 instead of 2. 76

the throughput of the Swarm distributed storage servers from 150 Kilobytes per second to about 5 Megabytes per second on a Fast Ethernet network [Mur99].

4.5.2 Timer Granularity

In most implementations of TCP, timers are not started and controlled on a per-packet basis. Instead, global timers are programmed to fire periodically, with all timing requests for individual packets then handled. Delayed ACKs (Section 4.1), for example, are han­ dled by a recurring 100 millisecond timer. This means that the actual moment when the timing begins is the next occurrence of the periodic timer, which may happen up to 100 milliseconds in the future. The delayed ACK will then be sent on the following timer activation, if the timing request is still active. This means that a delayed ACK timeout will not happen exsictly 100 milliseconds after the packet is received, but anytime after 100 milliseconds and before 200 milliseconds. More important in this discussion than the delayed ACK timer, however, is the re­ transmit timer. Section 4.1 discusses how the round-trip time is used to determine when an ACK should be expected—and therefore when it can be assumed that a packet has been lost. The retransmit timer is not only responsible for handling the retransmission timeouts, but for actually measuring round-trip times. TCP/IP implementations in use today have their origins in the first days of the Internet, when computers connected to the Internet were based on processors operating at only a fraction of the speed of current machines. Many of the machines at that time had no reliable time reference beyond the millisecond range. Since most of the connections happened over low-speed, high-delay links, 500 milliseconds was enough granularity in most cases, so that was the time-base chosen for the retransmit timer. Round-trip estimates were achieved by counting retransmit timer ticks, and that is still

the way it is done today. Although the SM:tuaI value of the RTT is computed as a running average with some digits of precision, in essence it meauas that no RTT timeout will be shorter than 500 milliseconds. Considering the huge variability of RTT measurements in the Internet, this is still a reasonable solution for most cases today, but this is not the case for high-performance computing within local networks. For such situations, half-second timers intervals are too long. Actual round trips are usuedly in the order of hundreds of microseconds; transmission rates are extremely 77 high, and there are not many queiieing points in the network. Even the meiximum RTT bounds one sees in local networks are below one second. For applications in high-speed local networks, a packet loss that causes a connection to stall until a timeout occurs is a devastating event that drastically reduces protocol bandwidth utilization. The limitations of coarse grained timers in TCP implementations have been recognized for some time, but for applications over the Internet the argument is that the large variations defeat the purpose of finer-grmned implementations. Nevertheless, at least one work has shown that better timers can be implemented for TCP, and that they may help improve performance in some cases [AD98]. TCP Vegas [BOP94, BP95] is one of the few solutions proposed so far that make use of the fine grained timers in current machines to implement a novel congestion avoidance mechanism. By making use of fine-tuned RTT measurements, TCP Vegas is capable of identifying small variations in round trips and of interpreting them as router queueing delay changes before they cause any losses.

4.5.3 Packet Trains

Packet treiiiis are multiple packets belonging to one coimection that are sent back to back, with little or no inter-packet gaps. Theoretically, if a connection operates below the speed of the sender output link, packets should be equally sp£iced, maintaining the same sending rate over time. Ecich packet reaches the bottleneck router as another packet leaves, as shown in Figure 4.3.

Fast Unit Router queue Slow Link ca at• [I] at• [zi • • enosisa: V. [I]a • [E• •Q] • •sais d]ain CE Ml • caMsuss[=2 [n• aKZ] • ncaQMS: V' m•• (H isiig ca• • eg •-* anaczzzza E]« « s—niiEi— d]• • —*• iwiiDici [3M —Bgni —I a [a] —»I IFIIIIH r V. —»loiHiia —. r::."?;

Figiure 4.3: Proper packet spacing causes no losses 78

This is not the case when there are packet trains. When packets axrive close together, as shown in Figure 4.4, one or more packets in the train may be lost. This is not a loss caused by an excessive transmission rate, but a spurious drop. The connection should not reduce its overall sending rate at this point, although that is what will happen when the loss is detected. If two such losses happen close enough together, the sender may not be able to recover based on the information gathered from duplicate ACKs, and the connection may staJl.

Fast Link Routar Slow Link III]• • a M• a iz] [7] ci] —» oiiiBr >: cu •a • a a a [pi?m— CD a a a a a a^—»tnns-o [=3 cs a a a a a a-* •ms-»:>:• S]a a a a a ngcra d]a a a a•susii=z>] ca a a a — cnas] — cs a a rr™ — r:::?; ["511—»• i I IMCI -•I cl

—»• I I laiaf — msia-^i Di Figure 4.4: Spurious drops caused by packet trains

Unfortunately, packet trains are a common event in practice. They occiu: for several reasons:

• Consider two independent connections operating in opposite directions between two hosts separated by a low speed link. The queues at each end of the link will almost always have some backlog of data packets. When a connection's ACKs reach the routers they may get queued after a number of data packets from the connection operating in the opposite direction. If some ACKs arrive without any intervening data peickets, the end effect is that they are compressed in time and end up being transmitted one right cifter the other over the bottleneck link. Since they are short packets, when they cirrive back at the sender they are so close together that the new packets transmitted (due to TCP self-clocking) form a train, following one right after the other. These trains tend to make the compression of ACKs behind them even more likely, exacerbating the problem [ZSC91]. 79

• When some intervening ACKs get lost in a connection, any ACK for a byte later in the sequence is used to signify that all packets up to that point have been received successfully. Tliis causes the window to open abruptly, with many packets being sent in quick succession.

• When a single packet in a window is lost, and then gets retransmitted by fast re­ transmit, many other packets that followed the lost packet in the data sequence may have been already received when the retransmitted packet reaches the connec­ tion end point. The receiver does not send one ACK for each out-of-order packet already received; it just sends a cummulative ACK for the last byte in the now com­ plete sequence, which may be acknowledging multiple psickets. A train will likely be created when that ACK reaches the sender.

• When the window is hcdved during fast recovery, as described in Section 4.4.1, transmission halts until the amount of data in transit is reduced to less than the new, reduced window. At that point the window starts to slide again. But ACKs jure still arriving at the rate at which the corresponding packets were sent, before any congestion was detected. Although the new window guaxantees that over the whole fast recovery interval the send rate has been reduced by half, new packets are grouped closely together, forming a train [Hoe96].

• Delayed ACKs in a steady stream cause ACKs to be sent baick for every other packet, as previously discussed. This means that each ACK will likely open the window by two packets, causing two packets to be sent together^.

• If a connection stays idle for an interval longer than one RTT, slow start is used to restart the self-clocking scheme. But if the inactivity interval is short, packets are sent one after the other. This is a problem specially for fast networks, when the actual RTT (and the time it takes the pipe to drain) may be much shorter than that measured by the 500 millisecond timer. If data transfers occurs in isolated blocks, each block may start with a long packet train.

Figure 4.5 shows a piece of a simulation nm illustrating the effect of peicket trains. It shows the times when packets are sent and when ACKs (including duplicate ACKS)

^During slow start, three-packet trains occur. 80 are received in one connection between two tiosts separated by a low speed link. During the simulation, another connection with similar characteristics is opened in the opposite direction.

Packet Trmim

Duplicate ACKs

Time (scconcts)

Figure 4.5: Packet trains in a TCP connection

There are two groups of compressed ACKs arriving at 1.82 seconds, and other two arriving at 1.84 seconds. Each group causes a packet train to be sent almost immediately. If there was no ACK compression, ACKs should eirrive equally spaced along the interval from 1.82 seconds to 1.84 seconds, for example, instead of in groups. Because of the trains, one of the first packets seat around 1.84 seconds is lost, and aU packets after it cause duplicate ACKs to be sent. When the packet is retransmited during fast retransmit, other duplicate ACKs trigger fast recovery and smaller trains are sent, at the same rate of the previous trains, which is too high for the congested network. These trains cause other losses, with more duplicate ACKs and new (shorter) packet trains, until the connection stalls. As long as TCP flow-control operates based solely on self-clocking, breaking up trains in the TCP protocol stack is not possible. Most solutions to the packet train problem have dealt with changes to router drop policies: random early drop routers, for example, drop packets picked up randomly from ansrwhere in the queue [LM97]. This tends to 81 spread drops caused by a train over multiple connections, instead of resulting in multiple drops to a single one, a situation that is more likely to cause a connection stall. One proposal addresses packet trains created when transmission resumes in a con­ nection that has been idle for a short while. By sidding a simple pacing technique when restarting such a connection, packets are evenly spaced based on the current bandwidth estimate (window/RTT) [VH97|. In this case, pacing stops as soon as a window of data has been sent, however, so it does not avoid trains due to other problems.

4.6 Cooperative Sessions

TCP was designed to handle individual (possibly bidirectional) data streams between two hosts, and it does a good job when connections are indeed isolated entities. The congestion avoidance algorithm guarantees a good distribution of bandwidth between competing connections that happen to share a common link in their path. Problems happen when connections with dififerent path lengths happen to share a congestion bot­ tleneck [HSMK98] ajid when the number of connections grow too large [Mor97]. Neither situations are relevant in high-performance computing, however. The problem for high-performance computing is that most of the time applications must use multiple parallel connections to reach a desired goal. As discussed in Sec­ tion 2.4.3, those connections work together, and that should be considered when their individual bandwidth utilization jure considered. Unfortunately, TCP has no way to con­ vey or handle this kind of information. It treats eau:h connection individually, trying its best to give each one the maximum possible utilization of the link. One connection, for example, may end up utilizing a larger share of the link capacity in detriment of others, for example because those others may have suffered losses caused by packet trains. For TCP, there is nothing wrong with that: each connection gets the bandwidth based on the losses it notices, and that is fine. But for a parallel file system application, this is not enough. The application as a whole caimot make any progress until data from all storage servers is received, for example. Considering that all connections carry equal amounts of data most of the time, one cotmection ending before the others is of no use to the supercomputer; it will have to wait for all of them to complete before it C2in continue. By the time only a few severely delayed coimections remain, their aggregate 82 bandwidth requirements may not be enough to fully utilize the network links. This will certainly lead to a sub-optimal network utilization. Consider the simplified connection trace^ in Figure 4.6, which represents the evolution of the sequence numbers over time for two connections having a common end point.

S«fv«rA S«n«r S

jfr ./•

O.SO 0.7S 1.00 12S ISO t.75 2.00 2.2S 2 50 Hmc (sccnnds)

Figure 4.6: Poor link utilization due to one st2illed connection

In this scenario, a client on a high-performance link (OC-3) is receiving data from two servers, each connected through lOObaseT links to a central switch. In theory, the application would be able to achieve throughput equal to its link capacity, 155 Mbps, but the final result in this case is just around 70 Mbps. There are two causes for the under-utilization in this case:

• When one of the connections stalls at 0.8 seconds, there is just one server left capable of transmitting, and it is limited to a rate of 100 Mbps (lOObaseT). It cannot utilize all the bandwidth of the link to the client (155 Mbps). At least for some time, the client is receiving at just 100 Mbps, and that reduces the maximum rate possible.

• The rates cire high enough that the server that did not stall manages to complete its transfer in less time that it takes the other server to recover with a timeout (in this case, that takes almost a whole second). After one connection finishes and before the other one resumes, the client link is completely idle, although there is data to be sent. ^TLis and other traces in this dissertation have been simplified by reducing the number of data samples plotted to improve their readability. 83

• Finally, when the remaining server resumes its transmission, the maximum speed will be limited to 100 Mbps, once again.

If TCP could identify the two connections as having a cooperative nature, sharing the same bottleneck link, it might have used that information to control both simultane­ ously. This might have allowed it to detect the multiple packet loss £ind the subsequent connection stall based on the information {i:om packets that continued to arrive for on other connection. The fact that TCP control state carries information that might be useful for a group of connections shairing a common link or belonging to a single application has been iden­ tified by the Internet community [Tou97]. However, there is no consensus on how this information can be used. One work in this area altered the TCP stack in a Web server to combine information from connections originating in a same network. The operating system decides which originating addresses belonged to the same network (and therefore shcire the same path) and then it groups such connections so that their states caji be inspected by each other. Optimizations include special handling of the congestion windows and improvements to duplicate ACK detection during fast retransmit [BPS'^Q?]. A loss in one connection causes all connections to reduce their bandwidth eiccordingly. This reduces the aggregate losses for all connections, since not all have to incur a packet loss to detect congestion. Connections originating in a same host in fact shared a single window. This allows the server to counteract excessive bandwidth pressure caused by browsers greedily opening multiple connections in parallel. The original TCP implementation has to wait for at least three duplicates before it can assume a pcicket was lost and retransmit it. If windows are small, there may not be enough duplicates to detect multiple losses. In the altered stack, one connection does not rely only on its own duplicate ACKs: if one connection receives a duplicate ACK while another with the same path keeps receiving proper ACKs, a single duplicate ACK may be enough to indicate a problem, since other connections continue without problems. In this way a retransmission can occur sooner and mote losses can be detected in a given window size. Some work has also been done to allow a host that has multiple connections to judi­ 84 ciously divide bandwidtti among them by limiting the advertised window it presents to the other end point of each connection [C098]. Although this may provide some control on maximum bandwidths, it is not of much use to handle dynamic variations in network capacity.

4.7 Concluding Remarks

The conclusion from this discussion is that when TCP is used jis the transport protocol for high performance computing it meets the basic requirements of providing guaranteed ordered data delivery and some congestion control. Nevertheless, since it was not specif­ ically designed to handle high-speed links with very short round trip times, and multiple connections with combined requirements, TCP still has problems, which usually appear in the form of performance-damaging connection stalls. Explicit delays, packet trains, coarse gTEiined timers and the inability to combine in­ formation from cooperating sessions are all important problems that must be addressed when we consider the networking needs of high-performance computing. Among those, only explicit delays have a simple solution from the implementation point of view, dis­ abling delayed ACKs is enough to avoid most of the problems caused by unexpected delays in the data flow. Unfortunately, the other problems are all closely interrelated, and no clear solutions exist at this point. Packet trains occur frequently, and connections may have problems to recover effi­ ciently from the multiple losses caused by them. This in turn causes connections to stall when TCP self-clocking fails for lack of returning ACKs. Finally, the granularity of the timers used to detect such stalls is too coarse to guarantee reasonable reaction times, so links sometimes become extremely under-utilized. Chapter 5 presents a technique that helps solve most of these problems: cooperative rate-based traffic shaping. 85

CHAPTER 5 COOPERATIVE RATE-BASED TRAFFIC SHAPING

In high-performance computing applications liice the parallel file systems (PFS) dis­ cussed in Chapter 2, the application is interested in the aggregate throughput of a set of connections, since each connection (server node) is delivering/receiving one fragment of the larger file object[G'^98]. In other words, the application's request is not satisfied until the slowest transfer completes. As discussed in Chapter 4, TCP is the protocol of choice for such systems, and it often happens that one or two of the TCP connections stall due to multiple losses, while the others proceed unaffected. This leads to reduced throughput for the application if the remaining sessions are not capable of utilizing the totai bandwidth available. In the worst case, which typically happens on high-bandwidth/low-latency networks, the link is actually idle due to the TCP timeout interval being longer than the required transfer time. This chapter describes how to improve the performance of applications using con­ current connections by adding rate control mechemisms to TCP. Instead of using just self-clocking to decide when packets are transmitted, rate pacing adds a new module to the stack, the rate controller. When a packet is sent, the rate controller uses its length and the available bandwidth to decide the time when the next packet can be transmitted without exceeding the connection's allocated rate. The controller then blocks any new outgoing packets until that time arrives. This module is not associated with individual connections directly, but instead it is applied to a group of connections that share some network resoiu-ce. Such aggregation improves the overall system performance in two ways:

• Combining concurrent connections that share a critical resource allows information about each connection's use of that resotirce to be shared with the other connections. As a result, when one coimectiou detects a loss and reduces its congestion window (and its associated gauge of the available bandwidth), the other coimections shciring the same controller see a particil reduction of the overall available rate, and start to slow down too. That avoids unnecessary losses. 86

• Even when applied to a single connection, a rate controller potentially improves performance since it automatically avoids packet bursts. Whenever TCP would inject packets into the network without spacing, as when an acknowledgment causes the window to be opened, the rate controller adds spacing between any two pcickets based on the current allowed rate for that connection. This spacing also serves to reduce losses.

Section 5.1 describes how rate control was implemented and added to a TCP/IP stack, discussing the main design points affecting the behavior of the new system. After that. Section 5.2 presents the models and configuration parameters used to analyze the resulting protocol through simulations, and Section 5.3 discusses the main results from that analysis.

5.1 Implementation

The rate controller was implemented as the vrate virtual protocol in the x-kemel [OP92], as shown in Figure 5.1(a). In that firamework, a virtual protocol is an isolated module that controls the flow of messages up and down the stack without adding new headers or creating new messages. An optional implementation strategy, better suited for monolithic systems, is the one illustrated in Figure 5.1(b), where the rate controller is added to TCP itself. This might be necessary to guarantee good performance in systems where the added moduleurity has cost.

Applic. Applic.

TCP TCP

rate ' vrate Ctrl. ,

(a) x-kemel (b) Monolithic Figure 5.1: Rate controller added to the protocol stack

Each host interface in the system has a rate controller responsible for its outgoing 87

traffic. This means that when a group of sessions has a common source, ail flows are coatrolled by the same entity, and information obtained by one session can be shared with all others. In the case of multiple sources and a common destination, each session is controlled independently by the rate controller at its origin. Should it be necessary to have these sessions share a common controller, at least part of the controller functionality would have to be located at the flrst switch shared by all sessions, so that it could manage the common information. Section 5.4 discusses how this could be done, for example, in a system like the one discussed in Chapter 3. The internal structure of vrate is illustrated in Figure 5.2, which shows some of the main elements in the design. Each session to be controlled must be assigned a queue in the protocol. Queues are served Eiccording to a chosen scheduling policy, and packets are spaced-out using a token bucket to keep track of the rate available to the controller. The different design options available to implement each of these elements are discussed in the sections that follow, which also explain exactly what mechanisms were implemented in vrate.

Individual sessions

Limiting Sch Token Rate .Bucket

IP

Figure 5.2: Rate controller internal structure

5.1.1 Rate Estimatioa

There are two ways to set the rate to be used in the controller: explicitly (i.e., some external entity determines the rate) or implicitly (i.e., the controller infers the rate). An application may want to limit the maximimi rate assigned to a group of connections, for example, based on an a priori knowledge of the network conditions, or due to some 88 bandwidth allocation agreement. This is what a server in an ATM network would do in ceise it requested a guaranteed rate service. Another kind of explicit control would be the one that exists in connections over an ATM network using the ABR service class. In this case, the network continually provides feedback to the sender to inform it about variations on the maximum available rate [CFKS96]. That feedback could be used to propagate the information about limiting rates up the stack to TCP, and let it perform the queueing needed to shape its traffic accordingly. This approach would solve a problem mentioned in the literature, where the ATM endpoints providing ABR services to a TCP connection must buffer the equivalent of a whole delay-bandwidth product to avoid losses [SS96]. If the information were fed back to TCP, very little additional buffer space would be needed, since messages would be queued at the source and TCP has to keep them Eiround until an acknowledge is received anyway. Another way rate can be determined is by using a variation of the packet-pair tech­ nique, as it is done in TCP Vegas*, for example [BP95]. A pair of packets is sent and the size of each packet is recorded. When the acknowledgments arrive, the system measures the exact interval between them. This interval and the size of the packets can be used to estimate the link rate, and new rate estimates may be obtained as often as necessary. The problem with this technique is that it is not immune to the effect of ACK compression—if the acknowledgments are queued behind larger packets, they may jirrive at the sender back-to-bcick, causing the rate to be overestimated. The simplest way to implement implicit rate determination is to use the basic infor­ mation already available in TCP's internal state. As explained in Section 4.1, TCP's view of the available rate is kept as the size of the congestion window and the round trip time (RTT). To a first approximation, the rate is fikeady known: rate = window/RTT. The problem is that in most TCP implementations, RTTs are measiured by a very coarse-grain timer (500 milliseconds). To actually tise the congestion window and RTT as an indica­ tion of the available rate, a more precise measurement of round trip times is required. Another limitatioD of this approach is that TCP usually measures one round trip time for each window's worth of data, so new estimates may take a round trip to react to changes. 89

Solution adopted in vrate: Although the implementation allows the controlling rate to be explicitly set by the application, which could also be combined with explicit notifications from the network, this is not used. Insteeid, this work focus on how TCP can use rate control to improve its performance based on indirect information about the available bandwidth, assuming a general network with no particular rate monitoring capability. In this implementation, the rate available to each connection is determined by the passive method of using the congestion window size and the measured round-trip time in the equation rate = window/RTT. The problem of the poor resolution of TCP's RTT measurement is solved by adding a precise timer in parallel to the one used to control actual retransmission timeouts. Implementing more precise timers requires the addition of two variables to each con­ nection's control block, one to hold the time (accurate to withinmicroseconds) when a packet is sent, and another to hold the value of the congestion window at that point. In the BSD code, these variables eire updated and checked at the same points where the original round trip time measurement is performed. The performance penalty is essen­ tially that of checking the system clock once or twice every round trip time. When an acknowledgment arrives for the packet being timed the time of transmission is subtracted from the current system time to determine the RTT. Once a new precise RTT has been computed, TCP notifies vrate of the new estimate by providing the measurement and the value of the congestion window that was active when the packet was sent using an 2>kemel control operation. From that point on it is up to vrate to decide how to use this information.

5.1.2 Combination of Multiple Connections

Combining multiple connections hcks the effect that a ch£inge detected by any one of them is immediately shared with all other in the group. For exeunple, if one of the connections detects a loss and reduces its congestion window, its contribution to the aggregate limiting rate is reduced, and all packets from that bucket will be sent at a lower rate from then on, even if they did not originate in that connection. Conversely, if a connection decides to open its congestion window, indicating that it suspects there is more bandwidth available, packets from all cotmections will benefit from the increased rate. 90

When the rate controller is supposed to treat connections in groups, two questions must be answered: (1) which connections should be grouped together and (2) how should their information be combined. There has been discussion on how TCP could identify connections sharing resources by itself, but this is an open issue at this point [Tou97]. In some cases, source and destination addresses cam be used (together with netmask values) to identify connections with end- points in a same network. However, techniques like IP masquerading, site-specific subnet masks, and others make even this observation unreliable in general. Once some technique has been devised to identify the related connections, another question is bow should the information &om them (their available rates, in this case) be combined? During the early phases of development of vrate a few different techniques to compute the limiting rate were tested, such as a sum of individual ninning averages, a rimning average of the sum of all rates computed each time a new rate estimate was available, and a simple sum.

Solution adopted in vrate: Since vrate was created to handle traffic from connections used in a parallel I/O environment, it is reasonable to sissume that the application opening connections knows which ones share some common element. The task of identifying connections is therefore left to the application, and it is performed by means of a;-kemel control operations. When the application decides to apply rate control to a group of connections, it first creates a new scheduler (a bucket) in vrate, which it associates with a certsiin group. When a connection belonging to a group is created, the application uses a second control operation to tell vrate to associate that connection with a cert2iin bucket. The method selected to combine the rate information from the multiple connections in the bucket was a simple instantaneous sum: each time a new rate estimate is available for a connection, the previous estimate for that connection is subtracted from the bucket's limiting rate and the new value is added. This simple scheme avoids the need for multiple state variables and complex computations and works well when the physical networks are not too complex. If the network under consideration may cause large spurious variations in round-trip times, a more robust method should be used. 91

5.1.3 Connection Scheduling

Each time the rate controller decides that the limiting rate allows a new packet to be sent, it must choose from the queues in the bucket which packet to send next. The policy used for queueing and dequeuing packets may change the behavior of the controller shghtly. The simplest solution is to use a single queue for all sessions in a bucket. This simplihes the implementation, but the final performcuice may be poor. Usual TCP implementations push all packets it can transmit after an ACK in quick succession, leaving for the lower layer to decide how to handle them based on the transmission rate of the device. If a single queue is used, it would be filled with long sequences of packets belonging to each connection. A connection would send no packets for a long time, and then it would send a group of packets at the limiting rate. This would lead to poor RTT estimates due to the alternation of intervals. Once the decision to use individual queues is made, there are different ways to serve them. Policies like weighted fair queueing [DKS90, SV98] and virtual clock [Zha91] provide eidvanced control features, guaranteeing fair behavior even between connections with different characteristics. The problem is their added complexity. On the other hand, a round-robin scheduling of the queues is simple to implement, although it may lead to great unfairness in the general case.

Solution adopted in vrate: Again we can use information from the applications in the parallel I/O environment to decide which solution to use. Connections used by a parallel file system interface to access the servers over the network are extremely regular and have similar profiles: they should always transfer similar amounts of data, and their periods of activity should be overlapping. Considering this, a simple round-robin solution was selected, since it works well when connections have similar traffic patterns.

5.1.4 Fine-Grained Timers

The last element in vrate's design is the token bucket. It provides the means for packet pacing: packets can only be sent when the bucket has as many tokens as the packet size in bytes. If there are not enough tokens, the packet must wait. Tokens cu:e fed into the bucket at the limiting rate established by the system, as discussed in Section 5.1.1, with 92 a maximum capacity equivalent to one MTU. If the connection has been idle for some time and the bucket is full, the first packet can be transmitted immediately, draining the bucket. The next packet will be delayed until the bucket contains enough tokens again. This guarantees that packets will not exceed the limiting rate over the connection. The token bucket abstraction is implemented using a fine-grained timer. The harder issue in implementing vrate is the same faced by any system using rate- based access control: how to implement timers precise enough to guarantee that packets will be sent exactly at the times needed to achieve the expected rates, without overloading the system with the run time overhezui of handling a multitude of fine-grained timers. This problem is discussed in the specification of NETBLT [CLZ87], for example.

Solution adopted in vrate: To keep overheads to a minimum, vrate uses a timing wheel with a very short period to approximate precise timers. All events are grouped by the resolution of the wheel, which is approximately 50 microseconds. Considering Ethernet-sized packets (approxi­ mately 1500 bytes), this gives a maximum controlled rate of 240 Mbps. If rates are too high to be approximated well by that resolution, vrate just leaves the connection uncon­ trolled for as long as that situation persists. The reasoning is that if the rate is actually that high, the system as a whole will have such hard time keeping up with it that no other rate limitations cu:e necessary.

5.1.5 Operation of the Rate Controller

The general operation is shown in Figure 5.3. Although an application can use vrate to control a group of coimections (Cl-3), connections that require no control (I) can continue directly to IP. When the application decides to use rate control, it first allocates one bucket in the rate controller for each group of connections it intends to use, either stating an initial rate limit or defining it as dynamically computed. It then specifies which connections must be bound to each bucket. This causes vrate to allocate a queue for each of those coimections and to bind them into a scheduler associated with the bucket, returning an identifier for each connection's queue. This identifier is kept by each TCP session to mark it as rate controlled and to identify itself to vrate during RTT/window updates. 93

Figtue 5.3: Rate controller operation

From TCP's point of view, every time a connection has a segment to send, it just sends it cis usual. If the connection is not associated with vrate, it goes down to IP and the rest of the stack immediately. Otherwise, vrate enqueues the message and leaves it to the scheduler and token bucket to determine when it is to be sent. This may be immediately if there are no other messages already queued and there are enough tokens for the packet.

5.2 Simulation Models

The rate controller described in the previous sections was simulated using the x-sim simulator [BP96]. Instead of using abstract specifications, x-sim executes the actual a>-kernel protocols, which allows us to use exactly the same code that would be used in a real system. The vrate implementation was developed on top of Big Windows TCP, as defined in the BSD4.4 distribution. This was also the version of TCP used for comparison in all simulation scenarios. It was chosen because it is the only widely available implementation of the protocol that has been altered to include the TCP extensions for high-speed networks [JBB92]. A group of different scenarios was selected to simulate some of the configurations that might occur in a high performcince computing environment. The main case corresponds to a distributed storage server in a local network. Figure 5.4 shows the configiu-ation used to represent a local area network with a single shared switch. The links were configiu-ed with speeds corresponding to both Fast Ethernet (lOObaseT, 100 Mbps) and OC-12 ATM 94

(622 Mbps). Delays were fixed at 1 millisecoad per link, although slight variations were added in a few runs to make sure there were no phase problems [FJ92]. The switch was modeled as a packet switch, even when ATM speeds were involved.' Unless otherwise noted, switch buffer capacities were set to 30 packets per link.

sw sw

srv srv

(a) Single client (rd.wr) (b) Two clients (rd,wr,2way)

Figiure 5.4: LAN with a single switch

Simulation runs were configured with one or two clients. A single client can perform either a read firom or a write to the servers, while a two-client scenario Ccui have both clients reading or writing at the same time, as well as performing opposite operations, thereby creating two-way traffic. A client is the cictive element that runs the application; it may be a workstation in a cluster, a compute node inside a multiprocessor, or a complete multiprocessor using a distributed protocol implementation as the one proposed in Chapter 3. It starts by setting up connections to all servers to be used and by transferring enough data in each to complete the slow start phase (this work deals with long-running connections). In each simulation, a client concurrently reads from or writes to all the servers. The amount of data in each individual transfer is 8 Megabytes per server, and the number of servers was varied from one to eight. Time is measured by the client (s) from the time a read request is sent to the servers until the time all data is received (read case), or from the time the client starts to send data until the arrival of a message from each server confirming reception (write case). This is done to achieve the effect of am application that caimot proceed until it has heard from all servers. When two clients cire involved, each measures its own performance and

^This is a reasonable approximatioa, assimiing the ATM hardware includes optimizations proposed to handle TCP connections such as early packet discard and others [RF94, PR95]. 95 the simulation reports the lowest value. With the different rates available, three cases could occur:

• Client and server links have same the same capacity. This represents a general purpose network with equal capacity machines acting as clients and servers. "Fast Ethernet" and "ATM" networks could have different behavior (even after the dif­ ferent speeds are taken into account) due to the different amounts of data in transit in each case.

• lOObaseT clients and ATM servers. This represents the standard server model in use in our networks today, where special purpose, high-capacity servers are connected at higher speeds, while clients are general-purpose machines.

• ATM clients and lOObciseT servers. This should be the usual case for distributed storage servers in the future: high-performance clients (e.g., massively parallel pro­ cessors) with high speed interfaces connected to cheaper, ubiquitous storage servers. Performance is achieved not by individual server speeds, but by aggregating a large number of them in parallel.

All possible combinations were simulated to make sure there were no anomalous cases, both for reads and writes. Scenarios where two clients communicated with the Scime set of servers were also used. The goal in this case was to verify how the system behaved with competing clients reading or writing at the seune time, and to see how the protocols behaved in the presence of two-way traffic, which would cause the phenomenon of ACK compression [ZSC91]. In the case of a local area network with a single switch, transfers from the client to multiple servers share no queues outside the client, since each server corresponds to a different switch port, but this is not £ilways the case. The model shown in Figiure 5.5 was added to address situations where a common link (and its associated queues) is shared by all connections. Such a configuration might happen in practice when clients and servers have to be physically sepaxated for some reason. This might be done for administrative purposes (placing different machines in different pools to simplify their management), or due to 96

SW SW

•Write Read srv

Figure 5.5: LAN with two switches.

physical limitations (e.g., when clients are placed in laboratories, while servers are kept in an isolated machine room). The same options of link speeds were considered and all combinations were simulated for completion, even though one might argue that in such a case any configuration with a link between switches with smaller capacity than an end machine link would not be a practical system. Although not illustrated in Figure 5.5, the models with two switches also included a configxu-ation with multiple clients. Note that all simulation scencurios described in this section represent high-speed net­ works, where even the Ijirge data transfers discussed here happen in a few seconds. Due to the short duration of the transfers and the difficulty in defining a reasonable model for additional load in such a system, no any additional background traffic is used. Tools like the traffic model offered by tcplib [DJ91], for example, are designed for wide area net­ works and for longer intervals. There is no study yet of traffic characteristics of scalable storage servers, so that kind of analysis is left for future work.

5.3 Performance Results

For all the cases discussed in Section 5.2, simulations were executed, varying the number of servers (and hence, number of concurrent connections) from one to eight. Cases used both the standard protocol stack and our modified stack, which includes vrate. Each case was executed at least ten times to accoimt for timing variations. In virtually adl cases, vrate performed at least as well as standard Big Windows TCP (btcp). In those cases that experienced congestion, the application achieved higher performance with the new technique. In the following discussion, a few illustrative cases 97 were chosen to show the major gains. To help visualize the impcict of rate control, the performance graphs show the band­ width perceived by the application with a vEirying number of connections (servers). For each case the graphs present the achieved bandwidth by both vrate and btcp as a percentage of the theoretical maximum bandwidth possible in that situation. This pre­ sentation factors out differences in the overall bandwidth available for different numbers of servers. For each configuration that follows, let ridi be the number of clients use (usually one in our tests) and Uarv t»e the number of servers. Since eax:h client accesses all servers concurrently, there will be a total of ridi x connections, of them through each client link, ticu of them through each server link, and all of them sharing the link between switches, if one is present. Let bwcn, bwjrv and represent the bandwidths for each client, server and inter-switch link (if applicable), respectively. So, the maximum available bandwidth for a client is given by:

I . ,, frtfsTD . '^mcLX — fntnyyWcluf^aTv i ) ^cli ^di Obviously, the last term is not used in configurations with a single switch. For example, when a client on a OC-12 link reads from a group of servers on lOObaseT links, the maximum bandwidth possible per session will be 200 Mbps when accessing two servers, and 622 Mbps when accessing more than six servers.

5.3.1 Single Switch Case: Read

For the model under study, the chances of congestion are always greater when the client reads from the servers. The reason is that even if the client has a link of higher bandwidth than the servers, as the application tries to read from a larger number of machines, the aggregate throughput can easily exceed the client's link capacity. Figure 5.6 shows our results for the case with higher contention, when server links are individually faster than the client link. In this case, vrate edways outperforms btcp, although in some cases just by a narrow margin. The results for a single connection, as might be expected, are statistically identical. As the number of servers grows, btcp first suffers a drastic drop in performance (due to stalls), later recovering some of that as the number of servers continues to increase. This recovery is due mostly for the increase 98

H btcp H btcp+vrate

Number of connections

Figure 5.6: Application performance: lOObaseT client reading from OC-12 servers

in the amount of data to be transfered. With more servers, when one connection stalls there are many others that can make use of its bandwidth. On the other hand, vrate keeps all connections at a higher level all the time. This is the sort of behavior found in a large subset of the cases that were simulated. To better understand what happens in such cases, this section takes a closer look at the traces for the two-server case in detail. In this scenario, vrate is capable of achieving rates approximately 80% better than those of btcp, on average.

Figure 5.7 shows a trace for a transfer using standard TCP after the initial slow start phase is over. The poor performance in this case is clearly due to the fact that one of the connections suffers two timeouts during its lifetime. This increases the time the application is held back waiting for data. The first loss happens £ts soon as the servers receive the request from the client and start to transmit. At this point, their connections have been idle for a short interval and they are allowed to send a whole window's worth of data at once. This causes extremely long packet trains to reach the switch, creating the sharp spike in queue length right before the 1.2 second mark. The switch has to drop multiple packets from each connection, causing both to stall. Later, when both connections resimie transmission and the switch queues start to change size cts dictated by the standeird TCP congestion control algorithm, a multiple-loss event happens right 99

35 • v; queue X- V drops /CS .p. ^ ^ 25- ••2Y' >''!V•L-?! X 20- I mi I 15- 10-J f 5- ?! -5

0- " , n ,n—pu 1.0 1.2 1.4 16 1.8 2.0 Z2 2.4 2.6 ^B 3.0 3.2 3.4 3.6 3.8 4.0 4.2 Time (seconds)

10- & Session 11 •0 ' Sesalan*2 8-

ca Z 6- aA'^ ;vA^'

lA a

1.0 ^2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 Time (scconds)

Figure 5.7: Transfer trace for plain TCP: single switch, read before the 2.8 second mark, forcing one of the connections to stall again. The rem2dning connection finishes before the other one resumes and the link is left idle for approximately half a second. (With more servers, transfers would continue for a longer time and the idle period would be shorter, causing the recovery noticed when more servers are present.)

The behavior of a transfer using vrate is quite different, as it can be seen in Figmre 5.8. The two connections proceed close together all the time, never suffering any multiple-loss events. Diuring each congestion window period, each connection loses exactly one packet, and they finish close together, yielding a throughput close to the maximum achievable. The results firom other cases with more servers confirm this analysis. The sudden drop in performance for two servers (for btcp) is due to the multiple timeouts that are likely to occur. Other servers being added tend to improve the scencirio, since at any moment there are more sessions that may be able to use the link while one (or a few) of them is 100

35 30- & ^ -4) ' drops 25

20- 'f^ 15- i 10- V > : b 5- !?• 0 PiPPV n CT o rr g , 1 1.2 1.4 1.6 1.8 2 ZZ 2.« 2.6 Time (seconds)

A Session *1 •fi I"-' /A SasslcnK •nt)''-' ../u2^ caZ

1.6 t.8 2 2.2 2.4 2.6 Time (seconds)

Figure 5.8: Transfer trace with rate control: single switch, read

blocked. Vrate, on the other hand, reduces the chances of blocked connections in any case. From these graphs it is possible to see that vrate not only improves the application performance and use of the network, but it also improves the conditions of the network as a whole. By avoiding packet trains it reduces the pressure on switch buffers, thereby reducing contention. An important point that should be clear from the router queue plot is that vrate does not preclude TCP's usual congestion control algorithm. It works only by delaying packets to meet certain bandwidth limitations. No packet is sent that would not be so in usual TCP, given that no multiple-loss events occurred.

In particular, there is no sudden surge of packets sunriving at the switch at the start of the connection, an event that in the previous case caused both cotmections to stall. 101

Ci SM«Mn«i i SwBonta I

.aA' ..A'

1 8 Time (Kconh)

&S^ - , A .aA A'r* n S«MMnt2 I a'S' :AA.-^A aT aAiA. , ^tA'-

16 18 3 2.2 24 TtaMitecnmb) t50 ^ SMttontt I2S - x""" \ " m\ [ ' S«Mion #2 I ? 100 !* Nc'W %|IK ^jJMlMtHNtl MtMHiitMMMllMWN , jifmtmm ^ US' X CemMM I f '\\ ^A"-"iA&iAAA: _ AAA '-' ^•«n i..qri aa^A J 50- ' X^X ^ j 2S • ' "-••^ ,v,aaaaaaaaaaaaaaa^ 16 18 2 Tim (MCimlii

Figure 5.9: State of the rate controller: single switch, read

The rate control guarantees that even though the connection had been idle for a short time, packets are not injected at a rate higher than that detect up to that moment. This can be verified by inspecting the internal state of the rate controller, as shown in Figtire 5.9. The first graph traces each connection's congestion window, the second shows the measiured values for the round trip time, and the last one gives the inferred available rate. From these graphs it is easy to see that although the congestion windows do not follow in perfect synchrony, their combined efiect is still very periodic. Although the rate estimates start with some noticeable difierences around 1.4 seconds, they slowly converge to a common value, always maintaining a combined total close to the actual capacity of the system.

5.3.2 Single Switch Case: Write

The same kind of behavior is observed when there is a clear cause of contention in the system. When a client sends data to a group of servers in the central switch configuration. 102

the only case where baadwidth is not limited by the client link is when it has a higher capacity link than the servers. Figure 5.10 shows the results for the case where a client on an OC-12 link writes data to servers with lOObaseT connections.

H btcp B btcp+vrate

Number of connections

Figiure 5.10: Application performance: OC-12 client writing to lOObaseT servers

The main reason for vrate performing so much better than btcp in this ceise is that it again reduces packet trains drastically. For seven and eight servers, the throughput is limited by the client capacity, and rate control is not relevant. When all connections share the same source, even though they do not have a common bottleneck in this case, the improvement due to rate control is even more noticeable. This is because processing for each connection tends to get naturally mixed in the operating system. For example, there is no case in which two packets &om two different connection arrive at the switch at virtually the same time, which could force one of them to be dropped. All data packets in this case arrive at the switch through the same link, so obviously no more than one data packet has to be routed at any time. Traces for these cases are not included here, since they add little new information to the read case, although the write case has more accentuated results. Connections using vrate proceed smoothly, even more so than before, due to the natural packet scheduling caused by the operating system processing mentioned before. Connections using btcp, as before, are subject to the same problems with stalls due to multiple losses. These stalls lead to intervals when the link is left idle, hiuting perfor­ 103 mance. One new element in this case is that when more than one connection happens to stall at approximately the same time (due to both experiencing multiple losses), they are usuaJly restarted at the same time. This is a result of their being managed by the same TCP machine and having similar round trip times: the RTT measurements ajid retrans­ mit activations being controlled by the same coarse grained timer, they end up having exactly the same effects. The fact that more than one connection restarts at almost the same time—first with a short slow start phase and then by trying to fill their pipes again—tend to increase the occmrence of packet trains and therefore lead to more losses. Although such simultaneous stall events are not very frequent, when they happen they almost always force an extra stall soon after restart, which contributes to btcp numbers being worse in the write case than in the read case.

5.3.3 Two-Switch Case: Read

Performance also improves in the case of connections crossing a common link between two switches. The cases that benefit the most, however, are those were the link connecting the switches has bandwidth equal to or lower than the sender(s), since they suffer the highest congestion and benefit the most from the improvements provided by vrate. Nevertheless, such cases are not of practical interest: one would expect the inter-switch link to have higher capacity than the links connecting individual hosts in such case. Considering this, this section presents one result for a case with greater practical value: individucil hosts connected to switches by lOObaseT links, while switches communicate with each other through aji OC-12 channel. The reod case is presented in Figure 5.11. As mentioned earlier, these results are very similar to the read case with centred switch described in Section 5.3.1. Compared to that case, the biggest difference is for two servers. Although vrate still outperforms btcp by a wide margin, its average throughput is lower than in the central switch case, and the opposite if true for btcp. The reason for that is probably related to the longer round trip time for this configmration, and the existence of two contention points.

5.3.4 Multiple Clients

When considering cases with two chents, both of them may be transferring data in the same direction (which accentuates congestion directly), or they may be transferring data 104

1.0 H btcp H btcp+vrate

1 2 3 4 5 6 7 8 Number of connections

Figure 5.11: Application performance: lOObaseT hosts behind OC-12 link in opposite directions. In the latter case, the main effect is the occurrence of ACK compression in both connections, contributing to losses due to the formation of packet trains. Most of the time, two way traffic does not affect the performance of vrate very much. On the other heind, in some cases btcp is seriously hurt by packet trains, mostly because of the addition of ACK compression effects. In those cases that already have high congestion on even just one connection, both systems were adversely affected. The results nevertheless do not differ significantly from the ones for a single client in terms of general conditions. One important aspect of the two-client case is that while many of the cases considered at first cire trivial for a single client (when its link is the limiting factor), they may become contention prone if multiple applications sire running concurrently in the network (a likely scenario). To illustrate, consider the results for the single switch, all-lOObaseT network, with clients writing to servers, as shown in Figure 5.12. The results with a single client are trivial: no matter what number of servers is considered, the performsince is always close to maximum for both vrate and btcp. This is because the limiting factor is the client link. When a second client is added, however, the clients compete for bandwidth to access each server. Since their operation is not synchronized, there is no way for them to control the timing of their transfers to each 105

• btcp B btcp^-vrate

Number of connections

Figlue 5.12: Effect of added clients on rate: two clients writing, lOObaseT network

server so that only one of them talks to a given server at any time. Vrate's ability to shape traffic to reduce impact on the switches guarantees that its connections will keep link utilization close to maximum at all times. Figure 5.13 shows the total number of packet drops registered by the simulator during the transfers. The results confirm the good performance of vrate: very few packets were lost. An interesting result is that losses in btcp decrease as the number of connections continue to increase. In these cases, the client bandwidth begins to be divided among an increasing number of servers, up to the point where each individual connection has access to just a small fraction of the overall capacity. As a consequence, their trains become smaller and less frequent. Under these conditions, the individual connections become more stable, and multiple-loss events become less likely.

5.4 Rate Control and Distributed Protocol Stacks

Chapter 3 discusses how user-level protocol techniques caji be used to implement dis­ tributed protocol processing in a massively parallel processor. As illustrated in Fig­ ure 5.14, the solution starts by adding access to the HLPPI device to the standard TCP/IP stack. This is done using libhippi over the Mach remote device access interface with Mach NORMA as the inter-process commuxucation interface (A). This was then altered to avoid the high costs of Mach NORMA IPC by using the more efficient NX IPC to 106

Packets Dropped Per Connection to T 11 H btcp H btcp4-vrate

Number of servers

Figiire 5.13: Effect of added clients: drops for two clients writing, lOObaseT network

interface the TCP/IP code to a streamlined interface process located in the I/O node, which used libhippi with local system calls to access the device (B). This chapter, on the other hand, shows how TCP performance can be improved by adding rate control to the protocol stack, as illustrated in Figure 5.15. In this case, vrate is implemented right under TCP (C) for simplicity: this way vrate can identify flows by checking just the port numbers in the message header, which is already guaranteed to belong to TCP. However, there is nothing in the solution that forces this organization. It can be implemented also closer to the device interface if desired (D). In general, such a solution would require more elaborate packet parsing code to differentiate TCP packets from others with no need for rate pacing, and then to identify the individual flows. A complete solution to the network performance problem in high-performance com­ puting requires that the two techniques be combined. Distributed protocol processing reduces the bottleneck created by centrcilized protocol servers and increases parallelism by moving protocol processing to the compute nodes, but it provides no solution to the problem of combining multiple cooperating TCP connections. Cooperative rate control makes it possible to reduce the number of stalls in TCP connections and to group related flows so that they may shcure information about available rates, but it must be possible to implement it in a distributed protocol stack. The combination of the techniques is illustrated in Figure 5.16. If vrate is moved 107

(A) (B)

TCP TCP

IP IP IP =00 ETH ETH I ETH

^(device))

I libhippj I ~T"~ HiPPI

Figure 5.14: Steps for a distributed protocol stack implementation

(C) (D)

TCP TCP

IP IP

ETH IP ^(devicejl ETH

(devicej) ^(device^

Figure 5.15: Steps for vrate implementation 108 down the stack, close to the device driver access, it can be implemented in the interfacing process executing right at the I/O node that contains the actual device. This allows connections from the different compute nodes to be easily grouped together to share information. If vrate is left close to TCP it could provide rate pacing for individual connections, but it would be impossible to combine information from connections initiated by different compute nodes.

(device^

libhippi ^(device)) HPPI J libhippi I

r HiPPI ^

Figure 5.16: Distributed protocol stacks and rate control combined together

The combined solution simplifies vrate since flow identification becomes much easier when the NX interface sits between the rate controller and the TCP/IP code. In this scenario, each flow uses its own NX end point, and the controller can associate messages with individual flows by simply verifying the IPC handle for the message. This yields a very efficient implementation of vrate, with almost no cost to demultiplex packets into individual queues, since such demultiplexing b already provided by the inter-process commimication facility. A preliminary comparison of the structure of the NX/HIPPI interface process used in Chapter 3 and of the rate controller module used in this Chapter suggests that their inter­ nal structure match well. The interface process has individual threads to receive messages for each flow, and the NX communication primitives automatically implements individ­ 109 ual queues for each thread, just as vrate queues flows separately. The rate controller's scheduler, which determines when the next packet can be transmitted and accesses the next queue to be serviced, fits well with the thread in the NX/HIPPI interface process that receives messages from the multiple NX threads and ships them to the device. This shipping thread would have to be changed to fetch the messages from the flow-associated threads as required by the scheduler instead of accepting messages as they become avail­ able. It would have also to deal with timing constraints to provide rate pacing, but the basic organization is already there.

5.5 Concluding Remarks

This chapter proposes the use of rate-based traffic shaping to improve performance with­ out chcmging TCP's basic congestion control algorithm. The results show that perfor­ mance improves markedly in cases where packet trains are an importajit factor acting against the transfer. In those cases where congestion occurs, vrate outperforms plain TCP by as much as 100 percent. In virtually all cases, it performs at least as well as plain TCP. This technique is useful for high-performance computing applications that require mul­ tiple parallel TCP connections, such as scalable storage servers and distributed parallel file systems. It can also be combined easily with a distributed protocol stack implemen­ tation. 110

CHAPTER 6

CONCLUSION

The goal of the work presented in this dissertation is to improve the performance of the network subsystem in massively parallel processors (MPPs) and other high-performance computing systems. Many of these systems lack a system-wide network subsystem pro­ viding the same network identity to all nodes. Those that do have such subsystem have performance problems due to poor implementation. The use of TCP in applications han­ dling multiple concurrent cooperating connections also contributes to less-than-optimal network performance in those machines. Considering these problems, the main contribu­ tions of this dissertation are the development of a distributed protocol processing orga­ nization using a user-level protocol implementation that allows each compute node in a multiprocessor to handle its own connections while maintaining a unified network identity for all nodes, and a technique to improve TCP performance by combining cooperating connections using rate-based flow control.

6.1 Limitations of Current Systems

Parallel applications in high-performance computing that make use of the external net­ work have become more common in the past few years for a multitude of reasons. The in­ creasing need to share data cimong different computer organizations and the development of architecture independent, high-performance parallel file systems (PFS) distributing file data over a set of independent hosts across the network are the le£iding factors in this new trend. As this dissertation has shown, the move of parallel file systems from inside supercom­ puters to the local network is not without its problems. The new PFS implementations roust use network protocols to access the storage servers, and therefore the finaJ perfor­ mance becomes dependemt on the network subsystem implementation. This is an area that has not received much attention in the past, to the point that many current MPPs Ill lack any notion of aji unified identity when their nodes access services over the external network. In other cases, all protocol processing is centralized in specific processors exe­ cuting special network servers, cts in the case of the Intel Paragon. Such solutions greatly limit the overall system performance, since they create bottlenecks to all network traffic. Besides the network subsystem implementation itself, another source of problems in high-performance computing applications using external networks is the lack of an appropriate protocol developed with cooperating connections in mind. The only protocol widely available that provides the basic services expected by such appUcations (such as guarcmteed ordered delivery, and so on) is TCP. Although it works well in other scenarios, one problem for TCP in this case is that it lacks any facility to identify connections with a cooperative nature. This makes it hard to make the best use of the information from individual connections to improve overall performance, many times leading to poor performance.

6.2 Main Contributions

Considering the problems just mentioned, the main contributions of this dissertation are aimed at improving the performance of high-performance computing systems over exter- neil networks by improving both the implementation of system-wide network subsystems and the behavior of TCP itself.

Distributed protocol processing: By using an user-level protocol implementation, this dissertation shows how protocol stacks can be implemented efficiently by di­ viding protocol layers between the I/O node containing the network interface and compute nodes executing the applications that require network connections. By implementing most of the protocol processing in the compute nodes, the actual connection endpoints, a bottleneck due to a centralized protocol processor is elim­ inated. The I/O node executes just the bottom portion of the protocol stack con­ taining the device access services and a simple demultiplexer to identify the proper node to which each incoming packet must be deUvered.

Cooperative Rate-based 'Draffie Shaping: This technique uses information from each TCP connection to measure the actual bandwidth available to the application as a whole and applies rate pacing to guarantee that packets are properly spaced out 112

over time. This avoids packet trains, which can lead to multiple losses in applica- tioDS with multiple connections. Such losses often cause some connections to stall, reducing application performance dramatically in many cases. Not only do individ­ ual connections benefit from rate peicing, but the rate information from multiple connections are identified as belonging to a same application and sharing part of a same path may be grouped to be paced by the combined rate.

The two techniques may be combined by having rate-shaping implemented at each compute node as part of the distributed protocol stack, by implementing the rate con­ troller at the I/O node, or by placing the rate controller at the compute node and the scheduler at the I/O node. Each solution present difiierent level of complexity and different scheduling behaviors that must be considered together.

6.3 Suggestions for Future Work

Distributed protocol processing allows massively parallel processors and other distributed systems like workstation clusters to have a unified identity without paying the price of increcised overhead and the bottleneck of a centralized protocol server. This idea may help the development of new organization techniques that make use of the unified identity to improve the performance of networked applications. The solution could be used, for example, to simplify the design of scjilable Web servers using workstation clusters. The work described in this dissertation started with a multiprocessor with a centralized server and proceed to create a distributed protocol stack. It would be interesting to see some work on creating a unified network identity for a cluster or some other machine that only offers independent protocol stacks for each compute node, like the IBM SP series. The work with rate-based flow control has effects in all network application areas, not only in high-performance computing. This work provides strong proof that rate shaping can improve TCP performance dramatically in local area networks, even when connections are completely independent and share no common end points. Adding some rate control to TCP might even improve its performance over networks that provide some kind of control over the available rate, such as ATM. Rate-controlled TCP could use ABR feedback about available bandwidth to control transmission rates at the sources, reducing the pressure on ATM border switches, for example. 113

Sharing connection information is another technique that promises to improve TCP in cases where hosts have to handle large volumes of information, like busy database and Web servers. One question that has to be addressed, however, is how to identify which connections to group together. In this work this decision is delegated to the application, assiuning that it can tell where the multiple hosts are, if they shaure a common bottleneck, and so on. This is certainly a reasonable solution, but not the only one, and it may even not be the best one. It may be interesting to investigate if a host can identify connections with similar characteristics, like a common bottleneck, form observing their traffic and other simple network information available. Such a possibility might have great impact in all work on sharing connection state, as well as in other areas, such as network management. Besides these new research areas, there is obviously some work that can be done that is more directly related to the results presented in this dissertation. The first way in which this work could be extended is by integrating cooperative rate shaping with distributed protocol processing in a working system. There are some performance aspects of the implementation of the rate controller that might lead to interesting problems, as well as questions about how best to implement a distributed scheduler for combined connections. Finally, another possible avenue of research is to study the effect of rate-based traffic shaping on wide area networks (WANs). On the WAN case, resecirch might consider how rate shaping works over lower bandwidth networks, with higher delays, how rate- controlled TCP connections would affect those using standard TCP, and how cooperative rate shaping would behave in the presence of WAN background traffic. Of particular interest is the application of the technique to busy Web servers. 114

APPENDIX A

DETAILED SIMULATION RESULTS

During the study of the effect of cooperative rate-based traffic shaping, as discussed in Section 5.2, all combinations of link speeds and client roles defined in the simulation model were observed. Chapter 5 discusses some of the cases in detail. The other relevant cases are presented here. The reader is referred to Section 5.2 for a detailed description of the simulation model.

sw sw

srv srv (a) Single client (rd,wr) (b) Two clients (rd,wr,2way)

Figure A.l: LAN with a single switch

Figures A.l and A.2 illustrate the various cases possible. Cases can have one or two clients, which Ccin read from or write to servers. Link speeds are varied to represent different network configurations.

srv

sw sw

srv

Figure A.2: LAN with two switches. 115

A.l No Congestion

The detailed simulation analysis performed in this work ignores the Ccises that have no network congestion, that is, those cases which are sender-limited. In such cases, the network connection has no other limiting factor than the capacity of the sender interface itself. That happens, for exjimple, when two clients using Fast Ethernet interfaces write to servers behind OC-12 links in a network with a central switch. Figure A.3 shows the performsince res\ilts for such a case.

Figure A.3: No congestion: two lOObaseT clients writing to OC-12 servers in a central- switch network

Both standard TCP and vrate behave well in this case. The maximum performances achieved in Figure A.3 are in fact the mziximum perfonnance available for application connections in this simulation model, which is 87 percent of the link speed. The difference to the link capacity is due to headers and synchronization overheads needed to control the transfers. All sender-limited cases behave in a similar fashion, with both plain TCP and vrate achieving performance near the maximum. For this reason, in the sections that follow, cases with no network congestion are not considered.

A.2 Central Switch and Related Central Link Cases

Whenever the link connecting two switches in a two-switch network has the same capacity as the client link in that network, the two-switch case behavior is almost identical to that of the central-switch case (for a single client). 116

Figxire A.4 illustrates this fact for the case where servers with OC-12 links send data to clients on Fast Ethernet links, in both (a) a network with a central switch and (b) a network with two switches connected by a Fast Ethernet link.

riambcrorierAn Sunder «tr «rvcn

(a) Central switch (b) Central lOObaseT link

Figure A.4: lOObaseT client reading from OC-12 servers

The addition of the inter-switch link does not affect the overall protocol behavior, but it does reduce the msiximum throughput achieved in each case. For example, for 6 servers peformance drops 15 percent for vrate and 6 percent for TCP. This is an effect of the longer path: there is more data in transit at any time, so more data may have to be retrcuismitted after losses, and the pipe takes longer to be filled. This similarity of behavior between central-switch networks and networks with a cen- trsd link with capacity equal to that of the client link was common in the simulation results. In the sessions that follow, such cases are many times discussed together.

A.3 Reading from Ethernet Servers Through Ethernet Bot­ tlenecks

When multiple Fast Ethernet servers are sending data £ind the bottleneck is the first link after the switch to which they are connected, both vrate and TCP have similar behavior, with a small advantage for the new protocol. That happens both with a central switch and with networks with a Fast Ethernet link connecting two switches, Eilthough the results are not exEictly the same in both cases. Figure A.5 shows the results for one client reading from Fast Ethernet servers for two different cases: (a) a Fast Ethernet client in a single-switch network, and (b) an ATM 117

(a) Central switch, lOObaseT client (b) Central lOObaseT link, OC-12 client

Figure A.5: Client reading from lOObaseT servers client in a network with two switches connected by a Fast Ethernet link. In the case of the single-switch network, performance drops for both protocols as the number of servers increases. With more servers sending data to the client, more packets can arrive at the same time at the client's switch and drops may occur if the queue is full. When the number of servers continues to grow there comes a point when the bandwidth for each connection gets so low that packets become spaced out enough to make collisions unlikely. In the two-switch network case, vrate maintains the high throughput as the number of servers increases.

(a) Central switch, lOObaseT clients (b) Centred lOObaseT link, OC-12 clients

Figure A.6: Two-way traffic with lOObaseT clients

When two-way traffic is considered for the same cases (Figure A.6), vrate still hcis better performance. In the central-switch case, for example, it achieves a throughput 118

more than 5 percent higher than that of TCP for all cases below 8 servers. In the two- switch case both protocols are not able to maintain the same performance level as the number of servers increases, but vrate is still better.

A.4 Reading from ATM Servers Through Ethernet Bottle­ necks

When clients read from ATM servers, the behavior is the same illustrated in Figure A.4; vrate is able to keep performance close to maximum most of the time, oscillating be­ tween 75 and 83 percent of the raw link capacity, while TCP degrades as the number of connections increases, going from 75 percent of the capacity with one server to just 45 percent with 7 servers.

numbcf uT Kmtn Nimbcr of «rvcn

(a) Central lOObaseT link, OC-12 client (b) Centred OC-12 link, lOObaseT client

Figure A.7: Clients reading from OC-12 servers

Figure A.7 shows two other cases with exactly the same behavior. As the number of servers increases, the chances of multiple packets from ATM links arriving together at the switch to be queued for transmission on the Fast Ethernet link become higher. That may lead to the creation of packet trains and to drops due to packets from different links arriving too close together. As vrate spaces packets over time, such events become less likely and performance increases. With two-way traffic, changes are noticeable if the bottleneck is the link connecting two switches (Figure A.8.a). In this case, ACK compression aggravates the problem of packet trains. For example, TCP performance for 3 servers drops from 67 percent of the link capacity to 51 percent, while vrate drops from 80 percent to 64 percent. Even then, 119

NtiMbcroC «cf*m NMiM»crc>r«:m»

(a) Central lOObaseT link, OC-12 clients (b) Central OC-12 link, lOObaseT clients

Figure A.8: Two-way traffic with OC-12 servers

vrate still outperforms TCP by at least 17 percent. When the client link is the bottleneck, however, there is almost no chzmge (Fig­ ure A.8.b) to the throughput. This is due to the fact that in this case the bottleneck is the client's link, and that link is never crossed by the other client's traffic.

A.5 ATM Clients Writing Through Fast Ethernet Central Link

When the client has a high-capacity link, the scenario is similar to that described in Section A.4, with the difference that there are more connections traveling through the client link in this case than there were in any server link in that case. That leads to a better behavior for TCP, which does not drop as much as it did in that case for intermediaxy numbers of servers, as it can be seen in Figure A.9. For one client, the difference in performance oscillates between 5 and 20 percent between vrate and TCP, for 2 servers and 7 servers, respectively. Although the new protocol behaves better for most cases, it is 3 percent lower than TCP for 8 servers.

When two clients Eire used, however, packet trains become more frequent on the switch closer to the clients and multiple drops happen, reducing performance. In this case vrate is capable of avoiding losses due to packet trains, having better performance than TCP. It is just 5 percent better than TCP for a single connection, but its performance becomes 30 percent higher than TCP's for 8 servers. Nevertheless it is also affected by the increased congestion, not being able to use more than 60 percent of the total capcicity. 120

NomberorKrvcn Numher m wntn

(a) One OC-12 client writing to lOObaseT (b) Two OC-12 clients writing to servers lOObaseT servers

Figure A.9; Central IGObjiseT configuration with lOObaseT servers

• Mcp B Mcp*faw

Y } ) 4 S

Figure A.10: Two-way traflSc: OC-12 clients, lOObaseT central link, lOObaseT servers

As it happened in Figures A.7.a and A.8.a, performance decreases with two-way traffic due to ACK compression. Again, although its maximum performance is not as high as in the single-client case, vrate gets better performemce by avoiding packet trains, giving more stable results (always around 60 percent of the system capacity) and outperforming TCP by more than 80 percent when 8 servers are active.

A.6 ATM Clients Writing to Fast Ethernet Servers

This case is slightly different from the previous one, because multiple drops aure less likely to occur. Packets coming through the high-capacity link (either the client's or an ATM central link) get distributed to multiple server links. Under normal conditions TCP congestion control should work correctly and no multiple losses should take place. 121

However, if two clients try to access the same servers, serious congestion can occur.

Numtwr irf

(a) Single-switch network (b) Two-switch network, OC-12 central link

Figure A. 11: Two OC-12 clients writing to lOObaseT servers

Figure A.11 shows the results for both network architectures when two clients write to the servers. In this case drops can occur in both switches: traffic from the two ATM client links must be routed to the same outgoing ATM link, and then the trsiffic from that ATM link must be routed into one of the Fast Ethernet links. In the single-switch network, TCP performance stJirts at just 40 percent of the total capacity with one server, growing as the number of connections incresises, up to approx­ imately 50 percent for 8 servers. For the same case vrate starts at 58 percent of the total capacity and ends with 83 percent for 8 servers. In the case of the network with two switches, due to the increased congestion, both protocols show lower numbers, with TCP's oscillating around 45 percent, and vrate's going from 55 to 74 percent. Figure A.12 shows the results for two-way traffic. TCP's performance is further reduced in both architectiu-es due to ACK compression, never exceeding 36 percent of the total capacity in both cases. On the other hand, vrate's performance suffers more when only one or two servers are used. As the number of servers increases its performance grows, exceeding TCP's by more than ICQ percent in some cases. On the central switch case it remains stable aroimd 70 percent of the link capacity. Nevertheless, that is lower than the maximimi of 80 percent it achieved in when only one client was active. 122

n«mMruf«n«ni

(a) Single-switch network (b) Two-switch network with OC-12 cen­ tral link

Figure A.12: Two-way traffic between OC-12 clients and lOObaseT servers

A.7 ATM Networks

The simulations show poor results for all cases where congestion occurs in an all-ATM network. Figure A.13 shows two such cases, where two clients (a) write to servers on a network with a single switch, and (b) read on a network with a central link. Except for the case with 8 servers in the two-switch network, performance never goes above 65 percent of the maximum capacity for vrate, and 50 percent for TCP.

5 Ot

t 2 a 4 S • 7 8 1 2 3 4 s « 7 8

NiMibcr of KTwen Noabcr itf -cnvn

(a) Single-switch network, two clients (b) Two-switch network, two clients read- writing ing

Figure A. 13: Performance in all-ATM networks

Although vrate performs better than TCP in some cases, overall performance is low. This is due to the large delay-bandwidth product in the case of ATM networks. If we consider round-trip times on the order of 1 millisecond, there may be more than 300 123

Kilobytes in transit at any time. In the case of the two-switch network, the fact that all connections share a single link seems to help vrate to keep a performance at least 25 percent higher than TCP from one to seven servers.

Mliull 1 3 3 4 e • 7 • NombcruTicnicfi

(a) Single-switch network (b) Two-switch network

Figinre A.14: Two-way traffic in all-ATM networks

Results for two-way traffic are shown in Figure A.13. Compressed ACKs reduce performance even further than in the previous results. Although vrate outperforms TCP, overall performance is low for both protocols, staying under 45 percent of the link capacity for vrate, and 35 percent for TCP. 124

REFERENCES

[AD98] Mohit Aron and Peter Druschel. TCP: Improving startup dynamics by adaptive timers and congestion control. Technical Report TR98-318, Rice University, June 1998.

[AFP98] M. Allman, S. Floyd, and C. Partridge. Increasing TCP's initial window. Request for Comments RFC 2414, IETF, 1998.

[AH098] M. Allman, C. Hayes, and S. Ostermann. An evaluation of TCP with larger initial windows. Computer Communication Review, 28(3), July 1998.

[Ahu93] Mohan Ahuja. An implementation of f-channels. IEEE Transactions on Parallel and Distributed Systems, 4(6):658-667, June 1993.

[AWG94] Preliminary survey of I/O intensive applications. Working paper no. 1, Applications Working Group of the Scalable I/O Initiative, 1994.

[AYHI97] D. Andresen, T. Yang, V. Holmedahl, and O. Ibarra. SWEB: Towards a scalable WWW server on multicomputers. Journal of Parallel and Dis­ tributed Computing, 1997.

[Bar9I] Joseph S. Barrera. A fast Meich network IPC implementation. In Proceed­ ings of the Usenix Mack Symposium. Usenix Association, November 1991.

[BCF''"95] N. J. Boden, D. Cohen, R. E. Feldennan, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and Wen-King Su. Myrinet: A gigabit-per-second local area net­ work. IEEE Micro, 15(l):29-36, February 1995.

[Ber95] L. Berdahl. Parallel transport protocol proposal. Technical report, Lawrence Livermore National Labs, January 1995. Draft, available at fkp://svr4.nersc.gov/pub/Pio-l-3-95.ps.

[BHZ98] R. Bruyeron, B. Hermon, and L. Zhang. Experimentations with TCP se­ lective acknowledgement. Computer Communication Review, 28(2), April 1998.

[Bie93] Ernst W. Biersack. Performance evaluation of forward error correction in an ATM environment. IEEE Journal on Selected Areas in Communications, ll(4):631-640. May 1993.

[BKLL93] Joseph Boykin, David Kirschen, Alan Langerman, and Susan LoVerso. Pro­ gramming under Mach. Addison-Wesley, 1993. 125

[BOP94] Lawrence S. BraJcmo, Sean W. O'Malley, and Larry L. Peterson. TCP Ve- gcis: New techniques for congestion detection and avoidance. Computer Communication Review^ 24(4):24-35, October 1994.

[Bor96] Rajesh Bordawekar. Implementation and evaluation of collective I/O in the Intel P?iragon Parallel File System. Technical Report CACR TR-128, Center of Advanced Computing Research, California Insititute of Technology, November 1996.

[BP95] Lawrence Brakmo and Larry Peterson. TCP Vegas: End to end conges­ tion avoidance on a global internet. IEEE Journal on Selected Areas in Communications, 13(8), October 1995.

[BP96] Lawrence S. Brakmo and Larry L. Peterson. Experiences with network simulation. In SIGMETRICS'96. ACM, June 1996.

[BPS+97] H. Blackrishnan, V. Padmamabhan, S. Seshan, M. Stemm, and R. Katz. TCP behavior of a busy Internet server. Technical Report CSD-97-966, University of California at Berkeley, 1997.

[BS98] Jose Carlos Brustoloni and Peter Steenkiste. User-level protocol servers with kernel-level performance. In Proceedings Infocom Conference, San Francisco, CA, April 1998. IEEE.

[BSS+95] Don£ild J. Becker, Thomas Sterling, Daniel Savarese, John E. Dorbcind, Udaya A. Ranawak, and Charles V. Pjicker. Beowulf: A parallel workstation for scientific computation. In Proceedings of the International Conference on Parallel Processing, 1995.

[C+95] P. F. Corbett et al. Parallel file systems for the IBM SP computers. IBM Systems Journal, 34(2), 1995.

[C+97] David E. Culler et eil. Parallel computing on the Berkeley NOW. In Pro­ ceedings of the 9th Joint Symposium on Parallel Processing, Kobe, Japan, 1997.

[CBM''"95] Alok Choudhary, Rajesh Bordawekar, Sachin More, K. Sivaram, and Rajeev Thcikur. PASSION runtime library for the Intel Paragon. In Proceedings of the Intel Supercomputer User's Group Conference, June 1995.

[CF96] Peter F. Corbett and Dror G. Feitelson. The Vesta parallel file system. ACM Transactions on Computer Systems, 14(3), August 1996.

[CFF"'"96] P. Corbett, D. Feitelson, S. Fineberg, Y. Hsu, B. Nitzberg, J.P. Prost, M. Snir, B. Traverat, and P. Wong. Input/Output in parallel and distributed computer systems, chapter Overview of the MPI-IO parallel I/O interface, pages 127-146. Kluwer Academic Publishers, 1996. 126

[CFKS96] Prashant Chajidra, Allcin Fisher, Corey Kosak, and Peter Steenkiste. Im- plementatioa of atm endpoint congestion control protocols. In Proceedings of Hot Interconnects, Stanford, CA, August 1996.

[CJR589] David D. Clcirk, Van Jacobson, John Romkey, and Howard Salwen. An analysis of TCP processing overhead. IEEE Communications Magazine, 27(6);23-29, June 1989.

[CLZ87] David D. Clark, Mark L. Lambert, and Lixia Zhang. NETBLT: A bulk data transfer protocol. Request for Comments RFC 998, IETF, SRI Inter­ national, March 1987.

[C098] J. Crowcroft and P. Oechslin. Differentiated end-to-end Internet services using a weighted proportional fair sharing TCP. Computer Communication Review, 28(3), July 1998.

[Com95] Douglas E. Comer. Internetworking with TCP/IP, volume I: Principles, Protocols and Architecture. Prentice-Hall, 3rd edition edition, 1995.

[CPD''"96] P. Corbett, J.P. Prost, C. Demetriou, G. Gibson, E. Riedel, J. Zelenka, Y. Chen, E. Felten, K. Li, J. Hartman, L. Peterson, B. Bershad, A. Wolman, and R. Aydt. Proposal for a common pEU'cdlel file system programming interface version 1.0. Technical Report CACR-130, Center for Advanced Computing Research of the California Institute of Technology, Pasaxiena, CA, November 1996.

[DDK''"90] W. A. Doeringer, D. Dykeman, M. Kaiserswerth, B. W. Meister, H. Rudin, and R. Williamson. A survey of light-weight transport protocols for high­ speed networks. IEEE Tmnsactions on Communications, 38(11);2025- 2038, November 1990.

[DJ91] Peter B. Danzig and Sugih Jamin. tcplib: A library of TCP/IP tradEc chaxEicteristics. Technical Report TR CS-SYS-91-01, USC Networking and Distributed Systems Laboratory, October, 1991.

[DKS90] A. Demers, S. Keshav, and S. Shenker. Analysis amd simulation of a fair queueing algorithm. Journal of Internetworking Research and Experience. pages 3-26, September 1990.

[DP93] Peter Druschel and Larry L. Peterson. Fbufe: A high-bandwidth cross- domain transfer facility. In Proceedings fa the Fourteenth Symposium on Operating Systems Principles, December 1993.

[dRC94] Juan Miguel del Rosario and Alok N. Choudhary. High-performance I/O for massively parallel computers: Problems and prospects. IEEE Computer, 1994. 127

(Dun94] Thomas H. Dunigan. Early experiences and performance of the Intel Paragon. Technical Report ORNL/TM-12194, Oak Ridge National Lab­ oratory, October 1994. [EM95] Aled Edwards and Steve Muir. Experiences implementing a high perfor­ mance TCP in user-space. Computer Communication Review, 25(4), Octo­ ber 1995. [EWL''"94] A. Edwards, G. Watson, J. Lumley, D. Banks, C. Calamvokis, and C. Dalton. User-space protocols deliver high performance to applications on a low-cost Gb/s LAN. Computer Communication Review, 24(4), October 1994. [FCBH95] Dror G. Feitelson, Peter F. Corbett, Sandra Johnson Baylor, and Yarsun Hsu. Parallel I/O subsystems in massively pcirallel supercomputers. IEEE Parallel and Distributed Technology, Fall 1995. [FDCF94] Robert Felderman, Annette DeSchon, Danny Cohen, and Gregory Finn. ATOMIC: A high-speed local communication architecture. Journal of High Speed Networking, 3(1), 1994. [FF96] Kevin Fall and Sally Floyd. Simulation-based comparisons of Tahoe, Reno amd SACK TCP. Computer Communication Review, 26(3), July 1996. [FGB91] Alessandro Forin, David Golub, and Brian Bershad. An I/O system for Mach 3.0. In Proceedings of the USENIX Mach Symposium, November 1991. [FGP95] David R. FoUet, Maria C. Gutierrez, and Richard F. Prohaska. A high performance ATM protocol engine for the Intel Paragon. White Paper by GigaNet, Inc., 1995. [FJ92] Sally Floyd and Van Jacobson. TVaffic phase effects in packet-switched gate­ ways. Journal of Internetworking: Practice and Experience, 3(3):115-156, September 1992. [FKKM97] Ian Foster, David Kohr, Jr., Rakesh Krishnaiyer, and Jace Mogill. Remote I/O: Fast eiccess to distant storage. In Proceedings of the Fifth Workshop on Input/Output in Parallel and Distributed Systems, pages 14-25, San Jose, CA, November 1997. ACM Press. [Flo95a] Sally Floyd. TCP and explicit congestion notification. Computer Com- munication Review, September 1995.

[Flo95b] Sally Floyd. TCP and successive fast retransmits. Technical Report, avail­ able via ftp://ftp.ee.lbl.gov/papers/fastretrans.ps, May 1995. [G+98] G. Gibson et al. A cost-effective, high-bandwidth storage architecture. In Proceedings of the Eighth Conference on Architecture and Programming Lan­ guage Support for Operating Systems (ASPLOS-VIII), pages 92-103, October 1998. 128

[GFLH98] Andrew Grimshaw, Adam Ferrari, Greg Lindahl, and Katherine Holcomb. Metasystems. Communications of the ACM, 41(11), November 1998.

[GLS95] W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Pro­ gramming with the Message Passing Interface. MIT Press, 1995.

[Har99] John Hartman. The swarm scalable file system. Project Web page avail­ able at http://www.cs.arizona.edu/swarm, 1999.

[Hoe96] J. C. Hoe. Improving the start-up behavior of a congestion control sheme for TCP. In Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, volume 26,4 of Computer Communication Review, pages 270-280, New York, August 1996. ACM Press.

[HP911 Norman C. Hutchinson and Larry L. Peterson. The z-kernel: An architec­ ture for implementing network protocols. IEEE Transactions on Software Engineering, 17(l):64-76, January 1991.

[HSMK98] T. Henderson, E. Sahoria, S. McCanne, and R. H. Katz. Improving fair­ ness of tcp congestion avoidance. In Procedings of Globecomm Conference, Sydney, Australia, November 1998. IEEE.

[Int91] Paragon XP/S product overview. Intel Corporation, 1991.

[Int95] Paragon high performance parallel interface manuEil. Intel Corporation, 1995.

[Int97] Intel TeraFLOPs supercomputer project home page. Intel Corporation, 1997.

[Jac88] Van Jacobson. Congestion avoidance and control. In Proceedings of the SIGCOMM'88 Workshop, pages 314-329. ACM SIGCOMM, August 1988.

[JBB92] Van Jacobson, R. Braden, and D. Borman. TCP extensions for high per­ formance. Request for Comments RFC 1323, IETF, May 1992.

[JWB96] Ravi Jziin, John Werth, and Jsunes C. Browne, editors. Input/Output in Parallel and Distributed Computer Systems. Kluwer Academic Publishers, 1996.

[KBM94] Eric Dean Katz, Michelle Butler, and Robert McGrath. A scalable HTTP server: The NCSA prototype. Computer Networks and ISDN Systems, 27:155-164, 1994.

[KBM'^96] Yousef A. Khalid, Jose M. Bemabeu, Vlada Matena, Ken Shirrlff, and Moti Thadani. Solaris MC: A multi-computer OS. In Proceedings of the USENIX Annual Technical Conference, San Diego, CA, January 1996. 129

[KC94] Vijay Karamcheti and Andrew A. Chien. Software overhead in messaging layers: Where does the time go? ACM SIGPLAN Notices, 29(Il):51-60, November 1994.

[KP93] Jonathan Kay and Joseph Pasquale. The importance of non-data touch­ ing processing overhesids in TCP/IP. Computer Communication Review, 23(4):259-268, October 1993.

[Lie95] . On micro-kernel construction. In Proceedings of the Fif­ teenth ACM Symposium on Operating System Principles, December 1995.

[LIN+93] Susan J. LoVerso, Marshall Isman, Andy Nanopoulos, William Nesheim, Ewan D. Milne, and Richard Wheele. it/sfs: A parallel file system for the CM-5. In Proceedings of the USENIX Summer 1993 Technical Conference, pages 291-306, Berkeley, CA, USA, June 1993. USENIX Association.

[LK98] Dong Lin and H. T. Kung. TCP fast recovery strategies: Analysis and improvements. In Proceedings of Infocom Conference, San FVancisco, CA, April 1998. IEEE.

[LM97] Dong Lin and Robert Morris. Dynamics of random early detection. In Proceedings of SIGCOMM'97. ACM, 1997.

[LR94] John LoVerso and Paid Roy. The network architecture of OSF/1 AD version 2. In OSF/RI Operating Systems Collected Papers Vol. 3. OSF Research Institute, February 1994.

[MB93] Chris Maeda and Brian N. Bershad. Protocol service decomposition for high-performjmce networks. In Proceedings of the Fourteenth ACM Sympo­ sium on Operating System Principles, December 1993.

[MJ93] Steven McCaume and Van Jacobson. The BSD packet filter: A new Eir- chitecture for user-level packet capture. In Proceedings of the Winter 1993 USENIX Conference, pages 259-269, San Diego, CA, January 1993. USENIX.

[MM96] Matthew Mathis and Jamshld Mahdavi. Forward acknowledgement: Re­ fining TCP congestion control. Computer Communication Review, 26(4), October 1996.

[MMFR96] Matt Mathis, Jamshid Mahdavi, Sally Floyd, and Allyn Romanow. TCP se­ lective acknowledgement options. Request for comments RFC 2018, IETF, October 1996.

[Mor97] Robert Morris. TCP behavior with many flows. In Proceedings of the IEEE International Conference on Network Protocols, Atlanta, GA, October 1997.

[Mos96] David Mosberger. Personal communication, 1996. 130

[MS96] Steven Moyer and V. S. Sunderam. Input/Output in parallel and distributed computer systems, chapter Scalable Concurrency Control for Parallel File Systems, pages 225-243. Kluwer Academic Publishers, 1996.

[MSM097] Mathew Mathis, Jeffrey Semke, Jamshid Mahdavi, and Teunis Ott. The macroscopic behavior of the TCP congestion avoidance algorithm. Com­ puter Communication Review, 27(3), July 1997.

[Mur99] Ian Murdoclc. Personal communication, 1999.

[NF95] Bill Nitzberg and Samuel A. Fineberg. Parallel I/O on highly parallel sys­ tems — Supercomputing '95 tutorial M6 notes. Technical Report NAS-95- 022, Numerical Aerodynamic Simulation Facility at NASA Ames Research Center, December 1995.

[Nic94] John R. Nickolls. The MasPar sccilable Unix I/O system. In Proceedings of the 8th International Parallel Processing Symposium, pages 390-395, Cancun, Mexico, April 1994. IEEE.

[NK96] Nils Nieuwejarr and David Kotz. Input/Output in parallel and distributed computer systems, chapter Low-Level Interfaces for High-Level Parallel I/O, pages 205-224. Kluwer Academic Publishers, 1996.

[OP92] Sean W. O'Malley and Larry L. Peterson. A dynamic network architecture. ACM Transactions on Computer Systems, 10(2):110-143, May 1992.

[PDZ97] Vivek S. Pai, Peter Druschel, and Willy Zwaenepoel. lO-Lite: A unified I/O buffering and caching system. Technical Report TR97-294, Rice University Computer Science Department, 1997.

[PR95] Michael Perloff and Kurt Reiss. Improvements to TCP performance in high­ speed ATM networks. Communications of the ACM, 38(2), February 1995.

[PS93] Thomas F. La Porta and Mischa Schwartz. The MultiStream protocol: A highle flexible high-speed transport protocol. IEEE Journal on Selected Areas in Communications, 11(4), May 1993.

[RBF"'"89] Richard Rashid, Robert Baron, Alessandro Forin, David Golub, Michael Jones, Daniel Julin, Douglas Orr, and Richard Sanzi. Meich: A foundation for open systems. In Proceedings of the Second Workshop on Workstation Operating Systems (WW0S2), September 1989.

[RBG'^93] Paul Roy, David Black, Paulo Guedes, John LoVerso, Durriya Netterwala, Fsu'amarz Rabii, Michael Bamett, Bradford Kemp, Michael Leibensperger, Chris Pejik, and Roman Zajcew. An OSF/1 unix for massively parallel multicomputers. In OSF/RI Operating Systems Collected Papers Vol. 2. OSF Research Institute, Cambridge, MA, October 1993. 131

[Ren97] J. Renwick. IP over HIPPI. Request for Comments RFC 2067, IETF, January 1997.

[RF94] Allyn Romanow and Sally Floyd. Dyncunics of TCP traffic over ATM net­ works. In Proceedings of SIGCOMM'94- ACM, August 1994.

[RH91] Franklin Reynolds and Jeffrey Heller. Kernel support for network protocol servers. In Proceedings of the USENIX Mach Symposium, November 1991.

[SCJ''"95] K. E. Seamons, Y. Chen, P. Jones, J. Jozwi2ik, and M. Winslett. Server- directed collective I/O in Panda. In Proceedings of Supercomputing '95, San Diego, CA, December 1995. IEEE Computer Society Press.

[SI095] Network-attached peripherals (NAP) for HPSS/SIOF. Avaulable at http://www.llnl.gov/liv_comp/siof/siof_nap.html, 1995.

[SLS94] Martin W. Sachs, Avraham Leff", and Denise Sevigny. LAN and I/O con­ vergence: A survey of the issues. IEEE Computer, 27{12):24-33, 1994.

[SS94] Subhash Saini and Horst D. Simon. Applications performance under OSF/1 AD and SUNMOS on Intel Paragon XP/2-15. In Proceedings of Supercom- puting'94, Washington, DC, November 1994.

[SS96] T. Liu Shen and V. Samalam. The available bit rate service for data in ATM networks. IEEE Communications Magazine, May 1996.

[Ste96] Peter Steenkiste. Network-based multicomputers: A practical supercom­ puter architecture. IEEE Transactions on Parallel and Distributed Systems, 7(8), August 1996.

[Ste97] W. Stevens. TCP slow start, congestion avoidance, fast retransmit, and fast recovery algorithms. Request for Comments RFC 2001, IETF, January 1997.

[SV98] Dimitrios Stiliadis and Anujan Varma. Efficient fair-queueing cilgorithms for packet-switching networks. IEEE/ACM Transactions on Networking, April 1998.

[SZC90] Scott Shenker, Lixia Zhang, and David D. Clark. Some observations on the dynamics of a congestion control cilgorithm. Computer Communication Review, 20(4), October 1990.

[TC96] Rajeev Thakiu* and Alok Choudhary. Input/Output in parallel and dis­ tributed computer systems, chapter Runtime Support for Out-of-Core Parallel Programs, pages 147-165. Kluwer Academic Publishers, 1996.

[Ten89] David L. Tennenhouse. Layered multiplexing considered harmfiil. In Pro­ tocols for High-Speed networks. Elsevier Science Publishers, 1989. 132

[TNML93] Chandramohan A. Thekkath, Thu D. Nguyen, Evelyn Moy, and Edward D. Lazowska. Implementing network protocols at user level. IEEE/ACM Transactions on Networking, l(5);554-565, December 1993.

[top98] TOP 500 supercomputing sites, http://www.top500.org/, 1998.

[Tou97] J. Touch. TCP control block interdependence. Request for Comments RFC 2140, IETF, April 1997.

[TR93] Don Tolmie and John Renwick. HiPPI: Simplicity yields success. IEEE Network, Janueiry 1993.

[VDK92] Ronald J. Vetter, David H. C. Du, and Alan E. Klietz. Network supercom­ puting. IEEE Network, 6(3):38-44, May 1992.

[Vet95] Ronald J. Vetter. ATM concepts, architectures, and protocols. Commu­ nications of the ACM, 38(2), February 1995.

[VH97] Vikram Visweswaraiah and John Heidemann. Improving restart of idle TCP connections. Technical report, ISI, August 1997.

[WM97] Stephen W. Wheirton and Monica Faeth Myers. MTPE EOS data products handbook. Technical report, NASA Goddard Space Flight Center, 1997.

[WS95] Gary R. Wright and W. Richard Stevens. TCP/IP Illustrated, volimie 2: The implementation. Addison-Wesley Publishing, 1995.

[YBMM94] Msisanobu Yuhara, Briem N. Bershad, Chris Maeda, and J. Eliot B. Moss. Efficient packet demultiplexing for multiple endpoints and large messages. In USENIX Conference Proceedings, pages 153-165, Winter 1994.

[Zha91] Lixia Zhang. Virtualclock: A new traffic control aJgorithm for packet- switched networks. ACM Transactions on Computer Systems, 9(2):101- 124, May 1991.

[ZSC91] Lixia Zhang, Scott Shenker, and David D. Clark. Observations on the dy­ namics of a congestion control algorithm: The effects of two-way traffic. In Proceedings of SIGC0MM'91. ACM, 1991.