• Availability in VAXcluster Systems • Network Performance and Adapters

Digital Technical Journal Digital Equipment Corporation

Volume 3 Number 3

Summer 1991 Editorial Jane C. Blake, Editor Kathleen M. Stetson, Associate Editor Leon Descoteaux, Associate Editor Circulation Catherine M. Phillips, Administrator Sherry L. Gonzalez Production Mildred R. Rosenzweig, Production Editor Margaret Burdine, Typographer Peter Woodbury, illustrator Advisory Board Samuel H. Fuller, Chairman Richard W. Beane Robert M. Glorioso Richard]. Hollingsworth John W. McCredie Alan G. Nemeth Mahendra R. Patel E Grant Saviers Victor A. Vyssotsky Gayn B. Winters

The Digital Technicaljournal is published quarterly by Digital Equipment Corporation, 146 Main Street ML01-3/B68, Maynard, Massachusetts 01754-2571. Subscriptions to the journal are $40.00 for four issues and must be prepaid in U.S. funds. University and college professors and Ph.D. students in the electri­ cal engineering and computer science fields receive complimen­ tary subscriptions upon request. Orders, inquiries, and address changes should be sent to the Digital Technical journal at the published-by address. Inquiries can also be sent electronically to DTJ®CRL.DEC.COM. Single copies and back issues are available for $16.00 each from Digital Press of Digital Equipment Corporation, 12 Crosby Drive, Bedford, MA 01730-1493.

Digital employees may send subscription orders on the ENET to RDVAX::JOURNAL or by interoffice mail to mailstop ML01-3/B68. Orders should include badge number, site location code, and address. Allemployees must advise of changes of address.

Comments on the content of any paper are welcomed and may be sent to the editor at the published-by or network address.

Copyright© 1991 Digital Equipment Corporation. Copying without fee is permitted provided that such copies are made for use in educational institutions by faculty members and are not dis­ tributed for commercial advantage. Abstracting with credit of Digital Equipment Corporation's authorship is permitted. All rights reserved.

The information in the journal is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in the journal.

ISSN 0898-901X

Documentation Number EY-H890E-DP

The following are tmdemarks of Digital Equipment Corporation: Cover Design Bl, CI, DEC, DECconcentmtor, DECdtm, DEC FDD!controller, Our cover graphic represents the shadowing, or replication, of DECnet, DELNl, DSA, DSSI, Digital, the Digital logo, HSC, LAT, Local Area VAXcluster, MSCI� RA, Rdb(VMS, TA, ULTRlX, UNIBUS, data on multiple physicaldisks in a cluster environment. VAX VAX, VAX DBMS, VAX MACRO, VAX RMS, VAX 6000, VAX 9000, VMS host-based volume shadowing pmvides the high data avail­ VAXcluster, VMS, VMS Volume Shadowing. ability required for applications such as transaction Motorola and 68000 are registered trademarks of Motorola, Inc. processing and is the subject of one of the papers in this issue. Book production was done by Digital's Database Publishing Group The cover was designed by Sandra Calef of Calef Associates. in Northboro, MA. I Contents

5 Fo reword Howard H. Hayakawa and George S. Hoff

Availability in VAXcluster Systems Network Performance and Adapters

7 Design of VMS Volume Shadowing Phase II-Host-based Shadowing Scott H. Davis

16 Application Design in a VAXclusterSystem William E. Snaman, Jr.

27 New Availability Features of Local Area VAXcluster Systems Lee Leahy

36 Design of the DEC LANcontroller 400 Adapter Richard E. Stockdale and Judy B. We iss

48 The Architecture and Implementation of a High-performance FDDI Adapter Sat ish L. Rege

64 Performance Analysis of a High-speed FDDI Adapter Ramsesh S. Kalkunte

78 Performance Analysis of FDDI Raj Jain I Editor's Introduction

fo r network delay detection and reduction, and fa il­ ure analysis in local area VA Xcluster systems. The fo cus then moves from VMS-Ievel concerns to the design of network adapters and performance measurement. The adapter described by Dick Stockdale and Judy We iss is the DEC LANcontroller 400, which connects systems based on Digital's XMI bus to an LAN. This particular design improves on previous designs by transforming the adapter from a dumb to an intelligent adapter which can off- load the host. Consequently, the adapter supports systems that utilize the fu ll band­ Jane C. Blake width of Ethernet. The authors provide a system Editor overview, performance metrics, and a critical exam­ This issue of the Digital Te chnical jo urnalcontains ination of firmware-based design. a collection of papers on two general topics­ Like the LANcontroller400, the FDDicontroller400 VA Xcluster systems, and network adapters and per­ is an adapter that interfaces XMI-based systems to a formance. The first set of three papers describes LAN. However, as Satish Rege relates, this adapter new VMS VAX cluster developments and features; was required to transmit data 30 times faster than the second set addresses the topics of LAN adapter Ethernet adapters. Satish discusses the architec­ design and performance measurement techniques. ture and the choices designers made to address the A common theme across these papers is the devel­ problem of interfacing a parallel high-bandwidth opment of technologies fo r interconnecting sys­ CPU bus (XMI) to a serial fi ber-optic network bus tems that offer high data availability without (FDDJ). Their design choices included a three-stage sacrificing performance. pipeline approach to buffe ring that enables these VMS Vo lume Shadowing, described by Scott Davis, stages to proceed in an asynchronous fashion. is a means of ensuring dataavaila bility and integrity To ensure that the perfo rmance goals fo r the in VMS VA Xcluster systems. By maintaining multiple FDDicontroller would be met, a simulation model copies of data on separate devices, the volume was created. In his paper, Ram Kalkunte details the shadowing technique protects data from being modeling methodology, reviews components, and lost as the result of media deterioration or device presents simulation results. Ram describes how in failures. Scott discusses the advantages of the addition to perfo rmance projections, the model new design over controller-based shadowing and provided designers with buffe r sufficiency analysis explains how this fu lly distributed software makes and helped engineers analyze the functional cor­ a broad range of topologies suitable fo r shadowing. rectness of the adapter design. The growth capabilities and availability of VMS The high level of performance achieved by the VAXcluster systems are characteristics well suited FDDicontroller was driven by the high performance to applications with high-availability requirements. of the FDDI LA N itself- 100 megabits per second. Sanely Snaman first presents an overview of the Raj Jain's subject is performance measurement at VA Xcluster system architecture, including explana­ the level of the FDDI LAN. Raj describes the perfor­ tions of the layers, their purpose and function. He mance analysis of Digital's implementation of FDDI then gives practical insights into how the system and how various parameters affect system perfor­ implementation affects application design and mance. As part of his presentation of the modeling reviews the choices available to application design­ and simulation methods used, he shares guidelines ers in the areas of client-server computing and data fo r setting the value of one of the key parameters, sharing. target token rotation time, to optimize performance. The availability of applications and cluster con­ Raj has recently published a book on computer sys­ figurations is also enhanced by developments in a tems performance analysis, which is reviewed in new release of the VMS . Lee Leahy the Further Readings section of this issue. describes a VMS feature that enables fail-over between multiple LA N adapters and compares this approach to a single-adapter implementation. He then discusses and gives examples of VMS features

2 Biographies I

Scott H. Davis Consultant software engineer Scott Davis is the VMS Cluster Technical Director. He is involved in future VMS 1/0 subsystem, VAX cluster, and storage strategies and led the VMS Volume Shadowing Phase II and VMS mixed-interconnect VAX cluster 1/0 projects. Since joining Digital in 1979, Scott worked on RT-11 development and led various real-time operating systems pro­ jects. He holds a B.S. (1978, cum laude) in computer science and applied mathe­ matics from the State University of New Yo rk at Albany. Scott is a coinventor for four patents on the shadowing design.

Raj Jain Raj Jain is a senior consulting engineer involved in the performance modeling and analysis of computer systems and networks, including VAX cluster, DECnet, Ethernet, and FDDI products. He holds a Ph.D. (1978) from Harvard University and has taught courses in performance analysis at M.l.T. A member of the Authors' Guild and senior member of IEEE, Raj has written over 25 papers and is listed in Who's Who in the Computer Industry, 1989. He is the author of The Art of Computer Sy stems Performance Analysis published recently by Wiley.

Ram Kalkunte As a member of the VAX VMS Systems and Servers Group, senior engineer Ram Kalkunte worked on the development of the high­ performance XMI-to-FDDI adapter. Since joining Digital in 1987, he has also worked on various performance analysis and modeling projects. Ram received a B.E. (1984) in instrumentation engineering from Mysore University in India and an M.S. (1987) in electrical engineering from Worcester Polytechnic Institute. He has a pending patent related to dynamically arbitrating conflicts in a multipart memory controller.

Lee Leahy Lee Leahy is a principal software engineer in the VAX cluster group in ALPHA/EVMS Development. He is currently the project leader of the Local Area VAX cluster development effort and was responsible fo r the final design and implementation of the multiple-adapter version of PEDRIVER. Lee joined Digital in 1988 from ECAD. He is coauthor of the book VMS Advanced Device Driver Te chniques and has been writing VMS device drivers since 1980. Lee received a B.S. degree in electrical engineering from Lehigh University in 1977.

3 Biographies

Satish L. Rege Satish Rege joined Digital in 1977 as a member of the Mass Storage Group. He wrote the first proposal for the MSCP protocol and evaluated disk controller design alternatives (e.g., seek algorithms and disk caching) fo r the HSC50 by implementing architectural and performance simulation. He was instrumental in architecting the low-end controllers used in RF-series disks. His latest project was the high-performance FDDI adapter. Satish is a consulting engineer and received his Ph .D. from Carnegie-Mellon University, Pittsburgh.

Wi lliam E. Snaman Principal software engineer SanelySnaman joined Digital in 1980. He is the technical supervisor fo r the VAX cluster executive services area and a project leader fo r VAX cluster advanced development in the VMS Development Group. His group is responsible fo r ongoing development of the VMS lock manager, connection manager, and clusterwide services. Sandy teaches computer science at Daniel Webster College and developed and taught VA .'

Richard E. Stockdale As a member of the Midrange Systems Engineering Group, Dick Stockdale was firmware project leader fo r the DEMNA project and prior to that fo r the DEBNA and DEBNI Ethernet adapters. He is currently a soft­ ware consulting engineer working on LAN drivers in the V.\1S Development Group. Dick joined Digital in 197H and performed diagnostic testing fo r 36-bit systems. He graduated from Wo rcester Polytechnic Institute in 1973 with a B.S. (magna cum laude) in computer science and a minor in electrical engineering. He is a member of Ta u Beta Pi.

judy B. Weiss Judy W<.: iss contributed to the design, implementation, debug, and performance analysis of the DEMNA Ethernet adapter as a member of the firmware team. She is a senior engineer working in the Data Center Systems and Servers Group as a gate array designer. Concurrently with the DEMNA, she worked on the firmware fo r the DEBNI adapter. Judy joined Digital in 1986 after receiving her B.S. (magna cum laude) in computer engineering from Boston University. She is a member of Tau Beta Pi and the Society fo r Wo men Engineers.

4 Foreword

Howard H. Hayakawa George S. Hoff Manager, V1!1S 110 Group l1Janage1; and Cluster Development High-availability Systems

Beginning as a vision fo r a highly available and OSSI, and industry-standard lo cal area networks expandable computing environment, Digital's such as Ethernet and FOOl. Storage systems now VAXcluster system is today recognized across the supported cover the spectrum fro m standard, industry as the premiere fo undation for creating economical SCSI peripherals to high-performance high-availability applications. The large number of RA-seriesdrives fo r large co nfigurations. This wdl­ VAXclustcr sites and the range of their use testifies architected system has allowed fo r expansion to the wide appeal of the capabilities of VA.Xcluster across an ever wider geography: fro m room to systems. Over 11,000 VAXcluster sites based on building to multiple buildings. Moreover, the entire Digital's Co mputer Interco nnect (CI) are being used range of VAX processors-fro m VAXstation wo rk­ in such diverse applications as manufacturing oper­ stations to VAX 9000 mainframes-are supported. ations, banking, and telepho ne information systems. The tight integration, flexibility. and power of Sites based on the Digital Storage System Inter­ toclay's VAXcluster systems is unparal leled. co nnect (DSSI) and Ethernetare even more numer­ The VAXcluster architecture which Digital initi­ ous. A scan of software licenses show s an amazing ated in the 1980s co ntinues to encompass new acceptance of VAXcluster technology-more than advances and innovative technologies that ensure 200,000 VAXcluster licenses have been so ld to elate. data availability and integrity. This issue of the Built fro m standard processors and a general­ Digital Te chnical journal presents several new VMS purpose operating system, a VA..'\cluster system is a VAXcluster products and fe atures, and co mplemen­ loosely co upled, highly integrated co nfiguration of tary developments in the areas of network adapters VA .'\ VMS processors and storage systems that oper­ and performance. One of the products described ates as a single system. Significantly, VAXcluster sys­ is VMS Vo lume Shadowing Phase: II which permits tems are so wel l integrated that users are often no t users to place redundant data on separate storage aware they are using a distributed system. In addi­ devices where most appropriate within the system, tion to the benefits of tight integration, these con­ thus dramatically increasing the availability po ten­ figurations provide Digital's customers with the tial of VAXcluster systems. A paper on multi-rail flexibility to easily ex pand and with the fe atures local area VAXclusters shows how customers are needed fo r high-availability applications. now able to add parallel LAN connections to Started in 1984, VA.,'\cluster systems were limited increase network capacity and to survive fa ilure of to specialized, proprietary interconnects and stor­ a network co nnection. With shadowing and multi­ age servers, which restrined them to the co nfines ple communication paths, recovery fro m site fail­ of a single co mputer room. In 1989, the cluster ure need no longer incur the delays associated with system was extended to support both industry­ restoration fro m archives. standard SCSI (small co mputer systems interface) Just as the VAXcluster software was able to storage ami Digital's DSSI storage interconnect. exploit the Ethernet to ex tend capabilities through­ To day, VAXclustcr systems support a wide range of out a building, it is now able to ex ploit the high per­ co mmunication interco nnects, including CI and fo rmance anc l ex tent of an FODI LAN.

5 Fo reword

The new industry-standard FDDI LAl'J allows the VAX cluster software to <::xtend the system's range by a factor of 1,000. Papers on both an Ethernet adapter and an FDDI adapter describe the care taken to ensure that adapter performance matches that of the targct proc<::ssor, which is one of the keys to achicving maximum performance in the overall VAX clustcr syst<:m. Pcrformance of the FOOl LAN itself is also one of the topics included here. FOOl's performance and range permit fo r the first time the ability to create integrated, high-availability solutions that span multiple buildings. With com­ bined FOOl and VMS VA.Xclustertechnol ogy, a bank's VA.Xcluster system can extend from a computer center in Manhattan to a standby center in New Jersey. Should Manhattan lose power, a disaster team can bring the bank's application into opera­ tion in New Jersey after only minutes. The clays of waiting fo r archives or driving tapes and disks across the river are over. Digital's VA,'< VMS, clusters, FOOl, and networking products continue to evolve; the process of inte­ grating new technologies is ongoing. The papers in this issue describe the latest steps we have taken to extend the range and availability of VAX cluster systems. Future issues of the jo urnalwil l keep you apprised of the latest stages in this evolutionary process.

6 Scott H Davis I

Design of VMS Volume Shadowing Phase If­ Host-based Shadowing

VMS Volume Shadowing Phase liis a fu lly distributed, clusterwide data availability product designed to replace the obsolete controller-based shadowing implementa­ tion. Phase II is intended to service current andfu ture gene·rationsof storage archi­ tectures. In these architectures, there is no intelligent, multiunit controller that junctions as a centralized gateway to the multipledrives in the shadowset. The new software makes many additional topologies suitable fo r shadowing, including DSSI drives, DSA drives, andshadow ing across VMS MSCP servers. This lastcon figuration allows shadow set members to be separatedby any supported cluster interconnect, including FDDI. All essential shadowing junctions are performed within the VMS operatingsys tem. New MSCP controllersand drives can optionally implement a set of shadowing performance assists, which Digital intends to support in a fu ture release of the shadowing product.

Overview physical devices in a shadow set. Devices such as Volume shadowing is a technique that provides data the RF-series integrated storage elements (JSEs) availability to computer systems by protecting with DSSI adapters and the RZ-series small com­ against data loss from media deterioration, commu­ puter systems interface (SCSI) disks present config­ nication path failures, and controller or device fail­ urations that conflict with this method of access. ures. The process of volume shadowing entails To support the range of configurations required maintaining multiple copies of the same data on by our customers, the new product had to be capa­ two or more physical volumes. Up to three physical ble of shadowing physical devices located any­ devices are bound together by the volume shadow­ where within a VAX cluster system and of doing so ing software and present a virtual device to the in a controller-independent fashion. The VAX cluster system. This device is referred to as a shadow set or I/O system provides parallel access to storage a virtual unit. The volume shadowing software devices from all nodes in a cluster simultaneously. replicates data across the physical devices. All shad­ In order to meet its performance goals, our shadow­ owing mechanisms are hidden from the users of the ing product had to preserve this semantic also. system, i.e., applications access the virtual unit as if Figure 2 shows clusterwide shadow sets fo r a hier­ it were a standard, physical disk. Figure I shows a archical storage controller (HSC) configuration VMS Volume Shadowing Phase II set fo r a Digital with multiple computer interconnect (CI) buses. Storage Systems Interconnect (DSSI) configuration When compared to Figure I, this figure shows of two VAX host computers. a larger cluster containing several clusterwide shadow sets. Note that multiple nodes in the cluster Product Goals have direct, writable access to the disks comprising The VMS host-based shadowing project was under­ the shadow sets. taken because the original controller shadowing In addition to providing highly available access to product is architecturally incompatible with many shadow sets from anywhere in a cluster, the new prospective storage devices and their connectiv­ shadowing implementation had other require­ ity requirements. Controller shadowing requires ments. Phase II had to deliver performance com­ an intelligent, common controller to access all parable to that of controller-based shadowing,

Digital Tecbttica/ ]ourtzal Vo l. 3 Nn. 3 Summer 1991 7 Availability inVAXcluster Systems

VIRTUAL UNIT DSA 1: VIRTUAL UNIT DSA2

_l _l _l .� _l RF31 RF31 R�O RF70

$7$DIA7: $7$DIA8 t S7$DIA2: S7SDIA5: T T T I I I DI GITAL STORAGE SYSTEMS INTERCONNECT (DSSI) I'

VAX 3400 VAX 3400

DSA1: DSA1: DSA2: DSA2:

ETHERNET I

WORKSTATIONI MICROVAXI II WORKSTATIONI MICROVAXI 2000

DSA1 : DSA1 DSA1 DSA1: DSA2: DSA2 DSA2: DSA2

Figure 1 Ph ase II Shadow Setfor a Dual-host DSSJ Co nfiguration

maximize application I/0 availability, and ensure set to the remainder of the cluster. This approach, data integrity for critical applications. however, would have violated several design goals; In designing the new product, we benefited from the intermediate hop required by data transfers customer feedback about the existing implemen­ would decrease system performance, and any fail­ tation. This feedback had a positive impact on ure of the serving CPU would require a lengthy the design of the host-based shadowing imple­ fail-over and rebuild operation, thus negatively mentation. Our goals to maximize application l/0 impacting system availability. availability during transient states, to provide cus­ To solve the problem of membership consistency, tomizable, event-driven design and fail-over, to we used the VMS distributed lock manager through enable all cluster nodes to manage the shadow sets, a new executive thread-level interface.'-' We and to enhance system disk capabilities were all designed a set of event-driven protocols that shad­ affected by customer feedback. owing uses to guarantee membership consistency. These protocols allowed us to make the shadow Technical Challenges set virtual unit a local device on all nodes in the To provide volume shadowing in a VAXcluster envi­ cluster. Membership and state information about ronment running under the VMS operating system the shadow set is stored on all physical members in required that we solve complex, distributed sys­ an on-disk data structure called the storage control tems problems.' This section describes the most block (SCll). One way that shadowing uses this SCB significant technical challenges we encountered information is to automatically determine the most and the solutions we :�rrived at during the design up-to-date shadow ser member(s) when the set is and development of the product . created. In addition to distributed synchronization primitives, the VMS lock manager provides a capabil­ Membership Consistency To ensure the level of ity for managing a distributed state variable called a integrity required for high availability systems, the lock value block . Shadowing uses the lock value shadowing cksign must guarantee that a shadow set block to define a disk that is guaranteed to be a cur­ has the same m<.:mbershipand states on all nodes in rent member of the shadow set. Whenever a mem­ the cluster. A simple way to guarantee this property bership change is made, all nodes take part in a would have been a strict client-server implementa­ protocol of lock operations; the value b.lock and the tion, where one VAX computer serves the shadow on-disk SCB are the final arbiters of set constituency.

8 1-b/. .3 No.3 Summer 1991 Digilal Technical journal Design of VM S Vo lume Shadowing Phase 11-Host-basedShadowing

Sequential Commands A sequential I/0 com­ with the rest of the set. The challenge is to make mand, i.e., a Mass Storage Control Protocol (MSCP) copy operations unintrusive; application 1/0s must concept, fo rces all commands in progress to com­ proceed with minimal impact so that the level of plete before the sequential command begins execu­ service provided by the system is both acceptable tion. While a sequential command is pending, and predictable. VMS file I/O provides record-level all new 110 requests are stalled until that sequen­ sharing through the application transparent lock­ tial command completes execution. Shadowing ing provided by the VAX RMS software, Digital's requires the capability to execute a clusterwide, record management services. Shadowing operates sequential command during certain operations. at the physical device level to handle a variety of This capability, although a simple design goal for a low-level errors. Because shadowing has no knowl­ client-server implementation, is a complex one for edge of the higher-layer record locking, a copy a distributed access model. We chose an event­ operation must guarantee that the application 1/0s driven, request/response protocol to create the and the copy operation itself generate the correct sequential command capability. results and do so with minimal impact on applica­ Since sequential commands have a negative tion 1!0 performance. impact on performance, we limited the use of these commands to performing membership changes, Merge Op erations Merge operations are triggered mount/dismount operations, and bad block and when a CPU with write access to a shadow set fails. merge difference repairs. Steady state processing (Note that with controller shadowing, merge oper­ never requires using sequential commands. ations are copy operations that are triggered when an HSC fails.) Devices may still be valid members of Full Copy A fu ll copy is the means by which a the shadow set but may no longer be identical, due new member of the shadow set is made current to outstanding writes in progress when the host

.-----1 HSC70 1------,

,.-----l HSC70 1------,

VIRTUAL VIRTUAL VAX UNIT UNIT 9000-210 Cl DSA1: DSA2: DSA1: DSA2: DSA3: LR�oJ DSA4: Cl LR�� DSA5: DSA6: LRA81J LR�oJ LR�oJ

VAX 6000 Cl DSA1· DSA2: DSA3: DSA4: Cl STAR COUPLER DSA5: (PRIMARY) VIRTUAL DSA6: UNIT DSA6:

VAX 6000-360 DSA1. Cl DSA2: DSA3: DSA4: DSA5: DSA6:

Figure 2 Clusterwide Shadow Sets fo r an HSC Co nfiguration witb Multiple CI Buses

Digital Te cbtlical jo urnal Vo l. 3 No. 3 Summcn· 1991 9 Availabilityin VAXc luster Systems

CPU fa iled. The merge operation must detect and source members. The shadowing software treats correct these diffe rences, so that successive appl i­ communication fa ilures between any cluster cation reads fo r the same data produce consistent node and shadow set members as normal shad­ results. As for fu ll copy operations, the challenge owing events with customer-definable recovery with merge processing is to generate consistent metrics. results with minimal impact on application 110 performance. Major Components VMS Vo lume Shadowing Phase I[ consists of two Booting and Crashing System disk shadowing major components: SHDRIVER and SHADOW _SERVER. presents some special problems because the SHDRIVER is the shadowing virtual unit driver. As a shadow set must be accessible to CPUs in the cluster client of disk class drivers, SHDRIVER is responsible when locking protocols and inter-cPU communica­ fo r handling all 1/0 operations that are directed to tion are disabled. In addition, crashing must ensure the virtual unit. SHDRIVER issues physical I/O opera­ appropriate behavior fo r writing crash dumps tions to the disk class driver to satisfy the shadow set through the primitive bootstrap driver, including virtual unit l/0 requests. SHDRIVER is also responsi­ how to propagate the dump to the shadow set. It ble for performing all distributed locking and fo r was not practical to modify the bootstrap drivers driving error recovery. because they are stored in read-only memory (ROM) SHADOW_SERVER is a VMS ancillary control pro­ on various CPU platforms that shadowing would cess (ACP) responsible fo r driving copy and merge support. operations performed on the local node. Only one optimal node is responsible for driving a copy or Error Processing One major function of volume merge operation on a given shadow set, but when a shadowing is to perform appropriate error process­ failure occurs the operation will fail over and ing fo r members of the shadow set, while maximiz­ resume on another CPU. Several factors determine ing data availability. To carry out this function, the this optimal node including the types of access software must prevent deadlocks between nodes paths, and controllers for the members and user­ and decide when to remove devices from the settable, per-node copy quotas. shadow set. We adopted a simple recovery ethic: a node that detects an error is responsible fo r fixing Primitives that error. Membership changes are serialized in the This section describes the locking protocols and cluster, and a node only makes a membership error recovery processing fu nctions that are used change if the change is accompanied by improved by the shadowing software. These primitives pro­ access to the shadow set. A node never makes a vide basic synchronization and recovery mecha­ change in membership without having access to nisms fo r shadow sets in a VAX cluster system. some source members of the set. Locking Protocols The shadowing software uses Architecture event-driven locking protocols to coordinate clus­ Phase IJ shadowing provides a local virtual unit on terwide activity. These request/response protocols each node in the cluster with distributed control of provide maximum application 110 performance. that unit. Although the virtual unit is not served to A VMS executive interface to the distributed lock the cluster, the underlying physical units that consti­ manager allows shadowing to make efficient use of tute a shadow set are served to the cluster using the locking directly from SHDRIVER. standard VMS mechanisms. This scheme has many One example of this use of locking protocols in data availability advantages. The Phase II design VMS Volume Shadowing Phase II is the sequential command protocol. As mentioned in the Technical • Allows shadowing to use all the VMS controller Challenges section, shadowing requires the sequen­ fa il-over mechanisms fo r physical devices. As a tial command capability but minimizes the use of result, member fail-over approaches hardware this primitive. Phase II implements the capability by speeds. Controller shadowing does not provide using several locks, as described in the fo llowing this capability. series of events. • Allows each node in the cluster to perform error A node that needs to execute a sequential com­ recovery based on access to physical data mand first stalls 110 locally and flushes operations

10 Vo l. 3 No. 3 Summer 1991 Digital Te chnical jo urnal Design of VMS Vo lume Shadowing Ph ase II-Host-basedShadowing

in progress. The node then performs lock opera­ Application 1/0 requests to the virtual unit are tions that ensure serialization and sends sequential always stalled during volume processing. In the stall requests to other nodes that have the shadow case of active volume processing, the stalling is nec­ set mounted. This initiating thread waits until all essary because many T/Os wou ld fail until the error other nodes in the cluster have flushed their I/Os was corrected. In passive volume processing, the and responded to the node requesting the sequen­ 110 requests are stalled because the membership of tial operation. Once all nodes have responded or the set is in doubt, and correct processing of the left the cluster, the operations that compose the request cannot be performed until the situation is sequential command execute. When this process is corrected. complete, the locks are released, allowing asyn­ chronous threads on the other nodes to proceed Steady State Processing and automatically resume l/0 operations. The local The shadowing virtual unit driver receives applica­ node resumes 1/0 as wel l. tion read and write requests and must direct the 110 appropriately. This section describes these steady Error Recovery Processing Error recovery pro­ state operations. cessing is triggered by either asynchronous notifica­ tion of a communication failure or a fa iling 110 Read Algorithms operation directed towards a physical member of the shadow set. Two major fu nctions of error recov­ The shadowing virtual unit driver receives applica­ ery are built into the virtual unit driver: active and tion read requests and directs a physical I/O to an passive volume processing. appropriate member of the set. SHDRIVER attempts Active volume processing is triggered directly by to direct the r;o to the optimum device based on events that occur on a local node in the cluster. locally available data. This decision is based on This type of volume processing uses a simple, local­ (1) the access path, i.e., local or served by the VMS ized ethic fo r error recovery from communication operating system, (2) the service queue lengths at or controller failures. Shadow set membership the candidate controller, and (3) a round-robin algo­ decisions are made locally, based on accessibility. rithm among equal paths. Figure 3 shows a shadow If no members of a shadow set are currently acces­ set read operation. An application read to the sible from a node, then the membership does not shadow set causes a single physical read to be sent change. If some but not all members of the set are to an optimal member of the set. In Figure 3, there accessible, the local node, after attempting fail­ is one local and one remote member, so the read is over, removes some members to allow application sent to the local member. 110 to proceed. The system manager sets the time Data repair operations caused by media defects period during which members may attempt fail­ are triggered by a read operation failing with an over. The actual removal operation is a sequential appropriate error, such as forced error or parity. command. The design allows for maximum flexibil­ The shadowing driver attempts this repair using ity and quick error recovery and implicitly avoids

dead lock scenarios. CLUSTER INTERCONNECT Passive volume processing responds to events that occur elsewhere in the cluster; messages from nodes other than the local one trigger the process­ CPU CPU ing by means of the shadowing distributed locking protocols. This volume processing fu nction is

responsible for verifying the shadow set member­ 8I ship and state on the local node and for modifying this membership to reflect any changes made to the set by the cluster. To accomplish these operations, the shadowing software first reads the lock value VIRTUAL UNIT block to find a disk guaranteed to still be in the SHADOW SET shadow set. Then the recovery process retrieves the physical member's on-disk SCB data and uses this info rmation to perform the relevant data struc­ ture updates on the local node. Figure 3 Shadow Set Read Op eration

Digital Tec hnical jo urnal Vol. 3 No. 3 Summer 1991 11 Availability inVAXc luster Systems

another member of the shadow set. This repair There is no explicit gatekeeping during the copy operation is performed with the synchronization of operation. Thus, application read and write opera­ a sequential command. Sequential protl:ction is tions occur in parallel with copy thread reads and required because a read operation is being con­ writes. As shown in Figure 5, correct results arc verted into a write operation without explicit, R.t\1 5- accomplished by the fo llowing algorithm. During laycr synchronization. the fu ll copy, the shadowing driver processes appli­ cation write operations in two groups: first, those Wr ite Algorithms directed to all source members and second, writes The shadowing virtual unit driver receives applica­ to all full copy targets. The copy thread performs a tion write requests and then issues, in parallel, sequence of read source, compare target, and write write requests to the JJhysical members of the set. target operations on each logical block number The virtual unit write operation does not complete (l.BN) range until the compare operation succeeds. until all physical writes complete. A shadow set If an LBN range has such frequent activity that the write operation is shown in Figure 4. Physical write compare fa ils many times, SHDRIVER performs a operations to member units can fail or be timed synchronized update. A distributed fe nce provides out; either condition triggers the shadowing error a clusterwide boundary between the copied and recovery logic and can cause a fail-over or the the uncopied areas of the new member. This fe nce removal of the erring device from the shadow set. is used to avoid performing the special fu ll copy ml:chanisms on application writes to that area of Transient State Processing the disk already processed by the copy thread. Shadowing perfo rms a variety of operations in This algorithm meets the goal of operational cor­ order to maintain consistency among the members rectness (both the application and the copy thread of the set. These operations include fu ll copy, achieve the proper results with regard to the con­ merge, and data repair and recovery. This section tents of the shadow set members) and requires no describes these transient state operations. synchronization with the copy thread. Thus, the algorithm achieves maximum application 110 avail­ Full Copy ability during the transient state. Crucial to achiev­ Full copy operations arc performed under direct ing this goal is the fact that, by design, the copy system manager control. When a disk is added to thrl:ad does not perform I!O optimization tech­ the shadow Sl:t, copy operations take place to make niques such as double buffering. The copy opera­ the contents of this new set member identical to tions receive equal service as application I!Os. that of the other members. Copy operations are transparent to application processing. The new Merge Op erations member of the shadow set does not provide any data The VMS Vo lume Shadowing Phase II merge algo­ availability protection until the copy completes. rithm meets the product goals of operational

I SHADOW SET I I I APPLICATION I SHADOWING EJ LAYER :-1 0 I I -{·· I I I

VIRTUAL I UNIT I WRITE · -0-Ii - D +EJ I I I I F=4 1 -r··l__j 1 :_o I I l _____j

Figure 4 Shadow Set Write Op eration

1 2 Vo l. 3 No. 3 Summe-r1991 Digital Te ch11icaljou r��al Design of VMS Vo lume Shadowing Phase !!-Host-based Shadowing

APPLICATION 1/0s

COPY 1/0s (FOR AN LBN RANGE)

Note: No synchronization exists between the application and copy operations. 1/0s can occur in parallel on different nodes in the cluster. Regardless of how the operations overlap, the correct data is copied to the target.

Figure 5 Full Copy Algorithm correctness, while maintaining high application l/0 an application read operation is issued to a shadow availability and minimal synchronization. A merge set in the merge state, the set executes the read operation is required when a CPU crashes with with merge semantics. Thus, a read to a source and the shadow set mounted fo r write operations. A a parallel compare with the other members of the merge is needed to correct fo r the possibility of par­ set are performed. Usually the compare matches tially completed writes that may have been out­ and the operation is complete. If a mismatch is standing to the physical set members when the detected, a sequential repair operation begins. The failure occurred. The merge operation ensures that merge thread scans the entire disk in the same all members contain identical data, and thus the manner as the read, looking for differences. A dis­ shadow set virtual unit behaves like a single, highly tributed fe nce is used to avoid performing merge available disk. It does not matter which data is more mechanisms for application reads to that area of recent, only that the members are the same. This the disk already processed by the merge thread. satisfies the purpose of shadowing, which is to pro­ Figure 6 illustrates the merge algorithm. vide data availability. But since the failure occurred Note that controller shadowing performs an while a write operation was in progress, this con­ operation called a merge copy. Although this HSC sistent shadow set can contain either old or new merge copy operation is designed fo r the same pur­ data. To make sure that the shadow set contains the pose as the Phase II operation, the approaches dif­ most recent data, a data integrity technique such as fer greatly. An HSC merge copy is triggered when journaling must be employed. an HSC, not a shadow set, fails and performs a copy In Phase II shadowing, merge processing is dis­ operation; the HSC merge copy does not detect tinctly different from copy processing. The shadow differences. set provides fu ll availability protection during the merge. As a result, merge processing is intention­ Performance Assists ally designed to be a background activity and to A future version of the shadowing product is maximize application l/0 throughput while the intended to utilize controller performance assists merge is progressing. The merge thread carefully to improve copy and merge operations. These monitors 1/0 rates and inserts a delay between its assists will be used automatically, if supported by l/Os if it detects contention fo r shared system the controllers involved in accessing the physical resources, such as adapters and interconnects. members of a shadow set. In addition to maximizing 110 availability, the COPY_DATA is the ability of a host to control a merge algorithm is designed to minimize synchro­ direct disk-to-disk transfer without the data enter­ nization with application 1/0s and to identify and ing or leaving the host CPU I/O adapters and mem­ correct data inconsistencies. Synchronization takes ory. This capability will be used by fu ll copy place only when a rare difference is found. When processing to decrease the system impact, the

Digital Te chnical jo urnal Vo l. 3 No. 3 Summer 1991 13 Availabilityin VAXcluster Systems

READ APPLICATION 1/0s SOURCE

MERGE 1/0s READ (FOR AN LBN RANGE) SOURCE

Note: Infrequent synchronization exists between the application and merge operations. 1/0s can occur in parallel on different nodes in the cluster. Regardless of how the operations overlap, data integrity is preserved.

Fig ure 6 Merge Algorithm

bandwidth, and the time requ ired fo r a fu ll copy. cally, thus requiring no shadowing intervention. The members of the set and/or their controllers When the controller cannot repair the data, a spare must share a common interconnect in order to block is revectored to this LBN, and the contents of use this capability. The COPY_DATA operation per­ the block are marked with a fo rced error. This fo rms specific shadowing around the active, copy causes subsequent read operations to fail, since the LBN range to ensure correctness. This operation contents of the block are lost. involves LBN range-based gatekeeping in the copy The forced error returned on a read operation is target device controller. the signal to the shadowing software to execute a Controller write logging is a fu ture capability in repair operation. SHDRIVER attempts to read usable controllers, such as HSCs, that will allow more effi­ data from another source device. If such data is cient merge processing. Shadowing write opera­ available, the software writes the data to the revec­ tion messages will include information for the tored block and then returns the data to the applica­ controller to log 1/0s in its memory. These logs will tion. If no usable data source is available, the then be used by the remaining host Cl'Us during software performs write operations with a fo rced merge processing to determine exactly which error to all set members and signals the application blocks contain outstanding write operations from a that this error condition has occurred. Note that a failed CPU. With such a perfo rmance assist, merge protected system buffer is used for this operation operations will take less time and will have less because the application reading the data may not impact on the system. have write access. A fu ture shadowing product is intended to sup­ Data Repair and Recovery port SCSI peripherals, which do not have the DSA As discussed in the Primitives section, data repair primitives outlined above. There is no fo rced error operations are triggered by failing reads and are indicator in the SCSI architecture, and the revector repaired as sequential commands. Digital Storage operation is nonatomic. To perform shadowing Architecture (DSA) devices support two primitive data repair on such devices, we will use the READL/ capabilities that are key to this repair mechanism. WRITEL capability optional ly supported by SCSI When a DSA control ler detects a media error, the devices. These 110 fu nctions allow blocks to be block in question is sometimes repaired automati- read and written with error correction code (ECC)

14 Vol. 3 No. 3 Summer 1991 Digital Technical jo urnal Design of VMS Vo lume Shadowing Phase II-Host-based Shadowing

data. Shadowing emulates fo rced error by writing occurs either when the node leaves the cluster data with an intentionally incorrect ECC. To circum­ (ifthere are other nodes present) or later, when the vent the lack of atomicity on the revector opera­ set is reformed. As the source for merge difference tion, a device being repaired is temporarily marked repairs, the merge process attempts to use the as a full copy target until the conclusion of the member to which the dump file was written and repair operation. If the CPU fails in the middle of a propagates the dump file to the remainder of the repair operation, the repair target is now a fu ll copy set. The mechanism here fo r dump file propagation target, which preserves correctness in the pres­ is best-effort, not guaranteed; but since writing the ence of these non atomic operations. dump is always best-effort, this solution is consid­ ered acceptable. System Disk System disk shadow sets presented some unique Conclusion design problems. The system disk must be accessed VMS Vo lume Shadowing Phase II is a state-of-the-art through a single bootstrap driver and hence, a sin­ implementation of distributed data availability. The gle controller type. This access takes place when project team arrived at innovative solutions to multihost synchronization is not possible. These problems attributable to a set of complex, conflict­ two access modes occur during system bootstrap ing goals. Digital has applied fo r fo ur patents on var­ and during a crash dump write. ious aspects of this technology.

Shadowed Booting Acknowledgments The system disk must be accessed by the system ini­ I would like to acknowledge the efforts and con­ tialization code executing on the booting node tributions of the other members of the VMS shad­ prior to any host-to-host communication. Since the owing engineering team: Renee Culver, William boot drivers on many processors reside in ROM, it Goleman, and Wai Yim. In addition, I would also was impractical to make boot driver modifications like to acknowledge Sandy Snaman for Fork Thread to support system disk processing. To solve this Locking, Ravindran Jagannathan fo r performance problem, the system disk operations performed analysis, and David Thiel for general consulting. prior to the controller initialization routine of the system device driver are read-only. It is safe to read References data from a clusterwide, shared device without syn­ 1. N. Kronenberg, H. Levy, W Strecker, and chronization when there is little or no risk of the R. Merewood, "The VAXcluster Concept: data being modified by another node in the cluster. An Overview of a Distributed System," Digital At controller initialization time, shadowing builds a Te chnicaljo urnal(September 1987): 7-21. read-only shadow set that contains only the boot 2. W Snaman, Jr. and D. Thiel, "The VAX/VMS member. Once locking is enabled, shadowing per­ Distributed Lock Manager," Digital Te chnical fo rms a variety of checks on the system disk jo urnal(September 1987): 29-44. shadow set to determine whether or not the boot is valid. If the boot is valid, shadowing turns the sin­ 3. W Snaman, Jr. , "Application Design in a gle-member, read-only set into a multimember, VAXcluster System," Digital Te chnical jo urnal, writable set with preserved copy states. If this node vol 3. no. 3 (Summer 1991, this issue): 16-26. is joining an existing cluster, the system disk shadow set uses the same set as the rest of the cluster.

Crash Dumps The primitive boot driver uses the system disk to write crash dumps when a system failure occurs. This driver only knows how to access a single physi­ cal disk in the shadow set. But since a failing node automatically triggers a merge operation on shadow sets mounted for write, we can use the merge thread to process the dump file. The merge

Digital Te chnical journal Vol. 3 No. 3 Summer 1991 15 Wi lliam E. Snaman, Jr. I

Application Design in a VAXc lusterSy stem

VAX cluster systems provide afle xible way to configure a computing sy stem that can survive the fu iture of any compo11ent. In addition, these sys tems can grow with an organi.mtion and can be serviced without disruption to applications. These fe atures make VAX cluster systems an ideal base fo r developing high-availability applications such as transaction processing systems, servers fo r network client­ server applicatinns, and da ta sharing applications. Un dersta nding the basic design of VAXcluster :-,y stellls and the possible configuration op tions can help application designers take advantage of tbe availability and growth characteristics of these sys tems.

Many organizations depend on near constant were connected to a single processor. Files can be access to data and computing resources; interrup­ shared transparently at the record level by appl ica­ tion of these services results in the interruption of tion software. primary business hmctions. In addition, growing To provide the fe atures of a VA Xcluster system, organizations face the need to increase the amount the VMS operating system was enhanced to facili­ of computing power available to them over an tate this data sharing and the dynamic adjustment extended period of time. VA Xcluster systems pro­ to changes in the underlying hardware configu­ vide solutions to these data availability and growth ration. These enhancements make it possible to problems that modernorg anizations fa ce.' dynamically add multiple processors, storage con­ This paper begins with an overview of VAXcluster trollers, disks, and tapes to a VA.'{clustersystem con­ systems and application design in such systems and figuration. Thus, an organization can purchase a proceeds with a detailed discussion of VA Xcluster small system initially and expand as needed. The design and implementation. The paper then focuses addition of computing and storage resources to the on how this information aftects the design of appl i­ existing configuration requires no software modifi­ cations that take advantage of the availability and cations or application conversions and can be growth characteristics of a VAXcluster system. accomplished without shutting clown the system. The ability to use redundant devices virtually elimi­ Overview of VA Xc luster Sy stems nates single points of fa ilure. VA"\cluster. systems are loosely coupled mu lti­ processor configurations that allow the system Application Design in a VA Xc luster designer to configure redundant hardware that can Environment survive most types of equipment failures. These Application design in a VA Xcluster environment systems provide a way to add new processors and involves making some basic choices. These choices storage resources as required by the organization. concern the type of application to be designed and This feature eliminates the need either to buy the method used to synchronize the events that nonessential equipment initially or to experience occur during the execution of the application. The painful upgrades and application conversions as designer must also consider application communi­ the systems are outgrown. cation within a VAXcluster system. A discussion of The VMS operating system, which runs on each these issues fo llows. processor node in a VAXcluster system, provides a high level of transparent data sharing and indepen­ dent fa ilure characteristics. The processors interact General Choices fo r Application Design to fo rm a cooperating distributed operating This section briefly describes the general choices system. In this system, all disks and their stored files available to application designers in the areas of are accessible from any processor as if those files client-server computing and data access.

16 Vo l . .) No . .3 Summer 1991 Digital Teclmical journal Application Design in a VA Xcluster System

Cl ient-server Computing The VAX cluster environ­ advantages of a partitioned data model without the ment provides a fine base for client-server comput­ problems associated with the failure of a single ing. Application designers can construct server server. applications that run on each node and accept Using a shared data model, the application requests from clients running on nodes in the designer can create an application that runs simul­ VAX cluster system or elsewhere in a wider network. taneously on multiple VA Xcluster nodes, which nat­ If the node runninga server application fails, the urally share data in a file. This type of application clients of that server can switch to another server can prevent the bottlenecks associated with a sin­ running on a surviving node. The new server can gle server and take advantage of opportunities fo r access the same data on disk or tape that was being parallelism on multiple processors. The VAX RMS accessed by the server that failed. In addition, the software can transparently share files between mul­ redundancy offered by the VMS Vo lume Shadowing tiple nodes in a VAXcluster system. AJso, Digital's Phase II software eliminates data unavailability in database products, such as Rdb/VMS and VA X DBMS the event of a disk control ler or media fa ilure.2 The software, provide the same data-sharing capabili­ system is thus very available from the perspective ties. Servers running on multiple nodes of a of the client applications. VAX cluster system can accept requests from clients in the network and access the same files or Access to Storage Devices Many application design databases. Because there are multiple servers, the questions involve how to best access the data stored application continues to fu nction in the event that on disk. One major advantage of the VAX cluster a single server node fa ils. system design is that disk storage devices can be accessed from all nodes in an identical manner. The App lication Sy nchronization Methods application designer can choose whether the The application designer must also consider how to access is simultaneous from multiple nodes or from synchronize events that take place on multiple one node at a time. Consequently, applications can nodes of a VA Xcluster system. Two main methods be designed using either partitioned data access or can be used to accomplish this: the VMS lock man­ shared data access. ager and the DECdtm services that provide VMS Using a partitioned data model, the application transaction processing support. designer can construct an application that limits data access to a single node or subset of the nodes. VMS Lock Manager The VMS lock manager pro­ The application runs as a server on a single node vides services that are flexible enough to be used and accepts requests from other nodes in the clus­ by cooperating processes fo r mutual exclusion, ter and in the network. And because the appli­ synchronization, and event notification.' An appli­ cation runs on a single node, there is no need cation uses these services either directly or indi­ to synchronize data access with other nodes. El imi­ rectly through components of the system such as nating this source of communication latencies can the VA X RMS software. improve performance in many applications. AJso, if synchronization is not required, the designer can DECdtm Services The VMS operating system pro­ make the best use of local buffe r caches and can vides a set of services to facilitate transaction aggregate larger amounts of data fo r write opera­ processing.' These DECdtm services enable the tions, thus minimizing I/O activity. application designer to implement atomic trans­ An application that uses partitioned data access actions either directly or indirectly. The services lends itself to many types of high-perfo rmance use a two-phase commit protocol. A transaction database and transaction processing environments. may span multiple nodes of a cluster or network. VA Xcluster systems provide such an application The support provided allows multiple resource with the advantage of having a storage medium managers, such as the VA X DBMS, Rdb/VMS, and VA X that is available to all nodes even when they are R.i\1 S software products, to be combined in a single not actively accessing the data files. Thus, if the transaction. The DECdtm transaction processing server node fails, another server running on a sur­ services take advantage of the guarantees against viving node can assume the work and be able to partitioning, the distributed lock manager, and the access the same files. For this type of application data availability fe atures, all provided by VA Xcluster design, V�'Ccluster systems offer the performance systems.

Digital Te ch11icaljourt�al Vo l. 3 No. 3 Summer 1991 17 Availability in VAXduster Systems

VAXcluster and Ne tworkwide VAXcluster software generally selects one fo r active Communication Services use and relies on the remaining paths fo r backup in Application communication between different pro­ the event of fa ilure. cessors in a VAXcluster system is generally accom­ The port layer can contain any of the fo llowing plished using DECnet task-to-task communication interconnects:

services or other networking software such as the • Computer Interconnect (CI) bus transmission control protocol (TCP) and the inter­

• net protocol (IP). Client-server applications or Ethernet

peer-to-peer applications are easy to develop with • Fiber distributed data interface (FDDI) these services. The services allow processes to locate or start remote servers and then to exchange • Digital Storage Systems Interconnect (DSSI) bus messages. Each bus is accessed by a port (also called an Since the individual nodes of a VA.,'(clustersystem adapter) that connects to the processor node. For exist as separate entities in a wider communication example the CI bus is accessed by way of a CI port. network, applications communication inside a The various buses provide a wide spectrum of VA .Xcluster system can rely on general network choices in terms of wire and adapter capacity, num­ interfaces. Thus, no special-purpose communica­ ber of nodes that can be attached, distance tion services were developed. Applications are between nodes, and cost.' simpler to design when they can communicate The Cl bus was designed fo r access to storage and within the cluster in the same manner in which fo r reliable host-to-host communications. Each Cl they communicate with nodes located outside the port connects to two redundant, high-speed physi­ VA.X cluster system. cal paths. The Cl port dynamically selects one of the A DECnet featureknown as cluster alias provides two paths fo r each transmitted message. Messages a collective name fo r the nodes in a VAXcluster are received on either path. Thus, two nodes can system. Application software can connect to a node communicate on one path at the same time that in the cluster using the cluster alias name rather two other nodes communicate on the other. If one than a specific node name. This feature frees the physical path fails, the port simply uses the remain­ application from keeping track of individual nodes ing path. The existence of the two physical paths is in the VA.X cluster system and results in design sim­ hidden from the software that uses the CI port ser­ plificationand configurationfle xibility. vices. From the standpoint of the cluster software, each port represents a single logical path to a VAX cluster Design and Implementation remote node. Multiple CI ports can be used to pro­ Details vide multiple logical paths between pairs of nodes. To understand how the design and implementation An automatic load-sharing feature distributes the of a VA.,'Xcluster system affects application design, load between pairs of ports. one must be familiar with the basic architecture of The DSST bus was primarily designed fo r access to such a system, as shown in Figure 1. This section disk and tape storage. However, the bus has proven describes the layers, which range from the commu­ an excellent way to connect small numbers of pro­ nication mechanisms to the users of the system. cessors using the VA.,'Xcluster protocols. Each DSSI port connects to a single high-speed physical path. Port Layer As in the case of the cr bus, several DSSI ports may The port layer consists of the lowest levels of the be connected to a node to provide redundant architecture, including a choice of communication paths. (Note that the KFQSA DSSI port is for storage ports and physical paths (buses). The VA.X cluster access only and provides no general communica­ software requires at least one logical communica­ tion service between nodes.) tion pathway between each pair of processor nodes Ethernet and FDDI are open local area networks, in the VAXcluster system. Several of the ports utilize generally shared by a wide variety of consumers. multiple physical communication paths, which Consequently, the VA .Xclustersoftware was designed appear as a single logical path to the VAXcluster to use the Ethernet and FDDI ports and buses simul­ software. This redundancy provides better commu­ taneously with the DECnet or TCP/TP protocols. This nication throughput and higher availability. If mul­ is accomplished by allowing the Ethernet data tiple logical paths exist between a pair of nodes, the link software to control the hardware port. This

18 Vo l. 3 No. 3 Summer 1991 Digital Te chnical journal Application Design in a VA Xcluster System

software provides a multiplexing function such when multiple ports are used. The port driver soft­ that the cluster protocols are simply another user of ware combines the multiple Ethernet and FDDI a shared hardware resource. paths into a single logical path between any pair of Each Ethernet and FDDI port connects to a single nodes. The load is automatically distributed among physical path. There may be more than one port on the various possible physical paths by an algorithm each processor node. This means that there may be that chooses the best path in terms of adapter many separate paths between any pair of nodes capacity and path latency6

MESSAGE SERVICE TAPE DISK CONNECTION CLASS CLASS MANAGER DRIVER DRIVER

SYSTEMS COMMUNICATION SERVICES

FDDI ETHERNET PORT Cl PORT DSSI PORT SOFTWARE DRIVER DRIVER DRIVER DRIVER DRIVER

HARDWARE FDDI ETHERNET Cl DSSI PORT PORT INTERFACE INTERFACE

LOCAL DISKS

FDDI

Figure 1 VA Xcluster Syste-mArchitecture

Digital Te cbttical jo urn.al Vo l. 3 No. 3 Summer 1991 19 Availability in VAXduster Systems

Sy stem Cornmunications ServicesLa yer members of the VA Xcluster system and which are The system communications services (SCS) layer of not. This is accomplished through a concept of the VAXcluster architecture is implemented in a cluster "membership.'' Nodes are expl icitly added combination of hardware and software or software to and removed from the active set of nodes by a only, depending upon the type of port. The scs distributed software algorithm. In a VAXcluster layer manages a logical path between each pair of system, every processor node must have an open nodes in the VA Xcluster system. This logical path scs connection to all other processor nodes. Once consists of a virtual circuit (VC) between each pair a booting node establishes connections to all other of scs ports and a set of scs connections that are nodes currently in the VA Xcluster system, this node multiplexed on that virtual circuit. The scs pro­ can request admission to the system. When one vides basic connection management and communi­ node is no longer able to communicate with cation services in the fo rm of datagrams, messages, another node, one of the two nodes must be and block transfers over each logical path. removed from the VA Xcluster system. The datagram is a best -effort delivery service In a VA Xcluster system, all nodes have a consis­ which offers no guarantees regarding loss, duplica­ tent view of the cluster membership in the pres­ tion, or ordering of datagrams packets. This service ence of permanent and temporary communication requires no connection between the com municat­ fa ilures. This consistency is accomplished by using ing nodes. In general, the VA Xcluster software a two-phase commit protocol to fo rm the cluster, makes minimal use of the datagram service. add new nodes, and remove fa iled nodes. The message and block transfer services take The second function provided by the connection place over an scs connection. C-onsumers of SCS manager is an extension of the scs message service. services communicate with their counterparts on This extension guarantees that the service will (1) remote nodes using these connections. Multiple deliver a message to a remote node or (2) remove connections are multiplexed on the logical path either the sending node or the receiving node from provided between each pair of nodes in the the cluster. The strong notion of cluster member­ VA.Xclustersystem. ship provided by the connection manager makes The message service is reliable and guarantees this guarantee possible. The service attempts to that there will be no loss, duplication, or permuta­ deliver the queued messages to remote nodes. If a tion of message sequence on a given connection. connection breaks, the service attempts to reestab­ The connection will break rather than allow the lish communication to the remote node and resend consumer of the service to perceive such errors. the message. After a period of time specified by the The block transfer service provides a way to system manager, the service declares the connec­ transfer quantities of data directly from the mem­ tion irrevocably broken and removes either the ory of one node to that of another. For CI ports, the sending or the receiving node from the VA Xcluster port hardware accomplishes the block transfer, membership. Thus, the service hides all temporary thus freeing the host processor to perform other communication failures from its client. tasks. Some OSSI ports use hardware to copy data This message service allows users to construct and others rely on software to perform this func­ efficient protocols that do not require acknowledg­ tion. Depending on the exact model of an Ethernet ment of messages. The service proved to be a very or FOOl port, the port software, rather than the powerful tool in the design of the VMS lock man­ hardware, moves the data. ager. The delivery guarantees inherent in the ser­ vice minimize the number of messages required to Sy stem Applications perform any given locking fu nction, resulting in a The next higher layer in the VA Xcluster architecture corresponding increase in performance. The abil­ consists of multiple system applications (SYSAPs). ity to hide fa ilures by updating cluster membership These applications provide, fo r example, access to fu rther simplified the lock manager design and disks and tapes and cluster membership control. increased performance; this capability enabled the The fo llowing sections describe some SYSAPs. removal of logic used to handle changes in VA Xcluster configurations and communication Connection Ma nager The connection manager errors from all main lock manager code paths. serves three major fu nctions. First, the connection The th ird fu nction of the connection manager manager knows which processor nodes are active is to prevent partitioning of the possible cluster

20 Vo l. 3 Nu. 3 Summer I'J')J Digital Te chn:ical ]ournal Application Design in a VA Xc luster Sy stem

members. Partitioning of a system exists when sep­ disks that are attached to one or more VAX proces­ arate processing dements fu nction independently. sors to be accessed by other processors in the lfa system allows data sharing, completely indepen­ VAXcluster system. Thus, a VAXc luster processor dent processing can result in uncoordinated access may emulate a multihost disk controller by accept­ to shared resources and lead to data corruption. ing and processing 110 requests from other nodes In a VAXcluster system, processors communicate and accessing the disk indicated by the request. The and coordinate access to resources by means of a server can process multiple commands simulta­ voting algorithm. The system manager assigns a neously and also performs fragmentation of com­ number of votes to each processor node based on mands if there is not enough system buffer space the importance of that node. The system manager to accommodate the entire amount of data at also informs each node of the total number of possi­ one time. ble votes. The algoritlm1 requ ires that more than half of these votes be present in a VAXcluster system Hierarchical Storage Controllers, Local Con trol­ fo r nodes to function. When the sum of aU votes lers, and RF-series Integrated Storage Elements contributed by the members of a VA Xcluster sys­ Hierarchical storage controller (HSC) servers are tem fa lls below this quorum, the VMS software specialized devices that perform MSCP serving of blocks 110 to mounted devices and prevents the RA-series disk drives and TA- series tape drives in a scheduling of processes. As nodes join the cluster, VAXcluster system. HSC servers connect directly to votes are added. Activity resumes once a quorum is the Cl bus. In addition to providing the host with reached. access to the storage media, HSC servers accom­ In practice, the connection manager uses two plish performance optimizations such as seek­ measurements of the number of votes: static and ordering and request fragmentation based on dynamic. The static count of votes is the globally real-time head position information. The local disk agreed on number of votes contributed by cluster controllers attached to the RA- and TA-series stor­ members. This count is created ignoring the state of age devices perform the same fu nction fo r a single connections between nodes. The value of the static host processor. The RF- series integrated storage ele­ quorum changes only at the completion of two­ ments (JSEs) attach to a DSSI bus. Each of these disk phase commit operations, which accomplish a storage devices performs its own command queu­ user-requested quorum adjustment in addition to ing and optimization without using a dedicated performing the other activities mentioned earlier controller. in this Connection Manager section. Each node independently maintains the dynamic Disk Class Driver The disk class driver allows count. This count represents the sum of all votes access to disks served by an MSCP server, an HSC contributed by VAXcluster members with which controller, a local Digital Storage Architecture (DSA) the tallying node has a functional connection. controller, or attached to a DSSI bus. This driver pro­ Changes in the dynamic quorum, and not the static vides a command queuing fu nction that allows a quorum, initiate the blockage of process and 110 disk controller to have multiple outstanding com­ activity. mands which can be used to provide seek, rotation, To provide configurations with a small number of and other performance optimizations. To handle nodes, e.g., two nodes, the concept of a quorum temporary communication interruptions, the driver disk was invented. The system manager assigns a restarts commands as needed. disk to contribute votes to the cluster. A node must VAXcluster systems can be configured so that all be able to access a file on the disk in order to disks are accessed by way of redundant paths fo r include the votes assigned to that disk in the node's increased availability. The way in which this is own total. Consequently, a special algorithm is used accomplished depends on the type of disk and the to access the file. This algorithm ensures that disk control ler. two unrelated nodes cannot both count the quo­ RF-series disks contain integrated controllers rum disk votes. Doing so could result in partitioned that connect to a single DSSI storage bus. This bus operation. can be accessed by up to two VAX processors. Each VAX processor can then serve the disks to all other Mass Storage Control Protocol Server The Mass nodes in the VA.'i:clustersystem. Thus, two paths are Storage Control Protocol (MSCP) server allows provided to each disk.

Digital Te clmicaljournal Vo l. 3 No. 3 Summer 1991 21 Availability in VA Xcluster Systems

RA-series disks connect to up to two storage con­ The lock manager provides a name space that trol lers. These controllers can be either (1) local is truly clusterwide. Cooperating processes can adapters attached directly to a single processor request locks on a specific resource name. The lock node or (2) HSC controllers located on the Cl bus. manager either grants or denies these requests. Disks connected to local adapters can be served to Processes can also queue requests. The lock man­ other nodes of the VA. Xcluster system. Disks located ager services allow processes to coordinate the on an HSC controller can be directly accessed by means of access to physical resources or simply pro­ processors that are not on that bus. Thus, the use of vide a communication pathway between pro­ multiple controllers when combined with disk cesses. Processes can use the service fo r such tasks serving provides at least two paths to a disk from as mutual exclusion, event notification, and server every node in the VA Xcluster system. failure detection.'- The lock manager uses the com­ Since many paths exist to gain access to a disk, munication service provided by the connection the disk class driver chooses which path to use manager to minimize the message count fo r a given when a disk is initially mounted by a node. If the operation and to simplify the design by eliminating path to the disk becomes inoperative, the disk class the need to consider changes in cluster member­ driver locates another path and begins to use it. ship from all main paths of operation. Server load and type of path, i.e., local or remote, are considered when selecting the new path. This Process Control Services The VMS process con­ reconfiguration is totally transparent to the end trol system services take advantage of VA .Xcluster user of the disk 1/() service. systems. Applications can use these services to alter process states on remote nodes and to collect Tap e Class Driver The tape class driver performs information about those processes. In the fu ture, it fu nctions in a VA.Xeluster system similar to those of is likely that other services will be extended to the disk class driver by providing access to tapes make optimal use of VAXcluster capabilities. located on HSC controllers, local controllers, and DSSI buses. File Sy stem The VMS file system (XQP) allows disk devices to be accessed by multiple nodes in a VAXcluster system. The file system uses the lock VM S Components Layered on To p of manager to coordinate disk space allocation, buffe r SYSA.Ps caches, modificationof file headers, and changes to The SYSAPs provide basic services that other VMS the directory structure." components use to provide a wide range of VA Xcluster features. Record Management Services The VAX RMS soft­ ware allows the sharing of file data by processes Vo lume Shadowing The volume shadowing prod­ running on the same or multiple nodes. The soft­ uct allows multiple disks to be utilized as a single, ware uses the lock manager to coordinate access to highly available disk. Volume shadowing provides files, to record data within files, and to global transparent access to the data in the event of disk buffe rs. media or controller fa ilures, media degradation, and communication failures.' The shadowing layer Batcb/Print Sy stem The batch/print system allows works in conjunction with the disk class driver to users to submit batch or print jobs on one node and accomplish this task. With the advent of VMS run them on another. This system provides a fo rm Vo lume Shadowing Phase II, disk shadowing is of load distribution, i.e., generic batch queues can extended to many new configurations. feed executor queues on each node. Jobs running on a fa iled node can be restarted automatically on Lock Manager The VMS lock manager is a system another node in the VA.Xcluster system. service that provides a distributed synchronization function used by many components of the WviS An Application Constructed Using operating system, including volume shadowing, VA Xcluster Mechanisms the file system, VA..'\ RMS software, and the The VMS software build process is an example of batch/print system. Application programs can also how these mechanisms can be used to benefit use the lock manager directly. application design. The VMS software build is

22 Vo l. 3 No. 3 Summer 1991 Digital Te chnical jo urnal Application Design in a VAX cluster Sy stem

broken down into various phases such as fe tch phase commit protocol to ensure a consistent sources, compile, and link. The phases must exe­ membership view from all nodes. cute in a given order but are otherwise indepen­ Removing a node is more complicated because dent. Each phase can be restarted from the both failure detection and reconfiguration must beginning if there is an error. Each major compo­ take place. In many cases, there may be multiple nent of the VMS operating system is processed sep­ simultaneous failures of nodes and communication arately during each of the phases. All sources reside paths. The view of what nodes are members and on a shared disk to which all nodes of the which paths are fu nctional may be very different V�'\cluster system have access; the output disk from each node. Additionally, new failures may is shared by all nodes also. A master data file occur while the cluster is being reconfigured. describes the phases and the components. For a The initial phase involves the detection of a node given phase, the actions required fo r each compo­ failure. A node may cease processing, but other nent are fed into a generic batch queue. This queue cluster members may not be aware of this fact. The fe eds the jobs into work queues on multiple nodes, communication components generally exchange resulting in the execution of many jobs in parallel. messages periodically to determine whether othl:r When all jobs of a phase have completed, the next nodes are functioning. The first indication of a fail­ phase starts. Ifa node fails during the execution of a ure may be the lack of response to these messages. job, that job is restarted automatically on another However, a minimum period of time must elapse node either from the beginning or from a check­ before the connection is declared inoperative. This point in the job. This use of shared disks and batch set delay prevents breaking connections when the queues provides great parallelism and reliability in network or remote system is unable to respond due the VMS build process. to a heavy load. Once the communication failure is detected, the connection managn is notified by the TheIm pact of VA .Xcluster Design and SCS communication layer. The connection manager Implementation on Applications attempts to restore the connection for a time inter­ This section discusses how multiple communica­ val defined by the system manager using a system tion paths, membership changes, disk location and control parameter known as RECNX.I:-.ITERVA L Once availability, controller selection, disk and tape path this interval has expired, the connection and hence the remote node is declared inoperative. The con­ changes, and disk failure impact application design. nection manager then begins a reconfiguration. Multiple nodes may attempt the reconfiguration Multiple Communication Paths at the same time. A distributed election algorithm is VA.X cluster software components are able to take used to select a node to propose the new configura­ advantage of multiple communication paths tion. The elected node proposes to all other nodes between nodes. For greatest availability, there that it can communicate with a new cluster config­ should be at least two physical paths between each uration that consists of the "best" set of nodes that pair of nodes in a VA.X cluster system.'' have connections between each other. "Best" is determined by the greatest number of possible Membership Changes votes. If multiple configurations arepossible with VA.X cluster membership changes involve several dis­ the same number of votes, the configuration with tinct phasl:s with slight variations depending upon the most nodes is selected. whether a node is being added or removed. Adding Any node that receives the proposal and can a node to a VA.Xcluster system is the simplest case describe a better cluster rejects the proposal. The

because it involves reconfiguration. There is a fur­ proposing node then withdraws the proposal and ther simplification in that nodes are only added one the election process begins again. This cycle con­ at a time. A booting node petitions a member of an tinues until all nodes accept the proposal. The clus­ existing cluster for membership. This member then ter membership is then altered using a two-phase describes the booting node to all other member commit protocol, removing nodes as required. nodes and vice versa. In this way, it is determined Even when one considers the worst case of a that the booting node is in communication with all continual failure situation, convergence on a solu­ membns of the cluster. The connection manager tion is guaranteed because the connection manager then adds the new node to the cluster using a two- does not add new nodes during a reconfiguration

Digital Te chnical journal Vo l. 3 No. 3 Summer 1991 23 Availability in VAX.clustcr Systems

and connections that fa il are never used again. cation needs a response based on one of these mes­ Thus, conditions cannot oscillate between good sages, the application is blocked . Otherwise, the and bad during the reconfiguration because of fa ilure does not affect the appl ication. Once the nodes rebooting or because failed connections reconfiguration starts, locking is blocked. An appli­ are restored. Conditions can only get worse, i.e., cation using the lock manager may expniencc a simpler, until failures cease to happen long enough delay, but as long as there are sufficient votes pre­ fo r the reconfiguration to complete. sent in the cluster ro constitute a quorum, the 1/0 is However, this worst-case condition is atypical; not blocked during the reconfiguration. Ifthe num­ most reconfigurations are very simple. A node that ber of votes drops below a qu orum, 1/0 and process is removed, as a result of a planned shutdown or activity are blocked to prevent partitioning and because it fails, attempts to send a "last gasp" data­ possible data cormption. gram to all VAXcluster members. This datagram indi­ Another aspect of node removal is the need to cates that the node is about to cease fu nctioning. ensure that all 1/0 requests initiated by the removed The delay present during the failure detection node complete prior to the initiation of new 1/0 phase is bypassed completely, and the connection requests to the same disks. To enhance disk perfor­ manager configures a new VA Xcluster system in mance, many disk controllers can reduce head considerably less than one second. movements by altering the order of simultaneously Normally, the impact on an application of a node outstanding commands. This command reordering joining a VA.Xcluster system is minimal. For some is not a problem during normal operation; applica­ configurations, there is no blockage of locking. In tions initiating 1/0 reqm:sts coordinate with each other cases, the distributed directory portion of the other using the lock manager, fo r instance, so that lock database must be rebuilt. This process may multiple writes, or multiple reads and writes, to block locking for up to a small number of seconds, the same disk location are never outstanding at depending on the number of nodes, number of the same time. However, when a node fails, all directory entries, and type of communication locks held by processes running on that node are buses in use. released. Releasing these locks allows the granting Application delays can result when an improp­ of locks that are waiting and the initiation of new 1/0 erly dismounted disk is mounted by a booting requests. If new locks are granted, a disk controller node. Failure to properly dismount the disk, e.g., may move the new l/0 requests (issued under the because of a node fa ilure, results in the tempo­ new locks) in front of old 1/0 requests. To prevent rary loss of some preallocated resources such as this reordering, a special MSCP command is issued disk blocks and header blocks. An application can by the connection manager to each disk before new recover these resources when the disk is mounted, locks are granted. This command creates a barrier but the 1/0 is blocked to the disk during the mount­ fo r each disk that ensures that all old commands ing operation. This I/O blocking has a potentially complete prior to the initiation of new commands. detrimental impact on applications that are attempt­ ing to allocate space on the disk. The answer to this Physical Location and Availability of problem is to mount disks so that the recovery of Disks the preallocated resources is defern:d. For all disks The application designer does not generally have to except the system disk, disk mounting is accom­ be concerned with the physical location of a disk in plished with the MO N'l/NOR.EB ILD command. a VA Xcluster system. Disks located on HSC storage Because a system disk is implicitly mounting during control lers are directly available to VAX processors a system boot, the system parameter ACP _REBLDSYSD on the same CI bus. These disks can then be MSCP­ must be set to the value 0 to defer rebuilds. The served to any VA X processor that is not connected application can recover the resources at a more to that bus. Similarly, disks accessed by way of a opportune time by issuing a SETVOLUM E/REBU ILD loca l disk controller on a VAX processor can be com mand. MSCP-served to all other nodes. This flexibility The impact on a VA Xcluster system of removing a al lows an application to access a disk regardless of node varies depending on what resources the appl i­ physical location. The only diffe rences that the cation needs. During the fa ilure detection phase, application can detect are varying transfer rates messages to a fa iled node may be queued pending and latencies, which depend on the exact path to discovery that there actually is a failure. If the appli- the disk and the type of controllers involved.

24 Vo l. 3 No. 3 Swnme1· 1991 Digital Tec hnical jo urnal Application Design in a VA.X cluster Sy stem

To provide the best application availability, the the device. Some consideration is given to selecting fo llowing guidelines should be considered: the optimal controller; fo r example, the driver inter­ rogates local controllers before remote controllers. 1. VMS Vo lume Shadowing Phase II should be used Once a new path is discovered or the old path to shadow disks, thus allowing operations to reestablished, the VMS system checks the volume continue transparently in the event that a single label to ensure that the disk or tape volume has not disk fa ils. been changed on the device. This vl:rification pre­ 2. Multiple paths should exist to any given disk. vents data corruption in the event that someone A disk should be dual-pathed between multiple substitutes the storage medium without cl ismount­ controllers. Dual pathing allows the disk to sur­ ing and remounting the device. After a successful vive controller fa ilures. check, the software restarts incomplete I/O requests and allows stalled I/O requests to proceed. In the 3. Members of the same shadow set should be con­ case of tapes, the tape must be repositioned to the nected to di.ffc rent controllers or buses as deter­ correct location before restarting l/0 requests. mined by the type of disk. If the label check determines that the original 4. Multiple servers should be used whenever serv­ medium is no longer on the disk or tape unit, then ing disks to a cluster in order to provide contin­ I/O requests continue to be stalled and a mes­ ued disk access in the event of a server fa ilure. sage is sent to the operator requesting manual inter­ vention to correct the problem. Attempts to Selection of Controllers reestablish the correct operation of a disk or tape Using static load balancing, the VMS software continue for an interval determined by the system attempts to select the optimal MSCP server fo r a parameter MVTIMOUT (mount verification time­ disk unit when that unit is initially brought on line. out). Once the time-out period expires, further The load information provided by the MSCP server attempts to restore are abandoned and pending is considered in this decision. The HSC controllers requests are returned to the application with an do not participate in this algorithm. In addition, the error status. Thus, the software handles temporary VMS software selects a local controller in prefer­ disk path failures in such a transparent fashion that ence to a remote MSCP server, where possible. If a the appl ication program, e.g., the user application, remote server is in use and the disk becomes avail­ VA X R.J.\1 5 software, or the VMS file system, is able by way of a local controller, the software unaware that an interruption occurred. begins to access the disk though the local con­ troller. This fe ature is know as local fail-back. Disk Failures An advanced development effort in the VMS oper­ Ifa disk fails completely when VMS Volume Shadow­ ating system is demonstrating the viability of ing Phase II software is used, the software removes dynamic load balancing across MSCP servers. Load the failed disk from the shadow set and satisfies all balancing considers server loading dynamically and further I/O requests using a surviving disk. If a moves disk paths between servers to balance the block of data cannot be recovered from a disk in a load among the servers. shadow set, the software recovers the data from the corresponding block on another disk, returns the Disk ana Ta pe Pa th Changes data to the user, and places the data on the bacl disk Path fa ilures are initially detected by the low-level so that subsequent reads will obtain the good data 2 communication software, i.e., the SCS or port lay­ ers. The communications software then notifies the Summary disk or tape class driver of the failure. The driver VA Xcluster systems continue to provide a unique then transparently blocks the initiation of new I/0 base fo r building highly available distributed sys­ requests to the device, prepares to restart outstand­ tems that span a wide range of configurations and ing l/0 operations, and begins a search fo r a new usages. In addition, VAXcluster computer systems path to the device. Static load balancing informa­ can grow with an organization. The av ailability, tion is considered when attempting to find a new flexibility, and growth potential of VAXcluster sys­ path. The path search is accomplished by sending tems result from the ability to acid or remove stor­ an MSCP GET C:'>iiT STATUS command to any known age and processing components without affecting disk controller or lVISCP server capable of serving normal operations.

Digital Te chnical jo urnal vr>l. 3 No. 3 Summer 1991 25 Availability in VAX cluster Systems

Refe rences 5. Guidelines fo r VAX cluster System Config ura­ 1. N. Kronenberg, H. Levy, and W Strecker, tions (Maynard: Digital Equipment Corporation, "VAXclusters: A Closely-coupled Distributed Order No. EK-VAXCS-CG-004, 1990).

System," ACM Transactions on Computer 6. L. Leahy, "New Availability Features of Local Area .�:ystems, vol. 4, no. 2 (May 1986): 130-146. VA.-'\cluster Systems," Digital Te chnical jo urnal, 2. S. Davis, "Design of VMS Vo lume Shadow­ vol. 3, no. 3 (Summer 1991 , this issue): 27-35. ing Phase 1!-Host-based Shadowing," Digital 7 T. K. Rengarajan, P Spiro, W Wright, "High Te chnical jo urnal, vol. 3, no. 3 (Summer 1991, Availability Mechanisms of VAX DBMS Software;' this issue): 7 -1'5. Digital Te chnicaljo urnal, no. 8 (February 1989):

3. W Snaman, Jr. and D. Thiel, "The VA X/VMS 88-98. Distributed Lock Manager," Digital Te chnical 8. A. Goldstein, "The Design and Implementation jo urnal, no. 5 (September 1987): 29-44. of a Distributed File System," Digital Te chnical

4. W Laing, ]. Johnson, and R. Landau, "Trans­ jo urnal, no. 5 (September 1987): 45-55. action Management Support in the VMS Oper­ ating System Kernel," Digital Te chnical jo urnal, vol. 3. no. 1 (Winter 1991): 33-44.

26 Vo l. 3 No. 3 Summer 1991 Digital Te chnical jo urnal Lee Leahy I

New Availability Features of Local Area VAXcluster Sy stems

VMS version 5. 4-3 increases the availability of local area VA Xcluster (IA Vc) config u­ rations by allowing the use of multiple (IAN) adapters in the VAX cluster system. Availability is increased by enabling Ja il-over between IAN adapters, reducing channel fa ilure detection time, and providing better network troubleshooting. Combined, these changes sigmficantly increase the availability of IAN-based VA Xcluster configurations by allowing the VA Xcluster system to tolerate and work around network fa ilures.

This paper describes the availability features added If not acknowledged within 2 seconds, a datagram to local area VAXcluster (LAVe) support in VMS ver­ is retransmitted. Retransmission continues until the sion 5.4-3. These features support multiple local connection between the two systems is declared area network (LAN) adapters, reduce the time broken. However, applications can be stalled during required to detect network path (channel) failures, this error recovery process. Therefore, reducing the and provide additional support fo r network trou­ time fo r detecting channel fa ilures and retransmit­ bleshooting. (Table 1 presents definitions for terms ting datagrams reduces the amount of application used throughout the paper.) delay introduced by network problems. We begin the paper with an overview of the VMS version 5.4-3 also increases availability by added LAVe availability fe atures of VMS version 5.4-3. reducing the delays introduced by network con­ We then present the multiple-adapter support gestion. This latest release measures the network fe atures of the new release, with comparisons to delays on a channel basis. The channel with the low­ the previous single-adapter implementation. The est computed network delay value is used to com­ detection of network delays is discussed, along municate with the remote system. with how the system selects alternate paths around LAVe network fa ilure analysis is a new feature in these delays after detection. Finally, we discuss the VMS version 5.4-3. This feature provides an analy­ analysis of network fa ilures and the physical sis of fa iling channels by isolating the common descriptions needed to achieve the proper level of network components responsible fo r the channel fa ilure reporting. fa ilures. LAVe network fai lure analysis increases availability by reducing the downtime caused by fail­ AddedAvailability Features ing network components. To enable this feature, the VMS version 5.4-3 supports LAVe use of up to fo ur system or network manager must provide an accu­ LAN adapters for each VA X system. Availability and rate physical description of the network used fo r performance are increased by connecting each LAN LAVe communications. adapter to a diffe rent LAL'\1 segment. Maximum avail­ ability is achieved by redundantly bridging these Multiple-adapter Support LAN segments together to form a single extended This section describes the availability fe atures added LAN . This configuration maximizes availability and with the multiple-adapter LAVe support in VMS ver­ reduces single points of fa ilure by increasing the sion 5.4-3. Some limitations of the single-adapter number of possible network paths between the dif­ implementation are presented fo r comparison. ferent systems within the VA Xcluster system. Availability has also been increased at the appli­ Single Points of Failure cations level by reducing the time required to In single-adapter LAVe satellites, the Ethernet adapter detect channel failures. The LAVe protocol (NISCA) remains as a single point of failure. This fa il­ sends sequenced datagrams to the remote system. ure "point" actually extends through the network

Digit11l Te chnical jo urnal Vo l. 3 No. 3 Summer 1991 27 Availability in VAXcluster Systems

Ta ble 1 LAVe Te rminology

Channel A data structure in PEDRIVER that represents a network path (see network path below). Each channel is associated with a single virtual circuit (VC).

Datagram A message that is requested to be sent by the client of the LAN driver. A datagram does not have guaranteed delivery to the remote system. The datagram may never be sent, or could be lost during transmission and never received.

LAN Adapter An Ethernet or fiber distributed data interface (FDDI ) adapter. Each type of LAN adapter has a unique set of attributes, such as the receive ring size.

LAN Address The network address used to refe rence a specific LAN adapter connected to the Ethernet or FDDI. This address is displayed as six hexadecimal bytes sepa rated by dashes, e. g., 08-00-28-12-34-56.

LAN Segment An Ethernet segment or FDDI ring. Each type of LAN has a unique set of attributes, e.g., maximum packet si ze. LAN segments can be connected together with bridges to form a single extended LAN. However, in such a LAN, the LAN segments can have different characteristics (e.g., different packet sizes for an FDDI ring bridged to an Ethernet ).

Network Path The pieces of the physical network traversed when a datagram is sent from one LAN address to another LAN address. The network path is represented by a pair of LAN addresses, one for the local system and one on the remote system. Each network path has a specific set of attributes, which are a combination of the attributes of the local LAN adapter, the remote LAN adapter, and each of the LAN segments and LAN devices on the path between them.

PEDRIVER The VMS port driver that provides re liable cluster communication utilizing the Ethernet.

Virtual Circuit A data structure in PEDRIVER that represents the data path between the local system and the remote system. This data path provides guaranteed delivery for the messages sent. PEDRIVER's datagram service, along with an error recovery mechanism, ensures that a message is delivered to the remote system or is returned to the client with an error. A virtual circuit (VC) has one channel for each network path to the remote system.

components common to all of the network paths this list. A new support routine steps through this in use fo r cluster communication. The combination list ancl returns a pointer to the next LAN device of VMS version 5.4-3 with multiple LAN adapters in the system. The single-adapter implementation removes the LAN adapter as a single point of fail­ requires code changes in PEDRIVER to add a new ure in the local system. Additionally, the use of mul­ LAN device; the new implementation no longer tiple LAN adapters connected to an extended LAN requires these changes. creates multiple network paths to remote systems. This configuration resu lts in a higher tolerance Channel Control Ha ndshake fo r network component fa ilures and higher cluster The channel control handshake is a three-way mes­ availability. sage exchange. The exchange starts when a HELLO message is received from a remote system and the Adapter Selection channel is in the closed state, or any time a CCSTART The single-adapter implementation is configura­ message is received . Upon receiving a HELLO mes­ tion-dependent and does not allow the system man­ sage on a closed channel, the system responds with ager a choice of adapters. The multiple-adapter a CCSTART message. support in VMS version 5.4-3 configures the system Upon receiving a CCSTART message, the system fo r maximum availability by starting the LAV e proto­ closes the channel if the PATH bit was set. In all col on all LAN adapters in the system. Support is cases, if the cluster password is correct, the system also provided to start and stop the LAVe protocol on n:sponds with a VERF message. Upon receiving the the LAN adapters. This support allows the system VERF message, the remote system verifies the clus­ manager to select which LAN adapters will run the ter password. If the password is correct, the remote LAVe protocol. system sends an acknowledgment (VACK) message The means of locating the LAN devices in the and marks the channel as usable by setting the PATH system has also changed. The system now main­ bit. The local system, upon receiving the VACK mes­ tains a I ist of LAN devices. As each LAN device driver sage, also marks the channel as usable by setting the is loaded into the system, an entry is appended to PATH bit.

28 Vul . .3 Nu. 3 Summer 1991 DigitCil Te chnical jo unral New Availability Features of Local Area VA .Xcluster Systems

The channel control handshake now verifies the path and maintain the channel in the open state, network path used by this channel, instead of verify­ and continuously verify the network topology. ing the virtual circuit (YC) as in the single-adapter To maintain the channel and test the network implementation. Additionally, the handshake is used path, each system multicasts a HELLO message to negotiate some parameters between the local and through each of its LAN adapters every 3 seconds. remote systems on a channel basis (instead of assum­ Upon receipt of a HELLO message (if the channel ing that the parameters are common for all channels is not open), a channel handshake begins. If the connected to the VC). channel is open, the network delay is computed Packet size and pipe quota are two characteristics and the channel packet size is verified. When an that are now arbitrated between the two systems. open channel does not receive a HELLO message These parameters are negotiated on a channel-by­ within 8 seconds, it declares a listen time-out and channel basis to al low different channels to fu lly uti­ the channel is closed. lize the capabilities of the specific network path. Additional topology change detection is required With the introduction of FDDI, larger packet because FDDI-to-FDDI communications use large sizes are now supported. The channel handshake packets. If two systems using FDDI adapters between two nodes negotiates a packet size that is exchange channel control messages, then both can supported by the entire network path. Any path agree on a large packet size. However, if the net­ that utilizes an Ethernet must use a packet size work is configured in the dumbbell configuration, of 1498 bytes or smaller. An FDDI-to-FDDI path on then only the small packet size can be used. (The the same extended ring must use a packet size of dumbbell configuration consists of two FDDI rings 4468 bytes or smaller. An increased packet size separated by an Ethernet segment.) reduces the number of messages required when Detection of the dumbbell configmation is per­ large blocks of data are sent. This increase in packet fo rmed using the priority field in the frame control size results in fewer messages, less handshaking, byte of the FDDI message header. This field does not and thus better network efficiency. exist in Ethernet messages and must be created The PIPE_ QUOTA value is used to limit the number when fo rwarding an Ethernet message to an FDDI of messages sent to the remote system before wait­ ring. Ethernet-to-FDDI LAN bridges set this field's ing for an acknowledgment. PIPE_QUOTA was imple­ value to zero. AJ l LAVe messages transmitted by the mented to help prevent receiver overrun on the FDDI adapters use a non-zero value fo r the priority remote system. Instead of using a fixed value, the field. When a channel control message is received, new implementation uses a value specified by the the value of this field is checked. If the value is non­ LAN driver. This value factors in the LAN device's zero, then large messages can be used because the receive ring size and is typically larger than the fixed message did not traverse an Ethernet segment. value of eight messages used previously. Increasing The priority field is also verified every time a the PIPE_QUOTA value allows more data to be sent HELLO message is received and the channel is open. between the nodes before an acknowledgment A topology change is detected when a change in the message is required, thus increasing the protocol's priority value is received. If the priority value goes efficiency and reducing the network traffic. from zero to non-zero, the packet size is renegoti­ These new fe atures in VMS version 5.4-3 have ated and a larger packet size may be used. Ifthe pri­ reduced the amount of handshaking required to ority value goes from non-zero to zero, the channel move data and the number of messages required to packet size must be reduced. If this is the only chan­ move large amounts of data. The result is greater nel with a larger packet size, then the VC closes and applications availability through fewer network­ fo rces the two systems to reassign the message based delays. sequence numbers. Use of He llo Messages Listen Ti me-out The single-adapter implementation uses a HELLO VMSversion 5.4-3 now consistently times out chan­ message to maintain the vc and not the channels. nels in 8 to 9 seconds, whereas the single-adapter AJso, the handshake to verify connectivity is per­ implementation detects the fa ilure in 8 to 15 seconds. fo rmed by the vc, which fo rces all channels to use Reducing this time reduces the delays experienced the same characteristics. In comparison, the mu ltiple­ by applications when a LAVe node is removed from adapter implementation uses HELLO messages to the cluster. The result is an increase in applications trigger the channel handshake, test the network availability.

Digita/ Te clmicaljournal Vo l. 3 1\'o. 3 Summer 1991 29 Av ailability in VA Xcluster Systems

The single-adapter implementation traverses the the channel packet size is greater than or equal to VC list and scans each of the receive channels (RCH the packet size currently in use by the VC structures embedded in the VC) to check fo r time­ Transfers restricted to an FOOl ring can use a out. Because this scan is CPt:-intensive. the algo­ larger packet size than those that traverse an rithm was designed to scan the vc list only once Ethernet LAN segment. PEORfVER now supports evetl' 8 seconds. Reducing th is scan time required variable packet sizes up to the size supported fo r the design of a new algo rithm that reduces the CPU the FOOl ring. Each time the vc switches channels, utilization requ ired to locate the channels that have the new channel characteristics are copied into the timed out. vc. The result is that as soon as the vc switches to The VivtS version 5.4-3 implementation places using the FODI-to-FODI channel, it also switches to each open channel into a ring of time-out queues. using the larger packet size. The time-out routine maintains a pointer into the ring of queues cotTesponding to the 8-second time­ Receive Message Caching VMS version 5.4-3 out. Each second, the time-out rou tine executes, introduces a receive message cache to prevent any removes any channds pointed to by the time-out performance degradation when messages are pointer, and calls the listen time-out routine fo r received out of order. Because of transmission and the channel. Next, the time-out pointer and the network delays, messages are typically received out 8-second time-out pointer are updated to point to a of order at approximately the time a channel switch new set of queue headers in the ring. Active chan­ occurs. Also, most of the channel selections are nels and channels receiving HELLO messages are invoked after locating a channel with a lower inserted into the ring of queues pointed to by the network delay value, thus increasing the probabil­ current time pointer. which prevents them from ity that messages will be received out of order. timing our. This implementation reduces CPU uti­ lization during the time-out scan by looking at only Channel Fa ilure No t Displayed The multiple­ the channels that have timed out. adapter implementation does not display any mes­ sages when a channel fails. This choice was made to Changes to Vi rtual Circuit Ma in tenance maintain compatibility with the previous imple­ The single-adapter implementation closes the VC mentation. We also wished to reduce the number of and performs a channel control handshake every console messages and still provide enough data to time a new channel is established. This implemen­ isolate the problem. However, without some chan­ tation also fo rces each channel to use the same nel fa ilure notification, all but one channel could characteristics, specifically packet size, thereby fa il without notice, thus negating al l the availability reducing the characteristics to the lowest common that was introduced by using multiple adapters. LAVe denominator. The network fa ilure analysis allows the VMS version 5.4-3 does not close the VC each time system or network manager to select one of the fo l­ a new channel is established . The channel hand­ lowing levels of channel fa ilure notification: no shake affects only the channel and is used to negoti­ notification, if not enabled; channel fa ilure notifica­ ate the channel characteristics, including packet tion, when barely enabled; or isolation of the fa iling size. The VC remains open as long as a channel with network component, when fu lly enabled. When the corresponding packet size is open. This mainte­ this fe ature is fu lly enabled, a fa iling network com­ nance increases applications availability by allow­ ponent typically generates only a single console ing channels to fail and reestablish transparently message instead of displaying tens or hundreds of without disrupting service at the vc and systems channel fa ilure messages. communication services (SCS) layers. Channel Selection One Ch annel with Matching Ch aracteristics VMS version 5.4-3 bases its selection of a single Required The VC can be opened as soon as the transmit channel fo r a remote system first, on the first channel to the remote system is opened. When packet size and second, on the network delay the vc opens, its packet size is set to the packet size value. The channel selection algorithm searches fo r of the channel being used. The VC can remain open an open channel with a compatible packet size so as long as at least one channel with a compatible that the vc does not have to be broken. If more than packet size is open. The packet size is compatible if one channel has a compatible packet size, the

30 Vo l. 3 No. 3 Summer 1991 Digital Te chnical jo urnal Ne w Availability Fea tures of Local Area VAX cluster Systems

network delays are compared and the channel with late into delays in the applications. Thus, through the lowest network delay value is chosen. The delay detection and the use of alternate paths, VMS selected channel is used until it fails, encounters an version 5.4-3 reduces the delays fo r applications error, or a channel with a lower network delay and increases overall cluster performance. value is found. Channel selection is performed independently Assumptions and Delay Calculations fo r each remote system. This implementation means PEDRIVER computes network delays through a that a two-node cluster increases its availability series of assumptions. The primary assumptions are through the use of more LAN adapters, but does not that the transmit and receive delays fo r a path are achieve a performance benefit by increasing the equal, and that there are small internal delays asso­ number of LAI'\1 adapters above two. Larger clusters, ciated with the LAN device. Although these assump­ however, can take advantage of the additional tions are occasionally invalid, PEDRIVER uses them LAN adapters and thus achieve better cluster perfor­ because there are no round-trip messages available mance. Multiple LAN adapters can also increase the in the NISCA protocol to compute the delay. bandwidth available fo r use by the LAVe protocol. As the first step in the delay calculation fo r each However, the actual performance is very configura­ channel between nodes, each node time-stamps the tion- and application-dependent. HELLO message just prior to transmission. When the Channel selection is limited to the transmit chan­ HELLO message is received, the time stamp is sub­ nel, but all channels are used to receive data. The tracted from the local system time. This resulting receive cache helps prevent retransmission by value equals the sum of the transmit queue delay, the remote system by placing messages received the network delay, the receive queue delay, and the out of order into the receive cache until the previ­ difference in the two system times. Applying the ous messages are received. This receive algorithm is assumptions reduces this value to the sum of the compatible with any transmit channel selection network delay and the difference in the two system algorithm, e.g., in PEDRIVER or in any component times. implementing NISCA. The second step of the delay calculation is to compare the delay times between different chan­ Multiple-adapter Availability Summary nels to the same remote system. This comparison is The multiple-adapter LAVe support added to VMS a subtraction of the values computed above fo r version 5.4-3 increases the availability of applica­ each channel. The computation removes the com­ tions and of the overall cluster. Availability is mon fa ctor (the difference in the two system times) increased by removing the LAN adapter as a single and results in the comparison of the two network point of fa ilure. Cluster availability is enhanced delays. When multiple channels exist, PEDIUVER through continuous testing of the network paths attempts to use the channel with the lowest and correction fo r network topology changes. network delay value. This implementation also increases network utilization and cluster performance by taking full Problems and Benefi ts Associated with advantage of a channel's characteristics. Larger the Assumptions receive ring sizes reduce the protocol handshaking The assumptions in the network delay calculation overhead. Moreover, larger packet sizes reduce do not always hold true. The arbitration delay to the number of messages that must be sent fo r large transmit a message on the Ethernet, between a pair transfers. of systems, is not always equal in both directions. The next section discusses how the PEDRIVER Over the long term, this assumption would be valid detects network delays and selects alternate paths. if the systems are sending the same number of mes­ sages in each direction; however, this is not typi­ Network Delay Detection cally the case. When this assumption does not hold VMS version 5.4-3 increases application availability true, i.e., if the transmit delay is longer than the by detecting significant network delays and select­ receive delay, then additional delay is introduced ing alternate paths. As the network gets busy, it when transmitting messages using this channel. becomes more difficult fo r a LAVe node to send The assumption that internal delays are small cluster messages. These delays in network commu­ depends upon the network traffic and the transmit nications cause delays in cluster traffic and trans- traffic generated fo r an adapter by the other LAN

Digital Te chnical jo urnal Vo l. 3 No. 3 Summer 1991 31 Availability in VAXcluster Systems

clients. If another LAN client is a heavy user of a par­ delays and failures. Channels fail as network fail­ ticular LAN adapter, then transmissions from this ures occur, reducing the availability provided by adapter experience additional queue delays while these extra channels. However, the VC remains waiting fo r the adapter. Ifthe network is busy, mes­ open, allowing cluster communication as long as a sages in the transmit queue have an additional wait. single channel remains open. Finally, the network delay computed is the delay To maintain compatibility with previous VMS ver­ from the remote system to the local system. Since sions, only VC failures are displayed on the local the delay is not always symmetric, it does not console. Displaying messages about channel fa il­ always represent the delay in the other direction, ures would only indicate a problem without help­ i.e., transmitting messages to the remote system. ing to locate the cause of the fa ilure. Also, as the Ye t, because the NISCA protocol does not have any cluster configuration gets larger, or the number of round-trip messages, this is the best possible delay LAN adapters increases, channel failure messages value. increase (depending on what component fa iled) Even with these problems in the assumptions, beyond the point where they are helpfu l. Yet to the network delay calculations increase the avail­ maintain cluster availability, the system or network ability of the cluster by detecting large network manager needs to be told of the channel fa ilures delays. With this data, PEDRIVER is usually able to that are reducing the availability. select alternate paths around the network delays The LAVe network fa ilure analysis, introduced when multiple channels exist, providing better with VMS version 5.4-3, is used to analyze the net­ cluster performance and availability work failures and display the OPCOM messages that Figure 1 represents an example of network delay call out the fa iling network component. This sup­ detection. If LAN segment A is very busy, then port requires a description of the physical network PEDRIVER can detect an additional network delay fo r used fo r LAVe communications. Depending upon channels Al-Bl, Al-82, and A2-81. PEDRIVER can the description supplied, the system or network then select an alternate path, that is, transmit pack­ manager can select the level of failure reporting. ets only on channel A2-82. Use of channels Al-81, This level may range from channel failure reporting Al-82, and A2-81 can resume when the network to calling out the actual component that fa iled. traffic level on LAN segment A is reduced to about the level of LAN segment 8, or if channel A2-82 fa ils. Display of Channel Fa ilures There is a significant difference between displaying LAVe Network Failure Analysis the channel fa ilures and performing LAVe fa ilure V:\1S version "i.1-3 uses multiple LAN adapters to analysis. This diJf<:renceis shown in Figure 2, which increase availability by working around network represents a multiple-adapter LAVe conJiguration.

ETHERNET LAN SEGMENT A • • • • I I I ETHERNET I I ETHERNET I I ADAPTER 1 I 1 ADAPTER 1 I ______I I I I ETHERNET VAX A VAX B BRIDGE .------, .------.

I I I ETHERNET I I ETHERNET I I ADAPTER 2 I I ADAPTER 2 I

• • • 1 l • ETHERNET LAN SEGMENT B KEY: I TERMINATOR - TRANSCEIVER

Figure 1 Network Delay Detection

32 Vo l. 3 No. 3 Summer !991 Digital Te chnical jo urnal New Availability Features of Local Area VAX cluster Sy stems

ETHERNET LAN SEGMENT A • • • • I I I DELNI A I I DELNI B I l I l I I ETHERNET I I ETHERNET I I ETHERNET I I ETHERNET I I ADAPTER 1 I I ADAPTER 1 I I ADAPTER 1 I I ADAPTER 1 I ______I ______I I I I I I I ETHERNET VAX A VAX B VAX C VAX D BRIDGE ,------. .... ------. .------. ,... ------.

I I I ETHERNET I I ETHERNET I I ETHERNET I I ETHERNET I I ADAPTER 2 I I ADAPTER 2 I I ADAPTER 2 I I ADAPTER 2 I I I I I I DELNI C I � DELNI D I • • • l l • ETHERNET LAN SEGMENT B KEY: I TERMINATOR - TRANSCEIVER

Figure 2 Multiple-adapter Channel Fa ilure

Looking from system VA X A, the fo llowing chan­ In this small cluster configuration, LAVe network nels exist: Al-A2, A2-Al, Al-Bl, Al-B2, A2-Bl, failure analysis has reduced the messages displayed, A2-B2, Al-Cl, Al-C2, A2-Cl, A2-C2, Al-Dl, i.e., from four channel failure messages to one A 1-D2, A2-Dl, and A2-D2. let us assume that component failure message. This simpler display DELNI B fails, causing the fo llowing channel failures: provides timely notification and better isolation of Al-Cl, A2-Cl, Al-Dl, and A2-Dl. A display of network component failures, allowing the system channel failures would show that some interesting or network manager to repair the network earlier event had just occurred but would leave it up to the and restore the fu ll availability of the cluster. system or network manager to isolate the actual failure. Also, since other channels are still open to Physical Ne twork Description VAX C and VAX D (Al-C2, A2-C2, Al-D2, and LAVe network failure analysis requires a description A2-D2), these nodes still remain in the cluster. of the physical network. This description lists the However, the number of channels to these nodes components used by the LAVe and the network has been halved, reducing cluster availability. paths that correspond to the LAVe channels. LAVe network failure analysis uses the physical The network component description consists network description to analyze channel failures. of several pieces of data, including a component The working channel Al-C2 indicates that VAX A, type and text description provided by the system or AI, DELNI A, LAN segment A, Ethernet-to-Ethernet network manager. Some component types will LAN bridge, LAN segment B, DELNI D, C2, and VA X C require additional data. There are several types of function. The working channel A2-D2 indicates network components: NODE, ADAPTER, COMPO­ that A2, DELNI C, 02, and VAX D also function. The NENT, and CLOUD. Each NODE component requires remaining components are DELNI B, Cl, and 01. By a unique node name associated with it that matches reviewing the failing channels fo r common failures, the SCSNODE SYSGEN parameter. The ADAPTER com­ we see that two channels use component Cl, two ponent has at least one and sometimes two LAN channels use component Dl, and all four channels addresses associated with it. One LAN address is the use component DELNI B. Therefore, DELNI B has the hardware address and the other, when specified, is highest probability of causing the failure and is the the DECnet LAN address. COMPONENTs are used to only network component displayed on the console. describe all pieces of the network, both working

Digital Te ch11icaljournal Vo l. 3 No. 3 Summer 1991 33 Availabilityin VAX cluster Systems

and nonworking. CLOUDs describe portions of the When a LAVe channel fa ils, the corresponding network that are working only if all paths are work­ network path is placed on a fa ilure list. The net­ ing. Any path fa ilure implies that the CLOUD com­ work path is then scanned and the working count ponent may not be working. for each component is decremented. Component descriptions can range from actual devices and cables to internal CPL bus adapters. Failure Analysis When the component is defined, an ID value is Related channel fa ilures are col lected by delaying returned fo r use in the network path description. 10 seconds fo llowing the channel fa ilure. Each The choice of the components dcscrihcd is left to channel failure extends the time delay to the full 10 the system or network manager and allows the seconds. Once the 10-second delay has elapsed fo l­ manager to select the desired level of network anal­ lowing the last channel failure, the fu ll list of failing ysis. Each network component has a reference network paths is processed. count and a working count. The reference count is Computing the fa ilure probabilities begins by incremented when a network path is defined that reviewing each of the components in the network utilizes the network component. The working path. If a component cannot be proven to work, count is incremented each time a LAVe channel is then it is placed on the suspect list and the compo­ opened, and decremented each time an open LAVe nent's suspect count is incremented. A component channel is closed. is working if the working count is non-zero; a The network path description consists of a CLOUD component is working if the working count directed list of component identifier (ID) values. equals the reference count. This step ends with a For proper analysis, this list must start with the ID list of suspect components, each with a suspect value fo r the local node. Each successive ID value in count that represents the number of times this the list must be associated with the next network component could have caused the fa ilure. component through which a message would travel Suspects are selected by comparing the suspect when using this path. The final component ID value counts for each of the components in a network is that of the remote node. path. Each network path is reviewed indepen­ Each network path description must contain two dently and a primary suspect is selected. The node ID values and two adapter ID values. '{(> be use­ primary suspect is the first component with the ful fo r analysis, the path description must contain highest suspect count in the network path. Sec­ the node ID value fo r the node running the analysis. ondary suspects are the other components in the Without this node JD value, the path cannot be network path with the same suspect count value. matched with any of the LAVe channels on that node. The primary and secondary suspects are displayed after all the network paths have been reviewed. The Channel Mapping and Processing other components in the suspect list are removed The network path descriptions are matched with from the list, and are not displayed because the fail­ the LAVe channels by using the LAN addresses. If ure analysis judged them to be unrelated to any of possible, only the LAN hardware address is used the channel failures. fo r the mapping function. This mapping provides There are several limitations to the fa ilure analy­ the best analysis because it remains constant with sis. The analysis requires an accurate description of respect to any LAN adapter. In clusters running the physical network. The fa ilure analysis is also mixed VMS versions, the LAN hardware address is looking for a common network component fa il­ not available fo r systems running a version prior to ure. Therefore, an incorrect analysis results from VMS version 5.4-3. In prior versions, the DECnet LAN either an inaccurate network description, multiple address is used fo r the mapping function. related fai lures, or too much detail. Each time a LAVe channel is opened, the network The key to a valid network fa ilure analysis is the path database is searched to locate a matching net­ correct description of the physical network. Jn work path description. If fo und, this description is Figure 2, if the network path Al-B 1 incorrectly connected to the channel and a scan of all the com­ listed DELNI B, then the failure analysis would find ponents in the path is performed. For each compo­ that DELNI B is working and remove it from the sus­ nent in the path, the working count is incremented. pect list. The final analysis woulcl list both Cl and If the component switches from not working to Dl as the fa iling components. Val idation of the working, then a WORKINGmessage is displayed. network description can be performed by network

34 \1J/.. i No. 3 Summer 1991 Digital Te e/mica/ jo urnal Ne w Availability Fea tures of Local Area VAX cluster Systems

fault insertion and by reviewing the network enables the system or network manager to maintain failure analysis. If the description is accurate, then the increased availability gained with the use of the failure analysis should display the expected multiple LAN adapters. Timely analysis and report­ messages. If an inaccurate network description ing of network component failures significantly exists, unexpected messages may be displayed. reduces troubleshooting times and increases the In such cases, the network description should be overall cluster availability. reviewed. Multiple related fa ilures may also cause an incor­ Summary rect fa ilure analysis. Referring again to Figure 2, VMS version 5.4-3 increases the availability of Local assume a correct network description. Instead of a AreaVAX cluster configurations by providing the fo l­ DELNI B failure, assume that both Cl and Dl have lowing features: failed. The fa ilure analysis reviews the network • Faster detection of channel failures description and locates the single component DELNI B because it is common to all of the failures. • Support fo r the use of multiple adapters In this case, the failure analysis does correctly • locate the area of the network (something con­ Support for the use of additional network paths

DELNI • nected to B). However, further review is Detection of network congestion required to identify that DELNI B itself has not failed, • Analysisof network fa ilures but rather both Cl and D 1. The choice of the network description, the num­ The goals of these features are to ber of components defined, and the path descrip­ • tions, is left to the system or network manager. Provide higher cluster availability

This choice allows the manager to select the level • Wo rk around network congestion and network of failure reporting needed to troubleshoot the net­ component fa ilures while keeping the cluster work. However, when the physical network descrip­ running tion includes too much detail (e .g., transceiver • cables), it becomes difficult fo r the failure analysis Detect problems earlier and report them more to reduce the components to a single fa ilure. accurately, with network data that helps isolate Instead, a primary suspect and several secondary the failing network components suspects are usually displayed. To o much detail also In addition to meeting these goals, the fe atures in CPU requires more cycles and memory fo r analysis, VMS version 5.4-3 increase the cluster communica­ and in general is a bad trade-off. tion bandwidth. In Figure 2, if the Ethernet adapter Cl fails, and the transceiver cables are listed in the network Acknowledgments description, then the failure analysis displays two I want to thank Kathy Perko and Steve Mayhew for messages. The primary suspect is listed as the their help with the design of the multiple-adapter transceiver cable because it is the first component version of PEDRNER. Kathy reviewed the code that matches the failure in the path from A to C. The during the implementation and provided va luable Ethernet adapter Cl is listed as a secondary sus­ input fo r both the code and this paper. Thanks to pect, because its suspect count matches the sus­ Scott H. Davis, Sandy Snaman, and Dave Thiel fo r pect count of the primary suspect. In this example, their contributions to the new PEDRIVER design. there are no network paths described that use Thanks also to the LAi\1 Group (Linda Duffell, Dave Ethernet adapter Cl without using the transceiver Gagne, Rod Gamache, Bill Salkewicz, and Dick cable connected between Cl and DELNI B. With the Stockdale) fo r the VAX communication interface to network description provided, there is no way to the LANdr ivers, which simplified the design of the distinguish between these two components. new PEDIUYER. I also wish to acknowledge the LAN Therefore, both are displayed when either is a pri­ Group fo r their help during the debug phase of this mary or secondary suspect. implementation. Benefi ts The LAVe network failure analysis, combined with an accurate description of the physical network,

Digital Te cb11icaljo urnal Vo l. 3 No. 3 Summer 1991 35 Richard E. Stockdale Judy B. Wei ss

Design of the DEC LANconwoUer 400 Adapter

The DEC LANcontroller 400, Digital's Xlvfl-to-Ethernet adapter (DEMNA), connects systems based on the Digital X.Iff bus to an Ethernet/IEEE 802.3 local area network (LAN) . These systems use the X141bus either as the system bus (VAX 6000 systems) or as an I/0 bus (VAX 9000 systems). The newsy stems, which can utilize the full band­ width of the Ethernet, are characterized by increased host processor speeds. The DEMNA adapter was designed to support these l/ 0 requirements. In addition, con­ sole and monitor fa cilities were built into the adapterfi rmware fo r debugging, ver­ ifi cation, and user visibility Tbe adapter'sper formance fo r small packets ex ceeds system capabilities, and Ethernet bandwidtb is the limiting fa ctorfo r large packets.

The high-performance DEC LANcontroller 400, • Supports two modes of addressing: VA.'< virtual Digital's XMI-to-Ethernet adapter (DEMNA), con­ addressing and 40-bit physical addressing nects a system based on the Digital XM! bus to an • Allows buffe r chaining on transmit Ethernet/IEEE 802.3 local area network (LAN). This adapter is intended for Digital systems that use the • Performs packet filtering and validation on XMI bus either as the system bus (\'1\X6000 systems) receive or as an r;o bus (VAX 9000 systems). It is an intelli­ • gent adapter that implements the physical layer and Supports Digital's maintenance operations pro­ part of the data link layer of network protocol. The tocol (MOP) functions term intelligent refers to the packet processing per­ Provides support for diagnostic routines and fo rmed by the adapter as part of the data link layer. field service functions implemented through the The OEMNA adapter was needed to support the system console or diagnostic software I/0 requirements of the VAX 6000 and VAX 9000 sys­ tems, which can utilize the full bandwidth of the • Has console and monitor facilities that aJiow a Ethernet. The adapter also provides the ability to console user to monitor DEMNA operation and configure these systems without a Ill bus. For these network utilization systems, the DE:'vl'\!A adapter is the only Ethernet This paper begins with a logic overview of the connection available. DEMNA device. The sections that fo llow discuss the The DEM'\!Aadapter is controlled by a port driver factors that influenced design and implementation, that resides in host memory The interface between describe the major performance metrics and user the port driver :md the DEMNA firmware (the port) visibility operations, and review the design results is a ring-based design which is optimized for low and future needs. system owrhead and high performance. The DEMNA adapter has the fo llowing major fe atures: Logic Overview The DEMNA adapter is a single-board XMI adapter • Supports Ethernet/lEEE 802.3 protocols based on complementary metal-oxide semiconduc­ • Supports up to 64 users (each one a separate tor/transistor transistor logic (CMOS/TTL) technol­ protocol such as local area transport [LAT) soft­ ogy. As shown in Figure 1, the hardware consists of ware, DECnet network software, or clusters) fo ur separate subsystems:

36 101.3 No. 3 Summer 1991 Digital Te chnical journal Design of the DEC LANcont roller 400 Adapter

� ------

MICROPROCESSOR SUBSYSTEM EPROM 128 KB >- SRAM f- oc 256 KB MAC EEPROM <{ DIAGNOSTIC a. ADDRESS 64 KB REGISTER � I I �IROM I + j_ + + 1 j_ ; j_ ; ; ; _j_ COAL BUS + + 'f 'f MEMORY SUBSYSTEM SRAM 1: TRANSCEIVERS/ oc LATCHES 256 KB <{ a. • • _i + DEMNA MEMORY BUS + + XMI • ETHERNET t �---- INTERFACE INTERFACE GATE LANCE l ARRAY I I l XCI BUS I 1 I XMI I CORNER G I + I L ------1 i ETHERNET XMI BUS

KEY CV AX CMOS VAX MAC MEDIA ACCESS CONTROL EEPROM ELECTRICALLY ERASABLE PROGRAMMABLE SIA SERIAL INTERFACE ADAPTER READ-ONLY MEMORY SRAM STATIC RAM EPROM ERASABLE PROGRAMMABLE READ-ONLY MEMORY sse SYSTEM SUPPORT CHIP LANCE LOCAL AREA NETWORK CONTROLLER FOR ETHERNET XCI XMI CHIP INTERCONNECT

Figure 1 DEMNA Block Diagram

• Microprocessor ule. The firmware is stored in EEPROM, but is copied to RAM fo r execution. The boot ROM contains the • Direct memory access (DMA) and shared memory initialization code and diagnostics. This subsystem also provides a console interface through the sse • XMI interface fo r diagnostics, module debugging, and network • Ethernet monitoring. The DMA and shared memory subsystem pro­ The microprocessor subsystem contains the CMOS vides the means of communication between the VA X (CVAX) processor, system support chip (SSC), CVAX processor and the other subsystems. The boot read-only memory (ROM), Ethernet address devices arbitrating fo r this shared memory are the programmable read-only memory (PROM), electri­ CVAX processor, the gate array, and the Local Area cally erasable programmable read-only memory Network Controller fo r Ethernet (LANCE) chip. (EEPROM), and random-access memory (RAM). The The XMI interface subsystem contains the XMI microprocessor subsystem provides an internal, network adapter (XNA) gate array and the XMI high-speed COAL bus so that the CVAX processor corner. The XNA gate array is the data-move engine can fe tch its instructions and execute them without fo r the DEMNA adapter and contains all the XMI­ being delayed by the other controllers on the mod- required registers.

Digitul Te chnical jo urnal Vo l. 3 No. 3 Summer 1991 37 Network Performance and Adapters

The Ethernet subsystem contains the LANCE a simple host interfa ce, off-loads the host when­ chip, the serial interface adapter (SIA) chip, and var­ ever possible, uses rings instead of queues, and sup­ ious bus interface logic modules. The Ethernet sub­ plies the address of the buffer directly with the system receives packets from the Ethernet and ring entry rather than indirectly through another stores them in the shared memory.When transmit­ data structure. ting a packet on the Ethernet, the LANCE chip gets The design of the adapter was now consistent the packets from shared memory and transmits with the needs of the new VAX 6000 and VAX 9000 them on the Ethernet. systems. These systems, characterized by increased host processor speeds, needed increased 1/0 per­ Design fo rmance. The task of the OEMNA team was to fi ll The design of the DEMNA adapter was influencedby that need fo r Ethernet 1/0. many fa ctors, including previous adapter design experiences, available hardware such as Ethernet Ty pe of Adapter chips, and system requirements. The DEMNA team The DEMNA product is a store-and-forward adapter, was assigned the fo llowing tasks: i.e., it copies data to and from host memory by way of temporary storage on the adapter. This data • Produce a working Ethernet adapter that could transmission diffe rs from that of a cut-through be used by operating systems such as VMS, adapter in which data flows directly between host , ELN, and custom operating systems on memory and the transmission medium. However, hardware configurations that use the Xt\1 1 bus as the DEMNA adapter is actually able to gain some of a system bus or an 1/0 bus the benefits of cut-through on the receive side. • Deliver high performance, measured by the amount of Ethernet bandwidth supported at var­ Host In terface ious packet sizes, with minimized host overhead We designed a simple host interface, using rings instead of queues. Interrupts to the host were kept • Supply debugging fe atures fo r design verifica­ to a minimum, from one interrupt per packet at tion and field maintenance of the adapter light loads to a fraction of that number under heavy First, we reviewed previous adapters to deter­ loads. As seen in Figure 2, the port and the port mine what improvements could be made. We driver (host) share the following data structures, learned that a complex host interface complicated which reside in host memory: host software and adapter firmware and greatly • Port data block. This structure gives the port the affected pcrtixmance. One of these adapters, the location of the rings and page tables in host mem­ Digital Bl Ethernet Network Adapter (DEBNA), ory and is a repository for error information. implemented a generic port interface that used interlocked queues containing a queue entry with a • Command and receive rings. These rings contain buffer name that indexed into a buffe r descriptor information describing outstanding command table (i.e., an additional level of indirection). In and transmit requests and buffe r information fo r addition to the firmware complexity, the hardware receive buffe rs. was not well suited to a complex port interface. • Transmit, receive, and command buffers. These Another area in which improvements could be buffe rs contain packet data and command data. made over previous Ethernet adapters was the amount of processing performed by the host proces­ These data structures constitute the primary sor during receive packet filtering, address transla­ means of communication and data transfer between tion, and buffe r copies. Overall system perfo rmance the port and the port driver. Control status registers improves if this processing can be reduced by per­ (CSRs) are provided fo r port poll demand registers, fo rming part or all of these fu nctions in the adapter. Xt\1 1 context, and port initialization. This diffe rence transforms the adapter from a dumb Two rings are used in the host interface: the com­ adapter (much ofthe data link processing performed mand ring and the receive ring. Each ring consists by the host) to an intelligent adapter (much of the of 1024 bytes of physically contiguous memory, and processing performed by the adapter). each ring contains entries that describe a buffer or a The results of our analysis of older Ethernet set of buffe r segments (when chaining transmit adapters led us to choose a design that employs buffe rs). The number of entries in the receive ring

38 vbl. 3 No . . ) Summer 1991 Digital Te cbllicaljournal Design of the DEC IANcont roller 400 Adapter

XMI 0 ETHERNET START/STO -- p- USER 1 USERS ..--- PORT READ/WRITE (LAT, FOR - - DRIVER DEFAULTS EXAMPLE) r- sTATUS/ - I-- c OMMAND INFO DEMNA USER 2 PACKETS I I PORT COMMANDs - (DECNET) G TRANSMITS � I I IRECEiVE"lRECEIVE _ RE EI ES �RING � � USER 3 I..- (MOP) �DATA BUFFER �-BUFFERS DATA v HOST DEMNA

Figure 2 DEMNA Po rt Interface is fixed, since each entry points to a single contigu­ ization, shutdown, and error handling. However, ous buffe r. The size of each transmit ring entry is this approach was not chosen because it compli­ variable and is fixed at initialization time. cates the interface and would have resulted in The port and port driver process the entries in firmware size difficulties. each ring in sequential order, starting with the first CVAX RAM (used by the CVAX processor exclu­ entry. A ring entry can be processed only by its sively) consists of 256 kilobytes and contains the owner. When the last entry in the ring is reached, firmware and data structures (the firmware is processing starts again with the first entq'. copied to RAM during self test). Smailer RAIVI s would Host interrupts are minimized by using a ring have been slightly less expensive but would have release function, which counts the number of ring complicated the firmware update procedure and entries processed for completion by the port and limited the ability of the firmware to use the large the port driver. The port driver counts the number data structures needed fo r receive packet filtering. of completed entries and writes this count to a Shared RAI\11 (shared by the CVAX processor and completion CSR when it has finished processing all the LAl'.JCE chip) consists of another 256 kilobytes. the completed transmit and receive ring entries. This RAI\ 1 contains the transmit and receive buffe rs The port maintains the same count and issues as well as the LANCE transmit and receive rings. another interrupt whenever it sees that its count There is a vast amount of buffering space here, so and the count last written by the port driver are dif­ the DEMNA device can tolerate a considerable ferent. This fu nction ensures that the port driver is amount of inattention from the host before being interrupted only when it stops processing the rings fo rced to discard incoming receive packets. because there is nothing else to process. The port Erasable programmable read-only memory driver can process multiple completed transmits (EPROM) consists of 128K bytes fo r diagnostics and and receives after each interrupt as well. Thus, no firmware boot code, including a backup copy of spurious interrupts are issued and the number of sufficient operational firmware to allow an update interrupts is reduced by processing multiple com­ of EEPROM fo r initial load or subsequent update. pletions at once. EEPROM consists of 64K bytes for operational firmware, diagnostic patches, and error history data. Adapter Design The gate array (data mover) handles the data The firmware is written in VAX MACRO code. An move and quadword read/write operations. The alternative was to use MACRO fo r the transmit and data-move operations transfer buffers between the receive paths and a higher-level language fo r initial- host and shared RAI\1. The quadworcl read/write

Digital 1'echllicaljournal Vo l. 3 No. 3 Summer 1991 39 Network Performance and Adapters

operations are used fo r control functions, such as Implementation reading ring entries, reading address translation This section describes the implementation of the information, and writing ring status on completion. DEMNA auapter through its major functional blocks: Once the firmware initiates a data-move operation, other work is performed by the firmware while the • Scheduler data move progresses. • Port processing Interrupts are very costly; therefore, we chose to limit the number of interrupts f,ielded by the CVAX • Command processing processor. A LANCE interrupt costs CVAX interrupt • Transmit task overhead, plus a LANCE CSR access. plus some nor­ mal interrupt overhead to save and restore regis­ • Receive task ters. A data-move interrupt is less costly, but the • Console task firmware can be coded so that the data-move oper­ • ation is usually complete, thus eliminating the need Monitor task fo r the interrupt. Po.l ling is performed fo r a.l l LANCE­ and data-move-related functions, but interrupts are Scheduler used fo r local console 1/0 and error events. The scheduler is a round-robin routine that simply checks fo r work, does it, checks fo r work, does it, Driver Design etc. There are no context switches, but some con­ text is maintained in registers and shared by all rou­ The DEMNA team needed to design a driver that tines. The scheduler, when idle, consists of about would be compatible with existing drivers but that 18 VA X MACRO instructions. Transmit and receive would use all the fe atures provided by the adapter. tasks are given higher priority by duplicating their For VMS systems, this meant using the set of com­ scheduler entry. When not idle, one pass of the mon routines that provide much of the data link scheduler processes four packets. fu nctionality of the driver, but avoiding packet fil­ tering. Another goal was to limit the copying of data by passing requests directly to the adapter. Port Processing For ULTRIX systems, the driver runs at a lower Port processing controls adapter initialization and level with respect to packet filtering so it cannot shutdown, LANCE initialization and restart, fa tal take advantage of this fe ature. However, buffe r adapter error handling, gate array error hand ling, chaining is used on the transmit side. As a transmit and miscellaneous host interface functions. This request traverses the various software layers, it task also handles firmware updates of EEPROM. accumulates buffer segments which the driver has to concatenate into a transmit frame. To avoid Command Processing buffer copies in all but the extreme and infrequent The command ring usually contains transmit cases, the driver then passes up to 11 buffer seg­ buffe rs, which can contain commands for special ments to the adapter. functions. These commands are included in the To allow customer-written drivers fo r special command ring to allow the port driver to synchro­ applications, we documented the interface to make nize control requests with transmit requests, e.g., it readily available to customers. user startup and stopping. Command processing routines are called by the Debug To ols transmit task after the command buffe r has been The adapter has a very simple mission in life : to read from host memory. The commands consist of transmit and receive packets. 1b verify operation, user startup (consisting of user context such as pro­ some debug tools are needed. The goal for the tocol type, packet fo rmat, physical address to use, DEMNA team was to provide extensive debug tools and multicast addresses to enable), user stopping, both in the operational firmware and in standalone read counters, and a set of maintenance commands. user tools. This design would allow debugging and verification in the development lab and in other, Tra nsmit 'fltSk less-controlled environments. These debug tools The transmit task copies a packet from the host are discussecl further in the Visibility section. memory to adapter buffe r memory and tells the

40 Vo l. J No. 3 Summer 1991 Digital Te chnical journal Design of the DEC LANcontroller 400 Adapter

LANCE to transmit it onto the Ethernet (store and Packet validation consists of length checks fo r fo rward). After the LANCE has completed the Ethernet frames (if the user is using a length field request, the firmware writes transmit status to the after the protocol type) and fo r 802 frames. This command ring entry, signifying completion of the saves the driver a little work. Additionally, users can transmit. request only packets smaller than a selected size; To minimize service time, the code in the trans­ the adapter discards packets that exceed this size. mit path was carefully scrutinized. The number of The cut-through feature adds complexity and checks and branches was minimized fo r the opti­ reduces throughput on small packets, but provides mized path. The optimized path through the trans­ many benefits fo r larger packets. W11en a packet mit code is the 30-bit virtual addressing path, which larger than 192 bytes is received, the packet filtering is the most used. However, the 40-bit physical and validation of all but the length is done fo r the addressing path still results in better throughput first segment. This segment is then copied into the because this path does not require any address host buffe r, and subsequent segments are copied translations, which are timely. The instruction sizes appropriately. The last segment completes the were shortened when possible, using word instruc­ packet validation and cyclic redundancy check tions instead of longword instructions, to reduce (CRC). The difficulty occurs when the packet the amount of instruction prefetch by the CVAX pro­ validation fa ils or an error is detected, because the cessor. Routines were placed on quadword bound­ packet is discarded and the context fo r the now­ aries to maximize cache efficiency. When waiting free receive buffe r has to be restored. The firmware for data moves to complete (getting the transmit elects to save as little context as possible fo r each buffe r from host memory) or obtaining address packet and to regenerate buffe r context after the translation information from the host, the firmware error, i.e., fetching the ring descriptor anew and was designed to perform other functions to increase redoing the address translation. the probability that the operation would be per­ formed when the firmware needed it. Console Ta sk The console task accepts and parses console com­ Receive Ta sk mands ancl displays the requested data. There are The receive task has the simple job of handing two means of accessing the console: local and received packets to the port driver. This task is com­ remote. The local console is a<.:cessed by a terminal plicated by the need to off-load the host of part connected directly to the DEMNA adapter. The of receive processing (including packet filtering, remote console is accessed through MOP console packet validation, maintenance of counters, and carrier commands directed at the adapter from processing MOP messages) and to make duplicates another system. A remote console may also be used of packets when more than one user has requested to access a DEMNA device on the local system (com­ a copy. It is fu rther complicated by the need to ing in through transmit instead of receive). The provide buffe ring, which the port driver uses to firmware does not distinguish between transmit or prevent the driver from supplying large numbers receive operations from remote consoles. The con­ of buffers. For enhanced performance, the firm­ sole block accepts the commands and decodes ware deals with re<.:eive packets in small groups them, and the monitor block determines the status. (192 bytes) to allow the benefit of cut-through on The monitor block passes this status back to the larger packets. console block where it is fo rmatted and displayed Packet filtering is done fo r the destination on the screen. address and fo r user type, either protocol type fo r Due to code size limitations in the EEPROM, com­ Ethernet, destination service access point (DSAP) pressed versions of the console screens are stored field fo r 802, and protocol identifier (PID) value for in the EEPROM. At initialization time the screens arc 802 subnetwork access protocol (SNAP) packets. uncompressed and stored in the RA.J\1 . (The screen Additional filtering is done fo r users who request compression saved 5 kilobytes in the EEPROM.) To all traffic or all multicast traffic. Filtering is done by easily setup and maintain the screens, especially maintaining a 64-bit user mask, which accumulates since they often changed during the project, the the list of users who want a copy of the packet screens were set up in separate text files. The fields according to the characteristics of the packet and in the screen were coded with different data types, what each user has requested. such as date or longworcl . The screen was then put

Digital Te chnical jo urnal Vo l. 3 No. 3 Summer 1991 41 Network Performance and Adapters

through a PASCAL program to convert it to a VAX • Service time is the time a packet spends in each MACRO data structure and compress it. stage along its path from source through host The local console and the remote console can be software and driver, through adapter, over wire, run simultaneously. They have separate input and through adapter, and through driver and host output buffers, the same decode and fo rmatting software to the destination. code, and different input and output methods. • The remote console uses the MOP console car­ Latency is another measure of service time. It is a rier, coming in on transmit or recl:ivc. The com­ measure of delays encountered by queue depths mand/poll and response/acknowledge commands of more than one at various points. are sent by the MOP program, i.e., either the • Reliability is measured as the probability of packet network control program (NCP) or a user program loss under a receive load . It is also measured as that implements the MOP console carrier. The con­ adapter buffering and host buffe r allocation sole code extracts the input characters from the effectiveness. For some protocols, recovery from commaml/poll packet and returns a response/ packet loss takes a significant amount of time, acknowledge packet with any available data from and the loss of a packet may be quite noticeable the remote console output buffe r. When a com­ to a user. Hence, recovery is related to a user's mand has been entirely received, it is decoded and perception of reliable operation. executed and the response placed in the remote console output buffe r, which is sent back to the The performance goal of the DEMNA team was to user in response/acknowledge packets. minimize the service time through the adapter to The local console is a terminal directly con­ maximize throughput. This is most critical fo r small DEMNA nected to the device and interfaced through packet sizes. If the service time is greater than the the sse universal asynchronous receiver transmit­ time it takes to transmit or receive a packet, then (UART) . ter This terminal connection receives and queue depths increase, increasing latency fo r sub­ transmits one character at a time. Characters are sequent packets. Small packets are critical because, collected into the local console input buffe r and obvious.ly, they take less time to transmit or receive. complete commands are parsed and executed. The speed of the Ethernet wire and the XMI bus Response data is placed in the local console output must also be considered. The Ethernet operates at buffe r. The local console uses interrupts to signal 10 megabits per second. The available bandwidth when a character has been typed or when the {jART into memory and the capacity of the XMI are much is ready to transmit another character. These are the greater; thus, the Ethernet is the limiting factor. To only interrupts uscd on the module, except fo r maintain maximum throughput, the DEMNA device error interrupts. Since console interrupts are rela­ must write and read packets to and from host mem­ tively infrequent, they are less costly than polling. ory at a speed equal to or greater than the Ethernet wire. If this speed is obtained, then the service time Mo nitor Task of the DEMNA adapter must be tess than the time it The monitor fa ci lity operates mainly during receive takes to transmit or receive one 64-byte (small) or transmit. It also runs as a low priority entry packet to or from the Ethernet wire to maintain in the scheduler to deal with debugging and veri­ maximum throughput at all packet sizes. fication activities (when debugging firmware is enabled). Ha rdware Performance The primary hardware factors influencing adapter perfo rmance are CVAX performance, DMA engine As stated previously, the primary goal of the DEMNA throughput, and bus contention. adapter was high-speed performance, i.e., this The gate array DMA engine can sustain between adapter wou l.d not create a bottleneck when placed 11.5 and 13.5 megabytes per second on a VAX 6000 in a system. The major performance metrics we system. When transferring packet data (and atten­ identified were throughput, service time, latency, dant host ring processing), the firmware can sus­ and reliability. tain about 5.8 megabytes per second. This is the • Throughput is the number of packets or bytes of approximate rate at which the firmware wou ld packet data that can be transmitted or received deliver a burst of large packets that had been stalled per unit of time. due to a lack of receive buffe rs.

42 Vo l. :) No . .3 Summe-r 1991 Digital Te- cbnical jo urnal Design of the DEC LANcont roller 400 Adapter

The CVAX chip used is the 60-microsecond vari­ much work must be done fo r each packet. These ant (the same one used in the VAX 6000 Model 310 instruction counts are for minimum size packets in processor). As seen in Figure 1, the processor runs virtual address mode and increase slightly with on its own internal COAL bus which has RA.t\1 con­ increasing packet sizes. taining firmware and private data structures. Thus For a transmit, the number of instructions the processor does not contend for the same bus as required was about 134, consisting of 5 instructions the gate array and the LANCE chip. However, the for work done in the scheduler to determine initial CVAX processor does touch shared memory and transmit context, 77 instructions for the data trans­ gate array registers; therefore the possibility of con­ fer from host memory, 18 instructions to get the tention is significant. Logic analyzer measurements LANCE chip to begin transmitting, and 34 instruc­ indicate that about 14 percent of CVAX cycles are tions to process packet completion and to update consumed while waiting fo r access to the shared status in the transmit ring entry in host memory. memory bus fo r minimum size packets. For large For a receive, the number of instructions required packets the consumption is 33 percent, but the was about 160, consisting of 5 instructions for work cycles needed are considerably less than the remain­ clone in the scheduler to determine initial receive der. The effect on the gate array accounts for context, 40 instructions to deal with the LAL'\JCE part of the difference between the speeds of 11.5 to operations, 20 instructions fo r packet filtering, 13.5 megabytes per second and of the 5.8 megabytes 65 instructions for the data transfer to host memory per second mentioned above. (including some time spent finding a user and validating the packet length), and 30 instructions Fi rmware fo r the prefetch of the next receive ring entry. Throughput is limited by the Ethernet bandwidth Some throughput was traded off in the interest of fo r packet sizes greater than 88 bytes. The average reducing adapter-added latency. By processing packet size on Ethernet is approximately 150 to receive packets in groups of 192 bytes, the latency 450 bytes per packet fo r a mix of DECnet, LAT, and contribution for any packet size is much smaller cluster traffic. Ta ble 1 represents the throughput than it would be if all the packet processing occurs that the host software can see, given sufficient host after the packet has been fully received. Thus the computes. These numbers show what might be time between the end of a packet on the wire expected. Virtual addressing costs some perfor­ and the host interrupt is fa irly constant from 64- to mance, and receive filtering accounts fo r most of 1518-byte packets, 50 to 70 microseconds. the difference between transmit and receive. It is interesting to look at the number of instruc­ Reliability tions executed by the CVAX processor fo r each Reliability, or probability of loss, is measured by how receive and transmit packet as the measure of how large a burst of traffic the adapter can withstand at

Ta ble 1 DEMNA Throughput

Packet Microseconds Length Ethernet LANCE Tra nsmit Tra nsmit Receive Receive (bytes) Maximum Maximum Virtual Physical Virtual Physical

64 14880 14662 13181 14633 12468 1291 8 72 13586 13404 12592 13361 12254 12830 80 12500 12345 12247 12340 11813 12227 88 11574 11441 11432 11438 11441 11441 96 10775 10660 10656 10658 10660 10660 112 9469 9380 9380 9380 9380 9380 128 8445 8374 8374 8374 8374 8374 256 4528 4508 4508 4508 4508 4508 51 2 2349 2344 2342 2344 2344 2344 1024 1197 1195 1195 1195 1195 1195 1518 81 2 81 2 81 2 81 2 81 2 81 2

Digital Technical journal Vo l. 3 No. 3 Summer 1991 43 Network Performance and Adapters

the maximum receive rate and deliver these pack­ The console also displays buffe r occupancy on ets to the host without losing any. Adapter reliabil­ the adapter fo r transmit and receive, user configura­ ity was measured at various packet sizes. A burst of tion as to protocol type and characteristics, buffe r 5 seconds without packet loss was considered to be availability counters, and host interrupt counters. of "infinite" duration. This data indicates how the system is running, i.e., Table 2 shows that the DEMNA adapter can sur­ whether sufficient buffe rs are allocated to the vive a significant burst of activity without packet device and to each user of the device. These coun­ loss. Such activity is unlikely, but possible, depend­ ters also indicate how much attention the driver is ing on the application being run and on the net­ paying to the adapter. For example, if the system is work configuration. not tuned properly, the adapter may be generating This testing does not measure how host software less than normal interrupts (because queuing delays perfo rms bu ffe r allocation fo r a user application or are affecting the system operation). These queuing fo r the adapter as a whole. For the latter, the DEMNA delays can be seen in the firmware counters, which adapter accounts fo r any lack of buffering by the monitor the depth of adapter queues and the ability host by not discarding a packet if a buffe r is not of the adapter to give receives to the host, i.e., immediately ava ilable. Instead, it waits up to three buffe ring on the adapter has been used to compen­ seconds fo r the host to supply a buffer. sate fo r queuing delays in the host.

Visibility Adapter Op eration A system user looking at the operation of the net­ When the adapter is not malfunctioning, visibility work sees three areas of complexity: the system into adapter utilization is important. The console software, the network controller, and the network. displays program counter (PC) sampling results fo r When everything is working well, there is little the firmware, showing how busy the adapter is and need to look at any of these areas except perhaps to where time is being spent. When looking at the I/O predict future operation (by extrapolating network subsystem as a whole, it is important to know how utilization or system usage) or to confirm that the much the adapter is contributing to queuing delays, system is indeed running well. When the system is bufferoccup ancy, and added latency. This adapter not running well, visibility into these areas is cru­ operation can be seen by looking at how busy the cial to understanding what is wrong and how to adapter is and how many buffe rs it has outstanding. correct it. The console and monitor fa cilities were For adapter fa ilure or problems on the XMI, the built into adapter firmware from the outset; we console displays error information which has been knew that the visibility was crucial to adapter saved in EEPROM . This error data consists of fatal debugging and verification and would later be help­ error context, data transfer or XMI error context, ful to users. and results of self-test.

System Op eration Ne two rk Op era tion The console displays XMI utilization as apportioned The DEMNA device normally sees all packets on the among the XMIdevices. This data comes from sam­ wire (excluding packets less than 64 bytes in length pling done by the firmware of the "last XMI node [runt packets] and collision fragments). When look­ active on the bus." From this, the user can estimate ing at the adapter operation through the console total XMIut ilization. facility, the user sees current network utilization

Ta ble 2 DEMNA Receive Burst To lerance

Packet Burst Burst Burst Burst Length Virtual Virtual Physical Physical (bytes) (packets) (microseconds) (packets) (m icroseconds)

64 3250 221 661 3843 2621 06

72 51 16 381 677 11591 864741 80 991 7 803321 Infinite Infinite 88 Infinite Infinite Infinite Infinite

44 Vo l. 3 No. 3 Summer 1991 Digital Te cbt1ical jo urual Design of the DEC LANcont roller 400 Adapter

and network error information. For transmit errors, and clears the interval data to prepare fo r the next the console displays the number of errors and date three seconds of monitoring. Figure 3 represents a and time of the last occurrence. For receive errors, sample network monitoring display. the console displays the number of errors, date and time, source address, and protocol type. Addition­ Debug To ols ally, receive errors that are not counted (because The monitor task provided other debugging func­ they do not pass receive filtering) are displayed. For tions during adapter debugging and internal field example, error information is displayed fo r a node test. These functions are not visibk features in the generating packets with CRC errors regardless of the finished product. However, they are extensions to destination of these packets. the functionality and illustrate the benefits of visi­ The console also provides the command SHOW bility into the adapter. A user program, XNA.MON, NETWORK to display network utilization in node was written to access the fo llowing functions. addresses and protocol types. For this command, • Traffic generation. It is difficult to generate the receive firmware calls a monitor facility routine heavy loads on an adapter, particularly because fo r each packet seen on the wire. This routine main­ of logistics. Other systems are needed with tains statistics fo r each source and destination node enough processing power to generate the load. address, consisting of the number of packets and Using the XNA.!\10N program, only one system the number of bytes. At three-second intervals, the was needed. XNA.!\10N was run on it to direct console calls a monitor routine which adds statis­ other adapters to generate traffic to another tics over the prior interval to cumulative data fo r node with a particular packet size at a specified each node, collects top nodes and protocol data, rate. Since traffic generation could be done

- - 19.117 -Network 21-APR-1 991 11 :29:38 -

- 3000002 usecs 21 . 6% NI - -00:00:33 19.7% NI - # User Pks/Sec Byt/Pk %NI-Cur Pa ckets Bytes(k) %NI-Tot

1 60-07 N!Sca 571 21 4 10.7% 1 501 9 3041 8. 1% 2 60-03 DECnet 177 645 9.4% 6358 4021 10.0% 3 60-04 Lat 167 64 1 . 1 % 5765 379 1 . 2% 4 60-02 MopRC 18 87 0. 1% 659 56 0. 1% 5 80-41 LAST 7 82 0.0% 206 1 7 0.0% 6 FE-00 2 2 71 0.0% 54 20 0.0%

# Nodes Pks/Sec Byt/Pk %NI-Cur Pa ckets Bytes(k) %NI-Tot

1 19.187 177 631 9.2% 5177 3530 8.8% 2 63 . 631 251 244 5.3% 6402 1 5 51 4.0% 3 1 9. 52 170 356 5.1% 3321 1410 3.5% 4 19.209 55 61 5 2.8% 2272 1455 3.6% 5 19.167 60 558 2.8% 201 2 982 2.4% 6 19.117 11 2 300 2.8% 3227 873 2.2% 7 63 .619 11 9 252 2.6% 2076 514 1 . 3%

KEY: usecs MICROSECONDS Nl NETWORK INTERCONNECT (TRANSMISSION MEDIUM) %NI PERCENTAGE OF AVAILABLE BANDWIDTH UTILIZED NISca NETWORK INTERCONNECT SYSTEM COMMUNICATIONS ARCHITECTURE MopRC MAINTENANCE OPERATIONS PROTOCOL 'REMOTE CONSOLE LAST LOCAL AREA STORAGE TRANSPORT Pks/sec PACKETS PER SECOND Byt/Pk BYTES PER PACKET

Figure 3 Ne twork Monitoring Disp lay

Digital Te chnical journal Vo l. 3 No. 3 Summer 1991 45 Network Performance and Adapters

regard less of system state (except fo r power Advantages of a Fi rmware-based Design on), there was always a good supply of traffic The advantages of designing an adapter in firmware generators. are as fo llows:

• Packet tracing. This function allowed a node to • The firmware can usually off-load host computes scan the receive stream fo r packets with selected by doing more pre-processing. source and destination addresses and protocol types. Either the packet header or the entire • The firmware can be changed easily (bug fi-"Xesor packet was saved fo r matching packets. This changes in functionality), thus reducing long­ function was used extensively during initial term maintenance and support costs. Also, debugging fo r validating transmit functionality. changes can be made in the field by a firmware Later it was used for validation of MOP and upgrade rather than requiring module rework at related hmctionality by creating trace files on a a manufacturing site.

known good node. We then ran functional • By designing in the firmware, designers can scripts through a test generator, which used the avoid software driver complexity and the neces­ traffic generator on one node to send a test sity of hardware redesign. packet to the node under test. The command and the response were traced by the trace node • The firmware can provide powerful debugging and the test program collectedthe trace data and mechanisms and tools. compared it against known. good data. Packet • The firmware is very flexible. Changes to sup­ tracing wasalso used to verify packet filtering by port hardware problems or additional off-load of devising a test program that could start up par­ host computes can be considered late in the ticular user configurations and loop back any design cycle. This may also allow new port archi­ packets received . tecture and addressing changes fo r creating new products. • Adapter test. The ability to exercise a module

under stress was critical to adapter hardware ver­ • Firmware designs allow extensive fu nctionality ification. The functionality in question was the for lower product and development cost than a Ethernet subsystem and the XMI interface total hardware design. through the gate array. The monitor facility pro­ vided this test functionality by doing MOP loop­ • Firmware designs allow the hardware to be back operations to another node while doing released earlier in the development cycle. various data transfer operations to host memory. Data compares were done on completed trans­ Disadvantages of a Fi rmware-based actions to validate data integrity. The XNAMON Design program provided the interface fo r this function The disadvantages of designing an adapter in firm­ and the remote display of its results. ware are:

• Remote debugger. The access to DEMNA inter­ • The adapter is generally more expensive, consid­ nals allowed remote adapter memory dumps ering the cost of a microprocessor subsystem and remote inspection of data structures while with enough computes for the job. the adapter was running. • The adapter is slower in terms of latency. Some Conclusion applications may be more sensitive than others, given the same throughput, but may have The DEMNA adapter meets the requirements of slightly larger service times per packet. The the VAX 6000 anclVAX 9000 systems. In fact, the per­ effect can be viewed in terms of buffe r occu­ fo rmance fo r small packets exceeds the capability pancy: an adapter with lower latency may uti­ of these systems. For larger packets, Ethernet lize, on average, few buffers. bandwidth is the limiting factor. Our experience illustrates some advantages and disadvantages of The approach is not fe asible fo r transmiSSIOn choosing a firmware-based design over an interface media much faster than Ethernet, because the per­ implemented entirely in hardware. fo rmance requirements of the microprocessor

46 Vo l. .3 No. 3 Summer 1991 Digital Teclmicaljountal Design of the DEC LANcont roller 400 Adapter

become extreme or the hardware assists fo r the the basic data size of the system to avoid cache microprocessor become too complex and costly. thrashing and unnecessary read-modify-write transactions.

Future Directions • Reduced latency. The adapter should minimize Several characteristics distinguish future antici­ its contribution to transmit and receive latency. pated system design from current systems (such as This may mean reducing some of the functions the VAX 6000 and VAX 9000 systems). clone by an intelligent adapter on receive, in order

• Increased host processor power to speed delivery to the host after packet recep­ tion is complete. These fu nctions include packet • Simplified bus design filtering, handling of maintenance operations

• Increased r;o bandwidth requirements packets, length validation, and maintaining coun­ ters data. Improving packet filtering by host soft­ Increased host processor speed moves the 1/0 ware would eliminate the reason fo r placing this bottleneck from the host to the I!O subsystem. To function on the adapter in the first place. supply the I/O needs, the 1/0 subsystem must Filtering in host software is considerably more provide fa ster media, e.g., fiber distributed data difficult than in the adapter. The difficulty interface (FDDI) in the near term, or multiple con­ comes from the need to deal with extreme user nections to slower media (such as Ethernet). The configurations. The DEMNA is bounded by limit­ r;o adapters will be expected to provide signifi­ ing the users and node addresses. The extreme cantly greater throughput with a smaller adapter cases must still be done by host software. contribution to latency. The effective performance of the system will be more sensitive to latency. For Acknowledgments example, an application using a single threaded The authors would like to acknowledge the fo llow­ commancl/rt:sponse protocol is extremely depen­ ing members of the DEMNA design team: Barbara dent on the amount of service time through the 1/0 Aichinger, Keith Bilafer, Mark Cacciapouti, Don subsystem at each encl. As the processing speed increases, application overhead is reduced and Dossa, Linda Duffell, Bernie Hall, Jeff Huber, Helen McGreal, Jonathan Mooty, Dave O'Keefe, David throughput becomes dominated by the service Oliver, Brian Parr, Art Singer, Andy Stewart, Fred time of the adapter and the transmission time. Faster processors place a greater burden on the Te mplin, Vick-y Triolo, Eel Tu lloch, and Don Villani. system bus and I!O interface, which necessitates a simpler bus protocol. This might consist of elimi­ General References nating costly fu nctionality such as byte masking DEC LANcontroller 400 Programmer's Guide and interlocks. However, a simpler interface to the (Maynard: Digital Equipment Corporation, Order I/0 adapter will require considerable change to the No. EK-DEMNA-PG-001, 1990). port protocol to ensure its efficiency. DEC LANcontroller 400 Console Us er's Guide The characteristics needed in future adapters are (Maynard: Digital Equipment Corporation, Order as fo llows: No. EK-DEMNA-UG-001 , 1990). • Greater throughput. This means more connec­ D. Mirchanclani and P Biswas, "Et hernet Perfor­ tions to a slower medium, such as a single mance of Remote DECwindows Applications," adapter supporting multiple Ethernet connec­ Digital Te chnical jo urnal, vo l. 2, no. 3 (Summer tions. Or it means a faster medium. Additionally, 1990): 84-94. configurations using as point-to-point links will be more common, thus implying a D. Boggs et a!., "Measured Capacity of an Ethernet: heavier load on each Ethernet. Myths and Reality," Proceedings of SIGCOMM '88 (ACM SIGCOMM, 1988): 222-234. • Simpler host interface. This is necessitated by the simpler bus protocol. Bus overhead should The Ethernet: A Local Area Ne twork, Data Link be minimized, which includes the elimination of Layer and Physical Layer Sp ecifications, Ve rsion 2.0 such functionality as page table access fo r virtual (Digital Equipment Corporation, Corporation, address translation. Also, the bus transfer size and Xerox Corporation, Order No. AA-K759B-TK, used by the adapter should be compatible with 1982).

Digital Te chuical jou r11al Vo l. .3 No. 3 Summer 1991 47 Satish L. Rege I

The Architecture and Implementation of a Highperformance FDDI Adapter

Wi tb tbe advent of fi ber distributed data interface (FDDI) tecbnology, Digital saw the need to define an architecture fo r a bigh-pe!formance adapter that could trans­ mit da ta 30 timesfa ster than previously built Ethernetada pters. We sp ecified afi rst generationFDDI data link layer adapter archit('cture tbat is capable of meeting the ma.ximum fDDI packet-can:Jiing capacity. The JJt:CFJJD!co ntroller 400 is an imple­ mentation of tbis architecture. Tb is adapter acts as an inteJfa ce between Xll11-based CPUs, such as the l.;t!X 60 00 and VAX 9000 series of computers, and an FDD! local area network.

Fiber distributed data interface (FDDI) is the second implementation of that architecture. An adapter generation local area network (LAN) technology. architecture specifies a set of fu nctions and the FDDI is defined by the American National Standards method of executing these functions. An imple­ Institute (ANSI) FDDI standard and will coexist with mentation that incorporates all of these functions Ethernet, the first generation LAN technology. and conforms to the method of executing these The architecture and implementation presented functions becomes a member of the adapter archi­ in this paper an: fo r the DEC FDDicontroller 400, tecture family. Thus, fo r a given architecture, many Digital's high-perfo rmance, XJ\11 1-to-FDDI adapter implementations are possible. known as DEMFA. This adapter provides an interface To grasp the concept presented in the previous between an FOOl LAN and Digital's XMI-bascd CPUs, paragraph, consider the VAX CPU architecture. This presently the VAX 6000 and v�x 9000 series of com­ architecture defines the instruction set, which is puters u DEMFA implements all functions at the composed of a set of arithmetic, logical, and other physical layer and most functions at the data link functions, and a fo rmat fo r the instruction set that a layer." processor shou ld implement to be classified as a We begin the paper by differentiating between VAX computer. Examples of VAX implementations an architecture and an implementation. Then we are the VA X 11/780 and the VAX 9000 compmers, present our project goal and analyze the problems which both conform to the VA X CPU architecture. encountered in meeting this goal. Next we give a historical perspective of Digital's LA.t'\J adapters. We Our Goal and the Problem Definition fo llow this discussion by describing in detail the Our goal was to define an architecture fo r an FDDI architecture and implementation of OEM FA . Finally, adapter that meets the ultimate perfo rmance goal we close the paper by presenting some results of of transmitting approximately 450,000 packets per performance measurement at the adapter hardware second (packets/s). This goal is considered ultimate level. because 450,000 packets/s is the maximum packet­ carrying capacity of FDDI. Note that this transmis­ Adapter Architecture and sion rate is approximately 30 times greater than that Implementation of Ethernet, which can transmit approximately Before we discuss the DE.\ib\ architecture and its 15,000 packets/s. implementation, it is necessary to understand Before ddining the problem, the basic properties what is meant by an adaptl'r architecture and an of XMI and FDDI must be understood. XMI is a

48 Vo l. 3 No. 3 Summer 1991 Digital Te cbnicaljou,-,wl The Architecture and Implementation of a High-performance Adapter

64-bit-wide parallel bus that can sustain a 100- ago. The first LAN adapter built by Digital was the megabyte-per-second (M B/s) bandwidth fo r multi­ UNIBUS-to-NI adapter (UNA). (NI is Digital's alias fo r ple interfaces.' Each interface attached to the XMI Ethernet.) The Digital Ethernet-to-XMI network bus is referred to as a commander when it requests adapter, known as DEMNA, is Digital's most recent data or a responder when it delivers data. XMI is an Ethernet adapter 7 interconnect that can have transactions from sev­ Let us choose the maximum throughput rate eral commanders and responders in progress simul­ expressed in packets per second as a perfo rmance taneously. metric fo r LAN adapters. The historical perspective FDDI is a packet-oriented serial bus that operates shows that the first adapter to meet the Ethernet using the token ring protocol and has a bandwidth packet-carrying capacity is the OE:\1NA. Therefore, of 100 megabits per second (Mb/s) -'' FODI is capable it took approximately eight years and six genera­ of transmitting packets as small as 28 bytes, which tions for an Ethernet adapter to achieve this take 2.24 microseconds to transmit. Therefore, throughput rate. Consequently, many designers FDOI can carry approximately 450,000 minimum­ thought that our goal of meeting the ultimate FDOI size packets/s. The largest packet that FOOl can packet-carrying capacity was impossible. carry is 4508 bytes. The ANSI/IEEE 8025 standard But the DEMFA architecture, a first generation defines the FOOl operation; Digital has developed FODI data link layer adapter architecture, can meet its own implementation of the FDDI base technol­ the maximum FODI packet-carrying capacity. In ogy as a superset of the ANSI standard.·' this sense, the DEMFA architecture is ultimate. Our problem was to architect an adapter that could interface XMI, i.e., a parallel high-bandwidth Tra ditionalAdapter Architectures CPU bus fo r VA,'\ computers, to a serial fiber-optic In this section, we analyze the traditional adapter networking bus. To avoid being the bottleneck in a architecture and show that by using this architec­ system, such an adapter must be able to transmit or ture we could not meet our perfo rmance goal. receive 450,000 packets/s. Figure 1 is a block diagram of a traditional adapter, ANSI defines the protocol fo r interfacing an e.g. , DEMNA. In such a design, a CPU in the adapter FODI adapter to an LAN." But we had to define the operates on every transmitted and received packet. protocol between the adapter and the VMS and Thus, using this traditional architecture to build an ULTRJX operating systems used by most VAX com­ ultimate FDDI adapter would require a CPU capable puters. Thus, solving the problem required us to of hand I ing 450,000 packets/s. To predict the per­ architect a data I ink layer adapter that would satisfy fo rmance of such a CPU, we extrapolated from the FODI ' both protocols and meet the maximum packet performance data of the CPU used in DE.\1 :'-IA . This transfer capability. traditional adapter can handle approximately 15,000 packets/s using a CPU rated at 3 VAX units of Historical Pe rspective performance (VUPs). The computer industry has built many LAN If we assume a linear model to extrapolate the adapters since the inception of Ethernet ten years performance of a CPU from DEMNA to OE:'\·11-"A, an

ADAPTER ADAPTER MICROPROCESSOR MICROPROCESSOR 1---- MEMORY

� HOST BUS HOST DMA LAN DMA LAN LAN � INTERFACE ENGINE ENGINE 1- INTERFACE

BUFFER MEMORY

Figure 1 Block Diagram of a Tr aditional Adapter

Digital Te chnical journal Vo l. 3 No. 3 Sttmme1· f991 49 Network Performance and Adapters

ultimate FDDI adapter would require at kast a Thus, the device we architected must process 90-VUP CPU. Such a CPU was neither available nor 2R-byte packets in less than 2.24 microseconds. A cost-effective to r timely shipment of our adapter. little thought will show that if we are able to meet Besides, it would be extravagant to usc a 90-VUP the requirements to r processing small packets at CPU in an adapter whose host CP may have a per­ the FDDI bandwidth, then the requirements for fo rmance as low as 3 to 4 VUPs. Therefore, we larger packets can be easily met. looked for a differentsolution. Our final choice was a three-stage pipeline approach which broke clown the complexity of DEMFA Architecture implementation while meeting our performance The DEMFA architecture is characterized by the fol­ goal. As shown in Figure 2, the three stages of the lowing specifications for functionality and the pipeline in the adapter are the FDDI corner and means to achieve this functionality: parser (FCI') stage, the ring entry mover (REM) stage, and the host protocol decoder (HPD) stage.

• As mentioned earlier, the DEMFA architecture Figure 2 also shows two other functions required implements all functions at the physical layer of the adapter: the buffering of packets, which and a major subset of the functions at the data requires a memory called the packet buffe r mem­ link layer. ory (PI3M) and a memory interface called the packet memory interface (PM!); and the local intelligence, • The architecture requires that this functionality also called the adapter manager (A.\1). be implemented in pipelined stages, which are used to receive and transmit packets over the FDDI ring without CPU interference. DEMFA Functions This section presents brief descriptions of the • The architecture specifiesa ring interface DEMFA DEMFA functions and the pipelined stages in which fo r communicating between the pipelined these functions are perfo rmed. This, according stages. Rings operate as queues that allow buffe r­ to our definition, is the DEMFA architecture. A later ing between pipelinecl stages, enabling these section, One Implementation of the DE!'vii'J\ Archi­ stages to proceed in an asynchronous fashion. tecture, describes an implementation in detail.

• The architecture requires a packet-filtering The FCP stage converts serial photons on the FDDI ring into packets and then writes the packets capability in the pipelined stage nearest to the into PBM lon�:,>words, 32 bits at a time. The parser r:oor ring; this capability helps to minimize implements the logical link control (LLC) filtering adapter and host resource utilization. functionality. This stage is also responsible for cap­

• The architecture specifies the DEMFA port, turing the token on the FDDI ring, transmitting which minimizes the information transfer packets, and implementing the physical layer, e.g., required to interact with the host operating media access control (MAC), functionality required system. This interaction takes place during both by the FDDI standard. initialization and the normal operation of receiv­ The REM stage is responsible for distributing ing and transmitting packets. packets received over the FDDI ring to the host computer and to the AM. This stage also collects the In the fo llowing sections, we daborate on different packets from the host and the AM to queue fo r FDDI fe atures of the DEMFA architecture. transmission. The HPD stage interfaces with the XMI bus to PipelinedArchit ecture witb No CP U move received packets from PBM to the host mem­ In terference ory and to move transmit packets from the host Once we determined that the traditional architec­ memory to the PBM. ture of a CPU processing the packets could not meet The PBM stores the packets received over the our performance goal, we began to investigate FDDI ring and the packets to be transmitted over alternative architectures. The requirement was to the FDDI ring. It also stores the control structures either process one receive packet or queue one required fo r accessing these packets. The PM! arbi­ transmit packet in a time period less than or equal trates the requests made by the three pipelined to the time it takes to transmit on an FDDI ring. stages and the A.\1 to access the PBM.

50 Vo l. 3 Nv..3 Summer 1991 Digital Te chttical]ournal The Architecture and Implementation of a High-performance Adapter

I MEMORY SUBSYSTEM PACKET BUFFER I MEMORY I I I PACKET I PMI BUS MEMORY RMC BUS INTERFACE

FOOl FOOl CORNER RING ENTRY RING AND MOVER STAGE PARSER STAGE

______PACKE2.._P�ELIN':._J L I MICROPROCESSOR I SUBSYSTEM HPD BUS ADAPTER MANAGER I I I �:

____L..:::_j___ I I[ _ j

Figure 2 DEMFA Block Diagram

The AM implements the functionalities of self­ must have enough bandwidth to service the three test and initialization in the adapter and also a sub­ stages. It also must service them with low latency set of the SMT function required by the ANSI FOOl so that the first-in, first-out (FIFO) buffers in the specification.' The adapter manager performs no FCP stage do not overflow. function in either the receipt or transmission of By dividing the processing of a packet over the individual packets to the host. three stages and the ring interfaces used to queue We use ring interfaces to communicate within packets between these stages, we reduced the the adapter and between the adapter and the host. complexity of the total adapter fu nctionality. Any These interfaces are described in detail immedi­ implementation of this architecture specification ately fo llowing the next section. would consist of three loosely coupled designs that use ring interfaces to communicate with one Performance Constraints on the Pipelined another. Stages Each stage must process a packet in less time Consider the three pipelined stages and their ring than it takes to transmit the packet on the FOOl interfaces. At any time, the three independent ring. As we mentioned previously, this transmission stages are processing different packets. Thus, if the time is 2.24 microseconds fo r the smallest packet. A HPO stage is processing received packet 0, the REM larger packet may take longer to process than a stage may be working on received packet 4 and the small packet, but such a packet also takes longer to FCP on received packet 7. Note that packets 1, 2, and transmit on the FOOl ring. 3 wait on a ring between the REM stage and the HPO Thus, to meet our perfo rmance goal, we archi­ stage. Similarly packets 5 and 6 wait on a ring tected a three-stage pipeline implementation, with between the FCP stage and the REM stage. The PBM each stage meeting a packet-processing time

Digital Te chnical jo urnal Vo l. 3 No. 3 Summer 1991 51 Network Performance and Adapters

dependent upon the packet size. In addition, our Rings are divided into entries that consist of sev­ architecture specified a PBM with sufficient eral bytes each; the number of bytes in an entry is memory bandwidth to service the asynchronous an integral multiple of longwords. A ring, in turn, requests from the three stages with minimal must contain an integral number of entries. The latency. entry size and the number of entries in a ring deter­ mine the ring size. We chose an entry size that is a Ring Interface -The Core of the DEMFA power of two in bytes and the number of ring Architecture entries to be divisible by two, as well. These The ring interface fo rms the core of the DEMFA choices helped to simplify the hardware implemen­ architecture. An interface is necessary to exchange tation used to peruse these rings. data between the adapter and the host computer Each entry consists of

and also between the different stages and func­ • An ownership bit, which indicates whether the tional units of the adapter. Such an interface usually transmitter interface or the receiver interface consists of a data structure and a protocol fo r com­ owns the entry munication. We evaluated various data structures, • including a linked list or queue data structure, and Ruffer pointers, which point to transmitted or fo und that a ring data structure is efficient to received data manipulate and would be easy to implement in • A buffe r descriptor, which contains the length of state machines, if desirable. the buffers, and status and error fields

The definitions of these fields in an entry and the Implementation of Ring Structures Ring struc­ rules fo r using the information in these fields con­ ture implementation requires a set of consecutive stitute the ring protocol. memory addresses, as shown in Figure 3. The ring Only the interface that owns an entry has the begin pointer and the ring end pointer define the right to use all the information in that entry. This beginning and end of a ring. Two entities, the trans­ right includes using the buffe r pointers to operate mitter and the receiver, interface with a ring to on data in the buffe rs. Both interfaces have the right exchange data. The transmitter interface delivers to read the ownership bit, but only the interface data to the receiver interface using the ring struc­ with ownership may write this bit. ture. This data resides in memory that is managed The two interfaces can exchange entries by tog­ by one of the two interfaces. If the transmitter inter­ gl ing the ownership bit. After toggling this bit, the face manages the memory, the ring is called a trans­ transmitter and receiver interfaces need to prod mit ring. If the receiver interface manages the each other to indicate that the ownership bit has memory, the ring is called a receive ring. be<:ntoggled. This is accomplished using two hard­ wired Boolean values, by mems of an interrupt, or

RING OWNERSHIP ONE ENTRY by writing a single-bit r<:gister. Hardwired Boolean BEGIN I BIT IN THE RING values are used when both the transmitter and the POINTER _ ,.....,4------�,. � BUFFER ADDRESS BUFFER r<:ceiver are on the adapter. Either the interrupt AND OTHER FIELDS II I scheme or the method of writing a single-bit regis­ ter is used when the transmitter and receiver con­ verse over an external bus, e.g., an XMI bus. TRANSMITTER RECEIVER The word "signal" is used henceforth to repre­ INTERFACE RING INTERFACE sent the prodding of one interface by the other. A transmitter interface uses "transmit done" to sig­ nal the receiver interface that data has been trans­ mitted. A receiver interface uses "receive done" to signal the transmitter interface that the data has been received. Note that we have defined the RING END DD1b\ port protocol in such a way that the number POINTER ______...... _ __, of interrupts used to signal the host across XMI is minimized to reduce the host performance degra­ Figure 3 Ring and Ring Interfa ces dation caused by interrupts.

52 Vi!/. 3 No. 3 Summer 1991 Digital Te chnicalj(mrnal Th e Architecture and Implementation of a High-performance Adapter

The unit of data exchanged between the transmit­ packet, the transmitter interface writes a single ter inte::rface and the receiver interface is a packet. A entry and then toggles the ownership bit and sig­ packet may he written in a single buffer if the packet nals the receiver interface. is small or over multiple buffers ifthe packet is large. For multiple buffe rs, the transmitter interface In this paper, we use the term buffer to refer generi­ increments the fill pointer and repeats the two cally to buffers in the adapter or in the host. The steps described in the previous paragraph to write buffe rs in the adapter are always 512 bytes in size all the buffe r addresses and the length and status and, whe::n referred to specifically, are ca lled pages. information. Then the transmitter interface toggles The buffers in the host may be of different sizes. the ownership bits of all later entries of the multi­ An exchange of data requires single or multiple ple buffers before toggling the ownership bit of the buffers, depending upon the packet and buffer first entry. This protocol preserves the atomicity of sizes. One field of two bits in the bu ffer descriptor is the packet transfer between the transmitter and used to designate the beginning and end of packet. These bits are called the start of a packet (SOP) and OWNERSHIP BIT the end of a packet (EOP). Thus, fo r a one-buffer RING BEGIN £ POINTER .._ ENTRY packet both the SOP and the EOP are asserted. For a 0 ENTRY ADDRESSES multiple-buffe r packet, the first buffe r has the SOP TRANSMITTER - 0 OF FREE 0 : asserted, the midd le buffers have both the SOP and FREE POINTER BUFFERS 0 / the EOP deasserted, and the last buffe r has the EOP 0 - asserted . The buffe r descriptor also contains fields TRANSMITTER 0 RECEIVER that we do not describe in this paper. INTERFACE 0 r INTERFACE 1 RECEIVE � POINTER 1 \ DataExch ange on a Tra nsmit Ring Data exchange FILL - 1 ADDRESSES POINTER 0 OF FILLED between a transmitter interface and a receiver BUFFERS interface is accomplished in a similar manner on 0 both transmit and receive rings. Therefore, we dis­ 0 RING END - 0 cuss the exchange in detail fo r a transmit ring; for a POINTER TRANSMIT RING receive ring, we note only the dissimilarities. The events that occur during the data exchange on a transmit ring are shown in Figure 4. The pro­ cess is as fo llows. The transmitter interface man­ RING BEGIN POINTER - ENTRY ages the memory used to exchange data and has ADDRESSES 1 ENTRY two pointers to the ring entries, i.e., the fill pointer OF FILLED 1 RECEIVE and the transmitter free pointer. The transmitter BUFFERS \ 1 : f- POINTER interface uses the fill pointer to deliver data to the 1 - 1 receiver interface. The transmitter interface uses TRANSMITTER RECEIVER INTERFACE l 1 INTERFACE the transmitter free pointer to recover and manage 0 the buffers freed by the receiver interface. The FILL 0 receiver in terface uses only one pointer, i.e., the POINTER 0 receive pointer, which points to the next entry that ADDRESSESjj I j 0 RECEIVER OF FREE 1 FREE the receiver interface interrogates to receive data. r- POINTER BUFFERS 1 To understand how data is transmitted , assume - RING END 1 that the pointers move from top to bottom, as POINTER RECEIVE RING shown in Figure 4. Initially, all the pointers desig­ nate the location indicated by the begin pointer. KEY: A transmitter that has data to transmit to a 0 THE TRANSMITTEROWNS THE ENTRY receiver uses the entry indicated by the fill pointer. 1 THE RECEIVER OWNS THE ENTRY First, the transmitter verifies that it owns the entry Note thai the pointers move in a downward direction. by checking the ownership bit. Second, the trans­ mitter writes the buffe r address and the remaining Figure 4 Data Exchange on Tra nsmit and fields in the entry. In the case of a single buffer Receive Rings

Digital Te cht�:ica/ ]our11al Vo l. 3 No. 3 Summer 1991 53 Network Performance and Adapters

receiver interface�. Then the transmitter interface interface is fa ster than the transmitter interface, the signals the receiver interface that a packet i� avail­ transmit ring will be nearly empty. In this case, the able on the transmit ring. This signal alerts the transmitter free pointer and the receive pointer are receiver interface, which then examines the entry almost always equal. pointed to by the receive pointer. The receiver inter­ Note the fo llowing invariants that apply to the face operates on the entry data if it owns the entry. pointers when data is exchanged on a transmit ring: The receiver interface returns the entries to the the fill pointer cannot pass the transmitter free transmitter interface by toggling the ownership pointer; the transmitter free pointer cannot pass bits and then signals receipt of data to indicate the the receive pointer; and the receive pointer cannot return of the entries (and hence the free buffers). pass the fill pointer. Note that there is no need to return these free buffers in a packet, atomic fashion. The transmitter Data Exchange on a Receive Ring As also shown interface use� the transmitter free pointer to exam­ in Figure 4, the operation of data exchange on a ine the ownership bits in the entry and to reclaim receive ring is similar to that operation on the trans­ the buffers. mit ring, with the fo llowing differences. The The interfaces operate asynchronously, since each receiver interface manages the memory used for one can transmit or receive data at its own speed. If exchanging data. Consequently, the receiver inter­ the transmitter interface can transmit faster than the face has two pointers, the receiver free pointer and receiver interface is able to receive, the transmit ring the receive pointer, and the transmitter interface fills up. Under such circumstances, the receiver has only one pointer, the fill pointer. interface owns all the entries in a transmit ring, the Table 1 shows the various DEMFA rings and the fill pointer equals the transmitter free pointer, and transmitters and receivers that interface with each data transmission stops. Conversely, if the receiver ring.

Ta ble 1 DEMFA Rings and Their Tra nsmitter and Receiver Interfaces

Rings Tra nsmitter Receiver Remarks

Rings in Packet Buffer Memory

RMC Receive Ring FDDI Corner Ring Entry Contains data that originated on the FDDI ring. and Parser Stage Mover Stage RMC Transmit Ring Ring Entry FDDI Corner Contains data that originated at the host or Mover Stage and Parser Stage the AM, destined for the FDDI ring. HPD Receive Ring Host Protocol Ring Entry Contains data that originated at the host, Decoder Stage Mover Stage destined for the FDDI ring. HPD Transmit Ring Ring Entry Host Protocol Contains data that originated at the FDDI ring, Mover Stage Decoder Stage destined for the host. AM Receive Ring Adapter Ring Entry Contains data that originated at the AM, Manager Mover Stage destined for the FDDI ring or the host. AM Transmit Ring Ring Entry Adapter Contains data that originated at the FDDI ring, Mover Stage Manager destined for the AM.

Rings in Host Memory

Host Receive Ring Host Protocol Host Contains data that originated at the FDDI ring Decoder Stage or the AM, destined for the host. Host Transmit Ring Host Host Protocol Contains data that originated at the host, Decoder Stage destined for the FDDI ring.

Command Ring Host Adapter Contains commands that originated at the (Tra nsmit Ring) Manager host for the AM; note that the AM replies in the same ring.

Unsolicited Ring1 Adapter Host Contains unsolicited messages from the AM (Receive Ring) Manager to the host.

54 Vo l. 3 Nu. 3 Summer 1991 Digit(l/ Te chnical journal The Architecture and Implementation of a High-performance Adapter

Subsystem Level Functionality structure. Such structures are more efficient to tra­ The basic functions that an FOOl LAN adapter is verse than queue structures. required to perform are receiving and transmitting The OEMFA port defines the fo ur separate host packets over the FODI ring. The adapter must be rings listed in Ta ble 1: able to establish and maintain connection to the • The host receive ring, which contains pointers FOOl network. The connection management (CMT) to free buffers into which a packet received over protocol, a subset of the station management (SMT) the network can be deposited protocol, specifies the rules fo r this connection." The implementation of the complex CMT algo­ • The host transmit ring, which contains point­ rithm in an adapter requires an intelligent compo­ ers to filled buffe rs from which packets are nent, such as a microprocessor, that can receive, removed and transmitted over the FODI ring by interpret, and transmit packets. Note that the num­ the adapter CMT FOOl ber of packets that flow over the ring con­ • The host command ring, which sends com­ stitutes only a small fraction of the normal traffic. mands to the A.i\ 1 Therefore, a low-performance CPU is adequate to • implement connection management. The CPU in The unsolicited ring, which the AJ"\-1 uses to initi­ the OEMFA device is called the adapter manager. ate communication with the host CPU The packets in the receive stream that originated By using four host rings, we were able to differen­ on the l·DDI ring and are addressed to this host or tiate between the fast and frequent data movement adapter (together called the node) can take one of to and from the FODI ring and the comparatively the fo llowing paths: slow and infrequent data movement required fo r.

• Packets not addressed to this node are fo r­ communication with the A.i"\- 1. warded over the FODI ring. One Implementation of the DEMFA • Packets addressed to this node are delivered to Architecture the host computer. Previous sections specified the OEMFA architecture.

• Packets addressed to this node are delivered to The remainder of this paper describes an implemen­ the A.i\1 . tation of the OEMFA architecture. In the following sections, we present details of the implementation The delivery of packets to the host computer for the packet buffe r memory and the packet mem­ implies that the adapter has a pointer to a free mem­ ory interface; the three pipelined stages, FCP, REM ory buffer in which to deposit the received packet. and HPO; and the adapter manager. The OEMFA port, described in the next section, specifies the rules fo r extracting free buffe r point­ Packet Buffe r Memory and Packet ers from the host memory. Memory Interface For each packet that the host needs to transmit, The packet buffe r memory stores the data received the adapter must know the buffe r address or over the FODI ring before delivering this data to the addresses and the extent of each buffe r. The OEMFA host. The PBM also stores data from the host before port defines the method to exchange this buffe r transmitting over the FODI ring. information. In addition, the host and the adapter PBM consists of two memories: the packet buffe r microprocessor must be able to exchange info rma­ data memory and the packet buffe r ring memory. tion. The OEMFA port defines the protocol fo r this Virtually, the packet buffe r data memory divides communication also. into seven areas-one used by the AM and three each for data reception and data transmission to DEMFA Port and from the three external interfaces. These three The OEMFA port specifies the data structure and interfaces are the FCP stage, the HPO stage, and the protocol used for communication between the A.i\ 1. The areas are accessed and managed by the six adapter and the host computer. Rather than invent a rings residing in the packet buffe r ring memory and new protocol, we modified the OEMNA port specifi­ listed in Ta ble 1. Note that the division is consid­ cation.' The data structure used to pass info rma­ ered virtual because the physical memory locations tion between the host and the adapter is a ring of the areas change over time.

Digital Te chnical jo urnal Vo l. 3 No. 3 Summer 1991 55 Network Performance and Adapters

The three pipelined stages and the memory the HPO interface has the lowest priority because refresh circuitry use the packet memory inter­ no data loss occurs if memory access is denied for a face (PM!) to access PBM. The PM! arbitrates and theoretically infinite amount of time. Our adapter prioritizes the requests fo r memory access from design has mechanisms that guarantee memory these fo ur requesters. Physically, the PM! has three access to the HPO. interfaces: the FCJ> stage, the REM stage, and the 1-IPOstage. Virtually, the PMI has fo ur interfaces; the FDDI Cornerand Parser Stage 1-IPO interface multiplexes traffic from both the The FCP stage, illustrated in Figure 5, provides the host and the adapter manager. The PMI also has the interface between the FODI ring and the packet fu nctionality to refresh the dynamic memory and buffer memory. This stage can receive or trans­ to implement a synchronizer between the SO­ mit the smallest packet in 2.24 microseconds, as nanosecond FOOl clock and the 64-nanosecond required by our performance constraints. XMI clock. The receive stream in this stage converts the All interfaces request access to the memory by incoming stream of photons from the FDDI ring into invoking a request/grant protocol. Some accesses a serial bit stream using the fiber-optic transceiver are longword (4 -byte) transactions that require (FOX) chip. The clock and data conversion chip one to two memory cycles; others are hexaword then recovers the clock and converts the incoming (32-byte) transactions and require a burst of mem­ code from 5 to 4 bits. The MAC chip converts this ory cycles. electronic serial bit stream to a byte stream. The The interfaces have the following priorities: (1) MAC chip implements a superset of the ANSI MAC refresh memory circuitry, (2) the REM stage, (3) the standard." Digital has a specific implementation of FCP stage, and (4) the HPO stage. The refresh mem­ the MAC chip.' The ring memory controller (RMC) ory circuitry has the highest priority because data interfaces with the byte-wide stream from the MAC, loss in the dynamic memory is disastrous. Also the converts the bytes into 32-bit words, and writes refresh circuitry makes a request once every 5 to 10 these words to the PBM, using the RMC receive ring microseconds, thus ensuring that the lower priority and the ring protocol. requesters always have access to the memory. The The transmit stream accesses a packet from REM has the second highest priority because it the PBM, waits fo r the token on the FOOl ring, and always requests one longword, which requires one transmits the packet as a stream of photons. This memory cycle. Once the REM receives data, by stage can generate and append 16 bytes of cyclic design it waits at least two cycles before making the redundancy code (CRC) to every packet before next request. Thus, the REM does not monopolize transmitting. the memory, and the FCP can always get its requests The parser component of this stage interfaces serviced. The FCP stage requires guaranteed mem­ with the RMC bus to generate a fo rwarding vector ory bandwidth with small latency to avoid an over­ that has a variety of informationinclud ing the data flow or underflow condition in its FIFOs. Finally, link user identity and the destination of the packet,

FIBER IN

ELM

KEY: FIBER OUT FOX FIBER·OPTIC TRANSCEIVER CDC_R CLOCK AND DATA CONVERSION RECEIVER CDC_T CLOCK AND DATA CONVERSION TRANSMITIER ELM ELASTICITY BUFFER AND LINK MANAGEMENT MAC MEDIA ACCESS CONTROL RMC RING MEMORY CONTROLLER

Figure 5 FDDI Co rner and Pa rser Stage

56 Vo l. 3 No. 3 Summer 1991 Digital Te chnical journal The Architecture and Implementation of a High-performance Adapter

i.e., the host or the AM . The parser extracts packet from transmit rings to receive rings, (3) managing headers from the RL\1 C bus and operates on the FOOl buffe rs, and (4) collecting statistics. Figure 6 shows and the LLC parts of the packet headers. The parser the various rings, the ring entry mover, and the then processes this information in real time, using a movement of filled and free packets. content-addressable memory (CAM) that stores the The REM moves filled packets from receive rings profiles of data link and other users. As a result, the to transmit rings by copying pointers rather than parser generates a forwarding vector that contains copying data. (Copying pointers is a much faster the destination address of either the host user or operation than data copy.) Note in Figure 6 that fo r the AM user. The fo rwarding vector destination a given interface, no filled packet moves from its field is given a "discard" value, ifthe packet header receive ring to its transmit ring. For example, no does not match any user profile. Note that the for­ filled packet moves from the RMC receive ring to warding vector is a part of the buffe r descriptor the RMC transmit ring. Also, in this design there is field in the RMC receive ring. no need for a path from the HPO receive ring to the Al\1 transmit ring. Ring En tryMo ver Stage A second function performed by the REM stage is The ring entry mover stage performs fo ur major to return free packets from th<.: transmit rings to the fu nctions: (1) moving filled packets from receive proper receive rings. Transmit rings point to free rings to transmit rings, (2) returning free packets packets after the receiver interface has consumed

RING ENTRY MOVER

HPD RMC INTERFACE INTERFACE

RMC HPD AM TO TRANSMIT TRANSMIT TRANSMIT FROM RING RING RING

RMC RECEIVE NO YES YES RING

HPD RECEIVE YES NO NO (BY DESIGN) RING

AM RECEIVE YES YES NO RING

KEY:

HPD HOST PROTOCOL DECODER RMC RING MEMORY CONTROLLER AM ADAPTER MANAGER

Figure 6 Movement of Filled Pa ckets ey the Ring Entry Mover

Digital Te chnical Jo urnal Vo l. 3 No. 3 Summ&r 1991 57 Network Performance and Adapters

the information in the packet. The REM, which is a statistical purposes. The A.i\1 read operation resets transmitter interface on all transmit rings in the these counters. There are a number of other coun­ PBM, owns these buffe rs after the appropriate ters in REM. receiver interface toggles the ownership bit. The REM returns the buffers to the original receive ring Ho st Protocol Decoder Stage by using information in the color field, a subset of The host protocol decoder interfaces with the X..\11 the buffe r descriptor field. The color field contains bus, fe tches and interprets entries from the host color information that designates the receive ring receive and transmit rings, and moves data between to which the buffers belong. This color information the host and the PBiVI . This stage also acts as a gateway is written into the buffer descriptors of the free for the AM to get to the host memory or to the PBM. buffers during initialization. Note that during ini­ Figure 7 is a block diagram of the HPD stage. The tialization, the adapter free buffe rs in the PllM are receive and transmit pipelines store and retrieve allocated to the three receive rings with which the receive and transmit data from the host memory. REM interfaces. The two pipelines work in parallel. We now explain The REM also perfo rms buffe r resource manage­ the operation of the receive pipeline in detail. The ment. Note that a reserved pool of buffe rs exists for transmit pipeline operates in a similar manner; traffic arriving over the FDDI ring. This FDDI traffic thus, we highlight only the differences. has two destinations, namely the host CPU and the adapter manager. To ensure that one destination HPD Receive Pipeline The receive pipeline has does not monopolize the pool of buffers, the pool three stages: (1) the fe tch and decode host receive is divided into two parts: host allocation and A.VI entry stage, (2) the data mover stage, and (3) the allocation. The Rl�M delivers no more than the allo­ receive buffe r descriptor write stage. Most pipe­ cated number of buffe rs to one destination. lines work in a lockstep fashion; that is, each The fourth major fu nction that the REM perfo rms stage takes the same amount of time to process is to collect statistics. The REM collects statistics in input. In our design, the processing time varies fo r discard counters tor packets that cannot be deliv­ each stage in the pipeline. For example the data ered clue to lack of resources. The REM interrupts mover stage will take a much longer time to trans­ the AIVI when these counters are half full. The AM fe r 4500-byte packets than to transfer 100-byte reads, processes, and stores these counters fo r packets. The fe tch and decode host receive entry

- ,.....- RECEIVE PIPELINE I I FETCH AND DATA RECEIVE DECODE HOST - - MOVER BUFFER w w RECEIVE � DESCRIPTORS r------u u STAGE <( <( ENTRY STAGE WRITE STAGE lL lL a: a: w w f-- f-- I + � z- (f) (f) :::J :::J "t I a:l a:l FETCH AND TRANSMIT � � DATA DECODE HOST BUFFER Q_ X - MOVER TRANSMIT r--- r------DESCRIPTORS r------STAGE ENTRY STAGE WRITE STAGE I TRANSMIT PIPELINE

L AM I T I INTERFACE I • TO AM KEY: PMI PACKET MEMORY INTERFACE AM ADAPTER MANAGER

Figure 7 Host Protocol Decoder Stage

58 Vb l. 3 No. 3 Summer 1991 Digital Te chnical journal Tb e Architecture and Implementation of a H(!!,b-perjonnance Adapter

stage, on the other hand, may take the same amount enough free buffe rs available. A hardware interface of time to decode entries fo r packets of either size. between the PMI and the HPD, i.e., a Boolean signal, Const:quently, stages use interlocks to signal the indicates whether there are enough buffe rs to completion of work. accommodate the largest possible size transmit The fetch and decode host receive entry stage packet. This exception is an artifact of our imple­ has knowledge of the format and size of the ring and mentation; we wanted to reduce the accesses to the sequentially fe tches host receive ring entries. If the PBM, since its bandwidth is a scarce resource. adapter does not own an entry, this stage waits fo r a signal from the host before fe tching the entry again. Adapter Manager If the adapter does own the entry, this stage The local intelligence, also known as the adapter decodes the entry to determint: the address of the manager, implements various necessary adapter free buffe r in the host memory and the number of functions including self-test and the initialization. bytes in the buffer. The stage then passes this buffe r The AM also implements part of the CMT code that information to the data mover stage and the address manages the FDDI connection."' In addition, the of the host entry to the receive buffe r descriptor AM interfaces with the host to start and stop data write stage. In addition, this stage prefetches the link users by dynamically manipulating the parser next entry to keep the pipeline full, in case data is data base. actively received over the FDDI ring. In parallel, the PM! interface stage part of the HPD Tr acing a Packet through the Adapter chip fe tches the next entry from the HPD transmit The major steps for data transfer incorporate the ring. Decoding this entry determines the address of subfunctions previously discussed. This section the buffe r in the PBM and the amount of data in the traces the path of a packet P through the adapter, buffe r. The packet buffe r bus interface passes the first on the receive stream and then on the transmit buffe r address and length information to the data stream. We assume that adapter initialization is mover stage and the address of the HPD transmit ring complete and that all data structures in the packet entry to the receive buffer descriptor write stage. memory and parser data base are properly set. In Now, the data mover stage has pointers to the this example, we fu rther assume that packet P is host free buffe r and its extent and to the PBM filled small enough to fit into a single buffer. Large pack­ buffer and its extent. The stage proceeds to move ets require multiple buffe rs. the data from the PBM to the host memory over the XMI bus. Depending on the X.M'I memory design, this transfer involves octaword or hexaword bursts. Receive Stream The process of moving data continues until the A packet destined fo r the host passes through the depletion of packet data in the PBM. three major pipelined stages in the adapter. A brief The data mover stage signals the receive buffer description of the intrastage operation and details descriptor stage when the packet moving is com­ of the interstage functioning fo llow. The fo ur parts plete. The receive buffe r descriptor stage writes in of Figure 8 illustrate the receive process. the status fields of the host receive ring entry and the HPD transmit ring entry. This stage also gives FDDI Cornerand Pa rser Stage Figure 8(a) shows ownership of the filled buffe r to the host and of the packet P on the FDDI ring; the packet is actually a free buffer to the REM. The REM can then return the stream of photons. This stage converts the stream free buffer to the ring of origin. of photons into a packet. At this point, a free buffe r is available fo r packet Pin both the RMC receive ring HPD Tra nsmit Pipeline The HPD transmit and and the host receive ring. The FCP stage owns the receive pipelines are symmetrical. The HPD receive free buffer in the RMC receive ring. pipeline delivers data from the HPD transmit ring to The stage determines ifpacket P is addressed to the host receive ring. The HPD transmit pipeline this node, forwards the packet on the FDDI ring, delivers data from the host transmit ring to the HPD and copies the packet fo r this adapter if it is receive ring. addressed to this node. This stage also generates a There is one exception to the symmetry. The CRC fo r the packet. The FCP stage then deposits the transmit pipdine does not fet ch an entry from the copied packet into the free buffe r in the R.i\1C HPD receive ring in PBM to determine if there are receive ring entry shown in Figure 8(b).

3 No. 3 Summer Digital Te chnical journal vb /. /991 59 Network Performance and Adapters

FREE FREE BUFFER D BUFFER I FOOl HOST RING D I CORNER FDDI RING I PROTOCOL ENTRY 0 AND c5 XMI I DECODER MOVER c5 PARSER HOST STAGE HPD STAGE Ea I RMC STAGE RECEIVE TRANSMIT RECEIVE PACKET P RING RING RING ON THE RING

(a) Receive Stream - Packet on the FDDI Ring

FREE FILLED BUFFER D BUFFER WITH PACKET P

I FDDI HOST RING CO RNER FDDI RING I ENTRY Or c5 XM� . 6�g6���L 0 AND • I MOVER !..---­ PARSER HOST STAGE STAGE I HPD RMC STAGE RECEIVE TRANSMIT RECEIVE ._____ .J RING RING RING

(b) Receive Stream - Packet at the RMCReceive Ring

FREE FILLED BUFFER D BUFFER WITH PACKET P

I FDDI HOST RING I Or CORNER FDDI RING PROTOCOL ENTRY 0 c5 1�XMI AND DECODER MOVER PARSER HOST STAGE STAGE II HPD RMC STAGE RECE IVE TRANSMIT RECEIVE '------' RING RING RING

(c) Receive Stream - Packet at the HPD Tra nsmit Ring

FREE PACKET TO FILLED BUFFER BE RETURNED TO THE ILL:JWITH PACKET P RMC RECEIVE RING

.--___...., I FDDI HOST RING c5F7/II CORNER FOOl RING cSDENTRY PROTOCOL 0 XMI AND I DECODER MOVER ,. ,. PARSER HOST STAGE STAGE I HPD RMC STAGE RECEIVE TRANSMIT '------' RECEIVE '------' RING RING RING

(d) Receive Stream - Packet in the Host Memory on the Host Receive Ring

Figure 8 Receive Stream -Th e Receipt of a Pa cketfr om the FDDI Ring to the Host Memory

After depositing the complete packet, this stage ber of pages in packet P This stage also has an writes the buffer descriptor and toggles the owner­ account of the number of pages outstanding on the ship bit. The ring entry mover now owns packet P HPD transmit ring. The REM delivers packet P to the The FCP stage is free to receive the next packet, HPD transmit ring provided the host resource allo­ which is stored in the next buffer in the RMC cation is not exceeded. receive ring. The REM delivers the packet by copying page pointers from the R.t\f C receive ring to the HPD Ring En try Mover Stage The REM extracts the transmit ring, as shown in Figure 8(c). Note that packet buffe r descriptor and determines the num- the HPD transmit ring is large enough to write all

60 Vo l. j No. 3 Summer 1991 Digital Te chnical jo urnal The Architecture and Implementation of a High-performance Adapter

pointers from the RM.C receive ring and the AM Hardware and Firmware receive ring. The REM then transfers ownership of Implementation the HPD transmit ring entry to the HPD stage and The hardware implementation of DEMFA consisted the RMC receive ring entries to the FCP stage. of four large gate arrays, custom very large-scale integration (VLSI)chips, dynamic and static random The HPD receive pipeline operates on HPD Stage access memories (RAMs), and jelly bean logic. a packet it owns in the HPD transmit ring. As shown Figure 9 is a photograph of the DEMFA board. in Figure 8(d) , after fe tching the address of the free The four gate arrays specified and designed by host buffe r, this pipeline moves packet P from the the group are the parser, the adapter manager inter­ PBM to the host memory and toggles the ownership face (AMI), the host protocol decoder, and the bit of the host entry. Simultaneously, the HPD packet memory controller (PMC), which incorpo­ returns ownership of the free buffe rs in the HPD rated the function of the packet memory interface transmit ring to the ring entry mover stage. The and the ring memory controller. We now describe REM returns these buffe rs to the RMCrece ive ring as aspects of the gate array development. Note that we free buffers. used the Compacted Array technology developed LSI Transmit Stream using logic for our implementation. The gate arrays have 224-pin surface mount packaging. To transmit data from the host transmit ring to the Ta ble 2 shows various gate arrays, the total gate FDDI ring, the packet must pass through the same count for each gate array, and the percentage of three stages as fo r the receive stream, but in the control gates and data path gates. Control gates are reverse direction. defined as gates required for implementing state machines used fo r control. Data path gates are gates HPD Stage For the receive stream, the HPD receive required fo r registers and multiplexors, fo r exam­ pipeline prefetches the free buffer from the host ple. Note that the complexity of gate arrays is pro­ receive ring. In contrast, the HPD transmit pipeline portional to the percentage of control gates. The must wait fo r the host to fill the transmit buffer and gate arrays in Ta ble 2 were fairly complex because transfer ownership to the host transmit ring. The they consisted of approximately 50 percent control HPD stage then moves the data from the host mem­ gates. ory to the PBM if the hardwired signal between the REM and the HPD indicates that a sufficient number of pages is available. Finally, the HPD transfers own­ Module Implementation ership of the host transmit ring entry to the host We used the 11-by-9-inch XMI module fo r imple­ and the HPD receive ring entry to the REM. menting the adapter. Early in the project we defined the pin fu nctions for various gate arrays. Once these Ring EntryMover Stage The REM moves the packet were defined we could design our module. SPICE HPD RI\1C from the receive ring to the transmit ring. modeling helped in arriving at a correct module Again, the REM copies pointers from ring to ring design with the first fabrication. The design was RI\1C and toggles the ownership bit on the transmit thorough and completed early in the project. ring.

FDDI Co rner and Pa rser Stage Although the Firmware Implementation packet is available in PBM for transmission, the FCP The DEMFA firmware has three major fu nctions: stage must receive a token before transmitting over self-test, FDDI management (using Common Node the FDDI ring. Once the transmission is complete, Software), and adapter functional firmware. The the buffer on the RI\1C transmit ring is now free. DEMFA team implemented the adapter functional The FCP stage returns ownership of the buffer to firmware while other groups designed the two the REM, which then returns the free buffe r back remaining components. The DEMFA functional firm­ to the HPD receive ring or the AM receive ring, ware can initialize the adapter and then interact depending upon the origin. Again, the free buffers with the host to start and stop data link layer users, are returned by copying buffer pointers. as well as perform other fu nctions. The firmware is The receive and transmit streams for the adapter implemented in the C language fo r the Motorola manager are similar to those for the host; therefore, 68020 system. The total image size is approximately we do not describe these processes. 160 kilobytes.

Digital Te chnical jou rnal Vo l. 3 No. 3 Summer 1991 61 Network Performance and Adapters

Figure 9 The DEMFA Board

Ta ble 2 Gate Counts for OEMFA Gate Arrays the DEMFA architecture are (1) the PMI and (2) the combination of the Xl\11 interface, bus, and memory. Data Control We implemented these interfaces in a conservative Gates Gates manner to reduce our risks and to produce the (Percent (Percent Gate Array Total Gates of total) of total) product in a timely fashion.

Parser 20296 39 61 140 PMC 61 537 40 60 0120 HPD 81 265 34 66 z 58 100 AMI 15002 49 51 o.. w i§\!? 80 �(/) 60 0a:Cil t:: I<{ Perfo rmance r- f3 40 The graph presented in Figure 10 shows the adapter 6 20 perfo rmance for the receive and transmit streams at 0 ��----�-T--�--�--�--�--�_, 20 50 200 500 2000 5000 the adapter hardware level fo r this implementation. 10 1 00 1000 1 0000 The data represents throughput measured in PACKET SIZE (BYTES)

megabits per second as a function of packet size KEY: measured in bytes. Figure 10 illustrates that the RECEIVE THROUGHPUT receive and transmit streams meet the 100-Mb/s TRANSMIT THROUGHPUT throughput when the packet size is approximately 69 bytes. The bottlenecks in this implementation of Figure 10 Adapter Performance

62 Vo l. 3 No. 3 Summer 1991 Digital Te chnical Jo urnal The Architecture and Implementation of a High-performance Adapter

For more detailed performance data, see the Dagg, Da-Hai Ding, and Martin Griesmer; and Ram paper entitled "Perfo rmance Analysis of a High­ Kalkunte, fo r designing and building a simulation speed FDDI Adapter" in this issue of the Digital model to accurately predict the performance well Te chnicaljo urnal." before the adapter was ready. Also, I would like to thank VMS and ULTRIX group members Dave Gagne, Conclusion Bill Salkewicz, Dick Stockdale, and Fred Te mplin. The goal of the OEMFA project was to specify an architecture fo r an adapter that would be at least References 30 times faster than any previously built adapter. 1. B. All ison, "An Overview of the VA)( 6200 The architecture also had to be easy to implement. Family of Systems," Digital Te chnical jo urnal, This paper describes the architecture and an imple­ no. 7 (August 1988): 10-18. mentation of DEMFA. Performance measurements of the adapter show that this first implementation 2. D. Fite, Jr., T. Fossum, and D. Manley, "Design successfully meets close to the maximum FODI Strategy fo r the VAY.. 9000 System," Digital throughput capacity; thus, the DEMFA performance Te chnical journal, vol. 2, no. 4 (Fall 1990): can be considered ultimate. Already, a number of 13-24. adapters have been designed based on ideas bor­ 3. H. Ya ng, B. Spinney, and S. To wning, "FDDI OEMFA rowed from the architecture and implemen­ Data Link Development," Digital Te chnical tation. In a fe w years, architectures similar to this jo urnal, vol. 3, no. 2 (Spring 1991): 31-41. one may become the norm fo r data link and even transport layer adapters, rather than the exception. 4. A. Tanenbaum, Computer Networks (Engle­ wood Cliffs, NJ : Prentice Hall, Inc., 1981). Acknowledgments 5. B. Allison, "The Architectural Definition I wish to acknowledge and thank my manager, Process of the VAY.. 6200 Family," Digital Howard Hayakawa, who out of nowhere presented Te chnicaljo urnal, no. 7 (August 1988): 19-27. me with the challenge of defining an architecture and an implementation fo r an FOOl adapter that 6. To ken Ring Access Method and Physical would have a performance 30 times that of any Layer Sp ecifications, Ai\JSI!IEEE Standard existing adapter. I must have taken leave of my 802.5-1989 (New Yo rk: The Institute of Elec­ senses to take on such a challenge. But the end trical and Electronics Engineers, Inc., 1989). results were worth the effort. I would also like to thank Gerard Koeckhoven, 7. R. Stockdale and J. We iss, "Design of the DEC LANcontroller 400 Adapter," who agreed to be the engineering manager fo r this Digital vol. 3, no. 3 (Summer 1991 , adapter project. He took on the consequent chal­ Te chnicaljournal, this issue): 36-47. lenge and the risk and supported me all along the way. In addition, I want to recognize Mark Kempf 8. FDDI Station Management (SM T), Preliminary fo r the many hours he spent helping us during Draft, Proposed American National Standard, the conceptualization period and fo r chiseling our ANSI X3T9/90-X3T9.5/84-49, REV. 6.2 (May TAN design. The architects were of great assis­ 1990). tance in making sure that our adapter met the FOOl standard. 9. To ken Ring Media Access Control (;HAC), I also wish to acknowledge the fo llowing mem­ (International Standards Organization, refer­ bers of the DEMFA project team for their contribu­ ence no. ISO 9314-2, 1989). tions: Santosh Hasani and Ken Wo ng, who designed 10. P. Ciarfella, D. Benson, and D. Sawyer, "An the parser gate array; Dave Va lley and Dominic Overview of the Common Node Software," Gasbarro, who designed the AMI gate array; Andy Digital Te chnicaljo urnal, vol. 3, no. 2 (Spring Russo and John Bridge, who designed the HPD gate 1991): 42-52. array; Ron Edgar, along with the other PMC design­ ers, Wa lter Kelt, joan K.lebes, Lea Wa lton, and Ken 11. R. Kalkunte, "Performance Analysis of a Wo ng; Ed Wu and Bob Hommel, who designed and High-speed FODI Adapter," Digital Te chnical implemented the module; the team that imple­ jo urnal, vol.3, no.3 (Summer 1991, this issue): mented the fu nctional firmware, Eel Sullivan, David 64-77.

Digital Te chnical jo urnal Vo l. 3 No. 3 Summer 1991 63 Ramsesh S. Kalkunte I

Pe rformance Analysis of a High-speed FDDI Adapter

The DEC FDD!controller 400 host-to-FDDJ network adapter implements real- time processing fu nctionality in hardware, unlike conventional microprocessor-based designs. To develop this high-performance product with the available technological resources and at minimal cost, we op timized the adapter design by creating a simu­ lation model. Th is model, ap art from predicting performance, enabled engineers to analyze the fu nctional correctness and the performance impact of potential designs. As a result, our implementation delivers close to ultimate performancefo r an FDDJ adapter and surpasses the initial project expectations .

As high-performance systems become available and has an XMIbackplan e.' LaboratOl1' measured perfor­ the use of distributed computing proliferates, the mance data presented later in the paper shows that need fo r high-perfo rmance networks increases. the adapter hardware can sustain a practically infi­ Faster interconnects are required to achieve such nite stream of frames at the fu ll FDDI data band­ perfo rmance goals. Consequently, network adapters width of 100 megabits per second (Mb/s) fo r frame must be able to function at higher speeds. In adopt­ sizes 69 bytes or larger on the receive stream and ing fiber distributed data interface (FDDI) local 51 bytes or larger on the transmit stream. Even the area network (LAN) technology as a fo llow-on to smal lest, i.e., 20-byte dataless, FDDI frames can be Ethernet, Digital recognized the need to build an received at 36 Mb/s and transmitted at 47 Mb/s. industry-leading network adapter to service its high­ The DHvii·A is an FDDI Class-B single attachment performance platforms. As a result, we designed station (SAS) that interfaces to the FDDI token ring and developed the DEC FDDicontroller 400 prod­ network through the DECconcentrator 500. A port uct. To track the adapter pertormance through the driver resident in the host controls the DEMFA design and development stages, we created a simu­ port. The port, the port driver, and the adapter lation model; our objective was to ensure that the hardware implement the American Na tional Stan­ device met our performance goals. This paper begins dards Institute (ANSI) data link and physical layer with a description of the DEC FDDicontroller 400, functionality fo r FDDI LANs. This fo undation sup­ fo llowed by a brief historical perspective and state­ ports user protocols such as the Open Systems ment of the performance objectives of the adapter Interconnection (OS!), DECnet, the transmission project. We then discuss in detail the modeling control protocol with the internet protocol methodology and the results achieved. In addition, (TCP/IP), and local area transport (LAl).'Figure 1 we present validation of these results in the form of shows a typical network configuration using the measurements taken on prototype hardware. DEC FDDicontroller 400 adapter with other Digital FDD[ products. The DEC FDDicontroller 400 The XMI bus is capable of transferring data at The DEC FDDicontroller 400, also known as the rates up to 800 Mb/s and can serve as either a CPU­ DEMf

64 Vo l. 3 No. 3 Summer 1991 Digital Te cb11ical ]ournal Performance Analysis of a High-speed FDDI Adapter

VAX 6000 SYSTEM VAX 9000 SYSTEM DECSTATION 5000

DEC DEC DEC FDDICONTROLLER 400 FDDICONTROLLER 400 FDDICONTROLLER 700 NETWORK ADAPTER NETWORK ADAPTER NETWORK ADAPTER

Fig ure 1 Ty pical Ne twork Configuration attenuation, low noise susceptibility, high security, adapter would be able to process approximately and low cost (as the technology matures) will make 80,000 frames per second (frames/s). Also, twenty FOOl a popular interconnect of the 1990s 6 microseconds was deemed an acceptable adapter latency for the smallest FOOl frames. Considering Historical Perspective and Performance the relatively small number of frames a host system Objectives of the DEMFA can process today, these adapter criteria repre­ With the advent of high-performance systems and sented an ambitious goal-one which would make distributed computing strategies, the need fo r high­ a product with high-performance scalability as performance networking options has increased. faster CPUs became available. Traditionally, l/0 adapters have been built to serve the current performance needs. As a consequence, Performance Modeling Considerations such adapters offe r little or no network perfor­ During the development of a high-performance prod­ mance scalability to accommodate future increases uct, changes in architectural functionality, technol­ in demand. Scalability is important to ensure that ogy constraints, and cost considerations can result in the adapter does not become a bottleneck when design modifications. It is desirable to track the per­ such demands exist. Nonscalable adapters become fo rmance of the product through its development to obsolete, and the resulting frequent hardware understand the impact of such modifications. upgrades increase system cost. The OEMFA consists of many hardware entities that The first Ethernet adapters, which complied perform the desired adapter fu nctions." Although with the IEEE 802.3 standard, were built in the early such hardware adapters have the obvious advantage 1980s. Only recently do adaptersexist that can pro­ of superior performance over conventional, i.e., cess frames at the maximum Ethernet throughput microprocessor-based adapter cards, this advantage rate of 10 Mb/s.' As mentioned earlier, FOOl has the does not come without the risks associated with capability of supporting speeds an order of magni­ hardwired logic. Such risks have a negative impact tude higher than Ethernet. Since the header in an on project budget and schedule and necessitate a FOOl frame is three times smaller than that fo r risk management strategy to ensure that product Ethernet, FOOl frame arrival rates can be as much goals are successfullymet. Performance modeling of as 30 times the Ethernet arrival rate. Considering the adapter and extending the use of such modeling the various constraints, Digital set out with the to evaluate various designs fo rmed part of this strat­ goal to build an FOOl adapter that could process egy. The fo llowing subsections describe the goals frames 150 bytes and larger at 100 Mb/s, i.e., the and tasks of the OEMFA performance modeling.

Digital Te cb11icaljournal Vo l. 3 No. 3 Summer 1991 65 Network Performance and Adapters

Goals methodology used to model the system must be The set of performance modeling goals fo r the well thought-out beforehand, so that the model is DEMFA evolved throughout the development accurate ami also flexible enough to be easily process. Three major goals were perfo rmance changed . projection, bu ffe r sufficiency analysis, and design testing through simulation. Definition of Metrics The main performance met­ rics used were throughput and frame latency. Performance Projection In the early phases of Throughput is the rate at which frames are pro­ the design, the primary goal of the modd was to cessed and is measured in megabits per second or project the adapter perfo rmance. This prediction frames per second. The units can be converted eas­ gave us confidence that the design could meet our ily from one to the other, if the average frame size is performance expectations. specified. In this paper, throughput is expressed in megabits per second. Buffe r Sufficiency Analysis Buffe r capacity plays Frame latency is the elapsed time measured in an important part in the performance of a design. microseconds between the time at which a frame is Whereas too much of this resource is wasteful, too queued fo r service at a facility and the time at little has a negative effect on performance. It was which the service is completed. The following critical to determine the extent of buffe ring neces­ descriptions illustrate the approach used to mea­ sary to attain the desired target performance at the sure receive and transmit latency. The host receives least cost. The perfo rmance model considered the frames from and transmits frames to the FOOl ring. dependencies on this resource. The amount of Receive frame latency is the time elapsed between buffering was varied and the effects of such varia­ (1) the arrival of the last bit of the frame into the tion, manifested in the simulation results, were ana­ adapter from the FOOl ring and (2) the time the lyzed. Using these results as input to a cost/benefits frame becomes available to the host fo r processing. equation helped the designers make intelligent Transmit frame latency is the elapsed time between decisions concerning buffer capacity. (1) the time the adapter starts processing a frame from the host and (2) the exit time of the first bit Design Te sting through Simulation As develop­ of the frame from the adapter destined fo r the ment progressed, important design issues arose FOOl ring. that could not be solved by simple analysis. The per­ The adapter can process transmit and receive fo rmance model served as a platform that could be frames simultaneously. We defined performance enhanced to solve these more complex problems metrics to analyze a variety of traffic scenarios to by simulation. Designs were analyzed to determine gain insight into the adapter behavior. For the con­ their impact on adapter performance. Because the text of this paper, we consider the OEMFA process­ simulation methodology afforded greater testabil­ ing pure frame streams only, i.e., the expressions ity, we were able to make the designs more robust "receive throughput" and "receive latency" refer to and to answer design questions in a significantly a pure receive stream of frames containing no trans­ shorter time than other methods. Consequently, mit frames. Similarly, "transmit throughput" and modifications to the hardware were made at an "transmit latency" refer to a pure transmit stream early design stage and at negligible cost. of frames.

Ta sks Wo rkload Definition Using a relevant traffic workload is very important in any simulation To accomplish performance modeling, we faced model. Since most systems are workload-sensitive, the fo llowing basic tasks: choosing the metrics, defining an incorrect workload may result in irrele­ defining the workload, and deciding on a modeling vant data. We identified two areas in which we methodology. Relevant metrics to measure the per­ needed to define workloads. We then characterized fo rmance of a product are crucial. We chose met­ the traffic patterns and built a workload model fo r rics that are simple to understand and provide perfo rmance simulation based on these patterns. insight into the behavior of the product. Also,areas in which workload development is required must • Frame receive and transmit workloads. The be identifiedand inv<:stigatccl in detail. An incorrect receive and transmit workloads are stimuli fo r workload invalidates all performance data. And the the performance simulation. These workloads

66 Vo l. 3 No. 3 Summer 1991 Digital Te chnical jo urnal Performance Analysis of a High-speed FDDI Adapter

mimic traffic due to frame arrival on the FDDI • Expectation of perfo rmance model accuracy. ring (i.e., the receive workload) or frame trans­ Typically, a performance model predicts results mission from the host (i.e., the transmit work­ accurate to within ± 10.0 percent of the perfor­ load). The receive workload model generates mance that would be achieved with the actual frames which the DEMFA model receives, hardware. whereasthe transmit workload acts as a source of frames to be transmitted by the DEMFA model on During the design phase, behavioral and struc­ the FDDI ring. These workloads must be charac­ tural models of hardware were in development. teristic of actual FDDI traffic. Since FDDI LANsdid This hardware was partitioned across important not exist when the DEMFA was in the develop­ functional boundaries. Hardware within these bound­ ment stage, we used our experiences with aries would be modeled and tested thoroughly by Ethernet to derive these workloads, as we explain the respective development engineers. Hence, to in greater detail in the FDDI To ken Ring section. include details of these pieces of hardware in our model would have resulted in redundant effort. • XMI DEMFA traffic workload. Apart from the Since the interfaces and the gross fu nctionality of XMI traffic, there may be other traffic on the the hardware within these boundaries are relevant bus due to CPU-to-memory transactions or from to performance, we did include these components other 110 adapters attached to the system. The in our model. Existing hardware components, such XMI load on the bus impacts the performance as the FDDI chip set, were grouped together before DEMFA. of the Consequently, we designed a being modeled fo r functionality. Each submodel workload model to mimic the traffic pattern on was designed and tested separately to ensure con­ the bus. We based our model on the traffic pat­ formity to the fu nctionality and performance of XMI terns observed fo r real bus traffic. The per­ other behavioral and structural models. This strat­ DEMFA fo rmance of may degrade as this traffic egy resulted in the base-level performance model DEMFA increases because the traffic and the non­ that we used to generate preliminary performance DEMFA traffic consume common resources. The data for the DEMFA. other traffic is referred to as the XMI interference As development progressed, we encountered XMI workload. The Wo rkload Generator section design changes of various complexities. Simple describes the model fo r this workload. design changes resulted in very small changes in the perfo rmance model. But larger and more com­ Modeling Methodology The simulationmodel has plex design changes required that we investigate a hierarchical design to allow the construction of behavior both specific to the piece of hardware of smaller, more manageable blocks, i.e., submodels. which the design is a part and generalized to the The structure also allows changes to be made easily. adapter system environment in which the piece The SIMULA language implements the simulation operates. Models that represent the changes were modelY The simulation-class and queuing constructs included and interfaced as submodels. These sub­ in this language are tailored to help simulation models served the dual purposes of testing the new and modeling.'0 '' The object-oriented structures design and of improving the accuracy of the perfor­ present other advantages to model development. A mance model. debug procedure coded into the model prints status information about all the queues in the model. This information helped us trace the path of frames Design of the SimulationModel through the system. The performance simulation model consisted of One important first step in designing a simulation the following major components: model is to determine the detail at which to model. • FDDI ring Two factors that influence the level of detail are the

• FDDI chip set and parser • Existing knowledge of the design. Usually, infor­ mation gathered from the behavioral and ana­ • Packet memory controller lytical models of a design helps to make a performance model abstraction. Designs with • Host interface behavior that cannot be analyzed by these lower­ • XMI system level models have to be modeled in greater

detail. • Host system

Digital Te cbmcaljournal Vo l. 3 No. 3 Summer 1991 67 Network Performance and Adapters

The base-level model evolved over time, as we modes of the distribution) rather than over a gained insight into the behavior of the individual wide range of frame sizes. components and defined workloads. The model • The probability fu nction of the frame sizes near evolved further to support the need to analyze new each mode can be approximated as a normal dis­ designs through simulation. This section briefly tribution centered about the mode. describes the components of the final model, as listed above. A composition analysis of the measurements pro­ vided different modal mean sizes, standard devia­ FDDI To ken Ring tions, and the probabilities of frames belonging to the different modes. We used these values to The FDDI token ring was modeled to act as a source statistically create Ethernet network traffic. For our of received frames and as a sink of transmit frames. performance measurements, it was necessary fo r us Gross fu nctionality fo r the remainder of the FDDI to change this traffic pattern appropriately to nodes and network components was desirable. reflect the differences that exist between FDDI LANs Consequently, we designed a black-box model fo r and Ethernet LANs. The FDDI frame header is the FDDI ring that provides two-way interaction smaller than the Ethernet header, and the largest with the FDDI chip set and parser model. This FDDI FDDI frame is approximately three times the size of model allocates time on the FDDI ring for transmit the largest Ethernet frame. We factored these and receive transactions. The model also controls a changes into the Ethernet model to produce an receive workload generator when frames are FDDI workload model. The FDDI workload has received by the adapter. either fo ur or five modes. The receive workload generator is an analytical The fo ur-mode distribution contained a major­ model used to create different patterns of receive traf­ ity of frames grouped around 60, 576, 1518, and fic to the DEMFA. The parameters input to this work­ 4496 bytes. The standard deviations of the frames load model are the average fran1e size, the frame-size around these mean values were 22, 5, 2, and 2 bytes, distribution, the frame type, the load, and the num­ respectively. The frame volumes at these modal ber of back-to-back frame arrivals (i .e., the burst rate values represented contributions of 29 percent, or "burstiness" of the frame arrivals). We varied these 67 percent, 3 percent, and 1 percent, respectively, parameters to generate desired workloads. to the total load. The average frame size and frame-size distribu­ The five-mode frame sizes were grouped around tion parameters generate different size frames. 33, 80, 576, 1518, and 4496 bytes. The standard Actual frame sizes can be specified as normally deviations of the frames around these means were or exponentially distributed about the mean or as 1, 20, 5, 2, and 2, respectively. The frame volumes at constant. The workload model can generate station these modes contributed 26 percent, 15 percent, management (SMT), U.C SNAP/SAP, or UC non­ 55 percent, 3 percent, and 1 percent, respectively, SNAP/SAP frame types and can create a load between to the total load. 0 and 100 Mb/s. If workloads are less than the peak In the above FDDI workload model, the mode of FDDI bandwidth, i.e., 100 Mb/s, the frame arrival pat­ 1';18 bytesis determined by the Ethernet network's tern can be specified as an exponential, constant, or maximum frame-size capacity and, similarly, the normal distribution. The model can generate a wide mode of 4496 bytes is determined by the FDDI range of synthetic traffic patterns, but to obtain network's maximum frame-size capacity. These credible performance results, we characterized the two modal frame sizes represent traffic generated traffic as seen in realistic networks. by large data transfer operations, e.g., file transfers. Several studies had been conducted on large Contributions due to these two modes vary from Ethernet LANs within Digital; a case study by network to network. We considered different D. Chiu and R. Sudama is one example." We ana­ contributions and fo und their effect on adapter lyzed the results from these studies to understand throughput to be negligible. Therefore, only one the frame-size distribution in such networks. From case fo r each workload is presented in this paper. the analysis we concluded that

• Frame sizes on the networks are related to user FDDI Chip Set and Parser protocols. Frames in a test sample were dis­ The FDDI chip set, also referred to as the FDDI tributed about a few discrete frame sizes (i.e., corner, is the base-level technology that was part

68 Vo l. 3 No. 3 Summer 1991 Digital Te chnical jo urnal Performance Analysis of a High-speed FDDI Adap ter

of Digital's strategy to build high-performance, Ho st In terface low-cost data links fo r FDDI LANs. This chip set per­ The host interface, also called the host protocol fo rms serial-to-parallel data conversion, acts as an decoder, moves data between the adapter and the interface to the packet memory in the data link host system through an XMI bus and also interfaces layer, and can support a data rate of 100 Mb/s.1' The with the PMC. We modeled the interface to include entire chip set, except fo r the ring memory con­ details of the dual direct memory access (DMA) troller (RMC), was modeled as a black box with a design (one channel for the receive stream and one specified per-frame latency. The RMC and the asso­ fo r the transmit stream), the staging buffe rs associ­ ciated first in, first out (FIFO) buffe rs fo r the receive ated with each DMA channel, the XMI interface, and and transmit stream staging were modeled in the PMC interface. The host interface also has the greater detail. The detail was necessary to capture capability of scheduling write operations while any overflow or underflow conditions that might waiting fo r the delivery of read information. occur in the FIFO buffe rs. We also modeled the Priority schemes to complete such transactions, interaction between the transmit and receive i.e., handshake mechanisms, are important from a streams. The RMC model, which served as the front performance perspective and, hence, were end of the chip set model, was also capable of gen­ included in the model. erating control and data transactions to perform read/write memory operations. XM! Sy stem The parser hardware off-loads some host frame The XMI system interacts with the host system and processing to the adapter. The parser reads info r­ was modeled to include details of the XMI bus and mation about a receive frame from the RMC bus and memory. This model consists of an XMI bus sub­ creates a fo rwarding vector, which is appended to model that interfaces to the XMI end of the host the frame. This fo rwarding vector is used by differ­ interface model of the adapter. The submodel also ent entities in the adapter and the host to efficiently interacts with a memory model and an XM I work­ process a frame. The parser latency to generate this load generator model. The bus submodel imple­ vector varies with the frame type and size. The ments the XMI protocol. parser moue! helped to analyze the impact of this latency on performance. This model mimics the Memory Model The memory model was designed hardware to produce a forwarding vector fo r a to generate responses to transactions that request given frame with a pertinent latency. memory. Latency for these requests is the memory access time, which includes a queue wait time. Packet Memory Controller There are basically two types of systems that sup­ The packet memory controller (PMC) is the heart of port the DEMFA, as shown in Figure 2. The type is the adapter system. The ring entry mover stage, the determined by whether the XMI is used as the CPU packet buffer memory, and the packet memory bus, denoted in this paper as the XMI (CPU) bus con­ interface constitute the fu nctionality in the PMC." figuration, or as the 110 bus, denoted as the XMI (I/O) The PMC controls the arbitration and servicing of bus configuration. The only difference between the requests to and from memory to effect the efficient two systems is memory access time. This time is transfer of information. The PMC also controls the greater if XMI is used as the I/O bus; there is an added movement of pointers corresponding to every latency on the read transactions performed to fe tch frame. These pointers and the associated protocol memory from locations that are not local to the XM'I generate work fo r the RMC, the host interface, or bus. The memory space that is local to the CPU bus is the adapter manager. accessed through another 110 adapter mechanism. The high throughput capability of FDDI rings can Such 110 adapters, CPU buses, and main memory result in traffic patterns that cause a strain on the bandwidth all play a role in determining the access packet memory. The PMC model allowed us to study times in such systems. The model presented in this such scenarios. It is also important to analyze the paper depicts the VAX 9000 1!0 architecture and cur­ working and performance of the ring entry mover, rent implementation. Performance may vary with which moves frames between different interfaces other implementations. by manipulating the control information of a stored frame. The control information and frame data XMI Wo rkload Generator We designed the Xt\ 11 reside in the packet memory. workload generator to represent the load on the

Digital Te cbttical jo ut·ttal Vo l. 3 No. 3 Summer 1991 69 Network Performance and Adapters

CPU DEMFA MEMORY

XMI BUS

DEMFA ATIACHED TO AN XMI (CPU) BUS

CPU MEMORY

SYSTEM BUS

DEMFA ATIACHED TO AN XMI (110) BUS

Figure 2 Sy stem Ty pes Th at Support the DEMFA

XMI bus, excluding traffic from the DEMFA. This they do not become bottlenecks during frame load tends to have a deteriorating effect on DEMFA reception or transmission. A model of the device performance and thus, is referred to as the XMI driver handles frame transmission and reception. interference workload. It was important not only The driver interacts with a host workload genera­ to model the amount of load but also to capture tor, which creates different traffic patterns for trans­ the arrival pattern of this traffic. The workload mission. This workload generator has the same model generated traffic based on three inputs: the capabilities as the receive workload generator dis­ total XMI bandwidth used by other XMI nodes, the cussed in an earlier section. average length of each XMI transaction, and the burst rate of the frame arrivals. Transaction lengths Resultsfr om Perfo rmanceSimulation on XMI vary from one to five XMI cycles (i.e., The data presented in this section was generated 64-nanosecond cycles). The maximum number of using the simulation model of the adapter. This nodes that can exist on an XMI bus is 14. Thus, the data represents the hardware performance of the burst rate can vary from 1 to 13. DEMFA; system performance with the DEMFA as a Typically, traffic on an XMI bus consists of many component is not within the scope of this paper. back-to-hack transactions of various sizes. We We input parameters to the simulation model that decided to use the worst case values for both the defined traffic patterns and ran simulations fo r a burst rate and the transaction length in the XMI sufficient length of time to ensure that we captured interference workload presented in this paper. The steady-state behavior. The models maintain statis­ worst case burst rate is 13, and the worst case trans­ tics of relevant events and quantities and print out action length is 5 XMI cycles. this information at the end of a simulation. As dis­ cussed previously, the hardware performance of Host Sy stem the DEMFA varies depending upon whether the The host system consists of the CPU, disks, layered system is implemented to use the XMI bus as a CPU software, the operating system, the device driver, bus or as an 110 bus. This section presents simula­ and a host workload generator. The host system tion results for both uses, where appropriate. was modeled in accordance with assumptions pre­ sented in the section Results from Performance Assumptions Simulation. The CPU, disks, host software, and the For our simulation purposes, we made several operating system were modeled in such a way that assumptions. These assumptions make the results

70 Vo l. 3 No. 3 Summer 1991 Digital Te chnicalJom·nal Performance Analysis of a High-speed FDDI Adapter

more general and bring out the hardware perfor­ 100 mance characteristics of the DEMFA, indicating the 0 80 upper bounds of performance that the adapter can f-z ::J o achieve. o._U I w 60 ::J(/J� c:Q o t:: CP U and Software Capabilities The device driver a:c:J 40 <{ and the host software do not become bottlenecks f-I � w during frame reception and transmission. We 5 20 assumed that the host CPU had enough computing ability to process frames without posing as a perfor­ 0 20 40 60 80 100 120 mance bottleneck. LOAD (MEGABITS/SECOND)

KEY: MemoryBandwidth Frames sent from or received EXPONENTIAL by the host result in XMI bus transactions that are CONSTANT written to or read from the host memory.

Throughput varies with the memory implementa­ Figure 3 Receive Th roughput as a Function tion and interleaving. We assumed that the memory of the Loadfo r a 33 -byteFra me implementation and interleaving were selected such that no overloading of the memory occurs, load up to a certain point, and then gradually thus eliminating wasted bus cycles. decreases until the load is 100 Mb/s. The decrease in throughput is caused by the Joss of resources due to Buff er Alignment and Segmentation We assumed excessive loading. that data fo r transmission and buffe rs fo r reception We simulated traffic with a constant arrival pat­ were hexaword (i.e., 32-byte) aligned and that tern and conducted the same experiments. These frames were unsegmented. results are also shown in Figure 3. Observe that the point of maximum throughput and the rate at Simulation Traffi c No error frames or error trans­ which the throughput decreases after reaching the actions were simulated, since we assumed these to maximum vary with the arrival pattern of traffic. be negligible. No adapter manager traffic was simu­ Afterper fo rming experiments on other frame sizes, lated during the performance measurements, since we concluded that there is no fixed relationship these represent a very negligible fraction of the between the maximum achievable throughput and frames received during steady-state ring operation. the throughput at FDDI saturation (i.e., 100-Mb/s load). Also, there is graceful degradation in through­ Throughput Measurements put after the peak. Measurements were made to determine the through­ put that the adapter can sustain fo r received and Receive Throughput fo r Fo ur- and Five-mode transmitted frames. It is important to understand Wo rkloads We measured adapter receive through­ how throughput is related to the load, the bursti­ put for four- and five-mode workloads with a load of ness of frame arrivals, the percent XMIinter ference, 100 Mb/s. The XMI interference workload was var­ and the frame size. This section presents the results ied, and the results are presented in Figure 4. The of the throughput measurements as functions of adapter can receive the workload at 100 Mb/s, if the these parameters. XMI interference workload remains moderate. Figure 4 also shows that there is very little differ­

Received Th roughput as a Function of the Load ence in performance between the four- and five­ The graph shown in Figure 3 is the result of several mode workloads. Large frames constitute a major experiments conducted by varying the load fo r part of both workloads, and larger frames can be eas­ 33-byte received frames. The frame arrival rates ily supported by DEMFA at full FDDI data bandwidth. depend on the load and the arrival rate distribution. As mentioned earlier, the model is capable of simu­ Receive Throughput as a Function of Frame Size lating traffic with different arrival patterns. Figure 3 Figure 5 shows the throughput as a fu nction of the shows that, with an exponential arrival pattern, the frame size and the XMIinter ference workload, with throughput increases at a rate proportional to the DEMFA attached to an XMI (CPU) bus. Smaller frames

Digital Te chnical jo urnal Vo l. 3 No. 3 Summer 1991 71 Network Performance and Adapters

120 The adapter throughput for an XM.I (110) bus configuration differs only slightly from that fo r an XMI (CPU) bus configuration. Any diffe rences that exist are for frames smaller than 64 bytes, since the adapter experiences a per-frame latency cost because the memory is not local to the XMI bus.

Transmit Th roughput fo r Fo ur- and Five-mode 20 Wo rkloads Figure 6 illustrates the transmit throughput for a fo ur-mode workload as a fu nction 0 20 40 60 80 of the XMI interference. We performed simulations XMI INTERFERENCE (PERCENT) to obtain throughput data for the DEMFA when

KEY: attached to an XMI (CPU) bus or to an XMI (1/0) bus. D------0 FOUR-MODE WORKLOAD Throughput fo r the XMI (CPU) bus configuration is • • • FIVE-MODE WORKLOAD +- -+ 100 Mb/s and is insensitive to low, XMI interf<:rcnce loads. Whereas, Xl\1 1 (110) bus configuration mea­ Figure 4 Receive Th roughput as a Fu nction surements are negatively affected by all levels of of XMI !nterference fo r an XMI interference traffic. The higher read latency XMI(CP U)Bus Config uration that is inherent to an XMI (i/O) bus configuration degrades further with increasing interference traf­ have a lower throughput rate than larger ones fic. In addition the degradation appears to be lin ear. because of high controi/data overhead. Since con­ The throughputs observed fo r the five-mode work­ trol transactions consume bandwidth, the band­ loads are very similar to the data shown in Figure 6. width available fo r data movement is reduced. Consequently, the overall throughput rate is lower. Tra nsmit Th roughput as a Fu nction of the Frame Another reason fo r lower adapter throughput is the Size Figure 7 shows the throughput as a function DEMFA XMI utilization by traffic from other nodes on the of the frame size when the is attached to an XMI XMI bus. This XMI interference results in less avail­ (CPU) bus. Throughput is also presented fo r XMI able XMI bandwidth fo r the adapter and hence, less various interference workloads. As in the case throughput. of receive throughput, transmit throughput degrades as the frame size decreases and the XMI interference load increases. This degradation is 140

120 120

- -·-

... 20

0 20 ��--�---r--�--�--r-�--�---+20 50 200 500 2000 5000 1 0 1 00 1000 10000

FRAME SIZE (BYTES) 0 20 40 60 80 KEY: XMI INTERFERENCE (PERCENT)

XMI INTERFERENCE WORKLOAD KEY: 60 PERCENT XMI (CPU) BUS 40 PERCENT D------0 · · XMI (I/O) BUS 20 PERCENT +- · ·+ 0 PERCENT

Figure 5 Receive Th roughput as a Function Fig ure 6 Tra nsmit Th roughput as a Function of the Frame Size fo r an of X.MI Interference fo r a XMI (CP U)Bus Configuration Fo ur-mode Wo rk load

72 Vo l. 3 No. 3 Summer 1991 Digital Te chnical jo urnal Performance Analysis of a High-speed FDDJ Adapter

140 read access time resulting from the XMI (110) bus configuration. The transmit operation consists 120 mainly of read transactions and hence, this latency is crucial to transmit performance.

Latency Measurements Latency, as it relates to the DEMFA, is explained in the Definition of Metrics section. We measured the 20 latency fo r receive and transmit frames. Frame latency consists of two components: the active 0 ��--�--��----�-r--�--�-+ 20 50 200 500 2000 5000 component, which contributes to the time when 1 0 1 00 1000 10000 FRAME SIZE (BYTES) the frame or a portion thereof is being processed at KEY: a service center (also called the service time); and XMI INTERFERENCE WORKLOAD the passive component, which is the time when the 60 PERCENT frame or a portion thereof waits for access to the 40 PERCENT 20 PERCENT service center. All latency data presented in this PERCENT 0 section represents averages across a large number of samples. When measuring the latency of a frame, FiguTe 7 Tra nsmit Throughput as a Function we applied the maximum load that can be sus­ of the Frame Size fo r an tained continuously fo r that frame size and type. Xll1I (CPU) Bus Configuration Receive Latency as a Function of the Frame Size again attributed to high control/data overhead and Figure 9 represents the receive latency data as a function of the frame size fo r an XMI (CPU) bus con­ lower Xt'\ 11 bandwidth availability. XMI Figure 8 shows adapter transmit throughput as a figuration. Latency is also presented fo r various interference levels. We present performance data fu nction of the frame size fo r an Xt\1 11 (110) bus con­ fo r only one XMI configuration because there is lit­ figuration. The transmit throughput is less when the OEM FA is used with an XMI (1/0) bus rather than tle variation between the results fo r the XMI (CPU) XMI (I/O) with an XMI (CPU) bus, due to the larger amount of bus and bus configurations. Both frame

140 1000 (jj 500 0 z 0 200 (..) w 100 50 a:8 - - - ..-- - - (..) � 20 i:i 10 dJ 5 20 � ...J 2 o r-�--��-+--�--�--r-�--��-+ 50 200 500 2000 5000 20 50 200 500 2000 5000 10 1 00 1000 1 0000 10 100 1000 10000 FRAME SIZE (BYTES) FRAME SIZE (BYTES) KEY: KEY:

XMI INTERFERENCE WORKLOAD XMI INTERFERENCE WORKLOAD 60 PERCENT 60 PERCENT 40 PERCENT 40 PERCENT 20 PERCENT 20 PERCENT PERCENT 0 PERCENT 0

Figure 8 Tra nsmit Th roughput as a Function Figure 9 Receive Latency as a Function of the Fra me Size fo r the of the Frame Size fo r an XMJ (fl O) Bus Configuration XMI(CP U) Bus Co nfig uration

Digital Te chnical jo urnal Vo l. 3 No. 3 Summer 1991 73 Network Performance and Adapters

size and latency are plotted using Logarithmic scales. 1000 (j) 500 The data illustrates that XMI latency increases lin­ 0 z early with increased X.i\11 interference. 0 200 u w 100 (f) 0 Transmit Latency as a Function of the Fra me Size a: 50 u Figure 10 presents transmit latency results fo r an � 20 Xt\1 1 (CPU) bus configuration and Figure presents >- 10 11 u XMI (1!0) z the results fo r an bus configuration. The w 5 f- latency was measured as a fu nction of the frame j 2 size for various XMI interference workloads. Transmit latency is more sensitive to the system 20 50 200 500 2000 5000 10 100 1000 10000 type and to the Xt\11 interfe rence workload because FRAME SIZE (BYTES) Xt\11 most transactions that constitute transmit traf­ KEY:

fic are read operations. There is a distinctly higher XMI INTERFERENCE WORKLOAD latency cost associated with these transactions in 60 PERCENT 40 PERCENT X:\11(1/0) the bus configuration as compared to the 20 PERCENT Xt\11 (CPU) bus configuration. As in the case of 0 PERCENT receive latency, the transmit latency degrades with X.MI interference. Figure 11 Tra nsmit Latency as a Function of the Frame Size fo r an Pe rformanceMe asurements with the XMI Bus Configuration Prototyp e DEMFA (1/ 0) The intent of performing measurements with the prototype DEMFA was twofold. First, we wanted to Again, we present only hardware performance confirm the performance predictions arrived at measurements; system performance with the through simulation. And second, we wanted to DEMFA is beyond the scope of this paper. measure some fe atures that we did not implement in the model, either because they were not quantifi­ Measurement Setups able or because they were too complex to model. The experimental configuration required to per­ fo rm the measurements on the prototype DEMFA is shown in Figure 12. This configuration con­ 1000 sists of a VAX 6000 processor connected to a (j) 500 0 DECconcentrator 500. The VA.'\: 6000 system has an z 0 200 XMI backplane. The DEMFA occupies one of the u w 100 (f) � 50 u STANDALONE 20 LOG IC ANALYZER � OPERATING SYSTEM iJ 10 AND DEVICE DRIVER 5 f-i5 VAX 6000 SYSTEM FDDI TESTER 2 j DEC FDDICONTROLLER 400 20 50 200 500 2000 5000 (DEMFA) 10 100 1000 10000

FRAME SIZE (BYTES) KEY:

XMI INTERFERENCE WORKLOAD 60 PERCENT 40 PERCENT 20 PERCENT FDDI RING NETWORK 0 PERCENT

Figure 10 Tra nsmit Latencyas a Fu nction of the Fra me Size fo r an Figure 12 Laboratory Setup fo r DEMFA XMI (CP U) Bus Co nfiguration Performance Measurements

74 Vo l. 3 No. 3 Summer 1991 Digital Te chnical jounml Performance Analysis of a High-speedFDDI Adapter

slots in the XMI backplane and is part of the Xlv11 throughput fo r the performance model demon­ (CPU) bus configuration in this system. An FDDI strates that the DEMFA can continuously receive tester is also attached to the DECconcentrator 500 frames greater than 65 bytes at 100 Mb/s. There is a and acts as a source of frames. The FDDI tester is slight difference between the measured and mod­ a useful tool fo r testing the DEMFA product; the eled results at the lower frame sizes because resid­ tester is capable of transmitting traffic at 100 Mb/s ual X!MI interference traffic exists in the measured and can generate frames of various sizes and types system. This experimental error is unavoidable, but with different destination addresses. A standalone the diffe rence is a small percentage of the total software driver and operating system runs on the throughput and is therefore acceptable. VAX 6000 system and is used fo r DEMFA hardware performance tests. A logic analyzer is used to mea­ Tra nsmit Throughput Measurements To measure sure elapsed time and count events. the transmit throughput, we fo rwarded frames from the driver to the FDDI ring at the maximum ThroughputMe asurements possible rate. The throughput was calculated from The device driver measures receive and transmit the number of frames that could be sent in a unit of throughput and is designed to perform minimal time. The adapter can transmit frames larger than processing fo r each frame. 51 bytes at 100 Mb/s. Transmit throughputs mea­ sured in the laboratory validate the modeled results Receive Th roughput Measurements We measured as closely as the receive throughput validation the receive throughput by sending a continuous results shown in Figure 13. The modeled through­ stream of frames at 100 Mb/s from the FDDI tester to put results were lower than the measured results the DEMFA. We varied the frame size fo r the tests because we used a conservative approach to mod­ and ran each test fo r a length of time sufficient to eling the memory latency. verify data convergence. We compared the prototype measurements with Multisegmented and Misaligned Frames Seg­ the modeled results for receive throughput as a mentation and alignment of transmit frame buffe rs function of the frame size fo r an Xl\1 1 (CPU) bus con­ in host memory is variable. Typically, frames consist figuration. This validation of the receive throughput of two segments, the first containing the frame resu lts is shown in Figure 13. The hardware mea­ header information and the second containing the surements demonstrate that the adapter can receive data. Since the DEMFA must access control and data frame sizes above 69 bytes at 100 Mb/s. Throughput separately, segmentation makes this process less degrades fo r smaller frame sizes. These measure­ efficient, from a hardware perspective, than if the ments closely validate the modeled results. The data and control information exist in the same buffe r. Also, buffers may be al igned to start on different byte boundaries. Since the DEMFA trans­ 140 actions begin on hexaword (i.e., 32-byte) bound­ 120 aries, hexaword alignment of frame data in the host iS" 1-610 0 II II buffe rs is the most efficient arrangement from the ::>u !i: � 80 adapter's perspective. We measured throughput (.9- ::> CJJ with unsegmented and two-segmented frames, and a t:: 60 a: CD I with frames aligned on longword, quadword, and I<( 1-<.9UJ 40 hexaword byte boundaries. Segmentation and align­ 6 20 ment variations cause negligible throughput degra­ dation fo r frames 64 bytes or larger. 0 ��--��-+--�--�--,_--�--�--+ 20 50 200 500 2000 5000 10 100 1000 10000 Latency Measurements FRAME SIZE (BYTES) KEY: We used the logic analyzer to measure the frame

[}· · · · {] MEASURED latency. The logic analyzer responds to signals that MODELED indicate the starting and ending times for process­ ing a frame. The difference between these two times Fig ure 13 Va lidation of Receive is the frame latency. The events were chosen such Throughput Results that the measurements conformed to the definition

Digital Te chnical journal Vo l. 3 No. 3 Summer 1991 75 NetworkPerf ormance and Adapters

of latency as described in the Definition of Metrics modeled and measured results demonstrate that section. the model actually surpasses our goal. The mea­ Note that the traffic pattern used to measure sured performance fo r the XMI (I/O) bus configura­ latency in this section differs from the workload tion using a VAX 9000 system validated the modeled iilustrated in the section Performance Results from results as closely as did the corresponding results Simulation. Here, a single frame was received or fo r the XMI (CPU) bus configuration. Disparity, if transmitted, and we measured latency due to that any, between the modeled and the measured results frame only. Whereas previously, we used the simu­ basically stem from unavoidable measurement lation model to measure latency as an average errors fo r receive frames and pessimistic memory across a large number of frames representing a load latency assumptions for transmit frames. equal to the maximum sustainable adapter through­ Throughput due to the fo ur- and five-mode work­ put. The workloads diffe r because of the practical loads is nearly the same. The average frame size fo r difficulty to perform latency measurements on a these distributions is 496 bytes and 487 bytes, large number of frames. respectively. Thus, throughput is a fu nction of the frame size and independent of the number of Receive Latency The receive frame latency predic­ modes that exist in the workload. Also, this data tions from the performance model and adapter ser­ leads to the conclusion that the DEMFA may never vice time measurements taken from the prototype pose as a performance bottleneck in a real network hardware are shown in Figure 14. These latency mea­ environment. surements validate the model predictions in a way For the simulation, we chose an XJ\1 1 work­ similar to that fo r the throughput measurements. load with an extremely high burst rate. Actual Tra nsmit Latency We also compared transmit XMI systems may result in better throughput latency measurements to predictions from the per­ than that presented in this paper. The resources fo rmance model and fo und these measurements required to create XMI workload variations are not to approximate the modeled results. But actual easily accessible, so we did not perform measure­ latency measurementswere slightly lower than the ments on the prototype adapter under different modded results, again due to a conservative mod­ workload conditions. But since other measurements eled latency. validated the model predictions so closely, measur­ ing performance with varied XMI workloads proved Conclusions unnecessary. The performance model was intended to track the Va lidation of the results that we predicted perfo rmance of the prototype hardware to an accu­ through simulation increased our confidence in racy of ::+::: 10.0 percent. The comparisons between various design mechanisms that were verified using the performance model as a test platform. When designing new 1/0 architecture or memory 1000 implementations, our performance model allows (j) 500 0 changes to be made easily in order to determine the z 0 200 impact of such changes on performance. The mod­ 0 w 100 (/) eling strategy proved very effective and helpecl to 0 a: 50 deliver a high-quality product with better perfor­ 0 mance than what was intended initially. � 20 >- 10 0 wz 5 t- Acknowledgments 4: -' 2 I wish to acknowledge all members of the DEMFA 20 50 200 500 2000 5000 development group fo r their help in modeling the 1000 10000 10 100 adapter. Their openness to examine new designs to FRAME SIZE (BYTES) KEY: enhance performance resulted in this high-speed adapter. I am also grateful to the group for assisting D- ...- -{] MEASURED MODELED with the performance measurements. Finally, I wish to extend special thanks to Howard Hayakawa, Figure 14 Va lidation of Receive Fra me Gerard Koeckhoven, Satish Rege, Andy Russo, and Latency Results Dick Stockdale.

Summer 76 vb/. .3 No . .3 1991 Digital Te cbuical jottr11al Pe rformance Analysis of a High-speed FDDI Adapter

References 7 R. Stockdale and ). We iss, "Design of the 1. R. Gillett, "Interfacing a VA..'( Microprocessor DEC LANcontroller 400 Adapter," Digital to a High-speed Multiprocessing Bus," Digital Te chnicaljournal, vo l. 3, no. 3 (Summer 1991 , Te chnical jo urnal, no. 7 (August 1988): this issue): 36-47 28-46. 8. S. Rege, "The Architecture and Implementa­

2. W Hawe, R. Graham, and P Hayden, "Fiber tion of a High-performance FDDI Adapter," Distributed Data Interfac<.: Overview," Digital Digital Te chnical juurnal, vol 3, no. 3 Te chnicaljo urnal, vol 3, no. 2 (Spring 1991): (Summer 1991, this issue): 48-63. 10-18. 9. Programmer's Reference Manual fo r 5/MULA VAX VJ \I!S 3. B. Al lison, "An Overview of the VA..'( 6200 fo r under Op erating System (North Family of Systems," Digital Te chnical jo urnal, Berwick, Scotland: EASE Ltd., 1991). 7 no. (August N8R): 19-27. 10. G. Birtwistle, 0. Dahl, B. Myhrhaug, and K. Nygaard, 5/MULA BEGIN (Kent, England: 4. D. Fite, Jr., T. Fossum, and D. Manley, "Design Strategy fo r the VA..'< 9000 System," Digital Chartwell-Bratt Ltd., 1980). Te chnical jo urnal, vo l. 2, no. 4 (Fall 1990): 11. L. Kleinrock, Queueing Sy stems, vols. I and 2 13-24. (New Yo rk: John Wiley and Sons, 1976).

5. F Ross, "FDDI-A Tutorial," IEEE Communi­ 12. D. Chiu and R. Sudama, "A Case Study of cations Magazine, vol. 24, no. 5 (May 1986): DECnet Applications and Protocol Perfor­ 10- 17 mance," Proceedings of the ACM SIGMETRIC5 Conference (May 1988). 6. S. Joshi, "High-Performance Networks: A Focus on the Fiber Distributed Data Interface 13. H. Ya ng, B. Spinney, ancl S. Tow ning, "FDDI (FDDI) Standard," IEEE MICRO Qune 1986): Data Link Development," Digital Te chnical 8- 14. jo urnal, vol. 3 no. 2 (Spring 1991): 31 -41.

Digital Te clmical jo urnal Vr>l. 3 No. 3 Swmner 1991 77 Rajjain I

Pe rformance Analysisof FDDI

Tbe performance of an FDDI LAN depends upon config uration and workload parameters sucb as tbe extent of tbe ring, the number of stations on tbe ring, tbe number of stations tbat are waiting to transmit, and tbe frame size. In addition, one key parameter that network managers can control to improve performance is tbe target token rotation time (TTRT). Analytical modeling and simulation meth­ ods were used to investigate the effe ct of the TTRT on various performance metrics fo r different ring configurations. This analysis demonstrated that setting the TTRT at 8 millisecondsprov ides good performance over a wide range of configurations and workloads.

Fiber distributed data interface (FDDI) is a 100-mega­ parameters. We investigated the effect of the TTRT bit-per-second (Mb/s) local area network (LAN) on FDDI LAN performance and developed guide­ defined by the American National Standards lines fo r setting the value of this parameter. The Institute (ANSI).'-' This standard allows as many as results of our investigation constitute a significant 500 stations to communicate by means of fiber­ portion of this paper. optic cables using a timed-token access protocol. Normal data traffic and time-constrained traffic, Timed-token Access Method such as voice, video, and real-time applications, are The token access method fo r network communica­ supported. All major computer and communica­ tion, as defined by the IEEE 802.5 standard, operates tions vendors and manufacturers in the fo llowing manner. A token circulates around offer products that comply with this standard. the ring network. A station that wants to transmit Unlike the token access protocol defined by the information waits fo r the arrival of the token. IEEE 802.5 standard, the timed-token access proto­ Upon receiving the token, the station can transmit col used by FDDI allows synchronous and asyn­ fo r a fL"Xed time interval called the token holding chronous traffic simultaneously.-' The maximum time (THT) . The station releases the token either access delay, i.e., the time between successive trans­ immediately after completing transmission or after mission opportunities fo r a station, is bounded the arrival of all the transmitted frames. The time for both types of traffic. Although this delay is short interval between two successive receptions of fo r synchronous traffic, that for asynchronous traf­ the token by a station is called the token rotation fic varies with the network configuration and load time (TRT). Using this scheme, a station on an and can be as long as 165 seconds. Long maximum n-station ring may have to wait as long as n times access delays are undesirable and can be avoided the THT interval to receive a token. This maximum by properly setting the network parameters and access delay may be unacceptable for some applica­ configurations. tions if the value of either n or THT is large. For This paper begins with a description of the example, voice traffic and real-time applications timed-token access method used by FDDI stations may require that this delay be limited to 10 to 20 and then proceeds to discuss how various parame­ milliseconds (ms). Consequently, using the token ters affect the perfo rmance of these systems. The access method severely restricts the number of target token rotation time (TTRT) is one of the key stations on a ring. The timed-token access method, invented by

This paper is a modified version of "Performance Analysis of Grow, solves this problem by ensuring that al l sta­ FDDI Token Ring Networks: Effectof Parameters and Guide! ines tions on a ring agree to a target token rotation time fo r Setting TTRT;· by Raj Jain, published in the Proceedin s of the g (TTRT) and limit their transmissions to meet this SIG'COMM '90, September 1990. Copyright 1990, Association fo r Computing Machinery, Inc. target:' There are two modes of transmission:

78 Vo l. 3 No. 3 Summer 1991 Digital Te chnical jo urnal Performance Analysis of FDDI

synchronous and asynchronous. Time-constrained Note that even though the stations attempt to applications such as voice and real-time traffic use keep the TRT below the target, they do not always the synchronous mode. Traffic that does not have achieve this goal. The TRT can exceed the target by time constraints uses the asynchronous mode. A as much as the sum of all synchronous-transmission station can transmit synchronous traffic whenever time allocations; however, these allocations are lim­ it receives a token; however, the total transmission ited so that their sum is less than the TTRT. As a time fo r each opportunity is short. This time is allo­ result, the TRT is always less than twice the TTRT. cated at the ring initialization. A station can trans­ mit asynchronous traffic only if the TRT is less than the TTRT. Perfo rmancePa rameters The basic algorithm fo r asynchronous traffic is as The performance of any system depends upon both fo llows. All stations on a ring agree on a target system parameters and workload parameters as token rotation time. Each station measures the time shown in Figure 1. There are two kinds of system elapsed since last receiving the token, i.e., the TRT. parameters: fixed and user-settable. Fixed parame­ On token arrival, a station that wants to transmit ters cannot be controlled by the network manager computes a token holding time using the following and vary from one ring to another. Cable length and fo rmula: the number of stations on a ring are examples of fixed parameters. It is important to study network THT = TTRT - TRT performance with respect to these parameters; if performance is sensitive to them, each set of fixed If the value of THT is positive, the station can trans­ parameters may require a different guideline. Sys­ mit fo r this time interval. After completing trans­ tem parameters that can be set by the network man­ mission, the station releases the token. If a station ager or the individual station manager include does not use its entire THT, other stations on the various timer values. Most of these timers influence ring can use the remaining time through successive the reliability of the ring and the time it takes to applications of the algorithm. detect a malfunction. The key settable parameters

PERFORMANCE PARAMETERS

I I I

SYSTEM WORKLOAD

I I I I I I NUMBER LOAD PER SETIABLE FIXED OF ACTIVE STATION STATIONS I I I l ____j _____ I TARGET TOKEN SYNCHRONOUS I : CABLE NUMBER OF ROTATION I TIME I LENGTH STATIONS TIME (TIRT) : ALLOCATION I ------1

�------l ____j _____ I l ____j _____ I I I INTERBURST INTER FRAME 1 1 BURST-SIZE I FRAME-SIZE I TIME I I TIME 1 I DISTRIBUTION I DISTRIBUTION I : DISTRIBUTION I : I I DISTRIBUTION ------1 ------1 ------

Note that the oarameters shown in the dashed boxes were not considered in this studv.

Figure 1 Performance Parameters

Digital Te chtlical journal Vo l. 3 No. 3 Summer 1991 79 Network Performance and Adapters

that affect system performance are the TTRT and the first bit in to the last bit out. This metric is mean­ the synchronous time allocations. ingful only if a ring is not saturated, because at In this paper, the performance was studied under loads near or above capacity the response time asynchronous traffic conditions only. The presence approaches infinity. With such loads, the maximum of synchronous traffic will further restrict the choice access delay fo r a station, i.e., the time interval ofTTRT, as discussed later in the section Guidelines between wanting to transmit and receiving a token, for Setting the Ta rget To ken Rotation Time. has more significance. The workload also has a significant impact on Another metric that is of interest for a shared performance. A guideline or recommendation may resource such as FDDT is the fairness with which the he suitable fo r one workload but not for another. resource is allocated. Fairness is particularly impor­ The key workload parameters are the number of tant under a heavy load. Given such a load, the asyn­ active stations and the load per station. Ry active chronous bandwidth is allocated equally to all we mean stations on a ring that are either transmit­ active stat�ons. However, the FOOl protocols are fair ting or waiting to transmit. A ring may contain a only if the priority levels are not implemented. In large number of stations, but generally only a few the case of multiple priority implementation, i1t is are active at any given time. Active stations include possible for two stations with the same priority and the currently transmitting station, if any, and sta­ the same load to have different throughput depend­ tions that have frames to transmit and are waiting ing upon their location.6 Low-priority stations fo r the access right, i.e., fo r a usable token to arrive. that are close to high-priority stations may get The load per station varies with the number of sta­ better service than those farther downstream. \Ve tions, the interval between bursts, the number of assumed a single priority implementation to keep frames per burst, and the frame size. the analysis simple. Since such implementations have no fairness problem, this metric will be dis­ Perfonnance Metrics cussed no further in this paper. We used two methods to analyze performance: The quality of service a system provides is mea­ analyt ical modeling and simulation. We first pre­ sured by its productivity and responsiveness.' sent the analytical model used to compute the effi­ For an FDDI productivity is measured by its LAN, ciency and maximum access delay of a network throughput, and responsiveness is measured by the under a heavy load. Then we discuss the simulation response time and maximum access delay. model workload used to analyze the response time The productivity metric of concern to the net­ at loads below the usable bandwidth. work manager is the total throughput measured in megabits per second. Over any reasonable time A SimpleAn alytical Model interval and fo r most loads, the throughput is equal The maximum access delay and efficiency are to the load. h>r example, if the load on a ring is meaningfu l only under heavy load. Therefore, we 40 Mb/s, then the throughput is also 40 Mb/s. But assume that there are n active stations, each gener­ this docs not hole! if the load is high. For example, ating enough frames to saturate the FOOl network. if there are three stations on a ring, each with a 100-Mb/s load, the total arrival rate is 300 Mb/s Basic Equations and the throughput is considerably less-close to For an FDDI network with a ring latency D (i.e., the 100 Mb/s. Thus, the key productivity metric is not time it takes a bit to travel around the ring) and a the throughput under low load but the maximum TTRT value of T, the efficiency and maximum access obtainable throughput under high load. This latter delay are computed using the fo llowing fo rmulas: quantity is also known as the usable bandwidth of the network. And the ratio of the usable bandwidth . . . n (T - D) Efftoency (1) to the nominal bandwidth (e.g., 100 Mb/s fo r an FOOl = n T + D LAN) is called the efficiency of the network. For Maximum access delay = (n - 1) T + 2D (2) instance, if we consider a set of configuration and workload parameters with a usable FDDI bandwidth Equations (1) and (2) constitute the analytical of at most 90 Mb/s, the efficiency is 90 percent. model. Their derivation is simple and is presented The response time is the time interval between in the next section. Those readers who are not the arrival of a frame and the completion of its interested in the details can proceed to the section transmission, including queuing delay, i.e., from Application of the Model.

80 Vo l. 3 No.: ) Summer 1991 Digital Te chnical jo urnal Performance Ana(ysis of FD DI

Derivation Assume that all stations are idle until t = D, when First consider a ring with three active stations, as the three active stations suddenly get a large (infi­ shown in Figure 2. (Later, we will consider the nite) burst of frames to transmit. The sequence of general case of n active stations.) The figure shows events shown in Figure 2 is as follows: the space-time diagram of various events on the 1. t = 0. Station S 1 receives the token and resets ring. The space is shown horizontally, and the time its own token rotation timer to zero. Since the t is shown vertically. The token reception is station has no frames to transmit, the token denoted by a thick horizontal line segment. The proceeds to the next station 52. interval of time during which transmission of frames takes place is indicated by a thick vertical 2. t = t,. Station 52 receives the token and line segment. resets its token rotation timer to zero. t, is equal to the signal propagation delay from Sl to 52.

3. t = t,,�. Station 53 receives the token and resets its token rotation timer to zero. t,; is equal to the signal propagation delay from Sl to 53. S1 S3 I I 4. t = D. Station 51 receives the token. Since Sl 1 0 I now has an infinite supply of frames to trans­ .. . . .l..� · -I -I T. · ·· · · � 3 I 4 · · mit, it captures the token and determines that 0 the TRT is D. Thus, the time interval during which Sl can hold the token, the difference between TTRT and TRT, is T- D. T ...... I 7 I ·. .-.- 8 ·1···· · 1 5 t = T. The THT at station Sl expires. 51 .,, I l �� releases the token. ·· 6. t = T + t". Station 52 receives= the token. 52 · last received the token at t t, ,; thus, the 11 I -I · . ·.-.-.-.:·. ·- 1 112 .1 ...... -.-. .·.·.· value of TRT is T. S2 cannot use the token at 2T + 0 1: this time and releases it. 7 t = T + t,,. Station 53 receives the token. 53 last received the token at t = t1 ;; thus, its TRT

I is also T. 53 cannot use the token at this time + -·.:,,·.-.·.-. l' 3T 0 t I 17 ... · �1 8 . · ··. and releases it. : 19 . i"...... · T : 8. t = T + D. Station Sl receives the token. Sl

last received the token at t = D; its TRT is also SPACE KEY: T. (Note that the TRT is measured from the

S1, S2, S3 STATIONS instant the token arrives at a station's receiver, - TOKEN i.e., event 4 fo r station Sl, and not from the I TRANSMISSION OF FRAMES time it leaves a station's transmitter, i.e., event T TARGET TOKEN ROTATION TIME 5.) Sl cannot use the token and releases it. 0 RING LATENCY TOKEN PATH 9. t = T + D + t,. Station 52 receives the token. TRT Note that the numbers in this figure refer to event numbers Since is only D, it sets the THT to the discussed in the text. remaining time, i.e., T- D. 52 transmits fo r

that interval and releases the token at t = T + D + t,2 + (T - D). Figure 2 Sp ace-time Diagram of Events

with Three Active Stations on 10. t = 2T + t,,. The THT at station 52 expires. 52 an FDDI Ne twork releases the token.

Digital Te cbt1ical ]our11al Vo l. 3 No. 3 Summer 1991 81 Network Performance and Adapters

11. t = 2T + t11• Station S3 receives the token. Application of the Model TRT Since is T, S3 releases the token. Equations (1) and (2) can be used to compute the maximum access delay and the efficiency fo r any 12. t = 2T + D. Station S 1 receives the token. FDOI ring configuration. For example, consider a Since TRT is T, S 1 releases the token. ring with 16 stations and a total fiber length of 20 kilometers (km). sing a two-fiber cable, this 13. t = 2T + D + t,. Station S2 receives the corresponds to a cable length of 10 km.)Light waves token. Since TRr is T, S2 releases the token. travel along the fiber at a speed of 5.085 micro­ 14. t = 2T + D + t, ,. Station 53 receives the seconds per kilometer (u slkln). The station delay token. Since TRT is only D, it transmits fo r the between receiving and transmitting a bit is approxi­ time interval T- I) and releases the token at mately I ,u s per station. The ring latency can be com­ puted as fo llows: t = 2T + D + 111 + (T - D).

Ring latency D = (20 kln) X (5.085 ,uslkm) 15. t = 3T + t,.,. The THT at station S3 expires. 53 releases the token. + (16 stations) X (l ,uslstation)

= 0.12 milliseconds (ms) 16. t = 3T + D. Station S 1 receives the token, and the sequence of events begins to repeat. The Assuming a TTRT of 5 ms and all 16 stations active,

token passes through stations S1, S2, and S3, . . 16 (5 - 0.12) all of which find it unusable. EffiCiency = 16 X 5 + 0.12

= 97. 5 percent

Maximum access delay = (16 - 1) X 5 + 2 X 0.12

19. t = 3T + 2D. The cycle continues with S1 cap­ = 75.24 ms turing the token as in event 4. Thus, on this ring, the maximum possible The above discussion illustrates that the system throughput is 975 Mbls. If the load is greater than goes through a cycle of events and that the cycle this for any substantial length of time, the queues time is 3T + D. During every cycle, each of the wi ll build up, the response time will increase, three stations transmits fo r a time interval equal to and the stations may start to lose frames due to T - D; the total transmission time is 3 (T - D). insufficient buffers. The maximum access delay is The number of bits transmitted during this time is 75.24 ms; thus, asynchronous stations may have to wait as long as 75.24 ms to receive a usable token. 3 (T - D) X 10", and the throughput equals 3 (T- D) The key advantage of this model is its simplicity, X 10"I (3T + D) bits per second. The efficiency, i.e., the ratio of the throughput to the FDDI bandwidth which allows us to see immediately the effect of of 100 ;\1bls, is 3 (T - D) I (3T + D). various parameters on network performance. With During the cycle, each station waits fo r a time only one active station, which is usually the case, interval of 2T + 2D after releasing the token fo r equation (1) becomes another opportunity to transmit. This interval is the . T-D . Effte1ency (n = = maximum access delay. ror lower loads, the access 1) T+ D delay is shorter. Thus, for a ring with three active stations, As the number of active stations increases, the efficiency increases. With a very large number of . . 3 (T - D) stations, EffiCiency = 3T + D D Maximum efficiency (n = x) = 1 - Maximum access <.!day = (3 - 1) T + 2D T = 2T + 2D This efficiency fo rmula is easy to remember To generalize the above analysis for n active and permits "back-of-the-envelope" calculations of stations, substitute 11 for 3. Equations (1) and (2) FDDI LANperf ormance. This special case of n = x are the results; the derivation is complete. has already been studied.7

82 Vo l. 3 No. 3 Summer 1991 Digital Te chnical jo urnal Pe rformance Analysis of FD DI

Similarly, we can use equation (2) to calculate the 1. Since the TRT can be as long as twice the TTRT, a maximum access delay with one active station as synchronous station may have to wait a time fo llows: interval of up to 2T before receiving the token. Therefore, synchronous stations should request Maximum access delay (n = 1) = 2D a TTRT va lue of one-half the required service That is, a single active station may have to wait as interval. For example, a voice station that wants long as twice the ring latency between successive to receive a token every 20 ms or less should transmissions because every alternate token that it request a TTRT of 10 ms. receives would be unusable. For n = co, the maxi­ 2. The TTRT must allow transmission of at least one mum access delay approaches infinity. maximum-size frame in addition to the syn­ chronous time allocation, if any. That is, Simulation Wo rkload TTRT > ring latency + token time One way to measure the responsiveness of a sys­ + maximum frame time tem is to use simulation to analyze the response + synchronous time allocation time. This metric depends upon the frame arrival pattern of the workload and is discussed further The maximum-size frame on FDDI is 4500 bytes in the Response Time section. The workload we plus preamble and takes approximately 0.361 ms used in our simulations was based on an actual mea­ to transmit. The maximum ring latency is 1.773 surement of traffic at a customer site. The chief ms. The token time (11 bytes including 8 bytes of application at this site was the warehouse and preamble) is 0.00088 ms. This rule, therefore, inventory control (WIC). Hence, we named the requires that the TTRT be set at a value greater workload \VI C. than or equal to 2.13 ms plus the synchronous Previous network measurements show that when time allocation. Violating this rule, fo r example, a station wants to transmit, it generally transmits by overallocating the synchronous bandwidth, not one frame, but a burst of frames. The WIC work­ results in unfairness and starvation, i.e., some load has this trait as well. Therefore, we used stations are unable to transmit. a bursty Poisson arrival pattern in our simula­ 3. A station must request a TTRT greater than or tion model with an interburst time of 1 ms and equal to the station parameter T_ min. The five frames per burst. default maximum value of T_ min is 4 ms. Gen­ We limited the frames to two sizes: 65 percent of erally, most stations do not request a TTRT less the frames were small (100 bytes), and 35 percent than 4 ms. were large (512 bytes). This workload constitutes a total load per station of 1.22 Mb/s. Forty stations, 4. A station must request a TTRT less than or equal each executing this load, would utilize 50 percent to the station parameter T_ max. The default of the FDDI bandwidth. Higher load levels can be minimum value of T_ max is 165 ms. Assuming obtained either by reducing the interburst time or that there is at least one station with T_ max increasing the number of stations on the ring. equal to 165 ms, the TTRT on a ring cannot exceed this value. On practice, many stations

use a value of 222 X 40 ns = 167.77216 ms, Guidelinesfor Setting the Ta rget will To ken Rotation Time which can be conveniently derived from the symbol clock using a 22-bit counter.) This section presents the rules specified by the ANSI FDDI media access control standard fo r setting Efficiency and Maximum Access Delay the value of the TTRT. We also discuss efficiency, Considerations maximum access delay, and response time con­ In addition to the rules specified by the standard, siderations, as well as reasons to limit the value the TTRT values should be chosen to allow high­ of TTRT. performance operation of a ring. This section dis­ cusses these performance considerations. ANSI FDDI Sta ndard Figure 3 is a plot of efficiency as a function of the According to the ANSIFDDI standard, the following TTRT. Three configurations called "Typical;' "Big," ru les must be observed when setting the TTRT: and "Largest" are shown.

Digital Te chnical journal Vo l. 3 No. 3 Summer 1991 83 Network Performance and Adapters

100 ------�--- TYPICAL BIG configuration. For larger configurations, the knee LARGEST occurs at larger TTRT values. Even fo r the Largest configuration, the knee occurs in the range of 6 to i=' 75 wz 10 ms. For the Typical configuration. the TTRT has u wa: little effect on efficiency as long as the TTRT is in eo_ the allowed range of 4 to 165 ms. >- 50 zu Figure 4 shows the maximum access delay as a w TTRT 0 function of the fo r the three configurations. LL To show the complete range of possibilities, we wLL 25 used a semilogarithmic scale on the graph. The ver­ tical scale is logarithmic. while the horizontal scale is I in ear. The figure shows that increasing the TTRT 0 brings about a corresponding increase in the maxi­ TTRT (MILLISECONDS) mum access delay fo r al l three configurations.

Figure 3 Efficiency as a Function of the TTRT 100,000

The Typical configuration consists of 20 single >- attachment stations (SASs) on a 4-km fiber ring. The �-10,000 numbers used are based on an intuitive feeling of Q(J) what a typical ring would look like ancl not based wO�� u u on any survey of actual installations. Twenty offices � � 1,000 located on a 50 m by 50 m floor require a 2-km 2::J cable or a 4-km fiber. �� ;;;; 100 The Big configuration consists of 100 SASs on a 2 200-km fiber. This configuration represents a rea­ sonably large ring with acceptable reliability. 1Q L---�----�----�----�­ Configuring a single ring with considerably more o 5 10 15 20 than this number of stations increases the proba­ TTRT (MILLISECONDS) bility of bit errors." The Largest configuration consists of 500 dual Figure 4 il1aximum Access Delay as a attachment stations (DASs) and a ring that has Function oj the TTR T wrapped. A DAS can have one or two media access controllers (MACs). In this configuration, each DAS has two MACs. Thus, the LAN consists of 1000 MACs 1;1hle l presents the performance metrics for the in a single logical ring. This is the largest number of maximum access delay and the efficiency as fu nc­ TTRT. MACs allowed on an FDDI LAN . Exceeding this num­ tions of the As evidenced in the table, on the ber would require recomputation of all default Largest ring, a TTRT of 165 ms causes a maximum parameters specified in the standard. access delay as long as 165 seconds. This means that Figure 3 shows that fo r all three configurations, in a worst-case situation, a station may have to wait the efficiency increases as the ·rTRT increases. The several minutes to receive a usable token. For many efficiency is very low at TTRT values close to the applications, this delay is unacceptable; therefore, a ring latency but increases as the TTRT increases. reduced number of stations or a smaller TTRT may Thus, to ensure a mi.nimal efficiency, the minimum be preferable. allowed TTRT on FDDI is 4 ms. This direct relation­ ship between the efficiency and the TTRT may lead Response Time some to conclude that the largest possible TTRT be Figure 5 shows the average response time as a func­ chosen. However, notice also that the efficiency tion of the TTRT fo r a relatively large configuration, gained by increasing the TTRT, i.e., the slope of the i.e., 100 stations and 10 km of fiber. The WICwork ­ efficiency curve, decreases as the TTRT increases. load was simulated at three load levels: 28, 58, and The "knee'' of the curve depends upon the ring 90 percent. Two of the three curves are horizontal

84 Vn f. 3 Nn.3 Summer 1991 Digital Te chnicaljo tn-nal Performance Analysis of FDDI

Ta ble 1 Maximum Access Delay and Efficiency as Functions of the TTRT

TTRT Maximum Access Delay Efficiency (ms) (seconds) (percent)

Ty pical Big La rgest Ty pical Big La rgest 20 SAS 100 SAS 500 DAS 20 SAS 100 SAS 500 DAS 4km 200 km 200 km 4km 200 km 200 km

4 0.08 0.40 4.00 98.94 71 .87 49.55 8 0.1 5 0.79 8.00 99.47 85.92 74.77 12 0.23 1.19 11.99 99.65 90.61 83.1 8 16 0.30 1.59 15.99 99.74 92.95 87.38 20 0.38 1.98 19.98 99.79 94.36 89.91 165 3.1 4 16.34 164.84 99.97 99.32 98.78

straight lines indicating that TTRT has no effect on ring is poor (50 percent). A very large value of the response times at these loads. The TTRT only TTRT, such as 165 ms, is also undesirable, because it affects the response time at a heavyload. In fact, it is results in long maximum access delays. The 8-ms only near the usable bandwidth that the TTRT has value is the most desirable, since it yields 75 per­ any effect on the response time. cent or more efficiency on all configurations and To summarize the results presented so far, if the results in a maximum access delay of less than one FDDI load is below saturation, the TTRT has little second on Big rings. Eight milliseconds is, there­ effect. At saturation, a larger value of TTRT gives a fo re, the recommended default TTRT. larger usabk bandwidth and therefore increased efficiency. But a longer TTRT also results in longer Problems with a Large TTRT maximum access delays. The selection of the TTRT There are three additional reasons fo r preferring an requires a trade-off between these two require­ 8-ms TTRT over a large TTRT such as 165 ms. First, a ments. To facilitate making this trade-off, the two large TTRT allows a station to receive a large num­ perfo rmance metrics fo r the three configurations ber of frames back-to-back. To operate in such an are listed in Ta ble 1. values in the allowable TTRT environment, all adapters must be designed with range of 4 to 165 ms are shown. The data shows that large receive buffe rs. Although memory is not con­ a very small value of such as 4 ms, is undesir­ TTRT, sidered an expensive part of a computer, its cost able, because the resulting efficiency on the Largest is significant fo r low-cost components such as adapters. The board space for the additional mem­ ory required by choosing a larger TTRT is consider­ 1.6 able as are the bus holding times required fo r such (j) 0 large back-to-back transfers. z 0 �12 � Second, a very large TTRT results in an exhaustive service discipline (i.e., all frames are transmitted in :J ...J 90% LOAD one token capture), which has several known draw­ � UJ 0.8 backs. For example, exhaustive service is unfa ir. ::;; � Frames coming to higher load stations have a UJ (/) greater chance of finding the token during the same 6 0.4 transmission opportunity, whereas frames arriving a.. (/) UJ ------58% LOAD at low load stations may have to wait. Thus, the (( ------28% LOAD response time is inversely dependent upon the 0 5 10 15 20 load, i.e., higher-load stations yield lower response TTRT (MILLISECONDS) times and vice versa 9 Third, with exhaustive service, the response

Figure 5 Response Timeas a Function oj TTR T time of a station is dependent upon station location

Digital 1echnical jo umal Vo l. 3 No. 3 Summer 1991 85 Network Performance and Adapters

with respect to that of high-load stations. The sta­ 100,000 tion immediately downstream from a high-load sta­ TIRT = 8 MILLISECONDS

tion may obtain better throughput than the one >-

immediately upstream. � ------� 10,000 MAGs OUl l 1000 U)o �6 Parameters Other Than the TTRT (_) (..) :i � 1000 /------MAGs That Affe ct Perfo rmance 100 :::; :J Many parameters other than the TTRT affect the � � x 1------20 MACs performance of a network. This section discusses <{ 100 :::; four configuration and workload parameters: the extent of the ring, the total number of stations, the � � number of active stations, and the frame size. 1 0 o�--�--��-� - 50 100 1 50� 20�0 EXTENT OF THE RING (TOTAL FIBER LENGTH IN KILOMETERS) Extent of the Ring The total length of the fiber is called the extent of the ring. The maximum allowed extent on an FOOl Figure 7 Maximum Access Delay as a Function LAN is 200 km. Figures 6 and 7 are graphs illustrat­ of the Extent of the Ring ing the efficiency and maximum access delay as fu nctions of the extent. A star-shaped ring with all To tal Number of Stations stations at a fixed radius from the wiring closet is The total number of stations on a ring includes assumed. The total cable length, shown along the active and inactive stations. In general, increasing horizontal axis, is twice the radius times the num­ the number of stations adds to the ring latency ber of stations. As is evident from the figures, rings because of the additional fiber length and addi­ with a larger extent have a slightly lower efficiency tional station delays. Thus, the number of stations and a longer maximum access delay than those with affectsthe efficiency and maximum access delay in smaller extents. a way similar to that of the extent; a ring that con­ tains a larger number of stations than another has a lower efficiency and a longer maximum access 100 delay. In addition, a large number of stations on �======�20 MAGs 100 MAGs a ring increases the bit-error rate. Consequently, 1000 MAGs large rings are not desirable. � 75 w (_) a: w e:. Number of Active Stations TIRT = 8 MILLISECONDS >- 50 (_) As the number of active stations, i.e., MACs, z w increases, the total load on the ring increases. (3 u:: Figures 8 and 9 show the ring performance as a 25 tt function of the number of active MACs on the ring. We considered a maximum-size ring with a TTRT value of 8 ms for the analysis. The figures show that 0 50 100 150 200 increasing the number of active MACs has a slight EXTENT OF THE RING (TOTAL FIBER LENGTH IN KILOMETERS) positive effect on the efficiency, but considerably increases the maximum access delay. Therefore, it is preferable to keep a minimal number of active Figure 6 Efficiency as a Fu nction stations on each ring by segregating small groups of tbe Extent of the Ring on separate rings.

Note that in Figure 7, the increase in maximum Frame Size access delay fo r each configuration is not apparent Frame size does not appear in the simple models of due to the semilogarithmic scale. efficiency and maximum access delays, because

86 Vo l. 3 No. 3 Summer 1991 Digital Te chnical journal Performance Analysis of FDDI

80 Here, I l is used to denote rounding up to the next integer value. The transmission time is equal to k times F, which is slightly more than T minus D. f=' z 60 w With asynchronous overflow, the modified effi­ (.) a: ciency and maximum access delay fo rmulas w e;_ become >- 40 (.) z nkF UJ Efficiency = u n (kF + D) + D u:: u.w 20

Maximum access delay = (n - 1) (kF + D) + 2D

Notice that substituting kF = T-D in the above 0 250 500 750 1000 equations results in Equations (1) and (2). NUMBER OF ACTIVE MAGs Figures 10 and 11 show the efficiency and the maximum access delay as fu nctions of the frame

Figure S Efficiencyas a Fu nction of the size. Frame size has only a slight effect on these Number of Active MACs metrics. Larger frame sizes do have the fo llowing effects: frame size has little impact on FDDl performance. • The probability of error is greater in a larger In our analysis, we assumed that transmission frame. stops at the instant the THT expires; however, the standard allows stations to complete the trans­ • Since the size of protocol headers and trailers mission of the last frame. is fixed, larger frames require less protocol The extra time used by a station after THT expiry overhead. is called asynchronous overflow. Assuming all frames are of fixed size, let F denote the frame • The time to process a frame increases only transmission time. During every transmission slightly with the size of the frame. A larger frame opportunity, an active station can transmit as many size results in fewer frames and, hence, in less as k frames: processing at the host.

T Overall, we recommend using as large a frame size k= � ;1 as the reliability considerations allow.

100,000 TTRT = 8 MILLISECONDS 100 TYPICAL RING LATENCY = 1.773 MILLISECONDS BIG >- 10,000 f=' :5 z 75 LARGEST w� w O(J) (.) a: �� w wO e;_ TIRT = 8 MILLISECONDS 8 &l 1,000 >- 50 <� (.) z :;2j w �� u x u:: < 100 wu. 25 ::2

0 1000 2000 3000 4000 NUMBER OF ACTIVE MAGs FRAME SIZE (OCTETS)

Figw·e 9 Maximum Access Delay as a Function Figure 10 Eff iciency as a Fu nction of the Number of Active MACs of the Frame Size

Digital Te chnical jow-nal Vo l. 3 No. 3 Summer 1991 87 Network Performance and Adapters

100,000 TTRT = 8 MILLISECONDS References

1. F. Ross, "An Overview of FDDI: The Fiber >- 4: u:J 10,000 Distributed Data Interface," IEEE jo urnal on ow ------LARGEST UJO Selected Areas in Communications, vol. 7, no. 7 UJZ wo (September 1989): 1043-51. 00 �� 1,000 ------BIG 2. Fiber Distributed Data Interface (FDDI) ­ ::::; :J �=' To ken Ring Media Access Control (1HA C), ANSI �� X ------TYPICAL X3.139-1987 (New Yo rk: American National 100 � Standards Institute, 1987).

3. To ken Ring Access Method and Physical Layer 10 o�--�10�0 �0--�2�0L00��30L0�0--�4 �00�0 Sp ecifications, ANS!/IEI�E Standard 802.5-1985, FRAME SIZE (OCTETS) ISO/DIS 8802/5 (New Yo rk: The Institute of Elec­ trical and Electronics Engineers, Inc., 1985).

Figure 11 Ma.Yimum Access Delay as a Punction 4. R. Grow, "A Timed-token Protocol to r Local Area of the hame Size Networks," Proceedings of tbe IEEE Electro '82 Conference on To ken Access Protocols, Paper 17/3, Boston, MA (May 1982). Summary Although many parameters affect the performance 5. R. Jain, The Art of Computer Sy stems Perfo r­ ISBN of an fDDI ring network, the target tokt:n rotation mance Analysis, 0-471-50336-3 (New Yo rk: & time (TTRT) is the key parameter that network john Wiley Sons, 1991). managns can control to optimize this perfor­ 6. D. Dykeman anclW Bux, "Analysisand Tuning of manct:. We analyzed the effect of other paranwters the FDDI Media Access Control Protocol," IEEE such as the extent of the ring (the length of the jo urnal on Selected Areas in Communications, cable), the total number of stations, the number of vo l. 6, no. 6 Ouly 1988): 997-1010. active stations, and frame size. From our data we concluded the fo llowing: 7. ]. Ulm, "A Timed-token RingLocal Area Network and Its Performance Characteristics," Proceed­ • Rings with a large extent and those with a large ings of the Seventh IEEE Conference on Local number of stations are undesirable because they Co mputer Ne tworks (February 1982): 50-56. yield a longer maximum access delay ami have 8. R. Jain, "Error Characteristics of Fiber Distrib­ only a sl ight positive effect on the ctficiency of (FDDI)," the ring. uted Data Interface IEEE Tra nsactions on Communications, vol. 38, no. 8 (August • It is preferable to minimize the number of active 1990): 1244-1252. stations on a ring to avoid increasing the maxi­ 9. W Bux and H. Truong, "Mean-delay Approxi­ mum access delay. mation for Cyclic-service Queueing Systems," • A large frame size is desirable, taking into consid­ Performance Evaluation, vol. 3 (Amsterdam: eration the acceptable probability of error. North-Holland, 1983): 187-196.

The value of TTRT does not significantly affect the response time unless the load is near saturation. Under very heavy load, response time is not a suitable metric. Instead, maximum access delay, i.e., the time between wanting to transmit and being able to do so, is more meaningfu I. A larger value of TTHT improves the efficiency, but it also increases the maximum access delay. A good trade-off is provided by setting TTRT at 8 ms. Since this value provides good performance fo r all ranges of configurations, we recommend that the default value of TTRT be set at 8 ms.

88 Vo l. j No. 3 Summer 1991 Digital Te cbtJicalJoun�al I FurtherRe adings

Th e Digital Te chnical]ournal MicroVA X ll System publishes papers that explore Vo l. 1, No. 2, March 1986 the technological fo undations VA X 8600 Processor of Digital's major products. Each Vo l. 1, No. 1, August 1985 Journalfo cuses on at least one product area andpresent s a Subscriptions to the Digital Te chnical jo urnal are compilation of papers written available on a yearly, prepaid basis. The subscrip­ by the engineers who develop ed tion rate is $40.00 per year (four issues). Requests the product. The contentfo r should be sent to Cathy Phillips, Digital Equipment the Journal is selected by the Corporation, ML01-3/B68, 146 Main Street, Maynard, jo urnalAdvisor y Board. MA 01754, U.S.A. Subscriptions must be paid in U.S. Digital engineers who would dollars, and checks should be made payable to like to contribute a paper Digital Equipment Corporation. to the Journal should contact Single copies and pastissues of the Digital Te ch nical the editor at RDVAX::BLAKE. jo urnal can be ordered from Digital Press at a cost of $16.00 per copy. To pics covered in previous issues of the Digital Te chnicaljo urnal are as fo llows: Technical Papers by Digital Authors

Fiber Distributed Data Interface R. Abbott, "Scheduling l/0 Requests with Dead­ Vo l. 3, No . 2, Sp ring 1991 lines: A Performance Evaluation," lEU:'Re al-time Sy stems Sy mposium (December 1990). Transaction Processing, Databases, and Fault-tolerant Systems S. Batra, "Magnetic Superexchange in )1G & Vo l. 3, No. 1, Winter 1991 CA2:)1G," Th irtyjifth Co nference on Magnetism and Magnetic Ma terials (October 1990). VAX 9000 Series Vo l. 2, No. 4, Fa ll 1990 A. Conn, ]. Parodi, and M. Taylor, 'The Place fo r Biometrics in a User Authentication Taxonomy," DECwind ows Program Th irteenth Na tional Computer Security Vo l. 2, No. 3, Summer 19 90 Conference (October 1990). VAX 6000 Model 400 System S. Das, "Suppression of Barkhausen Noise in an Vo l. 2, No. 2, Sp ring 1990 MR Head," Th irtyjifth Conference on Magnetism Compound Document Architecture and Magnetic Ma terials (October 1990). Vo l. 2, No. 1, Winter 1990 P Fang, "Yield Modeling in a Custom IC Distributed Systems Manufacturing Line," Advanced Semiconductor Vo l. 1, No. 9, june 1989 Ma nufacturing Conference (September 1990).

Stor age TecluJ. ol og y E. Freedman and Z. Cvetanovic, "Perfe ct Vo l. 1, No. 8, February 1989 Benchmarks Decomposition and Performance on VAX Multiprocessors," IEEE Supercomputing '90 CVAX-based Systems (November 1990). Vo l. 1, No. 7, August 1988 L. Jaynes, "The Effect of Symbols on Wa rning Soft ware ProductivityToo ls Compliance," Thirtyjourth Hu man Fa ctors Vo l. 1, No. 6, February 1988 Society (October 1990). VA Xcluster Systems M. Joshi, "Making Wafers in the JIT Style," Vo l. 1, No. 5, September 1987 Advanced Semiconductor Ma nufacturing VAX 8800 Family Conference (September 1990).

Vo l. 1, No. 4, February 1987 K. Mistry and B. Doyle, "Electron Traps, Interface Networking Products States and Enhanced AC Hot-carrier Degradation," Vo l. 1, No. 3, September 1986 iEEE Device Research Oune 1990).

Digital Te chnical jo urnal Vo l. 3 No. .3 Summer 1991 89 Fu rther Readings

Digital Press VAX ARCHITECTURE REFERENCE MANUAL, Second Edition Digital Press is the book publishing group of Digital RichardA. Brunner, Editor, 1991, softbound, Equipment Corporation. The Press is an interna­ 560 pages, Order No. EY-F576E-DP-EEB ($44.95) tional publisher of computer books and journalson new technologies and products for users, system This book describes the data types, instructions, and network managers, programmers, and other calling standards, addressing modes, registers, professionals. Proposals and ideas for books in exception and interrupt handling, memory man­ these and related areas are welcomed. agement, and process structure common to all VAX computers-from the MicroVAX II to the VAX 9000. The fo llowing book descriptions represent a sam­ New sections describe the VAX shared-memory ple of the books available from Digital Press. model supported in VAX multiprocessor computers VA X/VMS INTERNALS ANDDATA and the recently added vector processing exten­ STRUCTURES: Version 5.2 sions implemented by the VA X 9000 and VAX 6000 Ruth E. Goldenberg and Lawrence). Kenah, model 400 systems. The book introduces the with the assistance of Denise E. Dumas, 1991, design goals and terminology of the VAX instruction hardbound, 1427 pages, Order No. EY-C 171E-DP-EEB set, including those fo r memory management, ($ 124.95) exception and interrupt handling, process control, and vector processing. The description of each This is a totally revised edition of the most authori­ instruction gives fo rmat, operations, condition tative and complete description of the VAX/VMS codes, instruction-specific exceptions, opcodes, operating system in the industry The book features and mnemonics. new chapters on symmetric multiprocessing, the reorganized executive, VAX interrupts and excep­ A COMPREHENSIVEGUIDE TO Rdb/VMS 110 tions, and the subsystem, including device Lilian Hobbs and Kenneth England, 1991, drivers and interrupt service routines. The addi­ softbound, 352 pages, Order No. EY-H873E-DP-EEB tion of symmetric multiprocessing to the VA <'

VMS FILE SY'ITEM IA JJR\:4LS, based on VMS MIT PROJECT ATHENA: A Model fo r Ve rsion 5.2, is a book for system programmers, Distributed Campus Computing software specialists, system managers, applica­ George A. Champine, 1991, hardbound, 282 pages, tions designers, and other VAX /VMS users who Order No. EY-H875E-DP-EEB ($28.95) need to understand the interfaces to and the data structures, algorithms, and basic synchronization MIT Project Athena has emerged as one of the mechanisms of the VMS file system. This system is most important models fo r next-generation dis­ the part of the VAX /VMS operating system respon­ tributed computing in an academic environment. sible fo r storing and managing files and informa­ MIT pioneered this new approach, based on the tion in memory and on secondary storage. The client-server model, to support a network of work­ book is also intended as a case study of the VMS stations. The project began in 1983 as a five-year implementation of a file system fo r graduate project, with Digital Equipment Corporation and and advanced undergraduate courses in operating IBM as its two major industrial sponsors. Now a systems. production system of networked ,

90 Vo l. 3 No. 3 Summer 1991 Digital Te chnical journal I

Project Athena is replacing time-sharing (which MIT Book Review also pioneered) as the preferred model of com­ Th e Art of Computer Sy stems Performance puting at MIT. The size and uniqueness of Project Analysis: Te ch niques fo r Experimental Design, Athena has led to widespread interest in its design, Measurement, Simulation, and Modeling, implementation, and performance. R. Jain, John Wiley & Sons, Inc., New Yo rk, 1991. 720 pages (ISBN 0-471-50336-3). UNDERSTANDING CLOS: The Common Lisp Object System This is an edited version of a forthcoming review by Jo A. Lawless and Molly M. Miller, 1991, softbound, Ro bert Y. Al-Jaar in the Perfonnance Evaluation pages, Order No. EY-F591E-OP-EEB ($26.95) 192 Review of the ACM SIGMETRICS. The Common Lisp Object System (CLOS) is an The author achieves the major objectives presented extension to Common Lisp that brings object­ in his preface. Raj Jain provides computer profes­ oriented programming (OOP) to this popular ver­ sionals simple and straightforward performance sion of the Lisp language. Written fo r computer analysis tedmiques in a comprehensive textbook. professionals and students, UNDERSTANDING CLOS He gives basic modeling, simulation, measurement, quickly introduces necessary object-oriented pro­ experimental design, and statistical analysis back­ gramming concepts and provides complete syntac­ ground, and emphasizes and integrates the model­ tic descriptions of all CLOS functions adopted by ing and measurement aspects. The author discusses the ANSI X3J13 standards committee. Also included common mistakes and games in performance anal­ is an 800-line sample application, as well as a bibli­ ysis studies, and illustrates the presented tech­ ography, a glossary, and an index. niques using examples and case studies from the COMMON IJSP: The Language, Second Edition field of computer systems. Guy L. Steele, Jr., 1990, softbound, 1029 pages, The book consists of 36 chapters organized in the Order No. EY-C 187E-DP-EEB ($38.95) following six parts: "An Overview of Performance Evaluation," "Measurement Te chniques and Tools," The first edition of COMMON USP: Th e Language, "Probability Theory and Statistics," "Experimental which sold over 60,000 copies, became the de Design and Analysis;' "Simulation," and "Queueing facto standard fo r the Common Lisp program­ Models"; nearly the same level of attention is given ming language. This second edition is approxi­ to each part. Each chapter has a set of carefully mately twice the size of the first edition. The book designed exercises; solutions to selected exercises reflects, as closely as possible, the decisions and are presented at the end of the book. Each part con­ recommendations made by ANSI committee X3)13, cludes with a comprehensive list of references, bridging the gap between the first edition and the appropriately selected from the extensive list that forthcoming ANSI standard. It describes many of follows the exercise solutions. The book also the changes made to the Common Lisp program­ includes an appendix that contains statistical tables ming language, relative to the structure of the first and fo rmulas. edition, and discusses those areas that are likely to Part I emphasizes the importance of performance be revised fu rther. analysis fo r designers, administrators, and users of computer systems. The author introduces the field To receive a copy of our latest catalog or further of computer systems performance analysis and information on these or other publications from presents examples of problems that one should be Digital Press, please write or call: able to solve after reading the book. He discusses Digital Press 22 common mistakes that occur in performance Department EEB evaluation studies and presents in a "box" fo rmat 12 Crosby Drive a summary checklist to help avoid these mistakes. Bedford, MA 01730 This fo rmat is an effective presentation technique (617) 276 -1536 used judiciously throughout the book to highlight important techniques and summarize major results. Or, you can order by calling DECdirect at 800-DIGITAL The author advocates a 10-step approach to per­ (800-344-4825). formance analysis and discusses the selection of When ordering be sure to refer to Catalog Code EEB. performance evaluation techniques and metrics.

Digital Te cb11ical ]ournal Vo l. 3 No. 3 Summer 1991 91 Fu rther Readings

I enjoyed reading this coverage of issues critical Adding a discussion of process-oriented, as to the success of any performance engineering opposed to event-oriented, simulation methods project but often ignored. The discussions remind would provide the reader with a more complete experts of the importance of these matters and perspective of current simulation methods. encourage newcomers to develop the correct atti­ This part next describes the analysis of simu­ tude toward performance. lation results. Included are model verification and Part II begins with explanations of workload validation techniques, accompanied by algorithms types. The author emphasizes several major consid­ to aid the reader in the implementation. The book erations for workload selection. He then discusses also contains in-depth coverage of random number monitors, including program execution monitors generators. Part v concludes with a brief discussion and accounting logs. Of particular interest is the of current areas of research in simulation. Pointers discussion of the design of software monitors. to references for process- and object-oriented simu­ Capacity planning and benchmarking sections btion methods would be a welcomed addition. include enlightening material on common mistakes Part VI introduces the basic concepts and nota­ of inexperienced analysts and the games and tricks tion of queueing modt:ls, key tools fo r evaluating played by experienced analysts. By discussing such the performance of computer systems. lnduclcd is practical topics as load drivers and remote-terminal a clear, step-by-step analysis of single queues; a emulators (RTEs), the book provides comprehen­ discussion of stochastic processes; an explanation sive information on performance analysis, a wel­ of queueing networks and related operational anal­ come departure from the fo rmat of many other ysis techniques; and a demonstration of the convo­ books which consider such a discussion "unintel­ lution algorithm. The author also introduces the lectual." The art of data presentation techniques reader to the practical technique of hierarchical follows. The quality and fo rmat of the presenta­ decomposition of large queueing networks. Part VI tions in the book clearly indicate that the author concludes with a discussion on the limitations does practice what he preaches. of queueing theory. To choose the appropriate Part Il concludes with a discussion of ratio modeling approach, analysts must be aware of games. The author uses case studies and examples these limitations. to explain how to choose an appropriate base This is a truly landmark book which achieves the system and ratio metric. He also outlines strategies author's stated objectives. A strong point of the fo r defending oneself from ratio games played by book is its equal treatment of modeling, simulation, others. measurement, and experimental design in the con­ Part III introduces the basic concepts of proba­ text of computer syst<::ms. r believe that most of the bility and statistics, using examples and case studies chapters can be used as 45 -minute lectures, as the from the computer field to convince the reader author claims. Senior students in engineering and that these concepts have practical importance. computer science will generally have the mathe­ The author explains how to summarize measured matical sophistication required to understand the data and use sample data to compare systems; material covered in this book. Th e Art of Computer provides an easy-to-read introduction to simple lin­ Sy stems Pe rformance Analysis is indeed an ency­ ear regression models; and discusses other regres­ clopedia on the perfo rmance analysis of computer sion models. systems, and should be on the bookshelf of every The overall treatment of experimental design computer professional. and analysis is so comprehensive and thorough that Robert Y Al-)aar, Ph.D., Principal Systems Engineer Part IV is practically a short book on experimen­ Porting and Performance Engineering Group tal design techniques. The author explains the basic Digital Equipment Corporation concepts, terminology, and design techniques, and Marlborough, Massachusetts 01752-9122 discusses in detail a variety of experimental July 24, 1991 designs. Part v contains a good introduction to simulation Note: The book reviewed was written by an as a tool for computer performance analysis. The author who contributed a paper to this issue of author provides a checklist of common simula­ the journal. The editor included this review as tion mistakes and describes the Monte Carlo, trace­ one that might be of interest to our readers. The driven, and discrete-event simulation methods. review txpresses the opinions of the reviewer.

92 Vo l. 3 .\'u. j Summer I'J'Jl Digital Te chnical jo urnal ISSN 0898-901X

Printed in U.S.A. EY-H890E-DP/91 09 02 20.0 DBP/NRO Copyright © Digital Equipment Corporation. All Rights Reserved.