• High-performance Networking

• OpenVMS AXP System Software

• Alpha AXP PC Hardware

Digital Technical Journal Digital Equipment Corporation

cleon

cleon

Volume 6 Number l

Wimer 1994 Editorial jane . Blake, Managing Editor Helen L. Patterson, Editor Kathleen M. Stetson, Editor Circulation Catherine M. Phillips, Administrator Dorothea B. Cassady, Secretary Production Terri Autieri, Production Editor Anne S. Katzeff, Typographer Peter R. Woodbury, Illustrator Advisory Boa.-d Samuel H. Fuller, Chairman !Uchard W Beane Donald Z. Harbert !Uchard]. Hollingsworth Alan G. Nemeth jean A. Proulx jeffrey H. Rudy Stan Smits Hobert M. Supnik Gayn B. Winters The Digital Technicaljournal is a refereed journal published quarterly by Digital Equipment Corporation, 30 Porter Road LJ02/Dl0, Littleton, Massachusetts 01460. Subscriptions to the journal are $40.00 (non-US $60) for four issues and $75.00 (non­ U.S. $115) for eight issues and must be prepaid in U.S. funds. Universit)' and college professors and Ph.D. students in the electrical engineering and computer science fields receive complimentary subscriptions upun request. Orders, inquiries, and address changes should be sent to the Digital Technicaljo urnal at the published-by address. Inquiries can also be sent electronically to [email protected]. COM. Single copies and back issues are available for S 16.00 each by calling DECdirect at 1-800-DJGlTAL (J-800-344-4825). Recent back issues of the journal are also available on the Internet at gatekeeper.dec.com in the directory /pub/DEC/OECinfo/DTJ.

Digital employees may order subscriptiuns through Headers Choice by enrering VTX PROFILE at the system prompt.

Comments on the content of any paper are welcomed and may be sent to the managing editor at the published-by or network address.

Copyright© 1994 Digital Equipment Corporation . Copying without fee is permitted provided that such copies are made for use in educational instiwtions by faculty members and arc not distributed for commercial advantage. Abstracting with credit uf Digital Equipment Corporation's authorship is permitted . All rights reserved. The information in thejounwl is subject to change without notice and should not be construed as a commitmem by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in the journal. ISSN 0898-901X Documentation Number EY-QOlJE-TJ The following are trademarks of Digital Equipment Corporation: ACMS, Alpha AXP, AXP, CJ, DEC, DEC Rdb for OpenV,viS AXP, DECchip, DECNIS 600, DECpc, DECstation, Digital, the DIGITAl. logo, GIGAswitch, HSC, KDM. MicroVAX, OpenVMS, OpenVMS AXP, OpenVMS VAX, RA, TA, ULTRlX, VAX, VMS, and VMSclustcr AT&T is a registered trademark of An1erican Telephone and Telegraph Company. !350 and SPICE are trademarks of the University of California at Berkeley. CR AY Y-MP is a registered trademark of Cray Research , Inc. CoverDesign Gateway 2000 is a registered trademark of Gateway 2000, Joe. The GlGAswitch �J'Stem is a high-performance Hewlett-Packard is a trademark of Hewlett-Packard Corporation. packet switch for interconnecting a variety , lntel386, lntel486, and Pentium are trademarks of Intel Corporatiun. of network types and components, including Microsoft, MS-DOS, and Windows NT are registered trademarks of Microsoft Corporation. LANs, WANs, clients, servers, and members MIPS is a trademark of MIPS Computer Systems, Inc. of farms. The grid on our cover Motorola is a registered trademark of Motorola, Inc. represents the crossbar switching fabric at OSF/1 is a registered trademark of Open Software Fuundation, Inc. the core of the GlGAswitch system. Data repre­ PAL is a registered trademark of , Inc. sented bere as flowing through the crossbar SPEC, SPECfp, SPECint, and SPECmark are registered trademarks of the Standard -numeric, audio, video, text-is typical of Performance Evaluation Council. applications that use the G!GAswitch system. TPC-A is a trademark of the Transaction Processing Performance Council. is a registered trademark of Unix System Laboratories, Inc. , a wholly owned The cover was designed by joe Pozerycki,]r:, subsidiary of Novell, Inc. of Digital's Corporate Design Group. J3ookproduction was clone by Quantic Communicatiuns, Inc. I Contents

High-performance Networking

9 GIGAswitch System: A High-performance Packet-switching Platform Robert). Souza, P G. Krishnakumar, Ci.ineyt M. 6zveren, Robert). Simcoe, Barry A. Spinney, Robert E. Thomas, and Robert ,I. Wa lsh

OpenVMS AXP System Software

23 Peiformance of DEC Rdb Version 6.0on AXP Systems Lucien A. Dimino, Rabah Mediouni, T. K. Rengarajan, Michael S. Rubino, and Peter M. Spiro

36 Improving Process to Increase Productivity While Assuring Quality: A Case Study of the Volume Shadowing Port to OpenVMS AXP Wil liam L. Goleman, Robert G. Thomson, and Paul). Houl ihan

AlphaAXP PC Hardware

54 The Evolution of the Alpha AXP PC David G. Conroy, Thomas E. Kopec, and joseph R. Falcone

66 Digital's DECchip 21066: The First Cost-focused Alpha AXP Chip Dina L. McKinney, Masooma Bhaiwala, Kwong-Tak A. Chui, Christopher L. Houghton, James R. Mullens, Daniel L. Leibholz, Sanjay). Patel, De Ivan A. Ramey, and Mark B. Rosenbluth I Editor�s Introduction

Notable among performance results was a world record for transactions per second in a test of the Rdb version 6.0 database running on a DEC 10000 AXP system. Data availability is another important charac­ teristic of high-end systems. Volume shadowing is a feature of the OpenVMS that increases availability by rep I icating data on disk storage devices. The port of Volume Shadowing

Phase II to OpenVMS AXP is a case study of how a small engineering team can use software develop­ ment processes to tackle complex system design Jane C. Blake tasks and deliver improved quality products on Ma naging Editor accelerated first-release schedules. Bill Goleman, Robert Thomson, and Paul Houlihan describe key In answer to a recent jo urnal survey question, half process innovations, including early code inspec­ of those who responded said they would like to see tions and profile testing, which simulates complex a variety of topics in each issue; half prefer the single­ scenarios in order to reveal errors. topic format. This issue of the Digital Te chnical In the first of two papers on Alpha AXP PC prod­ jo urnal will please those who like variety. The ucts, Dave Conroy, To m Kopec, and Joe Falcone papers herein address networking, database perfor­ trace the design choices, alternatives, and method­ mance, software processes, PCs, and chips. In the ology used in the evolutionary development of the fu ture, we will try to please readers on both sides first AXP PCs. The authors explore the lessons of the question by presenting single-topic issues, learned from designing two experimental systems; such as Wo rkflow Software upcoming, and issues the first demonstrated the feasibility of building that encompass a range of topics. a PC utilizing the DECchip 21064 microprocessor Our opening paper is about the GJGAswitch ami industry-standard components; the subsequent packet-switching system, used to increase data system incorporated the EISA . They then review transfer among interconnected LANs and in work­ the design of the DECpc AXP 150 product, which station farms. At 6.25 million connections per sec­ built on the successes of the experimental systems ond, GIGAswitch is among the industry's fastest and was completed in just 12 weeks. multipart packet-switching systems. The paper by Future Alpha AXP PCs and desktop systems will Bob Souza, P G. Krishnakumar, Ci.ineyt Ozveren, use new Alpha AXP microprocessors that have Bob Simcoe, Barry Spinney, Bob Thomas, and Bob higher levels of system integration fo r performance Wa lsh is a substantive overview of GIGAswitch, the yet employ packaging and clocking techniques first implementation of which includes an FDDI that reduce system cost. Dina McKinney, Masooma bridge. The authors discuss the major design issues, Bhaiwala, Kwong Chui, Chris Houghton, Jim including the data link independent crossbar, the Mullens, Dan Leibholz, Sanjay Patel, Del Ramey, and arbitration algorithm (called take-a-ticket), and Mark Rosenbluth explain the trade-offs and results the techniques used to ensure the robustness, relia­ of the 21066 design, which is the first microproces­ bility, and cost-effectiveness of the switch. sor to integrate a PCJ bus controller along with just as critical as network perfo rmance in high­ many other system-level functions. end system environments, such as stock exchanges At the close of this issue, we have listeel the refer­ and banking, is the performance of server-based ees who offered their expert advice on the content database management systems. In their paper on and readability of papers submitted to thejournal the DEC Rdb version 6.0 database, Lou Dimino, between January 1993 and February 1994. We are Rabah Mediouni, T K. Rengarajan, MikeRubino, and grateful fo r the specialized knowledge they have Peter Spiro examine earlier steps taken to establish shared with authors and editors alike. good database perfo rmance on AXP systems and compare these and traditional approaches with the new enhancements that shorten the code paths, minimize 1/0 operations, and reduce stall times.

2 Biographies I

Masooma Bhaiwala A hardware engineer in the Models, To ols, and Verifica­ tion Group of Semiconductor Engineering, Masooma Bhaiwala has contributed to several projects since joining Digital in 1991 . She worked on the 1/0 controller verification of the DECchip 21066 microprocessor and wrote the PC! transactor that mimics Intel's PC! bus protocol. Recently, she led the DECchip 21066 second­ pass verification efforts. Masooma received a B.S.C.E. from N.E.D. University of Engineering and Technology, Pakistan, and an M.S.C.E. from the University of Massachusetts at Lowell.

Kwong-Tak A. Chui Kwong Chui, a principal engineer in the Semiconductor Engineering Group, is currently leading the definition of a core chip set for a ruse processor. He was the co-architect of the PC! interface section of the DECchip 21066. Since joining Digital in 1985, Kwong has worked on memory and 1/0 bus controller chips for the MicroVAX 3000 and VA X 4000 series systems, and a cache address fanout buffe r and multiplexer chip for the DEC 4000 AXP sys­ tem. Kwong holds a B.S. in computer engineering from the University oflllinois at Urbana-Champaign and an M.Eng.E.E. from Cornell University.

David G. Conroy Dave Conroy is a senior consulting engineer at Digital's System Research Center in Palo Alto, California. He received a B.A.Sc. degree in electrical engineering from the University of Wa terloo, Canada, and moved to the United States in 1980 as a cofounder of the Mark Williams Company. He joined Digital in 1983 to work on the DECtalk speech synthesis system and became a member of the Semiconductor Engineering Group in 1987 to work on system-level aspects of rusemicroprocessors, including the DECchip 21064 CPU. Dave is now investigating the design of high-performance memory systems.

Lucien A. Dimino A principal software engineer, Lucien Dimino is a member of the Database Systems Group. He is responsible fo r the R.MUrelational database management utility. Formerly with AT&T Bell Telephone Laboratories, he joined Digital in 1974. At Digital he has worked in the areas of networking and commu­ nications, transaction processing, manufacturing productivity and automation, and database management. Lucien received a B.S. in mathematics (1966) from City College of New Yo rk and an M.S. in mathematics from Stevens Institute of Technology.

3 Biographies

Joseph R. Falcone Joseph Falcone is a hardware engineering manager in the Te chnical Original Equipment Manufacturer Group, which develops Alpha AXP board products fo r the OEM market. Prior to this, he developed the system archi­ tecture and business strategy for the DECpc AXP 150 personal computer. Joe came to Digital from Hewlett-Packard Laboratories in 1983 to work in Corporate Research & Architecture on distributed systems research and development. He received an A.B.C.S. (1980) from the University of California, Berkeley, and an M.S.E.E. (1983) from Stanford University

William L. Goleman Principal software engineer Bill Goleman led the proj­ ect to port the Volume Shadowing Phase II product to Open VMS AXP. He is cur­ rently involved in planning OpenVMS r;o subsystem strategies. Bill joined Digital's Software Services organization in 1981. He worked in the Ohio Valley District and then moved to VMS Engineering in 1985. Since then Bill has worked on the MSCP server, local area VA Xcluster, mixed-architecture clusters, volume shadowing, and the port of cluster I/O components to OpenVMS AXP. Bi ll received a B.S. in computer science from Ohio State University in 1979.

Christopher L. Houghton A principal hardware engineer in the Semi­ conductor Engineering Group, Chris Houghton is responsible fo r the signal integrity, packaging, and r;o circuit design of the DECchip 21066 Alpha AXP microprocessor. His previous work includes signal integrity, packaging, and cir­ cuit design fo r the GIGAswitch crossbar chips, digital and analog circuit design fo r a semicustom cell library, and advanced development of processor-memory interconnect chips. Chris came to Digital in 1985 after receiving a BS.EE. from the University of Vermont. He holds one patent.

Paul J. Houlihan As a principal engineer with OpenVMS AXP Engineering, Paul Houlihan participated in the port of OpenVMS software components from the VAX to the Alpha AXP architecture, including volume shadowing, the tape class driver, and the system communication services layer. Paul's recent fo cus bas been on the OpenVMS l/0 subsystem and software quality He is the creator of Faulty Towers, a test tool that injects software faults into VMScluster systems. Paul received a B.A. in computer science and a B.A. in political science from the University of Wisconsin at Madison.

Thomas E. Kopec Principal engineer To m Kopec is a member of the Assistive Technology Group working on compact speech-synthesis and text-to-speech systems. Previously, he was a member of the Entry Systems Business Group that designed the VA X 4000 Models 200 through 600 and the Micro VAX 3500 and 3800 systems. To m received a B.S.E.C.E. (1980, with honors) in microwave engineering and signal processing and an M.S.E.C.E. (1985) in image processing and computer graphics, both from the University of Massachusetts at Amherst.

4 I

P. G. Krishnakumar A principal engineer in Digital's High Performance Networks Group, P G. Krishnakumar is currently the project leader fo r the -ATM bridge firmware. Prior to this, he worked on the GIGAswitch firmware and performance analysis of the CI and storage systems. He has also worked in the Tape and Optical Systems and the High Availability Systems Groups. He received a B.Tech. in electrical engineering from the Institute of Technology-BHU, India, an M.S. in operation research and statistics, and an M.S. in computer engineering, both from Rensselaer Polytechnic Institute.

DanielL. Leibholz Daniel Leibholz, a senior engineer in the Semiconductor Engineering Group, has been designing integrated Alpha AXP CPUs at Digital fo r three years. He is an architect of the DECchip 21066 Alpha AXP microproces­ sor and is currently involved in the design of a high-performance micro­ processor. While at Digital he has also worked on the development of massively parallel processors and was awardee! a patent fo r a processor allocation algo­ rithm. Daniel joined Digital in 1988 after earning B.S. and M.S. degrees in electri­ cal engineering from Brown University in Providence, RhodeIsland.

DinaL. McKinney Principal engineer Dina McKinney supervises the DECchip 21066 low-cost Alpha AXP verification team. Since joining Digital in 1981, Dina has held design and verification positions on various projects, including PC mod­ ules, semi custom chips, and fu ll-custom chips fo r communication, graphics, and processor products. Previously, Dina was a member of the U.S. Air Force. She holds two technical degrees from Community College of the Air Force and Gulf Coast College, a B.S.E.E. (1982, summa cum laude) from Central New England

College, and an M.S E . E . (1990) from Wo rcester Polytechnic Institute.

Rabah Mediouni Rabah Mediouni joined Digital in 1980 and is currently a principal engineer in the Rdb Consulting Group. His responsibilities include consulting on database design and tuning and performance characterization of new fe atures in major Rdb releases. He was responsible for the performance characterization effort during the DEC Rdb port to the AXP platforms. Prior to this, Rabah worked in the Mid-range Systems Performance Group as a primary contributor to the VAX 8000 series performance characterization project. Rabah received an M.S. in computer science from Rivier College in 1985.

James R. Mullens James Mul lens is a principal hardware engineer in the Semiconductor Engineering Group. He is working on CPU advanced develop­ ment and is responsible fo r the architecture and design of the DECchip 21066 dynamic memory controller. In earlier work, he perfo rmed verification and test of the system-on-a-chip (SOC), designee! four ASIC devices used in and DECwindows terminals, led the Corporate Mouse and Tablet projects, and was a design and project engineer fo r the Mini-Exchange. Jim joined Digital in 1982 after receiving a B.S.E.E. from Northeastern University.

5 BiograjJbies

Ciineyt M. Ozveren Clineyt 6zveren joined Digital in 1990 and is a principal engineer in the Networks Engineering Advanced Development Group. He is cur­ rently working on all-optical networks and ATJ\'1 architecture. His previous work incluclecl the architecture, design, and implementation of the SCP hardware in GlGAswitch. He received a Ph.D. (1989) in electrical engineering and computer science from MIT and an M.S. (1989) from MIT's Sloan School of Management. His research included the analysis and control of large-scale dynamic systems, with appl ications to communication systems, manufacturing, and economics.

Sanjay J. Patel As a verification engineer fo r the OECchip 21066 low-cost Alpha AXP project, Sanjay Patel was a key implementor of the DECchip 21066 ver­ ification strategy He is currently working on specifying power management capabilities fo r a future chip. Sanjay joined Digital in 1992. In his earl ier experi­ ence as a co-op student at IBM in Kingston, New York, he helped develop a vec­ torized subroutine library. Sanjay received B.SE. (1990) and MSE. (1992) degrees in computer engineering from the University of Michigan.

Delvan A. Ramey A principal hardware engineer, Del Ramey specializes in ana­ log CMOS development fo r CPU, communications, and graphics chips. His recent work involves phase-locked loop (PLL) design. Del architected the hyhrid analog­ to-digital PLL clock and data recovery system in the Ethernet protocol OECchip 21040 product, and designed the frequency synthesis PLL for the DEC:chip 21066 Alpha AXP microprocessor. He has also developed semicustom chips and libraries. Del joined Digital in 1980, after completing his Ph.D. at the University of Cincinnati . He is a member of IEEE, Eta Kappa Knu, and Phi Kappa Phi.

T. K. Rengarajan T. K. Rengarajan, a member of the Database Systems Group since 1987, works on the KODA software kernel of the DEC Rdh system. He has contributed in the areas of butfer management, high availability, OITP per­ fo rmance on Alpha AXP systems, and multimedia databases. He designed high-performance logging, recoverable latches, asynchronous batch writes, and asynchronous prefetch features fo r DEC Rdb version 6.0. Ranga holds M.S. degrees in computer-aided design and computer science from the University of Kentucky and the University of Wisconsin, respectively.

Mark B. Rosenbluth Consulting engineer Mark Rosenbluth contributes to the architecture, modeling, and logic design of the DECchip 21066 microproces­ sor. Jn earlier work, he was involved in the modeling, logic design, and circuit design of the DC222 system-on-a-chip (SOC) VAX microprocessor, the DC:)58 MicroVAX DNlA controller chip, and the J-11 PDP-11 microprocessor. He also helped create semicustom design system and cell libraries. Mark came to Digital in 1977 from RCA Corporation. He holds a BS.EE. from Rutgers University Col lege of Engineering.

G I

Michael S. Rubino Michael Rubino joined Digital in 1983 and is a principal software engineer in the Database Systems Group. As the Alpha program manager fo r database systems, his primary role is the oversight of the porting of Rdb as wel l as other database products from VAX to Alpha AXP. Prior to this work, Michael was with VMS Engineering and contributed to the RMS and RMS/journaling projects. Prior to that, he was the KODA (database kernel) project leader. Michael received a B.S. in computer science from the State University of New York at Stony Brook in 1983.

Robert J. Simcoe Robert Simcoe is a consulting engineer in the High Per­ fo rmance Networks Group and is responsible fo r ATM switch products. Prior to this, he worked in Network Engineering Advanced Development on the GIGAswitch and SONET/ATM. He has also worked on chip designs, from MicroVAX and Scorpio FPUs to high-speed switching fo r network switches. Prior to joining Digital in 1983, he designed !Cs and small systems fo r General Electric; before that, he designed secure communications devices for the National Security Agency. He has published several papers and holds 14 patents.

Robert J. Souza Bob Souza is a consulting engineer in the Networks Engi­ neering Advanced Development Group. He was the project leader fo r the GJGAswitch SCP module. Since joining Digital in 1982, Bob has worked on the DEMPR Ethernet repeater fo r MicroSystems Advanced Development and has coimpJemented an RPC system for Corporate Research at Project Athena. He is the author of several tools for sequential network design, a number of papers, and several patents. Bob received a Ph.D. in computer science from the University of Connecticut.

Barry A. Spinney Barry Spinney is a consulting engineer in the Network Engineering Advanced Development Group. Since joining Digital in 1985, Barry has been involved in the specification and/or design of FDDI chip sets (MAC,RMC, and FCAM chips) and the GIGAswitch chips (FGC, GCAM, GPI, and CBS chips). He also contributed to the GIGAswitch system architecture and led the GIGAswitch FDDI line card design (chips, hardware, and firmware). He holds a B.S. in mathe­ matics and computer science from the University of Wa terloo and an M.S.C.S. from the University of To ronto. Barry is a member of IEEE.

Peter M. Spiro Peter Spiro, a consulting software engineer, is currently the technical director fo r the Rdb and DBMS software product set. Peter's current fo cus is very large database issues as they relate to the information highway. Peter joined Digital in 1985, after receiving M.S. degrees in forest science and computer science from the University of Wisconsin-Madison. He has five patents related to database journaling and recovery, and he has authored three papers for earlier issues of the Digital Te chnicaljo urnal. In his spare time, Peter is building a birch bark canoe.

7 Biographies

Robert E. Thomas Bob Thomas received a Ph.D. in computer science from the University of California, Irvine. He has worked on fine-grained dataflow multi­ processor architectures at the MIT Lab for Computer Science. After joining Digital in 1983, Bob pioneered the development of a cell-based network sw itch­ ing architecture and deadlock-free routing. He is a coinventor of the high-speed crossbar arbitration method for GIGAswitch and was co-project leader of the AfoM-3 gate array and the ATM/SONET/T3/E3 interface card. Bob is currently hardware project leader for a 622-Mbs ATM PC! adapter.

Robert G. Thomson A senior software engineer in the Open VMS AXP Group, Robert Thomson led the validation of volume shadowing's port to the Alpha AXP platform. During the OpenVMS port, he measured quality and contributed to func­ tional verification, for which he was co-recipient of an Alpha AXP Achievement Award. Since joining Digital in 1986, Robert has also contributed to improve­ ments in system availability measurement and in symmetric multiprocessing and backup performance. He has a patent and three published papers based on this work. Robert holds an M.S. in computer engineering from Boston University.

Robert). Walsh Bob Walsh has published papers on TCP/IP networking, on radiosity and ray-tracing algorithms fo r 3-D shaded graphics, and on graphics rendering with parallel processors. He has a patent pending on language constructs and translation fo r programming parallel machines. Bob has exten­ sive operating system experience for uniprocessors (UNIX v.6, 4.1 BSD, and follow-ons) and switch-based parallel processors (BBN's Butterfly and Digital's GIGAswitch). He received A.B., S.M., and Ph.D. degrees from Harvard University.

8 Robert]. Souza

P. G. Krishnakumar CuneytM. Ozveren Ro bert]. Simcoe Barry A. Sp inney Robert E. Thomas Robert]. Walsh GIGAswitch System: A High-performance Packet-switching Platform

The GIGAswitch system is a high-perfonnance packet-switching platfonn built on a 36-port 100 lvlb/s crossbar switching fabric. The crossbar is data link independent and is capable of making 6.25 million connections per second. Digital's first GIGAswitch system product uses 2-port FDD!line cards to construct a 22-port IEEE 802.1 d FDDI bridge. The FDDI bridge implements distributed forwarding in hard­ ware to yield forwarding rates in excess of 200,000 packets per second per port. The G!GAswitch system is highly available and provides robust operation in the presence of overload.

The GlGAswitch system is a multiport packet­ platform and conclude with the resu lts of perfor­ switching platform that combines distributed fo r­ mance measurements made during system test. warding hardware and crossbar sw itching to attain very high network performance. When a packet GIGAswitch System Architecture is received, the receiving line card decides where The GIGAswitch system implements Digital's to fo rward the packet autonomously. The ports on architecture fo r switched packet networks. The a GIGAswitch system are fully interconnected with architecture al lows fast, simple fo rwarding by map­ a custom-designed, very large-scale integration ping 48-bit addresses to a short address when (VI_<;I) crossbar that permits up to 36 simultaneous a packet enters the switch, and then fo rwarding conversations. Data flows through 100 megabits packets based on the short address. header con­ per second (Mb/s) point-to-point connections, A taining the short address, the time the packet was rather than through any shared media. Movement received, where it entered the switch, and other of unicast packets through the GIGAswitch system information is prepended to a packet when it is accomrlished completely by hardware. enters the switch. When a packet leaves the switch, The GlGAswitch system can be used eliminate to the header is removed, leaving the original packet. network hierarchy and concomitant delay. It can The architecture also defines fo rwarding across aggregate traffic from local area networks (LANs) multiple GIGAswitch systems and specifies an algo­ and be used to construct wo rkstation farms. The rithm fo r rapidly and efficiently arbitrating fo r use of LAN and wide area network (WAN) line cards crossbar output ports. This arbitration algorithm makes the GIGAswitch system su itable fo r build­ is implemented in the VLSI, custom-designed ing, campus, ancl metropolitan interconnects. The GIGAswitch port interface (GPI) chip. GIGAswitch system provides robustness and avail­ abil ity features useful in high-availabi lity appl ica­ tions like fi nancial networks and enteqxise Hardware Overview backbones. Digital's first product to use the GIGAswitch plat­ In this paper, we present an overview of the form is a modu lar IEEE 802.ld fiber distributed data switch architecture and discuss the principles influ­ interface (FDDI) bridge with up to 22 ports.' The encing its design. We then describe the imrlemen­ product consists of four module types: the FDDI tation of an FOm bridge on the GIGAswitch system line card (FGL), the S\Vitch control processor (SCP),

Digital Tecbnicaljournal Vnl. 6 No. I Winter 1994 9 High-performance Networking

the clock card, and the crossbar interconnection. • Maintaining loosely consistent line card address The modules plug into a backplane in a 19-inch, databases rack-mountable cabinet, which is shown in Figure l. Forwarding multicast packets and packets to The power and cooling systems provide N+ 1 • redundancy, with provision for battery operation. unknown clestinations The first line card implemented for the Switch configuration GIGAswitch system is a two-port FOOl line card • (FC;L-2). A four-port version (FGL-4) is currently Network management through both the SNMP under design , as is a multifunction asynchronous • and the GIGAswitch system out-of-band manage­ transfer mode (ATM) line card. FCL-2 provides con­ ment port nection to a number of different FDOI physical media using media-specific daughter cards. Each The clock card provides system clocking and port has a lookup table for network addresses and storage for management parameters, and the cross­ associated hardware lookup engine and queue man­ bar switch module contains the crossbar proper. ager. The SC:P provides a number of centralized The power system controller in the power sub­ functions, including system monitors the power supply front-end units, fans, and cabinet temperature. • Implementation of protocols (Internet protocol [IP], simple network management protocol [SNIVJP]. and lEU� 802.Jd spanning tree) above Design Issues the media access control (MAC) layer Building a large high-performance system requires a seemingly endless series of design decisions and • Learning addresses in cooperation with the ine I trade-offs. In this section, we discuss some of the cards major issues in the design and implementation of the GIGAswitch system.

Multicasting Although very high packet-forwarding rates for unicast packets are required to prevent network bottlenecks, considerably lower rates achieve the same result for multicast packets in extended LANs. Processing multicast packets on a host is often done in software. Since a high rate of multicast traf

fie on a LAN can render the connected hosts use­ Jess, network managers usually restrict the extent of multicast packets in a LAN with filters. Measuring extended LAN backbones yields little multicast traffic. The GIGAswitch system forwards unicast traffic in a distributed fashion. Its multicast forwarding implementation, however, is centralized, and soft­ ware forwards most of the multicast traffic. The GIGAswitch system can also limit the rate of multi­ cast traffic emitted by the switch. The reduced rate of traffic prevents lower-speed LANs attached to the switch through bridges from being rendered inop­ erable by high multicast rates. Badly behaved algorithms using multicast proto­

cols can render an extended LAN useless. Therefore, the GIGAswitch system allocates internal resources so that forward progress can be made in a lJ\N with figure 1 The G/GAswitcb Sj;stem badly behaved traffic.

6 Digital Technical jourual 10 Vol. No. I V?iuter 1994 GIGAswitch Sys tem: A High-performance Pa cket-switching Pla tform

Switch Fa bric By observing the service of those before it, the line The core of the GIGAswitch system is a 100 Mb/s card can determine when its turn has arrived and full-duplex crossbar with 36 input ports and 36 out­ instruct the crossbar to make a connection to the put ports, each with a 6-bit data path (36 X 36 X 6). output port. The crossbar is fo rmed from three 36 X 36 X 2 cus­ The distributed arbitration algorithm is imple­ tom VLSI crossbar ch ips. Each crossbar input is mented by GP! chips on the line cards and SCP. The paired with a corresponding output to fo rm a dual­ GPI is a custom-designed CMOS VLSI chip with simplex data path. The GIGAswitcb system line approximately 85,000 transistors. Ticket and con­ cards and SCP are fully interconnected through the nection information are communicated among crossbar. Data between modules and the crossbar GPis overa bus in the switch backplane. Although it can flow in both directions simultaneously. is necessary to use backplane bus cycles fo r cross­ Using a crossbar as the �;witch connection(rather bar connection setup, an explicit connection tear than, say. a high-speed bus) allows cut-through fo r­ clown is not performed. This reduces the connec­ warding: a packet can be sent through the crossbar tion setup overhead and doubles the connection as soon as enough of it has been received to make rate. As a result, the c;Jc;Aswitch system is capable a fo rwarding decision. The crossbar allows an input of making 6.25 million connections per second. port to be connected to multiple output ports simultaneously; this property is used to implement Hunt Groups The (;p[ al.lows the same logical multicast. The 6-bit data path through the crossbar address to be assigned to many physical ports, provides a raw data-path speecl of 150 Mb/s using which together form a hunt group. To a sender, a a 2"i megahertz (MHz) clock. (Five bits are used to hunt group appears to be a single high-bandwidth encode each 4-bit symbol; an additional bit pro­ port. There are no restrictions on the sizeand mem­ vides parity.) bership of a hunt group; the members of a hunt Each crossbar chip has about 87,000 gates and group can be distributed across different line cards is implemented using complementary metal-oxide in the switch. When sending to a hunt group, the semiconductor (CMOS) technology. The crossbar take-a-ticket arbitration mechanism dynamically dis­ was designed to complement the FDDI data rate; tributes tratiic across the physical ports comprising higher data rates can be accommodated through the group, and connection is made to the first free the use of hunt groups, which are explained later port. No extra time is required to perform this arbi­ in this section. The maximum connection rate for tration and traffic distribution. A chain of packets the crossbar depends on the switching overhead, traversing a hunt group may arrive out of order. i.e., the efficiency of the crossbar output port arbi­ Since some protocols are intolerant of out-of-order tration and the connection setup and tear-down delivery, the arbitration mechanism has provisions mechanisms. to fo rce alI packets of a particular protocol type to Crossbar ports in the GIGAswitch system have take a single path through the hunt group. both physical and logical addresses. Physical port Hunt groups are similar to the channel groups addresses derive from the backplane wiring and are described by Pattavina, but without restrictions on a function of the backplane slot in which a card group membership. 2 Hunt groups in the GIGA�witch resides. Logical port addresses are assigned by the system also diffe r from channel groups in that their SCP, which constructs a logical-to-physical address use introduces no additional switching overhead . mapping when a line card is initial ized. Some of Hardware support fo r hunt groups is included in the logical port number space is reserved; logical the first version of the GI<;Aswitch system; software port 0, fo r example, is always associated with the fo r hunt groups is in development at this writing. current SCP. Address Lookup Arbitration Algorithm With the exception of A properly operating bridge must be able to receive some maintenance functions, crossbar output port every packet on every port, look up several fields in arbitration uses logical add resses. The arbitration the packet, ami decide whether to fo rward or filter mechanism, called take-a-ticket, is similar to the (drop) that packet. The worst-case packet arrival system used in delicatessens. A line card that has rate on FOOl is over 440,000 packets per second per a packet to send to a particular output port obtains port. Since three fields are looked up per packet, a ticket from that port indicating its position in line. the FOOl line card needs to perfo rm approximately

Digital Te ch nical jo urnal v( i/ 6 No. I Wi nter /'J'J4 II High-performance Networking

1.3 million lookups per second per port; 880,000 of the properties of this hash function is that it is a these are fo r 48-bit quantities. The 48-bit lookups one-to-one and onto mapping from the set of 48 -bit must be done in a table containing 16K entries values to the same set. As long as the lookup table in order to accommodate large LANs. The lookup records are not shared by different hash buckets, fu nction is replicated per port, so the requisite per­ comparing the upper 32 bits is sufficient and leaves fo rmance must be obtained in a manner that mini­ an additional 16 bits of info rmation to be associated mizes cost and board area. The approach used to with this known address.

look up the fields in the received packet depends In the case where 1 < size :": 7, the pointer stored upon the number of values in the field. in the hash bucket points to the first entry in a bal­ Content addressable memory (CAM) technology anced binary tree of depth l, 2, or 3. This binary currently provides approximately 1K entries per tree is an array sorted by the upper 32 hash remain­ CAM chip. This makes them impractical fo r imple­ der bits. No more than three memory reads are menting the I6K address lookup table but suitable required to find the lookup record associated with fo r the smaller protocol field lookup. Earlier Digital this address, or to determine that the address is not bridge products use a hardware binary search in the database. engine to look up 48-bit addresses. Binary search When more than seven addresses collide in the requires on average 13 reads for a I6K address set; same hash bucket-a very rare occurrence-the fast, expensive random access memory (RAM) overflow addresses are stored in the GIGAswitch would be needed for the lookup tables to minimize content-addressable memory (GCAM). If several the forwarding latency. dozen overflow addresses are added to the GCAM, To meet our lookup performance goals at reason­ the system determines that it has a poor choice of able cost, the FDDI-to-GIGAswitch network con­ hash multipliers. It then initiates a re-hashing oper­ troller (FGC) chip on the line cards implements ation, whereby the SCP module selects a better a highly optimized hash algorithm to look up the 48-bit hash multiplier and distributes it to the FGLs. destination and source address fields. This lookup The FGLs then rebuild their hash table and lookup makes at most fo ur reads from the off-chip static tables using this new hash multiplier value. The new

RAI\1 chips that are also used fo r packet buffering. hash multiplier is stored in nonvolatile memory. The hash function treats each 48-bit address as a 47-degree polynomial in the Galois field of order Pa cket Buffering 2, GF(2).l The hashed address is obtained by the The FDDI Iine card provides both input and output equation: packet buffering fo r each FOOl port. Output buffe r­ ing stores packets when the outgoing FDDI link M(X) X A(X) mod G(X) is busy. Input buffe ring stores packets during

where G(X) is the irreducible polynomial, X48 + switch arbitration fo r the desired destination port. G X3 + X25 + X 10 + 1; M(X) is a nonzero, 47-degree Both input and output buffe rs are divided into sep­ programmable hash multiplier with coefficients arate first-in, first-out (FIFO) queues fo r different in GF(2); and A(X) is the address expressed as a traffic types. 47-degree polynomial with coefficients in GF(2). Switches that have a single FIFO queue per input The bottom 16 bits of the hashed address is then port are subject to the phenomenon known as used as an index into a 64K-entry hash table. Each head-of-line blocking. Head-of-line blocking occurs hash table entry can be empty or can hold a pointer when the packet at the front of the queue is des­ to another table plus a size between 1 to 7, indicat­ tined for a port that is busy, and packets deeper in ing the number of addresses that collide in this hash the queue are destined fo r ports that are not busy. table entry (i.e., addresses whose bottom 16 bits of The effect of head-of-line blocking fo r fixed-size their hash are equal). In the case of a size of 1, either packets that have unifo rmly distributed output port the pointer points to the lookup record associated destinations can be closely estimated by a simple with this address, or the address is not in the tables probability model based on independent trials. but happens to col lide with a known address. To This model gives a maximum achievable mean uti­

= 1 - = determine which is true, the remaining upper lization, U 1/e 63.2 percent, fo r switches 32 bits of the hashed address is compared to the pre­ with more than 20 duplex ports. Utilization viously computed upper 32 bits of the hash of the increases fo r smaller switches (or fo r smaller active known address stored in the lookup record. One of parts of larger switches) and is approximately

Vol. 6 No. Digital Tech nical jounwl 12 1 Wi nter 1994 GIGAswitch Sy stem: A High-performance Pa cket-switching Platfo rm

75 percent fo r 2 active ports. The independent trial a larger set of workstations) is uniformly dis­ assumption has been removed, and the actual mean tributed to a smaller set of outputs (for example, utilization has been computed.'' It is approximately a smaller set of file servers). The result is 60 percent fo r large numbers of active ports. lim 1 Hunt groups also affect utilization. The benefits =1- -- ull, (n - !),." U= l C 11 n -- oo of hunt groups on head-of-line blocking can be seen ) by extending the simple independent-trial analysis. where cis the mean concentration fa ctor of input The estimated mean utilization is ports to output ports, and n is the number of out­ puts. This yields a utilization of 86 percent when g k g k n g k !!--=-__. l "g � an average of two inputs send to each output, and a = 1 _ � U11. g ..:::... g- ( k ) (�) n ( n ) utilization of 95 percent when three inputs send to k=O each output. Note that utilization increases fu rther where n is the number of groups, and g is the hunt fo r smaller numbers of active ports or if hunt group size. In other words, all groups are the same groups are used. size in this model, and the total number of switch Other important factors in head-of-line blocking ports is (n X g) . This result is plotted in Figure 2 are the nature of the links and the traffic distribu­ along with simulation results that remove the inde­ tion on the I inks. Standard FDDI is a simplex link. pendent trial assumption. The simulation results Simulation studies of a GIGAswitch system model agree with the analysis above fo r the case of only were conducted to determine the mean utilization one link in each hunt group. Note that adding a link of a set of standard FDDI links. They have shown to a hunt group increases the efficiency of each that utilization reaches 100 percent, despite head­ member of the group in addition to adding band­ of-line blocking, when approximately 50 Mb/s of width. These analytical and simulation results, doc­ fixed-size packets traffic, uniformly distributed umented in January 1988, also agree with the to all FDDI links in the set, is sent into the switch simulation results reported by Pattavina. 2 from each FDDI. The reason is that almost 50 per­ The most important factor in head-of-line block­ cent of the FDDI bandwidth is needed to sink data ing is the distribution of traffic within the switch. from the switch; hence the switch data path is only When all traffic is concentrated to a single output, at 50 percent of capacity when the FDDI links are there is zero head-of-line blocking because traffic 100 percent utilized . This result also applies to behind the head of the line cannot move any more duplex T3 (45 Mb/s) and all slower links. In these easily than the head of the line can move. To study situations, the switch operates at well below capac­ this effect, we extended the simple independent­ ity, with little internal queuing. trial model. We estimated the utilization when the A number of techniques can be used to reduce traffic from a larger set of inputs (for example, the effect of head-of-line blocking on link efficiency. These include increasing the speed of the switching fabric and using more complicated queuing mecha­ 0.90 nisms such as per-port output queues or adding 0.85 lookahead to the queue service. All these tech­ :2 � 0.80 niques raise the cost and complexity of the switch; -z some of them can actual ly reduce performance fo r � Q 0.75 :2 !;( normal traffic. Since our studies Jed us to believe til�0.7 0 that head-of-line blocl

the worst-case maximum. The techniques used to Arriving packets allocated to an exhausted buffer guarantee fo rward progress on activities like the quota are dropped by the XAC. For instance, pack­ spanning tree include preallocation of memory ets to be flooded arrive due to external events and to databases and packets, queuing methods, operat­ are not rate limited before they reach the SCP. ing system design, and scheduling techniques. These packets may be dropped if the SCP is over­ Solutions that provide only robustness are insuffi­ loaded. Some buffe r quotas, such as those fo r ICC cient; they must also preserve high throughput in packets, can be sized so that packets are never the region of overload. dropped. Since software is not involved in the deci­ sion to preserve important packets or to drop Switch Co ntrol Processor Queuing und Quota excessive loads, high throughput is maintained dur­ Strategies The SCP is the focal point for many ing periods of overload. In practice, when the net­ packets, including (l) packets to be flooded, work topology is stable, the SCP is not overloaded (2) 802.1d control packets, (3) intrabox intercard and packets passing through the SCP fo r bridging command (ICC) packets, and (4) SNMP packets.'i are not dropped, even on networks with thousands Some of these packets must be processed in a of stations. This feature is most important during timely manner. The 802.1d control packets are part power-up or topology-change transients, to ensure of the 802.1d algorithms and maintain a stable net­ the network progresses to the stable state. work topology. The ICCs ensure correct fo rwarding If the SCP simply processed packets in FIFO order, and filtering of packets by col lecting and distribut­ reception of each packet type would be ensured, but ing information to the various line cards. The SNMP timely processing of important packets might not. packets provide monitoring and control of the Therefore, the first step in any packet processing is GIGAswitch system. to enqueue the packet for later processing. (Packets Important packets must be distinguished and may be fu lly processed and the buffe rs reclaimed if processed even when the GIGAswitch system is the amount of work to do is no greater than the heavily loaded. The aggregate fo rwarding rate fo r a enqueue/dequeue overhead.) Since the operating GIGAswitch system fu lly populated with FGL-2 line system scheduler services each queue in turn,split­ cards is about 4 million packets per second. This is ting into multiple queues allows the important too great a load for the SCP CPU to handle on its packets to bypass the less important packets. own. The FDDI line cards place important packets Multiple queues are also used on the output port in a separate queue for expedient processing. of the SCP. These software output queues are ser­ Special hardware on the SCP is used to avoid loss of viced to produce a hardware output queue that is important packets. long enough to amortize device driver entry over­ The crossbar access control (XAC) hardware on heads, yet short enough to bound the service time the SCP is designee! to avoid the loss of any impor­ for the last packet inserted. Bounding the hardware tant packet under overload. To distinguish the pack­ queue service time ensures that the important ets, the XAC parses each incoming packet. By 802.1d control packets convey timely information preal locating buffer memory to each packet type, fo r the distributed spanning tree algorithms. These and by having the hardware and software cooper­ considerations yield the queuing diagram shown in ate to maintain a strict accounting of the buffers Figure 3. used by each packet type, the SCP can guarantee At time tl, packets arriving in nonempty quotas reception of each packet type. are transferred by direct memory access (DMA) into

]]]- TASK A --+-iTn--l

PACKETS ARRIVE PACKETS LEAVE - TASK B --+-___ll_lj � FROM CROSSBAR ]]]- -i-f�Jill-- ON CROSSBAR JI]]· PROCESS C 11 t2 t3 t4 t5

Figure 3 Pa cket Queuing on the SCP

14 Vo l. 6 No. I Winter 1991 Digital Te �·httica/ journal GIGAswitch Sy stem: A High-pelformance Pa cket-switching Pla tform

CPU, dynamic RAM . They enter the hardware-received so that all parts of the system can make for­ packet queue. At time t2, software processes the ward progress in a timely manner. received packet queue, limiting the per-packet On the SCP, limiting the interrupt rate is accom­ processing to simple actions like the enqueu ing plished in two ways. One is to mask the propa­ of the packet to a task or process. At time t3, the gation of an interrupt by combining it with a packet contents are examined and the proper pro­ software-specified pulse. After an interrupt is tocol actions executed. This may involve the fo r­ serviced, it is inhibited fo r the specified time by warding of the arriving packet or the generation of triggering the start of the pulse. At the cost of hard­ new packets. At time t4, packets are moved from ware complexity, software is given a method fo r the software output queues to the short hardware quick, single-instruction, fine-grained rate limiting. output queue. At time t5, the packet is transferred Another method, suitable fo r less frequently exe­ by DiVlA into the crossbar. cuted code paths like error handl ing, is to use soft­ ware timers and interrupt mask registers to limit the frequency of an interrupt. Limiting the intcr­ Limiting Malicious Influences Using packet mpt rate also bas the beneficial effect of amortizing types and buffer quotas, the SCP can distinguish interrupt overhead across the events aggregated important traffic, like bridge control messages, behind each interrupt. when it is subjected to an overload of bridge con­ Noninterrupt software inhibits interrupts as part trol, unknown destination addresses, and multicast of critical section processing. If software inhibits messages. Such simple distinctions would not, how­ interrupts for too long, interrupt service code can­ ever, prevent a malicious station from consuming not make fo rward progress. By convention, inter­ all the buffers fo r multicast packets and allowing rupts are inhibited fo r a limited time. starvation of multicast-based protocols. Some of Interrupt servicing can be divided into two these protocols, like the JP address resolution pro­ types. In the first type, a fixed sequence of actions tocol (ARP), become important when they are not is taken, and limiting the interrupt rate is sufficient allowed to function 6 to limit interrupt execution time. Most error pro­ To address this problem, the SCP also uses the cessing fa lls into this category. In the second type, incoming port to classify packets. A malicious sta­ fo r all practical purposes, an unbounded response tion can wreak havoc on its own LAN whether or is required. For example, if packets arrive fa ster not the GIGAswitch system is present. By classifying than driver software can process them, then inter­ packets by incoming port, we guarantee some rupt execution time can easily become unaccept­ buffers fo r each of the other interfaces and thus able. Therefore, we need a mechanism to bound the ensure communication among them. The mali­ service time. In the packet l/0 interrupt example, cious station is reduced to increasing the load of the device driver polls the microsecond clock to nuisance background traffic. Region t4 of Figure 3 measure service time and thereby terminate device contains the layer of flooding output queues that driver processing when a bound is reached. If ser­ sort flooded packets by source port. When fo r­ vice is prematurely terminated, then the hardware warding is done by the SCP bridge code, packets continues to post the interrupt, and service is from well-behaved networks can bypass those from renewed when the rate-limiting mechanism allows poorly behaved networks. the next service period to begi n. Fragmentation of resources introduced by the Interrupt rate limiting can lead to lower system fine-grained packet classification could lead to small throughput if the is sometimes idle. This can buffe r quotas and unnecessary packet loss. To com­ CPU be avoided hy augmenting interrupt processing pensate fo r these possibilities, we provided shared with polled processing when id le cycles remain resource pools of buffers and high-throughput, after all activities have had some minimal fa ir share low-latency packet forwarding in the SCP. of the CPU.

Guaranteeing Fo rward Progress If an interrupt­ driven activity is offered unlimited load and is Reliability and Availability allowed to attempt to process the unlimited load, Network downtime due to switch fa ilures, repairs, a "livelock" condition, where only that activity exe­ or upgrades of the GIGAswitch system is low. cutes, can result. Limiting the rate of interrupts Components in the GIGAswitch system that could allows the operating system scheduler access to the be single points of failure are simple and thus more

Digital Te chnical Jo urnal Vo l. 6 No. J Wi nter 1994 15 High-performance Networking

reliable. Complex functions (logic and firmware) rather than on SCP modules. This simplifies the task are placed on modules that were made redundant. of coordinating updates to the management param­ A second SCP, fo r example, takes control if the first eters among the SCP modules. The SCP module con­ SCP in the GIGAswitch system fails. If a LAN is con­ trolling the box is selected by the clock card, rather nected to ports on two different FDDI line cards, than by a distributed (and more complex) election the 802.1d spanning tree places one of the ports in algorithm run by the SCPs. backup state; fa ilure of the operational port causes The clock card maintains ami distributes the the backup port to come on-line. system-wide time, communicated to the SCP and The GIGAswitch system allows one to "hoH,wap" line cards through shared-memory mailboxes on (i.e., insert or remove without turning off power) their GPI chips. Module insertion and removal are the line cards, SCP, power supply front-ends, and discovered by the clock card, which polls the slots fans. The GIGAswitch system may be powered from in the backplane over the module identification bus station batteries, as is telephone equipment, to interconnecting the slots. The clock card controls remove dependency on the AC mains. whether power is applied to or removed from a given slot, usually under command of the SCP. The Module Details clock card also provides a location fo r the out-of­ band management RS-232 port, although the out­ In the next section, we describe the fu nctions of of-band management code executes on the SCP. the clock card, the switch control processor, and Placing this set of fu nctionality on the clock card the FDD! line card. does not dramatically increase complexity of that module or reduce its reliability. It does, however, Clock Card significantly reduce the complexity and increase The clock card generates the system clocks fo r the the reliability of the system as a whole. modules and contains a number of centralized sys­ tem functions. These functions were placed on the Switch Control Processor clock card, rather than the backplane, to ensure The SCP module contains a MIPS R3000A CP with that the backplane, which is difficult to replace, is 64-kilobyte (kB) instruction and data caches, write passive and thus more reliable. These functions buffer, 16-megabyte (MB) DRAM, 2-MB flash mem­ include storing the set of 48-bit IEEE 802 addresses ory, and crossbar access control hardware. Figure 4 used by the switch and arbitration of the backplane shows a diagram of the SCP. The large DRAM pro­ bus that is used fo r connection setup. vides buffering fo r packets forwarded by the SCP Management parameters are placed in a stable and contains the switch address databases. The XAC store in flash electrically erasable programmable provides robust operation in the face of overload read-only memory (EEPROM) on the clock card, and an efficient packet flooding mechanism fo r

TRANSMIT DATA � CROSSBAR BACKPLANE ACCESS CONTROL RECEIVE DATA

ADDRESS AND DATA ...__

PACKET AUXILIARY INSTRUCTION MEMORY MEMORY CACHE ¢:::::> (64 KB) 16-MB DRAM 2-MB FLASH MICROPROCESSOR AND WRITE BUFFER DATA CACHE WLDATA II (64 KB) ¢:::::>

ADDRESS

Figure 4 Switch Co ntrol Processo r

16 Vo l. 6 No . I Wi nter 1994 Digital Te chnical journal GIGAswitch Sys tem: A High-performance Pa cket-switching Platform

highly populated GIGAswitch systems. The XAC is tions were receded in assembly language, which implemented using two field-programmable gate yielded tighter code and fe wer memory references. arrays with auxiliary RAl\1 and support logic. A The conventional MIPS method fo r fo rcing major impetus fo r choosing the MIPS processor was the cache to be updated from memory incurs software development and simulation tools avail­ three steps of overhead 7 One step is linear in the able at the time. amount of data to be accessed, and the other two We input address traces from simulations of the are inefficient fo r small amounts of data. During the SCP software to a cache simulation that also under­ I in ear time step, the information to be invalidated is stood the access cost fo r the various parts of the specified using a memory operation per tag while memory hierarchy beyond the cache. Using the exe­ the cache is isolated from memory. This overhead is cution time predicted by the cache simulation, we avoided on the SCP by tag-bit manipulations that were able to evaluate design trade-offs. For compiler­ cause read memory operations to update the cache generated code, loads and stores accounted fo r from DRAM.No additional instructions are required about half the executed instructions; therefore to access up-to-date information, and the method is cache size and write buffe r depth are important. optimal fo r any amount of data. For example, some integrated versions of the MIPS R3000 processor would not perform well in our FDDI Line Card application. Simulation also revealed that avoiding The FGL contains one FDDl port subsystem per port stale data in the cache by triggering cache refills (two for FGL-2 and fo ur fo r FGL-4) and a processor accounted fo r more than one-third of the data cache subsystem. The FDDI port systems, shown in Figure misses. Consequently, an efficient mechanism to 5, are completely independent. The process fo r update the cache memory is important. It is not sending a packet from one FDDI LAN to another is important, however, that all DMA input activity the same, whether or not the two FDDI ports are on updates the cache. A bridge can fo rward a packet the same FGL module or not. The processor subsys­ by looking at only small portions of it in low-level tem consists of a Motorola 68302 microprocessor network headers. with 1 MB of DRAM, '512 kB of read-only memory Cache simulation results were also used to opti­ (ROM), and 256 kB of . The processor mize software components. Layers of software subsystem is used fo r initial configuration and were removed from the typical packet-processing setup, diagnostics, error logging, and firmware code paths, and some commonly performed opera- functions. The FGL reused much firmware from

DATA TO CROSSBAR

DATA FROM CROSSBAR

TO BACKPLANE BUS

DATA TO CROSSBAR

DATA FROM CROSSBAR

Figure 5 fDDI Line Card fo r Two Ports

Digital Te chnical journal Vo l. 6 No. I Winter IY94 17 High-performance Networking

other network products, including the DECNIS 600 note when each source address was last heard, and FDDI line card firmware; station management func­ (3) to map packet types to classes used fo r filtering. tions are provided by the Common Node Software a It also provides a number of packet and coun­ The FGL uses daughter cards, called ModPMDs, to ters. The FGC chip is also used in the FDDI line card implement a variety of physical media attachments, fo r the DECNIS multiprotocol router. including single- and multiple-mode fiber usable fo r distances of up to 2 and 20 kilometers, respectively, FDDI Bridge Implementation and unshielded, twisted-pair (UTP) copper wire, In the next section, we describe the implementa­ usable up to 100 meters. Both the F<;L-2 and the tion of an FDDI bridge on the GIGAswitch system FGL-4 can support four ModPMDs. The FGL-2 can platform. support one or two attachments per FODI LAN and appear as a single attachment station (SAS), dual Pa cket Flow attachment station (DAS), or M-port, depending upon the number of Modl'MD cards present and Figure 6 shows the interconnection of modules in management settings. The FGL-4 supports only SAS a GIGAswitch system. Only one port of each FDDI configurations. The Digital VLSI FODI chip set was line card is shown; hardware is replicated for other used to implement the protocols fo r the FODI physi­ ports. Note that the line cards and SCP each have cal layer and MAC layer. Each FODI LAN can be either dual-simplex 100 Mb/s connections to the crossbar; an ANSI standard 100 Mb/s ring or a fu ll-duplex traffic can flow in both directions simultaneously. The FGC hardware on the GIGAswitch system line (200 Mb/s) point-to-point link, using Digital's fu ll­ card looks up the source address, destination duplex FOOl extensions 9 addresses, and protocol type (which may include The heart of the fGL is the bridge forwarding the frame control [FC] , DSAP, SNAP SAP, etc., fields) component, which consists of two chips designed by Digital: the FDOHo-GIGAswitch network con­ of each packet received. The result of the lookup troller (FGC) chip and the GIGAswitch content may cause the packet to be filtered or dropped. If the packet is to be fo rwarded, a small header is addressable memory (GCAM) chip. It also contains prepended and the packet is placed on a queue of a set of medium-speed static RAM chips used fo r packets destined fo r the crossbar. Buffers fo r bridge packet storage and lookup tables. The GCA.M provides a 256-entry associative mem­ control traffic are allocated from a separate pool so that bridge control traffic cannot be starved by data ory for looking up various packet fields. It is used traffic. Bridge control traffic is placed in a separate to match 1-byte FDDI packet control fields, 6-byte queue fo r expedient processing. destination and source addresses, 1-byte destina­ Most packets travel through the crossbar to tion service access point (DSAP) fields, and 5-byte another line card, which transmits the packet on subnetwork access protocol service access point the appropriate FDDI ring. If the output port is free, (SNAP SAP) protocol identifiers. The GCA...\1 chip has the GIGAswitch system will fo rward a packet as two data interfaces: a 16-bit interface used by the soon as enough of the packet has been received processor and an 8-bit, read-only interface that to make a fo rwarding decision. This technique, is used fo r on-the-fly matching of packet fields. referred to as cut-through fo rwarding, significantly The FGC can initiate a new lookup in GCAM evet-y reduces the latency of the GlGAswitch system. 80 nanoseconds. Some packets are destined fo r the SCP. These are The FGC chip is a large (approximately 250,000 network management packets, multicast (and transistors), 240-pin, 1.0-micrometer gate array that broadcast) packets, and packets with unknown des­ provides all the high-performance packet queuing tination addresses. The SCP is responsible fo r coor­ and fo rwarding functions on an FGL. It also controls dinating the switch-wide resources necessary to the packet flow to and from the crossbar and to and fo rward mul.ticast and unknown destination from the FDDI data link chip set. It queues inbound address packets. and outbound packets, splitting them by type into queues of a size determined by firmware at start-up time. The FGC looks up various FOOl packet fields Learning and Aging (1) to determine whether or not to forward a packet A transparent bridge receives all packets on every and which port to use, (2) to determine whether IAN connected to it and notes the bridge port on the packet contains a new source address and to which each source address was seen. In this way,

18 Vo l. 6 No. 1 Winter 1994 Digital Te chnicaljom·nal GIGAswitch Sy stem: A High-performance Pa cket-switching Platform

OJ[ ]]] OJ[ ]]]

LOOKUP NON BLOCKING LOOKUP - - DATA LINK ADDRESS ADDRESS DATA LINK FDDI l ENGINE CROSSBAR ENGINE j ( INTERFACE TABLE TABLE INTERFACE � - QUEUE QUEUE MANAGER MANAGER I (FGC) I I (FGC) I I B B ]]] OJ[ '-- -- 1-- FDDI LINE CARD ]]] OJ[ FDDI LINE CARD (ONE PORT (ONE PORT SHOWN) SHOWN)

BACKPLANE BUS t t CLOCK CARD � I I SCP

Figure 6 GJGAswitch Sy stem Modules

the bridge learns the 48-bit MAC addresses of nodes ware task scans the table, placing addresses that in a network. The SCP and line cards cooperate to have not been seen fo r a specified time on a Jist that build a master table of 48-bit address-to-port map­ is retrieved at regular intervals by the SCP. Aged pings on the SCP. They maintain a loose consis­ addresses are marked but not removed from the tency between the master address table on the SCP table unless it is full;this reduces the overhead if and the per-port translation tables on the line cards. the address appears again. When a I ine card receives a packet from a source The GIGAswitch system should respond quickly address that is not in its translation table, it fo r­ when an address moves from one port of the switch wards a copy of the packet to the SCP. The SCP to another. Many addresses may move nearly simul­ stores the mapping between the received port and taneously ifthe LAN topology changes. The fact that the 48-bit address in the master address table and an address is now noticed on a new port can be informs all the line cards of the new address. The used to optimize the address aging process. A SCP also polls the line cards at regular intervals to firmware task on FGL scans the translation table fo r learn new addresses, since it is possible for the addresses that have been seen but are not owned by packet copy to be dropped on arrival at the SCP. this port and places them on a list. The SCP then The FGC hardware on the line card notes when an retrieves the list and quickly causes the address to address has been seen by setting a source-seen bit be owned by the new port. in the addresses' translation table entry. Since stations in an extended LAN may move, Multicast to Multiple Intetjaces bridges remove addresses from their fo rwarding The SCP, rather than the line cards, sends a packet tables if the address has not been heard from fo r a out multiple ports. This simplifies the line cards management-specified time through a process and provides centralized information about packet called aging. In the GlGAswitch system, the line flooding in order to avoid overloading remote card connected to a LAN containing an address is lower-speed LANs with flooded packets. The SCP responsible for aging the address. Firmware on FGL allows network management to specify rate limits scans the translation table and time-stamps entries fo r unknown destinations and for multicast destina­ that have the source-seen bit set. A second firm- tion traffic.

Digital Te chnical jo urnal Vo l. 6 No. I Winter 1994 19 High-performance Networking

Flooding a packet requires replicating the packet extracted from the instruction interpreter that and transmitting the replicas on a set of ports. understood the SCP memory map. During the design of the flooding mechanism, we FGL testing initially proceeded in standalone needed to decide whether the replication would mode without other GIGAswitch system modules. take place on the SCP, in hardware or software, or The FGL firmware provided a test mechanism that on the line cards. The design criteria included allowed ICC packets to be received and transmitted (1 ) the amount of crossbar bandwidth that is con­ over a on the module. This facility was sumed by the flooding and (2) the effect of the used to test ICC processing by sending ICC packets flooding implementation on the fo rwarding perfor­ from a host computer over a serial line. The next nunce of unicast packets. level of testing used pairs of FGLs in a GIGAswitcb The size of the switch and the filters that are system backplane. To emulate the SCP hardware, specified also affected this decision. If the typical one FGL ran debug code that sent ICC packets switch is fully populated with line cards and has no through the crossbar. filters set, then one incoming multicast packet is Initial testing of SCP!FGL interaction used the flooded out a large number of ports. If a switch has FGL's serial ICC interface to connect an FGL to SCP few cards and filters are used to isolate the LANs, software emulated on workstations. This interface then an incoming multicast packet may not have to allowed SCP and FGL firmware to communicate be sent out any port. with ICC packets and permitted testing of SCP and Since the crossbar design allows many outputs FGL interactions before the SCP hardware was ready. to be connectecl to a single input, a single copy of a packet sent into the crossbar can be received by Network Management FDDI multiple ports simultaneously This improves The GIGAswitch system is manageable via the SNMP the effective bandwidth of the SCP connection to protocol. SNMP uses get and get-next messages to the crossbar since fe wer iterations of information examine and traverse the manageable objects in the are required. The queuing mechanisms used to GIGAswitch system. SNMP uses set messages to con­ send multicast connection information over the trol the GIGAswitch system. backplane allow each line card port using the multi­ A single SNMP set message can manipulate multi­ cast crossbar facility to receive a copy no more than ple objects in the GIGAswitch system. The objects one packet time after it is ready to do so. should change atomically as a group, with all of We decided to implement flooding using special them modified or none of them modified from the . DMA hardware on the SCP The transfers the packet viewpoint of the management station. If the man­ once into a private memory, and al.l iterations pro­ agement station indicates that the GIGAswitch sys­ ceed from that memory, thus removing contention tem reported an error, there are no dangling side at the SCP's DRAM. The multicast hardware repeat­ effects. This is most advantageous if the manage­ edly transmits the packet into the crossbar until the ment station maps a single form or command line backplane queuing mechanisms signal that all rele­ to a single SNMP set message. vant ports have received a copy The SCP software achieves atomicity by checking individual object values, cross-checking object Integration and Te st values, modify ing object values with logging, and recording commit/abort transactional boundaries GIGAswitch system software development and test in a phased process. As each object value is modi­ were performed in parallel with hardware develop­ fied, the new value is logged. If all modifications ment. Most of the SCP software was developed succeed, a commit boundary is recorded and the before reliable SCP hardware was available and SNMP reply is sent. If any modification fails, all pre­ before integration of the various modules could ceding modifications for this set operation are proceed in a GIGAswitch system. The simulation of rolled back, an abort boundary is recorded, and the hardware included the SCP's complete memory SNMP reply is sent. map, a simplified backplane, a simplified clock card, and FDDI line cards. At run time, the program­ mer could choose between line card models con­ Measured Performance nected to real networks or real GIGAswitch system Measuring the performance of a GIGAswitch sys­ FGLs. Instruction and data address references were tem requires a significant amount of specialized

20 Vo l. 6 No. I Winter 1994 Digital Te chnical journal GIGAswitch Sys tem: A High-performance Pa cket-switching Platfo rm

equipment. We made our performance measure­ By design, the GIGAswitch system flooding rate is ments using proprietary FDDI testers constructed at lower than the rate fo r unicast traffic. We measured DigitaL Under the control of a workstation, the the flooding rate to be 2,700 packets per second. testers can send, receive, and compare packets at Note that a multicast rate of r implies transmission the fullFDOI line rate. We used 21 testers in con­ of (s X r) packets by the switch, where sis the num­ junction with commercial LAN analyzers to mea­ ber of ports currently on the bridge spanning tree. sure performance of the GIGAswitch system. Measurement of GIGAswitch system flooding The fo rwarding rate was measured by injecting rates on real networks within Digital and at field a stream of minimum-size packets from a single test sites indicates that the switch is not a bottle­ input port to a single output port. This test yields neck fo r multicast traffic. a fo rwarding rate of 227,000 packets per second. The forwarding rate in this test is limited because Conclusions connection requests are serialized when the (sin­ The GIGAswitch system is a general-purpose, packet­ gle) output port is busy.The arbitration mechanism switching platform that is data link independent. allows connections to ports that are not busy to be Many issues were considered and techniques were established in parallel with ongoing packet trans­ used in its design to achieve robustness and high missions. Modifying the test so that one input port performance. The performance of the switch is sends packets to three output ports increases the among the highest in the industry. A peak �w itching aggregate fo rwarding rate to 270,000 packets per rate of 6.25 million variable-size packets per second second. We also measured the aggregate fo rward­ includes support fo r large hunt groups with no Joss ing rate of a 22-port GIGAswitch system to be in performance. Digital's first product to include a approximately 3.8 million minimum-sized FOOl GIGAswitch system was a 22-port IEEE 802.ld FOOl packets per second. At very high packet rates, small bridge. Shipment to general customers started in differences in internal timing due to synchroniza­ june 1993. A four-port FDDI line card (FGL-4) is in tion or traffic distribution can exaggerate differ­ development. A two-port GIGAswitch system line ences in the fo rwarding rate. The difference card (AGL-2) using ATM is planned fo r shipment in between 270,000 packets per second and 227,000 May 1994. This card uses modular daughter cards to packets per second is less than 9 byte times per provide interfaces to synchronous optical network packet at 100 Mb/s. (SONET) STS-3c, synchronous digital hierarchy For the GIGAswitch system, the fo rwarding (SDH) STM-1 , DS-3, or E3 transmission media. latency is measured from first hit in to first bit out The GIGAswitch system can be extended to of the box. Forwarding latency is often measured include other data links such as high-speed serial from last bit in to first bit out, but that method hides interface (HSSI), fast Ethernet, and Fibre Channel. any delays associated with packet reception. The High-performance routing is also possible. fo rwarding latency was measured by injecting a The GIGAswitch system has been used success­ stream of small packets into the switch at a low fully in a wide variety of application areas, includ­ rate, evenly spaced in time. The fo rwarding latency ing workstation farms, high-energy physics, and was determined to be approximately 14 microsec­ both LA..\ \1 and metropolitan area network (MAN) onds, or approximately 175 byte times at 100 Mb/s. backbones. One or more GIGAswitch systems can This measurement illustrates the result of applying be used to construct a large-scale WAN backbone distributed, dedicated hardware to the fo rwarding that is surrounded by routers to isolate individual path. It includes two 48-bit addresses and a proto­ LANs from the WAN backbone. A GIGA�witch system col type lookup on the incoming line card, the out­ network can be configured to provide the band­ put filtering decision on the outbound line card, width needed to support thousands of conven­ and the delays due to connection latency, data tional LAN users as well as emerging applications movement, and synchronization. such as large-scale videoconferencing, multimedia, The GIGAswitch system filters minimum-sized and distributed computing. packets from a multiple-access FOOl ring at the line rate, which is approximately 440,000 packets per Acknawledgment second per port. Filtering is necessary to reject Our thanks to the fo llowing engineers who traffic that should not be fo rwarded through the contributed to the GIGAswitch system architecture switch. and product: Vasmi Abidi, Dennis Ament, Desiree

Digital Te chnical journal Vo l. 6 No. 1 Wi nter 1994 21 High-performance Networking

Aw iszio, Juan Becerra, Dave Benson, Chuck Benz, 4. M. Karol, M. Hluchyj, and S. Morgan, "lnput Denise Carr, Paul Clark, Jeff Cooper, Bill Cross, John Ve rsus Output Queueing on a Space-Division Dowell, Bob Eichmann, john Forecast, Ernie Grella, Packet Switch," IEEE Tmnsactions on Commu· Martin Griesmer, Tim Haynes, Chris Houghton, nication, vol. COM-35, no. 12 (December 1987): Steve Kasischke, Dong Kjm, jeanne KJauser, Tim 1347-1356. Landreth, Rene Leblanc, Mark Levesque, Michael Lyons, Doug Macintyre, Ti m Martin, Paul Mathieu, 5. ). Case, M. Fedor, M. Schoffstall, and J Davin, Lynne McCormick, joe Mel lone, Mike Pettengill, "A Simple Network Management Protocol Nigel Poole, Dennis Racca, Mike Segool, MLke Shen, (SNMP)," Internet Engineering Ta sk Fo rce, Ray Shoop, Stu Soloway, Dick Somes, Blaine RFC 1157 (August 1990). Stackhouse, Bob Thibault, George Va rghese, Greg 6. D. Plummer, "An Ethernet Address Resolution Wa ters, Carl Windnagle, and Mike Wolinski. Protocol," Internet Engineering Ta sk ro rce, References RFC 826 (November 1982).

1. Local and Metropolitan Area Ne tworks: Media 7. G. Kane, 1HIPS RISC Architectu·re (Englewood Access Control (MAC) Bridges (New Yo rk: Insti· Cliffs, NJ : Prentice-Hall, 1989). tute of Electrical and Electronics Engineers, Inc., IEEE Standard 802.1 D-1990). 8. P Ciarafella, D. Benson, and D. Sav.ryer, "An Overview of the Common Node Software," 2. A. Pattavina, "Multichannel Bandwidth Al loca­ Digital Te chnical ]o umal, vol. :1, no. 2 (Spring tion in a Broadband Packet Switch," IEEE jo urnal 1991): 42-52. on Selected Areas in Communication, vol. 6, no. 9 (December 1988) 1489-1499. 9. H. Ya ng, B. Spinney, and S. Towning, "FDDI Data 3. W Gilbert, Modern Algebra with Applications Link Development," Dt�f!,ital Te chnical Jo urnal, (New Yo rk: John Wiley, 1976). vol. 3, no. 2 (Spring 1991) 31-41.

22 l·ol. 6 No. I Winter 1994 Digital Te chnical jo urnal Lucien A. Dimino Rabah Mediouni T. K. Rengarajan Michael S. Rubino Pe ter M. Sp iro

Performance of DEC Rdb Ve rsi on 6.0 onAXP Systems

The Alpha AXPfa mily of processors provided a dramatic increase in CP U sp eed. Even with slo-wer processors, many database applications were dominated by relatively slow 110 rates. To maintain a balanced system, database software must incorporate techniques that sp ecifically address the disparity between CPU sp eed and 1!0 perfo r­ mance. The DEC Rdb version 60 database management sy stem contains shorter code paths, fewer 110 operations, and reduced stall times. Th ese enhancements minimize the effe ct of the 11 0 bottleneck and allow the AXP pr ocessor to run at its intended higher sp eeds. Empirical perfonnance results show a marked improvement in l/ 0 rates.

The DEC Rdb fo r OpenVMS AXP product (hereafter Performance Gains During the in this paper designated as DEC Rdb) is Digital's flag­ Port to openVMS ship database management system.1 The DEC Rdb In this section, we recount some of the initial, gen­ relational database software competes effectively eral performance modifications to the DEC Rdb in multiple data processing domains such as stock software in the port from the VA X VMS system to the exchanges, image-processing applications, telemedi­ OpenVMS AXP system. We briefly discuss data align­ cine, and large databases used fo r decision support ment and reduction of the software subroutines or or scientific applications. Virtually all these appli­ PA Lcode calls. cation frameworks are demanding increased pro­ The DEC Rdb engineering developers saw the cessing power and increased 1/0 capabilities. opportunity for increased performance through The AJpha AXP processor family represents a careful alignment and sizing of data structures. We quantum jump in the processing power of CPUs. It is needed to develop a solution that would exploit the designed to scale up to 1,000 times the current pro­ performance gain of data alignment, yet st ill main­ cessing power in a decade.2 On the other hand, disk tain an easy migration path fo r our large base of cus­ 1/0 latency is improving at a much slower rate than tomers on VA X systems. CPU power. As a result, especially on an AXP plat­ For the first release of DEC Rd b, all in-memory form, the total time to execute a query is dominated data structures were naturally aligned. In addition, by the time to perform the disk I/Ooperations. This many in-memory byte and word fields in these data disparity between processor speed and 1/0 latency is structures were expanded to 32 bits of data (long­ commonly called the 1/0 bottleneck or the I/0 gap.3 words). Once the in-memory data structures were In this paper, we describe our efforts to improve aligned, we turned our attention to the on-disk data the performance of DEC Rdb on AXP systems. First structures. The database root file, which is also fre­ we explain general porting steps that ensure a fo un­ quently accessed, was completely aligned. New dation of good performance on AJphaAXP systems. databases can be created with these al igned data Then we describe our effo rts to reduce the I/0 bot­ structures, and existing databases can be aligned tleneck. We present the performance enhance­ during a database convert operation on both VA X ments to various components of earlier versions of and AXP systems. This operation takes only seconds. DEC Rdb and compare the enhancements and new We did not align the data on the database pages in features of version 6.0. Finally, we discuss the TPC-A the storage area. Database pages contain the actual transaction processing benchmark and present user data records. By leaving this data unaligned, empirical results that quantify the benefits of the we did not force a database unload/reload require­ optimizations. ment on our customers. This factor and our support

6 Digital Technical journal Vo l. No. I Wi nter 1994 23 Open VMS AXP System Software

of clusters fo r both the VA X and the Alpha AXP archi­ buffe r pool to be replaced. We refer to this as the tectures simplify migration from the VA X system to victim buffer. the AXP system. Prior to DEC Rdb version 6.0, writes of updated After we completed the port of DEC Rd b, we ran database pages to disk happened in synchronous perfo rmance benchmarks to determine which areas batches. If the victim buffe r is marked, a syn­ of the system could be enhanced. By using an inter­ chronous batch write is launched. In addition to nal tool called !PROBE, we learned that we could the victim buffe r, a number of least recently used improve the performance of DEC Rdb by rewriting (LRU) marked buffers are collected. The number or eliminating code paths in PA Lcode subroutines. of buffe rs in the batch write is specified as the From january 1993 through july 1993, we reduced BATCH_MAX parameter by users. PA Lcocle cycles from 143k to 62.6k per TPC-A trans­ The list is then sorted by page numbers in order action. Details of earlier performance modifica­ to perform global disk head optimization. After tions have been discussed in this jo urna/.'1 this, asynchronous disk write 1/0 requests are issued fo r all the marked buffe rs. In these earl ier DEC Performance En hancements versions, the Rd b product then waits until all the write 110 requests are completed. Although the In this section, we describe how we improved the individual writes are issued asynchronously and performance of DEC Rdb by addressing l/0 bottle­ may complete in parallel, DEC Rdb waits syn­ neck and problems in the code path. chronously for the entire batch write to complete. The general strategy to combat the l/0 gap in DEC No query processing happens during the batch Rd b was twofold. The first step was to minimize 1!0 write. Figure shows alternating synchronous DEC 1 operations. We used the "global buffers" of Rdb batch writes and other work. Writes to the same to avoid read 1/0 requests. We took advantage of disk are executed one after another, and writes to the large physical memory available in modern different disks are executed in parallel. computers to cache interesting parts of the data­ The synchronous batch leverages two important base in memoryand hence reduce the disk read 1!0 optimizations: (1) parallelism between various requests. The 64-bit Alpha AXP architecture al lows disks and (2) disk head optimizations implemented computers to easily use more than 4 gigabytes (GB) at various levels of the storage hierarchy The syn­ of directly addressable memory. Write l/0 requests chronous batch write fe ature reduced the average DEC are avoided by using the fast commit fe ature of stall time per disk write 1/0. The extent of reduc­ Rdb. The minimization of 1/0 activity is detailed in a tion depends upon the degree to which the above technical report by Lomet et al.' two optimizations happen, which in turn depends The second step was to reduce the stall time for upon the physical database design and app lication the 1!0 operations that must be done. A reduced query behavior. In the TPC-A benchmark, the syn­ stal l time allows DEC Rdb software to continue pro­ chronous batch writes reduced the average stall cessing the queries even when disk 1/0 operations time fo r the account write by 50 percent compared DEC are in progress on its behalf. Two fe atures of to synchronous individual writes. Rdb version 6.0, the asynchronous pre fetch and the To fu rther reduce stall time, we implemented asynchronous batch writes, reduce the stall time asynchronous batch writes (ABWs) in DEC Rdb ver­ fo r the read and write 1/0 requests, respectively sion 6.0. With ABW, DEC Rdb now maintains the last Al though this strategy handles the 110 gap, the few buffe rs unmarked . The size of this clean region write to the after image journal (AI]) becomes a lim­ of the buffe r pool is specified by the user. As new iting factor in high-performance transaction pro­ cessing (TP) systems. The stall time fo r the AIJ write is reduced through the AIJ log server and AU cache on electronic disk fe atures of DEC Rdb version 6.0. In the fo llowing sections, we discuss these fe a­ tures in detail. OTHER SYNCHRONOUS OTHER SYNCHRONOUS WORK BATCH WRITE WORK BATCH WRITE

Wr ite 1/ 0 Re quests To read a new set of pages from the d

Vo l. 6 No. Digital Te chllicaljournal 24 I IVilller 1')')4 Performance of DEC Rdb Ve rsion 60 on AXP Sys tems

pages are read from the database, database buffe rs strategy is to reliably determine the desirable data­ migrate toward the end of the LRU chain of buffe rs. base pages fo r the immed iate fu ture. If a marked buffer were fo und in the clean region, Fortunately, applications request data in sets of an ABW would be invoked. rows. Therefore, based on user requests and the Figure 2 shows asynchronous batch writes query optimizer decisions, the database access pat­ invoked periodically, while other work continues. terns are known at the time query execution starts. Processing does not explicitly wait fo r any of the This allows the mechanism to prefetch the data disk write 1/0 requests to complete. (There may be from the database into the buffe r pool. implicit waits due to disk queuing effects.) Instead, In DEC Rdb version 6.0, we implemented the processing continues. If a buffe r with a pending mechanism to prefetch data from the database DEC write is chosen as the victim or if one of the buffe rs based on requests from higher layers of Rdb. with pending writes is required fo r fu rther process­ This is performed in an integrated manner in the ing, DEC Rdb then waits fo r completion of the pend­ buffer pool with the usual page locking protection ing writes. For applications with good temporal in a cluster. We also implemented the policy to use locality, it is likely that the buffers with pending asynchronous prefetch in the case of sequential writes will not be required fo r fu rther processing. scans. Figure 2 also shows a rare instance in wh ich Sequential scans are quite com mon in databases other processing stops, waiting fo r one of the asyn­ fo r batch applications producing large reports. chronous write 1/0 requests to complete. Again, They are also chosen by the optimizer when a large other processing includes stalls for disk read 110 number of records are selected from a table fo r a requests. It is also possible to start a new ABW when query fo r processing. the previous one has not yet completed. The user can specify the parameter APF_DEPTH that controls the number of buffe rs to be used fo r prefetching database pages, hence the lead of Read 1/ 0 Requests prefetch ahead of the real fetches. With a sufficient Requests fo r database pages are often satisfied by level of prefetch, it is quite possible to exploit the a large global buffe r pool. This is true when the parallelism in the disk subsystem as well as achieve whole database fits in a large memory, common on close to the spiral transfer rate of disks, as we Alpha AXP systems. Under certain circumstances, describe in the fo llowing test. however, the buffe r pool is not large enough to sat­ We placed one table in a mixed fo rmat area on an isfy all requests. Moreover, seldom-used data may 400 RA73 disk. The database server process used be replaced by more frequently used data in the buffe rs of 6-kilobyte (kB) size, and the APF_DEPTH buffe r pool. parameter was set to 50 buffers. With these set­ The essential strategy is to submit asynchronous tings, the sequential scan of a 1-GD area took 593 disk read 110 requests wel l before the data is really seconds. This is equivalent to a transfer rate of 1.69 needed during query processing. If the asynchro­ megabytes per second (MB/s), compared to the nous prefetch (APF) request is made far enough in rated spiral transfer rate of 1.8 MB/s fo r the disk. advance of the actual request, the process wi ll nor We performed another test to determine the need stall fo r the I/0. Critical to any prefetch to improvement with the APF feature in DEC Rd b ver­ sion 6.0. Aga in, we built a database with one table in a mi,"Xedfo rmat area, but this time on a stripe set of two RA92 disks. The database server process used 400 buffe rs of 3-kB size. The elapsed time to scan a 1-GB area in version 5.1 was 6, 512 seconds, and the transfer rate was 0.15 MB/s. In version 6.0 the elapsed KEY: time was 569 seconds, and the transfer rate was 1.76 OTHER PROCESSING D 100 MB/s. An APF _ EPTH of buffe rs was used fo r ver­ STALL FOR ABW sion 6.0. We find that APF has made sequential scan - INDIVIDUAL DISK WRITE 1/0 ACTIVITY 10 times faster in DEC Rdb version 6.0. Note that this performance improvement can be made much bet­ Figure 2 Simultaneous Asynchronous Batch ter by using more disks in the stripe set and by Wr ites and Other WcJrk using more powerfu l processors.

Digital Te ch ,icaljournal 6 Vo l. No. I Winter I

Co mmit Ti me Processing commit records are successfully written to the AI] file . The ALS gathers the AI] log records of al l users, Although the disk 110 stalls are significantly reduced by the APF and ABW in DEC Rdb version 6.0, formats them, writes them to the AIJ fi le, and then the commit time processing remains a significant wakes up servers waiting fo r commit. Thus, the ALS wait. perfo rms AI)writes in a continuous loop. Prior to version 6.0, DEC Rdb software used a The ALS allows DEC Rdb version 6.0 to scale up to cooperative flushing protocol. Each database thousands of transactions per second with one server produced the AIJ log records and then used magnetic disk using the TPC-A benchmark. the lock manager to determine the group com mit­ With the ALS, the average stall time fo r a server to ter. The group committer then fo rmatted the AIJ log commit is 1.5 times more than the time taken to records fo r all the servers and flushed the data to perform one log write f/0. This stall is of the order of 17 milliseconds with 5,400-rpm disks. Note that the AlJ file in one disk write I/0. Each server thus competed fo r the AIJ lock in order to become the this stall time is a fu nction of disk performance group committer. Each server also had to acquire only and is independent of the workload. In a high­ throughput environment, where transaction the AIJ lock even to determine if it had committee!. TP Figure 3 shows this algorithm. DEC Rdb software times are very short, this stall at commit time is still also supports timer-based tuning methods to a significantwait. Considering the speed of Alpha increase the group size fo r group commit.6 AXP processors, the commit stall is many times more than the processing time fo r servers. AljLog Server With DEC Rdb version 6.0, a dedi­ cated process called the AI] log server (ALS) runs on A lj Ca che on Electronic Disk The AIJ cache on every node of the cluster to perform the task of pro­ electronic disk (ACE) is a feature of the ALS that also cessing group writes to the AI] file fo r all. servers. helps high-throughput TP systems. ACE utilizes a The new algorithm is in two parts, one for the small amount (less than 1 MB) of very low latency, database servers and another fo r the ALS . Figure 4 solid-state disk to reduce the com mit stall time. shows these algorithms. Figure 5 shows the new ACE algorithm with ALS. At com mit time, database servers generate AI] log The ACE file is on a sol id-state disk shared by all records, store them in shared memory, and go to nodes of a cluster. The file is partitioned for use by

"sleep." They are "woken up" by the AlS when their various ALS servers in the cluster.

Put AIJ data in shared memory

Get AIJ Lock

i f our data IS flushed then begin

R elease AIJ Lock

return end

! We are the group commi tter

Format AIJ data in shared memory

Reserve space in AIJ file

Release AIJ Lock

Wri te AIJ data to AIJ file

Indicate AIJ data is flushed for group members

Figure 3 Cooperative FlushingAlgo rithm

Vo l. Digital Teclmical journal 26 6 No. I Winter 1994 Performance of DEC Rdb Ve rsion 6. 0on AXPSy stems

Put AIJ data in shared memory

Un t i l com mi tted

sleep ! Woken up after commi t by ALS (a) Database Server Algorithm

Get AIJ lock

loop begin

Format AIJ data in shared memo ry

Wri te AIJ data to AIJ fi le

Indi cate AIJ data is flushed for group members

Wake up all commi tted processes

end

Re lease AIJ lock

(b) ALS Algorithm

Figure 4 AI] Log Server Process

Get AIJ lock

loop begin

Format AIJ data in shared memo ry

Save AIJ block # and length of I/0 in ACE header

Write AIJ data and ACE header to ACE fi le 1ms

Wake up all commi tted processes

Write AIJ data to AIJ fil e 11m s

Set flags to indi cate AIJ data is flushed for group members

end

Release AIJ lock

Figure 5 ACE Algo rithm

Digital Teclmicaljounwl Vo l. 6 No. I Winte•· 1994 27 OpenVMS AXP System Software

Basically, the data is first written to the ACE fi le, not proportiona.lly. Low device utilization and non­ and the servers are allowed to proceed to the next linear scaling mean that as the system capacity and transaction. The data is then flushed to the AI] file database size increase, the cost in system time and in a second l/0. The write to the ACE disk includes hardware fo r a given level of backup performance the virtual block number of the AI] fi le where the becomes increasingly burdensome. log data is supposed to be written. The ACE disk The traditional backup process, such as provided serves as a write-ahead log fo r the AIJ write. By by OpenVMS BACKUP, is not coordinated with doubling the number of disk writes per group, database activity.The database activity must be pro­ we reduced the response time fo r the servers to hibited during the backup. If it is not, the restore 0.5 times the stall for AIJ write and 1.5 times the stall operation will produce an inconsistent view of the fo r ACE write. This reduction in the stall time data. In the latter case, database activity journals conversely increases the CPU utilization of servers, are required to return the database to consistency. thereby reducing the number of servers required to Application of these journals significantly reduces saturate an AXP CPU. the performance of the restore process. To maximize the performance of the traditional Backup and Restore Op erations backup process, the backup must consume a large Simply stated, the way we traditionally perform portion of the entire throughput of the disks being backup does not scale well with system capacity, backed up. As a result, database activity and backup rarely uses its resources effectively, and is dis­ activity severely impede each other when they ruptive to other database activity. We designed compete fo r these disks. the backup and restore operations to resolve these issues. Before we present that discussion, we first RMUBA CKUP Op eration The RMU BACKUP oper­ examine the problems with the traditional database ation resolves the problems with the traditional backup operation. backup operation. RMU BACKUP is coordinated with database activity. It produces a consistent image of Traditional Backup Process Using OpenVMS the database at a point in time without restricting BACKUP as an example of the traditional backup database activity or requiring application of jour­ process, we find that a single backup process per­ nals after the restore operation. fo rmance does not scale with CPU performance, The RMU BACKUP operation is a multithreaded nor with system aggregate throughput. Instead, it process; therefore, it backs up multiple disks to is limited by device throughput. The only way to multiple tapes and eliminates the limiting factor increase the performance is to perform multiple associated with throughput of a single disk or a backups concurrently. However, concurrent back­ single tape drive. The aggregate of disk and tape ups executing on the same CPU interfere with one throughputs determines the performance of RMU. another to some extent. Each additional backup Because the aggregate disk throughput is usually process provides Jess than 90 percent of the perfor­ significantly higher than the aggregate tape mance of the previous one. Five OpenVi\•lS BACKUP throughput, all the disks have spare throughput at operations executing on one CPU provide no all times during RMlJ BACKUP. Consequently, the greater performance than fo ur operations, each RMU BACKUP process and database activity interfere executing on its own CPU. Consequently, five tape to a much lesser extent than is the case with tradi­ drives may provide only fo ur times the perfor­ tional backup. lts multithreaded design also scales mance of a single drive. linearly with CPU capacity, aggregate disk through­ The OpenVMS BACKUP operation is limited by the put, and aggregate tape throughput. lesser of the disk throughput and the tape through­ The first steps of an RMU BACKUP are to evaluate put. Read performance fo r an RA73 disk may be as the physical mapping of the database to disk high as 1.8 MB/s, but OpenVMS BACKUP more typi­ devices and to determine the system I/O configura­ cally achieves between 0.8 and 1.0 MB/s. Performance tion. R..VI U then devises a plan to execute the fo r a TA92 tape is 2.3 MB/s. Five tape drives, with an backup. The goals of this plan are to divide the data aggregate throughput of 11.5 Ml3/s, and 25 disks, among the tape drives equally, to prohibit interfer­ with an aggregate throughput of 45.0 Mfl/s, can on ly ence between devices sharing a common 110 path, be backed up at a rate of 4.0 MB/s (14.4 GR per hour). and to minimize disk head movement. The gener­ Increasing the CPU capacity and aggregate ated plan is a compromise because complete and throughput improves the backup performance but accurate configuration data is difficult to collect

28 Vo l. G No. I Winter 1994 Digital Te chnical journal Petformance of DEC Rdb Ve rsion 6 0 on AXPSy stems

and costly to assimilate. To implement the plan, being restored. This is the only restriction on RMUcreates a network of interacting threads. Each database activity during the restore operation. thread operates asynchronously, perfo rming asyn­ We performed a test to illustrate the effective­ clu·onous muJtibuffe red 1/0 to its controlled device. ness and scalability of RNru BACKUP with database lnterthread communication occurs through buffer size and system capacity. Ta bl e 1 gives the results. exchange and shared memory structures. This test also demonstrates the high level of backup RMU uses the database page checksum to provide performance that can be provided on AXP systems end-to-end error detection between the database without the use of exotic or expensive technology. updater and the backup operation. It uses a backup Although the results are limited by the aggregate file block cyclic redundancy check (CRC) to provide tape 1/0 performance, the CPU does not appear to end-to-end error detection between the backup have sufficient excess capacity to justify a test with and the restore operations. In addition, RMU uses a sixth tape drive. XOR recovery blocks to provide single error correc­ Unfortunately, comparable numbers are not avail­ tion on restore. As a consequence, the data being able fo r other database products on comparable backed up must be processed four times, which has platforms. Nevertheless, when any database system a major effect on CPU usage. The first time is to eval­ relies on the operating system's backup engine, the uate the database page checksum. The second time expected performance limits are the same as the is to copy the data and to exclude unused storage scenario presented. and redundant structures. Here we are willing to expend extra CPU cycles to reduce the I/O load . The Sorting Mechanism third time is to generate an error detection code A sorting mechanism is a key component of a data­ (CRC), and the fo urth time is to generate an XOR base system. For example, the relational operations error recovery block. The RMU BACKUP operation improves perfor­ mance in other ways. It does not back up redundant Ta ble 1 Backup Performance on database structures, nor does it back up al located AXP Syste ms

storage that does not contain accessible data. As a Configuration Ty pe Amount result, the size of the backup fi le is significantly reduced, and relative backup performance is System DEC 7000 Model 610 improved . The incremental backup fe ature selec­ (182 MHz) tively backs up only those database pages that have Disks RA73 23 been modified. This provides an additional, sign ifi­ RZ23 2 cant reduction in backup file size relative to a file­ Controllers HSC95 6 oriented backup. These reductions in backup file CIXCD 6 size further improve RMU BACKUP performance rel­ KMC44 5

ative to traditional backup. Ta pe drive TA92 5 The performance of the RMU RESTORE operation Operating system OpenVMS V1.5 mirrors that of the RMU BACKUP operation. In spite Database DEC Rdb V5.1 of this, a natural asymmetry between the opera­ management tions allows RMU BACKUP to outperform RMU software

RESTORE by 20 percent to 25 percent. There are sev­ Database size of 48.8 GB eral reasons for the asym metry: the cost of allocat­ (71 % relational data or 34.7 GB) ing files, the asymmetry of read versus write disk Performance performance, the asymmetric response of the T/0 subsystem to read versus write l/0 operations Backup time 1:11 :29 under heavy load, and the need to re-create or ini­ Backup rate 41.0 GB/hour tialize redundant data that was not backed up by Sustained disk data rate of 455 kB/s per disk RMU BACKUP. (25% utilization) The RMU RESTORE operation can restore the Sustained tape data rate of 2.3 MB/s per tape (1 00% utilization) entire database, selected database files, and even selected pages within the files. Restoring files or Restore time 1:32:49 Restore rate 31 .6 GB/hour pages requires exclusive access to only the objects

Digital Te chnical journal Vo l. 6 No. I Winter 1994 29 OpenVMS AXP System Software

of join, select, and project, which are executed to When clients and servers communicate over a satisfy user queries, often sort records to enable network, each SQL statement incurs the additional more efficient algorithms. Furthermore, selected overhead of two network 1/0 operations. The result records must often be presented to the user in is a long transaction code path as well as delays due sorted order. Quite literal ly, a faster sorting mecha­ to excessive network traffic. Figure 6 shows the nism directly translates to faster executing queries. execution path that individual SQL statements tra­ Hence during the port to the AJpha AXP platform, verse in both the local and the remote cases. we also focused on ensuring that the DEC Rdb sort­ Network overhead is a significant problem for ing mechanism worked well with the AJpha AXP cl ient-server applications. Without the services of architecture. In the case of the sort mechanism, we the VA X ACMS TP monitor, 14 network 110 opera­ reduced 110 stalls and optimized for processor tions are required to complete a single TPC-A trans­ cache hits rather than access main memory. action. Ta ble 2 lists the TPC-A pseudocode needed Prior to the port to AJpha AXP, DEC Rdb utilized to complete one transaction. only the replacement-selection sort algorithm. For AXP systems, we have modified the sort mechanism CLIENT SERVER so that it can also utilize the "quicksort" algorithm I under the right conditions 7 In fact, we developed a APPLICATION I modified quicksort that sorts (key prefi..'(,poin ter) I pairs. Our patented modification allows a better I address locality that exp loits processor caching 8 � i DEC NET I This is especially important on the AJpha AXP plat­ APPLICATION NETWORK I fo rm since the penalty fo r addressing memory is INTERFACE ROB SERVER significant due to the relatively fast processor I speed . The quicksort algorithm also performs bet­ ! I ter than the replacement-selection algorithm in a i I � i DATABASE I memory-rich environment, which is the trend fo r ENGINE APPLICATION INTERFACE Alpha AXP systems. I To summarize, if the required sort fits in main I memory, DEC Rdb version 6.0 utilizes the optimized I I � i quicksort algorithm; if the sort requires temporary I DATABASE resu lts on disk, DEC Rdb uses the traditional ENGINE I replacement-selection sort algorithm. When writ­ I ing temporary results to disk, both sort algorithms also utilize asynchronous 110mechanisms to reduce Figure 6 SQL Statement Execution Fl ow 110 stalls. The DEC Rdb version 6.0 implementation of the quicksort algorithm is based on the AJphaSort algo­ rithm developed at Digital's San Francisco Systems Ta ble 2 Individual SOL Statements for Center. The AJphaSort algorithm achieved the Single TPC-A Tra nsaction world-record sort of 7 seconds on an industry­ Network 1/0 standard sort benchmark. This result is more than TPC-A Pseudocode Operations 3 times faster than the previous sort record of Start an update transaction 2 26 seconds on a CRAY Y-MP system H Update a row in BRANCH table 2 Upda te a row in TELLER table 2 Upda te a row in AC COUNT table 2 Multi-statement Procedures Select Account_ba lance from ACCOUNT and display it on Prior to the implementation of multi-statement pro­ the termi na l 2 cedures in DEC Rdb, individual SQL statements were Insert a row in the serially submitted to the database engine for execu­ HISTORY table 2 Com mi t transaction 2 tion. This method of execution incurs excessive code path because individual SQL statements must To tal network operations traverse different layers before they reach the 1/0 per TPC-A transaction 14 database engine.

6 Digital 1ecbnicaljounml 30 Vo l. No. I Wi nter 1994 Performance of DEC Rdb Ve rsion 6 0 on A.XPSy stems

Prior to DEC Rdb version 6.0, client applications ations are required between the client and the accessing remote servers relied on the services of remote server for each database request. Through the VA X ACMS TP monitor to reduce the network the DEC Rdb remote facility, which is an integral overhead. With ACMS present on both the client part of DEC Rdb, client-server applications no and the server, a message carrying a transaction longer need the services of a TP monitor. Therefore request is transferred to the server in one network many of the processing cycles that would have 1/0. As sl1own in Figure 7, an ACMS server is then been dedicated to the TP monitor are regained and selected to execute the request, and a message is applied toward processing requests. returned to the client upon transaction completion. The DEC Rdb remote server is an ordinary VMS pro­ This reduces the number of network 1/0 operations cess that is created upon the first remote request to to two per transaction. In simple applications such the database. It has less overhead than the TP moni­ as TPC-A, however, a TP monitor can be very intru­ tor. The DEC Rdb remote server remains attached to sive. Measurements taken on a VA X 6300 system the database; it communicates with and acts on behalf running the TPC-A benchmark and using ACMS of its client fo r the duration of a database session. revealed that the TP monitor consumes 20 percent Figure 8 shows our implementation of the TPC-A to 25 percent of the cycles on the back-end server. client-server application with DEC Rdb version 6.0. SQL multi-statement procedures in DEC Rdb ver­ With the implementation of multi-statement proce­ sion 6.0 address both these performance issues. dures fo r the TPC-A transaction, we reduced the They reduce the transaction code path by com­ code path approximately 20 percent to 25 percent. pounding a number of SQL statements in one BEGIN­ END procedure that fu lly adheres to the VA X and Performance Measurement AXP procedure calling standards. The BEGIN-END In this section, we briefly describe a TPC-A trans­ procedure is atomically submitted to the database action and the TPC-A benchmark. We discuss our manager to compile once per database session and goals for TPC-A and recount our progress. Finally, to execute as often as the appl ication requires fo r we present profiling and benchmark results. the duration of that particular session. When multiple SQL statements are bundled into a single BEGIN-END block, only two network 1!0 oper-

REMOTE TERMINAL EMULATORS

CLIENT

RESPONSE CLIENT PROCESSORS TIME

SERVER SERVER

KEY: SP ACMS SERVER PROCESS KEY: RS DEC RDB SERVER SP ACMS SERVER PROCESS EXC EXECUTION CONTROLLER EXC EXECUTION CONTROLLER

Figure 8 Client-server Application with Figure 7 Client-server Application with ACMS DEC Rdb Ve rsion 60

Digital Te chnical journal Vo l. 6 No. I Winter 1994 31 OpenVMS AXP System Software

TPC-A Tra nsaction and $K/tpsA is a measure of the price per transac­ The TPC-A transaction is a very simple database tion fo r the hardware and software configuration. transaction: a user debits or credits some amount of money from an account. In database terms, that Anatomy of the TP C-A Tra nsaction requires fo ur updates within the database: modify To understand how DEC Rdb achieves such fast the user account balance, modify the branch transaction rates, we need to examine the effects of account balance, modify the teller account balance, the optimizations to the software with respect to and store a history record. The resulting metric the execution of the TPC-A transaction. indicates how many transactions are performed To complete the fo ur updates required by the per second (TPS) 9 TPC-A transaction, DEC Rdb actually incurs two In terms of complexity, the TPC-A transaction physical 1!0 operations and two lock operations. fa lls somewhere in the middle range of bench­ More specifically, the branch and teller records are marks. In other words, the SPECmark class of bench­ located on a page that is cached in the database marks is very simple and tends to stress processors buffer pool; therefore, these two records are and caching behavior; the sorting benchmarks updated extremely fast. The update to the account (e.g., Al phaSort) expand the scope somewhat to record causes the account page/record to be test processors, caches, memory, and I/O capabili­ fetched from clisk ami then modified. A history ties; the TPC-A benchmark tests all the above in record is then stored on a page that also remains in addition to stressing the database software (based the buffe r pool. on the relatively simple transaction). Other bench­ Note that there are too many account records for marks such as TPC-C and TPC-D place even more all of them to be cached in the buffe r pool. Hence, emphasis on the database software, thereby over­ the account fe tch is the operation that causes pages shadowing the hardware ramifications. Hence the to cycle through the buffe r pool and eventually be TPC-A benchmark is a combined test of a processor flushed to disk. The ABW protocols described previ­ and the database software. ously are utilized to write groups of account pages During the porting cycle, Digital was aware back to disk asynchronously (The branch, teller, that the highest result in the industry fo r a single­ and history pages never approach the end of the processor TPC-A benchmark was approximately LRU queue.) 185 TPS. To ward the end of the porting effort, we In regard to locking, the branch, teller, and his­ worked fo r six to nine months to ensure that DEC tory pages/records are all governed by locks that Rdb had attained optimal performance for the are carried over from one transaction to the next. TPC-A transaction. There is no need to incur any locks to update these In April 1993, the DEC Rdb database system was records. The account page/record, which must be officially audited on a DEC 10000 AXP system at the fe tched from disk, requires a new lock request. world-record rate of 327.99TPS. As a result, DEC Rdb At commit time, the after images of the recorcl became the first database system to exceed the 300 modifications are submitted to the AIJ log. The new TPS mark on a single processor. A few weeks later, protocols allow the user process to "sleep," while DEC Rdb was again audited and achieved 527.73TPS the ALS process flushes the AI] records to the AI] on a cl ual-processor DEC 10000 AX!' system. Thus, file. After the ACE r;o has completed, which occurs DEC Rdb became the first database system to before the r;o to the AIJ file on disk, the user pro­ exceed 500 TPS on a dual-processor machine. Ta ble cesses are "woken up" to begin processing their 3 gives our results; the audits were performed by next transactions. KPMG Peat Marwick. The mean qualified through­ As shown in Figures 9 and 10, the transaction is put (MQTh) is the transaction rate at steady state, dominated by stall times. Since the AXP processors

Ta ble 3 DEC Rdb TPC-A Benchmarks

Processor Cycle Time MOTh $K/tpsA

DEC 7000 AXP Model 610 5.5 ns 302.68 $6,643.00 DEC 7000 AXP Model 610 5.0 ns 327.99 $6,74 9.00 DEC 7000 AXP Model 620 5.0 ns 527.73 $6,431.00

32 Vu /. 6 No I Winter 1994 Digital Te chnical jo urnal Pe1jormance of DEC Rdb Ve rsion 6 0 on AXP Sy stems

[----;:= BRANCH UPDATES r---;= ACCOUNT UPDATES I I TELLER UPDATES I I HISTORY UPDATES ------�-�-t-----;�;� - � - - �� -- -1-++ ������ ��� � ��;- +----������ ;�; � ·I UN R E A T L

Figure 9 Tra nsaction Duration befo re kf odifications

r---;=BRANCH UPDATES 35000 TELLER UPDATES I I I I 30000 90TH PERCENTILE RESPONSE TIME Cflz :_...... - -- - 1.64 SECONDS �-�-t--- ;��� - �- ��� --- +-�---1 I 1 U N R 0 i= 25000 (.) <( ACCOUNT UPDATES Cfl 20000 HISTORY UPDATES � z MEAN RESPONSE TIME --QJ1-+ <( ACCOUNT WRITE 1/0 a: 1.01 SECONDS COMMIT STALL ______f- ___j u. 15000 0 a: LJ.J 10000 I Figure Improved Tra nsaction Dura tion due :2C!l :,...... - MAXIMUM RESPONSE TIME 10 ::J 1 3.82 SECONDS to Performance Modifications z 5000

8 14 0 4 6 10 12 are so fa st, the branch, tel ler, and history updates, RESPONSE TIME (SECONDS) and the two locks are a very small fraction of the transaction duration. The synchronous account Figure Response Ti rne uersus TPS read is a big ex pense. The batched asynchronous 11 account writes are interesting. Indeed, each trans­ action requires one account read, which then configured with 5 KDM70 disk controllers and 20 causes one account write since the bufter pool RA70 disk devices. We relied on the !PROBE tool pri­ overflows. Because the account writes are batched marily to generate PC (program counter) sampling into groups and written asynchronously, however, and to track transaction path length variations as there is no stall time required in the path of the new features were prototyped and performance transaction. optimizations were added. We also used !PROBE to Another critical performance metric in the TPC-A verify cache efficiency and the instruction mix in benchmark is response time. Figure 11 shows the TP applications. We used the RMU to measure the response times with an average of about 1 second. maximum throughput of the TPl benchmark, the Figure 12 shows the response time during the database and application behaviors. steady-state period of the TPC-A experiment. Tra nsaction G'y cles Profile Pe Jformance Profiling of DEC Rdb We conducted experiments to determine the per­ In this section, we describe in more detail the fo rmance of TP l transactions. We used the DEC. instruction profile (i.e., instruction counts, machine 7000 AXP Model GIO system (with a 55-nanosecond cycles, and processor modes) generated during the processor) and the OpenVMS AXP version 1.5 oper­ Tl'C-A tests. ating system. With this configuration, DEC Rdb Performance profiling of DEC Rdb was obtained version 6.0 software achieved a maximum through­ using Digital's !PROBE tool, the RMLI, and the VMS put of 334 TPS fo r the TPl benchmark. More than performance monitor. !PROBE is an internal tool 95 percent of the transactions completed in less built to capture information from the two proces­ than 1 second. sor counters that were established to count on-chip Table 4 gives the distribution of the cycles per events and interrupts after a threshold value was transaction and the cycles in PA Lcode in the various reached. CPU modes. The cycle count per TP I transaction The TP I benchmark, a back-end-only version of was measured at 544,000 cycles. PA Lcode calls repre­ the TPC-A benchmark, was used to measure DEC sent approximately 13 percent of the overall cycle Rdb performance on a DEC 7000 AXP Model 610 count. In measurements taken with early OpenVMS

Digital Te chnicttl]ournal Vo l. 6 No. I Winler 1')94 33 OpenVMS AXP System Software

3 en z0 0 0w '!2 � 2 f= DATA COLLECTION w INTERVAL z(f) a_0 (/)w a: w (') <( a: w > <(

0 10 20 30 40 50 60 70 TIME (MINUTES)

Hgure 12 TPC-A Measurement at Steady State

Ta ble 4 Tra nsaction Cycles in CPU Modes

.------% Cycles in System Modes ------. Cycles per Tra nsaction Interrupt Kernel Executive User

Cycles 544.0k 7.1 14.0 73.0 5.8 PALcode cycles 71.5k 13.0 18.0 61 .4 7. 3 Dual issues 19.0k 5.1 11.6 79.0 4.0

AXP base levels, PALcode calls represented 28 per­ path length was measured at 300,000 reduced cent of the overall cycle count. The most frequently instruction set computer (RISC) instructions with called PALcocle functions were misses in the data the available OpenVMS AXP base levels. In April and instruction translation buffe rs. Four major DEC 199 3, the TPI transaction path length dropped to Rclb images may now be installed "/RESIDENT" to 189,000 RJSC instructions. As shown in Table 5, the take advantage of granularity hinrs and reduce TP l transaction path length is currently at 133,500 misses in the instruction translation buffer. Dual RISC instructions. That measurement is 30 percent issuing remained low throughout the experiments less than it was in April 1993. The cycles per instruc­ we conducted with various DEC Rdb and OpenVMS tion were measured at 3.8. AXP base levels. Summary TP 1 Tra nsaction Pa th Length Profile Based on current performance and fu ture trends, At the initial stage of the DEC Rdb port to the Alpha the Alpha AXP fa mily of processors and platforms AX!' platform in January 1993, the TPl transaction will provide superb servers fo r high-end production

Ta ble 5 TP1 Tra nsaction Path Length

,------%Time in CPU Modes ------, Tra nsaction Cycles per MOTh Path Length Inst ruction Interrupt Kernel Executive User

334 TP1 TPS 133.5k 3.8 4.8 11.3 79.9 3.4

34 Vo l. 6 No. 1 Wi nter 1994 Digital Te chnical journal Performance of DEC Rdb Ve rsion 6.0on AXP Sy stems

3. J. systems. To keep pace with the phenomenal Ousterhout and F Douglas, "Beating the 110 increases in CPU speed, database systems must Bottleneck: A Case fo r Log-Structured File Sys­ incorporate features that reduce the 1/0 bottleneck. tems," Te chnical Report, University of Cal ifornia DEC Rdb version 6.0 software has incorporated at Berkeley (1988). a number of these fea tures: asynchronous page 4. J Coffler, Mohamed, and Spiro, "Porting Dig­ fe tch, asynchronous batch write, multithreaded Z. P. ital's Database Management Products to the backup and restore, multi-statement procedures, Alpha AXP Platform," Digital Te chnical jo urnal, and AI) log server using electro nic caches. These vol. 4, no. 4 (1992): 153- 164. enhancements not only al low optimal transaction 5. processing performance but also permit systems to D. Lomet, R. And erson, T. Rengarajan, and deal with very large data sets. P. Spiro, "How the Rdb/VMSData Sharing System Became Fast," Technical Report CRL92/4, Digital Acknowledgments Equipment Corporation, Cambridge Research The development of DEC Rdb V6.0 was a team effort Laboratoq' (1992). involving more peop.le than can be acknowledged 6. P. Spiro, A. Joshi, and T. Rengarajan, "Designing here . We wou ld, however, like to recogn ize the an Optimized Transaction Com mit Protocol," significant contributions of Jay Feenan, Richard Digital Te chnical jo urnal, vol. 3, no. 1 (Wi nter Pledereder, and Scott Matsumoto fo r their design 1991): 70-78. of the multi-statement procedures feature of DEC Rdb and of Ed Fisher fo r his work on the Rdb code 7. E. Knuth, Sorting and Searching, The Art of generator. Computer Programming (Reading, MA: Addison­ We sley Publishing Company, 1973). References 8. C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and D. 1. L. Hobbs and K. England, Rdb/ VMS: A Compre­ Lomet, "A1phaSort: A !USC Machine Sort," Te chni­ hensive Guide (Burlington, MA: Digital Press, cal Report 93.2, Digital Equipment Corporation, 1991). San Francisco Systems Center (1993).

2. R. Sites, "Alpha AXP Architecture," Digital Te ch­ 9. J Gray, The Benchmark Handbook (San Mateo, nicaljo urnal, vol. 4, no. 4 (1992): 19-34. CA: Morgan Kaufmann Publishers, Inc., 1993).

Digital Tech11icaljournnf Vu l. 6 No. 1 Wi nter 1994 35 Wi lliam L. Goleman Robert G. Thomson Paul]. Houlihan

Impraving Process to Increase Productivity While Assuring Quality: A Case Study of the Vo lume Shadowing Po rt to Op enVMS AXP

The volume shadowing team achieved a high-qualit)\ accelerated delive;y of volume shadowing on Op en VMS AXP by app�)'ing techniques from academic and industry literature to Digital's commercial setting. Th ese techniques were an assess­ ment of the team process to identify deficiencies, Jonnal inspectionsto detect most porting defects before testing, and principles of experimental design in the testing to efficiently isolate defects and assure quality Th is paper describes how a small team can adopt new practices and improve product quality independent of the larger organization and demonstrates how this led to a more enjoyable, productive, and predictable work environment.

To achieve VMScluster support in the Open VMS AXP • Provide performance and functionality equiva­ version 1.5 operating system one year ahead of the lent to the OpenVMS VA X system original plan, OpenVMS Engineering had to fo rego • Allow trouble-free interoperability across a early support of Vo lume Shadowing Phase ll (or mixed-architecture VMScluster system "shadowing"). Shadowing is an OpenVMS system­ integrated product that transparently rep! icates • Del iver to customers at the earliest possible date data on one or more disk storage devices. A shadow All three goals were met with the separate release of set is composed of all the disks that are shadowing shadowing based on OpenVMS AXP version 1.'5 in (or mirroring) a given set of data. Each disk in a November 1993, more than six months ahead of the shadow set is referred to as a shadow set member. original planned release fo r this support. Should a fa ilure occur in the software, hardware, In the fo llowing sections, we describe how we firmware, or storage media associated with one achieved these goals by reshaping our overall pro­ member of a shadow set, shadowing can access the cess, reworking our development framework, and data from another member. redirecting our testing. In the final section on proj­ The ability to survive storage fa ilures is quite ect resu lts, we demonstrate how our improved important to customers of OpenVMS systems process assures quality and increases productivity. where data loss or inaccessibility is extremely costly. This paper assumes familiarity with the shadowing Such customers typically combine shadowing and product and terminology, which are described fu lly VMScluster technologies to eliminate single points in other publications. '·2 of fa ilure and thereby increase data availability. For these customers, delayed support f(Jr shadowing on the OpenVMS AXP system meant either fo regoing Reshaping the Overall Process the advanced capabilities of an Alpha AXP proces­ Because the need was urgent and the project we ll­ sor within their VMScluster systems or fo regoing defined, we could have leapt directly into porting the additional data availability that shadowing pro­ the shadowing code. Instead, we rook a step back vides. To resolve this dilemma, OpenVMS Engi­ to evaluate how best to deliver the required func­ neering began a separate project to rapidly port tionality in the shortest time ancl how best to verify shadowing to the OpenVMS AXP system. This proj­ success. Doing so meant taking control of our soft­ ect had three overall goals. ware development process.

36 Vo l. 6 No. 1 Winter 1994 Digital Te cbnicaljournal lmjJroving Process to Increase Productivity WhileAssuri ng Quality

Effective software process is generally acknowl­ Deject Isolation Costs A second major process edged as essential to delivering quality software challenge was to contain the cost of isolating products. The Capability Maturity Model (CMM) defects. A defe ct is defined to be the underlying developed by the Software Engineering Institute flaw in the OpenVMS software that prevents a embodies this viewpoint and suggests that evolv­ VMScluster system from meeting customer needs. ing an entire organization's process takes time.H System software defects can be triggered by Grady and Caswell's experience implementing a VMScluster hardware, firmware, and software. metrics program at Hewlett-Packard bears out this Since few individuals possess the combined skills viewpoint. s Our experience with the continuous necessary to troubleshoot all three areas, defe ct improvement of software development practices isolation normally involves a team of professionals, within Digital's OpenVMS Engineering does so a.'i which adds to the cost of troubleshooting well. VMScluster operating system software. However, our engineering experience also sug­ Debugging of shadowing code is difficult since it gests that the current emphasis on evolving an executes in the restricted Open VMS driver environ­ entire organization's process tends to overshadow ment: in kernel mode at elevated interrupt priority the ability of a small group to accelerate the adop­ level. Shadowing is also written mostly in assembly tion of better engineering practices. Within the language. To maintain shadow set consistency context of an individual software project, we across all 96 nodes of a VMScluster system, much of believed that process could be readily reshaped the shadowing code involves distributed algo­ and enhanced in response to specific project rithms. Troubleshooting distributed algorithms can challenges. We further believed that such enhance­ greatly increase isolation costs, since a given node ments could significantly improve project produc­ failure is often only incidental to a hardware, tivity and predictability. firmware, or software defect occurring earlier on another VMScluster node. Identifying Process Ch allenges Many shadowing problem reports ultimately At the project's outset, we identified fo ur major prove to contain insufficient data fo r isolating the challenges that we believed the project fa ced : con­ problem. Other problem reports describe user figuration complexity, defect isolation costs, beta errors or hardware problems; some are duplicates. test ineffectiveness, and resource constraints. For example, Figure 1 shows the trend for Vo lume Shadowing Phase II problems reported, problems Co nfig uration ComjJlexity Our most significant resolved, and defects removed between December challenge was to devise a process to efficiently 1992 and April 1993. During this period, only one validate the product's complex operating environ­ ment: a mixed-architecture VMScluster system com­ prising both Alpha AXP and VA X processors (or 15 nodes) 6 Digital's VMScluster technology currently supports a configuration of loosely coupled, dis­ tributed systems comprising as many as 96 AXP and �w 10 VA X processors. These nodes may com municate a: over any combination of fo ur different system inter­ >­ -' ::.::: connects: Computer Interconnect (CI), Digital w 5 Storage Systems Interconnect (OSSI), fiber dis­ � tributed data interface (FOOl), and Ethernet. VMScluster systems support two disk storage archi­ tectures-the Digital Storage Architecture (OSA) DEC JAN FEB MAR APR and the small computer systems interface (SCSI)­ 1992 1993 and dozens of disk models. Once ported, shadow­ KEY: ing would be required to provide a consistent view · · · · · · · PROBLEMS REPORTED across all nodes of as many as 130 shadow sets. Each ---- PROBLEMS RESOLVED DEFECTS REMOVED shadow set may involve a different model of disk -- and may span different controllers, interconnects, nodes, or processor architectures. The potential Figure 1 Problem Ha ndling ana Deject Removal number of configuration variations is exponential. on VAX : December 1992 to Ap ril 1993

Digital Technical journal Vo l. 6 No. I Wi nter 1994 37 Open VMS AXP System Software

defe ct was fixed fo r every ten problem reports was staggered over the course of the project. closed. Because this low ratio is not typical of most Moreover, the team was split between the United OpenVMS subsystems, it is not readily accommo­ States and Scotland and, hence, separated by a six­ dated by our traditional development process. hour time difference.

Beta Te st Ineffectiveness A third process chal­ Making Process Enhancements lenge was that customer beta testing had not To meet these chal lenges, we believed our overaJI contributed significantly to shadowing defect process required enhancements that would provide detection. justifiably, most customers simply can­ • Independent porting tasks within a col l abora­ not risk incorporating beta test software into the tive and unifying development framework kind of complex production systems that are most likely to uncover shadowing problems. Figure 2 • Aggressive defect removal with an emphasis on shows the distribution of shadowing problem containing porting defects reports received from its inception in January 1990 • Directed system testing that preceded large­ to January 1993. During these three years, only 8 scale stress testing percent of the problem reports came from cus­ • Clear val illation of shadowing's basic error­ tomer beta test sites. In contrast, 46 percent of the handling capabilities problem reports came from stress test and alpha test sites within Digital, where testing was based on Figure 3 shows our reshaped process fo r the large, complex VMScluster configurations. shadowing port. Each step in the process is depicted in a square box starting with planning and Resource Constraints A fo urth process chal lenge ending with the project completion review. New fo r the shadowing port was competition fo r engi­ steps in the process are shaded gray. The most neering resources. Only the development ami significant enhancements were the insertion of validation project leaders could be assigned fu ll­ inspection and profile testing steps. To evaluate our time. The ongoing demands of supporting shadow­ progress in removing defects, we incorporated ing on OpenVMS VA X precluded members of the defect projections fo r each development step into existing shadowing team from participating in our release criteria. To track this progress, we sup­ the port. Most other engineering resources were plemented the organization's problem-reporting al ready committed to the accelerated delivery of database with a project-defect database. Empha­ VMScluster support in OpenVMS AXP version 1.5. sizing error insertion during profile and acceptance As a consequence, the majority of the shadowing test allowed fo r validation of shadowing's error­ team comprised experienced OpenVMS engineers handling capabilities. whose familiarity with shadowing was limited, In making these process enhancements, we were whose individual skill sets were often incomplete carefu l to maintain both consistency with prevail­ fo r this particular project, and whose availability ing industry practices ancl compatibility with cur­ rent practices within OpenVMS Engineering. We felt that adopting ideas proven in industry and STRESS PRODUCTION TESTING USE having a common framework fo r communication 18% 5°/o within our organization would increase the proba­ bility of success fo r ou r enhancements. How we ALPHA BETA TESTING TESTING implemented these enhancements is described in 28% go;o the fo llowing sections. Measuring Process Effectiveness Establishing Release Criteria In fo rmulating the release criteria fo r shadowing given in Ta ble l, we MODULE used Perry's approach of TESTING 41% • Establishing the quality fa ctors that are impor­ tant to the product's success

Figure 2 Sources of Shadowing Problem Reports: • Mapping the factors onto a set of corresponding ja nuary 1990 through ja num:v 1993 attributes that the software must exhibit

:)8 Vo l. 6 No. I W'i11/er /')')4 Digital Te chnical jo urnal Improving Process to Increase Productivity Wh ile Assuring Quality

ESTABLISHING CRITERIA ------r--=-:==-l------PLANNING ------( •CODING •TESTING • PROCESS

INSPECTION •CHANGES •FIXES

EVALUATING READINESS PROCESS FEEDBACK

ALPHA TEST

BETA TEST

STRESS TEST • GENERAL •LARGE ACCEPTANCE •COMPLEX TEST

• STABLE DEBUGGING • HANDLES ERRORS PROJECT COMPLETION EVALUATING REVIEW 1------READINESS

Figure 3 Enhanced Development and Va lidation Process

Ta ble 1 Shadowing Release Criteria

Quality Software Software Threshold Factor Attribute Metric Va lue

Reliability Error tolerance Profile test completion 100% Operational Defects detected 240

accuracy Incoming problem reports Near O Operational per week consistency Acceptance test with error 200 node-hours insertion

Integrity Data security Unresolved high-severity None defects

Correctness Completeness Code ported 100%

Module test completion 100%

Code change rate per week 0

Efficiency Processing time Queue 1/0 reads and writes Comparable Throughput Copy and merge operations to VAX

Usability Ease of training Documentation completion 100%

I nteroperability Backward Stress test completion 12,000 node-hours

compatibility Unresolved high-severity None Tra nsparent defects recovery

Maintainability Self-descriptiveness Sou rce modules restructured 100% Consistency

Digital Te chnical journal 1-'

• Identifying metrics and threshold values for 2500 determining when these software attributes are CUMULATIVE 2000 present7 0 w (') z 1500 Defining release criteria based on threshold val­ <( I ues provided a clear standard fo r judging release u (f) 1000 (f) readiness independent of the project schedule. u z INCREMENTAL These criteria spanned the development cycle in 500 order to provide a basis fo r verifying progress at each stage of the project. The emphasis of most 0 OCT NOV DEC JAN FEB MAR APR MAY JUN metrics fo r these criteria was on containing and 1992 1993 removing defects. Other metrics were selected to corroborate that high defect detection equated to high product quality. Figure 4 Projections of Code Changed by Jlllonth

Tr acking Defects Projecting defect detection levels for all stages in the development cycle was 3. Using the results from our first milestone, we a departure from the traditional practice of our estimated that 250 defects would be introduced development engineers. Previously, only the test as a result of the complete port. This estimate engineers within OpenVMS Engineering estab­ included not only defects introduced through lished a defect goal prior to beginning their proj­ code modifications but also defects induced in ect work. Extending the scope of this goal fo r existing code by these modifications. shadowing resulted in a paradigm shift that perme­ ated the entire team's thinking and encouraged 4. We projected the schedule of defect detection each member to aggressively look fo r defects. Since fo r each stage of the development cycle. This all team members were more committed to meet­ projection assumed that defects were distrib­ ing or exceeding the defect goal, they were eager uted uniformly throughout the code. Based again to provide detailed information on the circum­ on the results of our first porting milestone, we stances surrounding defect detection and removal. estimated that our efficiency at removing the This detail is often lost in a traditional development defects in this release (or defect yield) would be project. roughly GO percent through inspections and 25 Because data on code modification and defect percent through module testing. We assumed removal during an OpenVMS port was not readily that an additional lO percent of the defects would available, we derived our projections as fo llows. be removed during profile and stress testing. A 95 percent overall yield fo r the release is con­ 1. We determined how many lines of code would sistent with the historic data shown in figure 2. be modified during the shadowing port. This It is also consistent with the highest levels of estimate was based on a comparison between defect removal efficiency observed in the indus­ the ported and original sources fo r another try where fo rmal code inspections, quality assur­ OpenVMS component of similar complexity. The ance, and fo rmal testing are practiced H Figure 5 resulting estimate fo r shaclowing changes was shows our projections fo r removing 240 defects 2,500 noncomment source statements (NC:SS) (95 percent yield of our estimate of 250 defects) out of a total of roughly 19,400. by both month and method.

2. We projected the rate at which these modifica­ Tracking defects was difficult within our larger tions wou ld occur fo r shadowing. We based this organization because the problem reporting system projection on the actual time spent porting, used by OpenVMS Engineering did not distinguish inspecting, anc.l debugging the code of a second between defects, problem reports, and general OpenVMS component of similar complexity. We communication with test sites. To work around this revised it upon reaching the first major porting shortcoming, we created a project database to milestone to reflect our actual performance. The track defects using off-the-shelf personal computer revised projection is shown by month in figure 4. (PC:)software to link defe cts to problem reports.

40 Vo l. o No. I Winter 1994 Digital 1ecbnicaf jo urnal Improving Process to Increase Productivity While Assuring Quality

70 350 on content coupling.10 Modules frequently branched (/) 60 300 between one another with one module using data 0 �u w 250 or control information maintained in another mod­ lL 50 � LU w_ ule. These modules also exhibited low cohesion � (i) 40 200 � � <{CI: a: with subroutines grouped somewhat by logical 30 150 � !z ca => . 10 w- <{� ordering but primarily by convenience :2 20 100 5 LU Structured in this fas hion, the shadowing mod­ 5 10 50 3 u ules were not only more difficult to maintain but � 0 also less separable into independent porting tasks. DEC JAN FEB MAR APR MAY JUN 1992 1993 Moreover, diffe rences between the VA X and the KEY: Alpha AXP architectures, together with the trans­ iZ) STRESS TESTING D MODULE TESTING fo rmation of the VA X assembly language from • PROFILE TESTING D INSPECTION a machine assembly language to a compiled lan­ guage, precluded the continued use of content cou­ pling in the ported code. Figure 5 Projected Defect Detection by Month and Me thod To remedy these structural problems, we parti­ tioned the shadowing code into functional pieces that could be ported, inspected, and module tested Reworking the Develop ment separately before being rein tegrated fo r profile and Framework stress testing. Figure 6 shows both the original and Only the development project leader satisfied all the reworked relationships between shadowing's the requirements fo r executing a rapid port of the source modules and its fu nctions. During restruc­ shadowing code: turing, we emphasized not only greater fu nctional cohesion within the modules but also improved • Expertise in shadowing, VMScluster systems, OpenVMS drivers, the VA X assembly language, coupling based primarily on global clara areas and the AMACRO compiler (which compiles VA X 110 interfaces. As a consequence, most shadowing assembly language fo r execution on Alpha AXP functions were directly dependent on only one systems), and the Alpha AXP architecture9 other fu nction: mounting a single member. Once the port of this function was complete, all the Experience porting OpenVMS code from the • others could be largely ported in parallel. Where VA X to the Alpha AXP platform9 a particular module was used by more than one function, we coordinated our porting work using • Fa miliarity with the defect history of shadowing and the fixes that were being concurrently a scheme fo r marking the code to ind icate portions applied to the VA X shadowing code that had not been ported or tested.

• Av ailability throughout the duration of the proj­ ect to work on porting tasks at the same time Inspecting Changes and the same place We believed inspections would serve we ll as a tool for containing our porting defects. Industry litera­ To compensate fo r the lack of these capabilities ture is replete with data on the effectiveness of across all team members and to improve our ability inspection as wel l as guidel ines fo r its use 8. II.I2 to efficiently port the code while minimizing the Since our code was now structured fo r parallel number of defects introduced, we reworked our porting activities, however, the project also needed development framework in two ways. First, we a framework for restructured the modules to improve their portabil­

ity ancl maintainability. Second, we inspected all • Integrating engineers into the project porting changes and most fixes to assure unifo rm Coordinating overlapping tasks quality across both the project and the code . •

• Col laborating on technical problems Restructuring Modules • Sharing technical expertise and insights Examining the interconnections between the shad­

owing modules in the OpenVMS VAX software • Assuring that the engineers who would maintain revealed a high degree of interdependence based shadowing understood all porting changes

Digital Technical journal VIII. 6 No. I Winter 1994 41 Open VMS AXP System Software

ORIGINAL MODULE FUNCTIONAL PARTITIONS REWORKED MODULE STRUCTURE AND DEPENDENCIES STRUCTURE SHD_LOCK SHDMAC ------SHDMAC ------,

Figure 6 Code Restructuring by Shadowing Function

We bel ieved that group inspections cou ld provide changes into the overall shadowing code ba se. This this framework. assured that any diffe rences in coding standards or conventions were harmonized and any misunder­ Ta iloring the Inspection Process Our inspections standing of code operation was corrected before diffe red from the typical processes used fo r new the code underwent module testing. Inspections code. 12 Only the changes required to enable the VA X occurred before the engineers who performed the code to execute correctly on the AJpha AXP plat­ port returned to their original duties within fo rm were made during the port. These changes OpenVMS Engineering. By participating in these were scattered throughout the sources, primarily at inspections, the engineers who would maintain the subroutine entry points. With our time and engi­ ported code understood exactly what was changed, neering resources quite constrained , we chose to in case additional debug work was needed. inspect only these changes and not the entire code base. Because the project involved the port of an Sharing Experience and E.xpe1·tise The inspec­ existing, stable product, no new fu nctionality was tion process provided a fo rum fo r team members to being introduced and therefore no functional or share technica I tips, folklore, background, and design specifications were available. Instead, experience. Having such a fo rum enabled the entire inspections were made using the VA X code sources team to leverage the diverse technical expertise of as a reference document. its individual members. The resulting technical syn­ ergy increased the capacity of the team to execute Integrating Engineering Resources Prior experi­ the porting work. It also led to rapid cross-training ence in OpenVMS Engineering indicated that engi­ between team members so that everyone's tech ni­ neers working on unfamiliar software could be cal skills increased during the course of the project. very productive if they worked from a detailed This teamwork and increased productivity led to specification and used inspections. Using the VA X more enjoyable work fo r the engineers involved. sources as a "specification" fo r the shadowing port In retrospect, the use of inspections proved the provided such a focus. Inspecting only code changes greatest single factor in enabling the project to al leviated the need for team members to under­ meet its aggressive delivery schedule. stand how a particular shadowing function worked in its entirety. Simply verifying that the algorithm in the ported sources was the same as that in the origi­ Redirecting the Te sting nal VA X sources was sufficient. To detect both new and existing defects, shadowing Inspections of ported code preceded both mod­ has historica lly undergone limited functional ule testing and the integration of the porting testing fo llowed by extensive stress testing. The

Digital Tee/mica/ jounwl 42 vb /. 6 No. 1 Winler /'}'}4 Improving Process to Increase Productivity WhileAssu ring Quality

effectiveness of this testing bas been constrained, often encounter. By relying on combinatorics however, because it rather than randomness to detect defects, the event sequences leading to a defect can be more readily

• Provided no measurement of actual code rep roducecl. coverage The fo llowing sections show how we used this approach to design and implement profile testing Lacked an automatic means fo r fo rcing the exe­ • fo r shadowing. They also reveal that the cost­ cution of error paths within shadowing effectiveness of our testing improved significantly

as a result. • Failed to target the scenarios in which most shadowing fa ilures occurred Describing the Te st Domain To compensate fo r these shortcomings and Both AT&T and Hewlert-Packarcl have used opera­ improve our ability to efficiently detect defects, we tional profiles to describe the test domain of com­ fo rmulated profile testing: a method of risk-directed plex software systems. '-' Such operational profiles resting that would follow module testing and pre­ were typically derived by monitoring the customer cede large-scale stress testing. usage of the system and then methodically reducing this usage to a set of frequencies fo r the occurrence Defining Profi le Te sting of various systems fu nctions. These frequencies are Profi le testing focuses on operating scenarios that then used to prioritize system testing. For the pur­ pose the greatest risk to a software product. poses of validating shadowing, we extended this Engineering experience clearly indicated that the notion of an operational profile by decomposing highest-risk operating scenarios fo r shadowing the test domain into fo ur distinct dimensions that involved error hand ling during error recovery. characterize complex software systems: Examples of such scenarios include media fa ilure System configuration after a node failure or the unavailability of system • memory while handling media fa ilure. Problems • Software resources with such error handling have typically occurred Operational sequences only in large and complex YMScluster systems. Test • profiles are si mple, clearly defined loads and config­ • Error events urations designed to simulate the complex error scenarios and large configurations traditionally The emphasis of our test profiles was nor on needed to detect shadowing defects. how the system was likely to operate, but rather Fundamentally, profile testing is the application on how it was likely to fail. For our project, rapid of the principles of ex perimental design to the characterization of the test domain was of greater challenge of "searching" for defects in a large test importance than precise reproduction of it. domain. Deriving profile tests begins with a careful identification of operating conditions in which the Id entifying the Te st Fa ctors product has the gr eatest risk of fa iling. These condi­ Assessment of shadowing's test domain identified tions are then reduced to a set of hardware and soft­ the factors that characterized its high -risk operat­ ware variables (or "factors") and a range of values ing scenarios. This assessment was based on a fo r these fa ctors. A test profile is the unique combi­ review by both test and development engineers of nation of fa ctor values used in a rest run. the product's functional complexity, defect history, Combining a large set of test fa ctors and factor and code structure as characterized by its cyclo­ values can resu lt in unmanageable complexity. For matic complexity. H The resulting fa ctors provided this reason, profile testing uses orthogo nal arrays the basis fo r fo rmulating our test profiles. to select fa ctor combinations for testing. These arrays guarantee uniform coverage of the target test domain described by the test factors. Instead of Sy stem Configuration The fo llowing key factors selecting tests based on an engineer's ingenuity, we in system configuration describe how a shadow set used these arrays to systematically select a subset of is fo rmed and accessed across the range of compo­ all possible factor co mbinations. As a result, we nents and interconnects that VMSclusrer systems uncovered nonobvious situations that customers support .<'

Digital Te chuical jounutf Vo l. No. 6 I Wi nter JI)

(M BCNT) • Number of shadow set members EM • Controller fa ilure (CTRLERR)

MSCP (DTSKERR) • Device serving (MSCPSERV) • Disk fa ilure

(SEPCTRL) Media failure (MEDIA ERR) • Control ler sharing •

(SERVSYS) • Emulated versus local disk controller Ma naging Te st Complexity \Vhen the ist of key factors fo r targeting shadow­ (LOADSYS) I • Alpha AXP or VA X 1/0 load initiator ing's high-risk operating scenarios was enumer­

• Location of the storage control block (SCBI3EG) ated, the resu lting test domain was unmanageably complex. If just two test values for each of 25 fac­ Size imits fo r transfers (DIFMFlYT) • I 1/0 tors are assumed, the set of all possible combina­ (DIFCTivt O) tions was or more than 33 million test cases. • Controller time-out values 21, To reduce this combinatorial complexity to a (SYSDISK) • System disk shadow set manageable level, we structured profiles in two test

(DISKTYPE) dimensions, error event and system configuration, • Disk device type using orthogonal arrays. 1'- 1° Columns in these orthogonal arrays describe the test factors, and Software Resources Although running an applica­ rows describe a balanced and orthogonal fraction tion in Open VMS can involve competing fo r a wide of the set of fa ctor combinations. Because of range of finite system and process resources, only fuII their balance and orthogonality across the test fac­ two software resources initially appeared signifi­ tors, such arrays provide a uniform coverage of the cant for targeting error hand ling during error test domain. recovery within the shadowing product: Figure 7 is a composite table of error-event pro­

• System memory used fo r 1/0 operations files created using a D-optimal array and shadow set configuration profi les created using a standard • VMScluster communication resource (send . F ts credits) array These two arrays fo rmed the base of our profile test design. Operational sequences were Op era tional Sequences Shadow set membership applied (as described below) to 18 test profiles is controlled by the manager of an OpenVMS sys­ fo rmed by relating the error event and shadow set tem.1 The manager initially fo rms the shadow set arrays as shown in Figure 7. These profiles were and then adds or removes members as needed. arranged into groups of three; each group included Applications use these shadow sets fo r file creation two types of disks, a shadowed system disk, and the and access. During its use, a shadow set can require presence of each type of disk error. The test values copy and merge operations to maintain data consis­ assigned to the factors SYSDISK and DTSKTYPE as a tency and correctness across its members. Profiles result of this grouping are shown in the two that target these activities involve sequences of the columns positioned between the arrays in Figure 7. fo llowing key operations. Note that physically configuring the test environ­ ment prevented us, in some instances, from using • Merge, assisted merge, copy, ami assisted copy the prescribed assignment of factor values. As a

• Member mounts and dismounts consequence, only those fa ctors whose columns

are shaded in gray in Figure 7 retained their balance • File creation and deletion and orthogonality during implementation of the

• Random reads and writes; repeated reads of a test design. "hot" block on the disk At this point, profile test design is complete. The use of orthogonal arrays al lowed us to reduce the Error Events AJIcomplex software systems must tests to a manageable number and at the same time deal with error events that destabilize its operation. have unifo rm coverage of all test factors. For the shadowing product, however, rei iably hand ling the fo llowing set of errors represents the essence of its value to Open VMS customers. Co nfig uring the Te st Environment In addition to the shadow set profiles, the fo llowing Removal of a VMScluster node (NODEERR) • practical constraints guided the configuration of a (PROCERR) • Process cancellation VMScluster system fo r conducting our profile testing.

44 Vo l. 6 No. I If/inter 1994 Digital Te chnical ]ourllal Improving Process to Increase Productivity WhileAssu ring Quality

> a: a: w t- a: (f) cc a: _J (f) (f) w a: a: a: a: z w t- 0 a: w � ()._ a: >- >- (C) >- _J w w a: (f) >- u (f) t- (f) (f) w ::2' w w ..: t- Cil ()._ Cil t- u:: w u _J 0 u > 0 m 0 0 0 a: � 0 (f) � ::2' u ()._ a: Cil ::2' u (f) (f) ..: a: 0 a: t- w >- w (f) w w 0 u u.. u.. ()._ z ()._ u 0 ::2' (f) 0 ::2' ::2' (f) (f) _J (f) 0 0

1 NO NO NO FATAL NO NO RZ 2 SOME NONE AXP AXP ALL NONE NONE 2 NO NO YES MNTVER YES NO RF 2 SOME ALL BOTH AXP ALL ALL ALL NO YES NO NO NO YES RF NONE ALL BOTH AXP NONE NONE NONE ' 2 �4 ' YES NO NO NO YES NO RZ 2 ALL ALL AXP AXP ALL NONE NONE 5 NO NO YES MNTVER NO YES RF 2 ALL SOME BOTH AXP NONE NONE NONE NO YES YES FATAL YES NO RZ 2 ALL NONE VAX AXP NONE NONE NONE +7 NO YES NO MNTVER YES NO RA 2 NONE NONE BOTH VAX NONE NONE NONE 8 YES NO NO FATAL NO NO RZ 2 NONE ALL BOTH BOTH ALL NONE NONE 9 NO YES YES NO NO YES RZ 2 ALL NONE AXP AXP NONE NONE NONE 10 YES YES NO MNTVER YES NO RZ 3 NONE SOME VAX VAX NONE NONE NONE 11 NO NO NO FATAL NO YES RA 3 NONE NONE AXP AXP ALL NONE NONE YES YES YES NO NO NO RA I 3 NONE SOME AXP BOTH NONE NONE SOME J4-13 YES YES NO MNTVER NO NO RZ 3 ALL ALL BOTH VAX NON E SOME SOME 14 YES YES YES FATAL YES YES RA 3 ALL NONE VAX AXP NONE NONE NONE YES NO YES NO YES NO RZ 3 ALL NONE BOTH AXP ALL NONE NONE �16 YES NO YES MNTVER NO NO RZ 3 SOME ALL BOTH VAX ALL SOME SOME 17 NO NO NO NO YES YES RF 3 SOME ALL AXP AXP NONE SOME SOME 18 YES YES YES FATAL YES NO RZ 3 SOME SOME BOTH BOTH ALL SOME SOME

1 ERROR EVENT ARRAY L 18 (24 x 3 ) SHADOW SET CONFIGURATION ARRAY L 18 (21 x 37)

Figure 7 Set of Composite Te st Profi les

• Minimize hardware requirements and maximize Ta ble 2 VMScluster Configuration Size ease of test execution by Te st Method

• Configure profiles from the shadow set array in Profile Large-scale groups of three to expedite test execution Testing Stress Testing

• Reflect both anticipated customer usage of shad­ Systems 5AXP 10 AXP owing and historic usage as characterized in 4VAX 12 VAX existing surveys of VA Xcluster sites in the United Interconnects 1 Cl 2CI States 1 Ethernet 9 Ethernet 1 FDDI • Enable the fo rmation of either one integrated or Cl Storage 1 HSC 6 HSC two separate VMScluster systems based on either Controllers 1 HSJ 1 HSJ the DSSI or the CI system interconnect Test Disks 31 165 Shadow Sets 9 two-member 23 two-member • Require no physical reconfiguration during 9 three-member 5 three-member testing

• Maintain a consistent batch/print and user authorization environment The resulting test configuration fo r our profile testing was fo rmally described in a configuration • Follow the configuration guidelines set fo rth in diagram, which appears in a simpl ified fo rm in Digital 's software product descriptions fo r Figure 8. During our testing, each profile shown OpenVMS, VMScluster systems, and volume shad­ in Figure 7 was uniquely marked on this diagram owing for Alpha AXP and VA X systems to show both the disks comprising each shadow Constructing a test configuration that reflected set and the nodes to load them. Ta ken together, Fig­ al l these constraints and supported the shadow set ures 7 and 8 proved quire useful in transferring the profiles in Figure 7 was quite a challenge. As a task of executing a particular test profile from one result of having clear configuration guidelines, test engineer to another. In addition, development however, we could re-create shadowing's high-risk engineers fo und them to be invaluable tools fo r operating scenarios using substantially less bard­ clarifying the fault loads and the configuration of ware than required fo r large-scale stress testing. a particular test profile. Ta ble 2 contrasts the hardware requirements of the To illustrate how we used these two tools, con­ two test approaches. sider the first three test profiles shown in Figure 7.

Digital Te chllicaljour11al vbl. 6 No. I Winter 1994 45 OpenVMS AXP System Software

/ V1 .5 I DSA1 LOAD DSSI r-- DSSI .l .J. VAX DEC RF73 V5.5-2 RF73 4000-300 4000-610 DIA200 DSSI DSSI WEBEYU E E E DIA261 M B H DSA3 RF73 RF72 � DSSI DIA13 DIA202 , � DSA2 DSSI RF73 RF73 DIA251 DIA201 I DEC � 4000-610 DSSI DSA3 r-- HEBE ME DSA3 LOAD SCSI DSA2 LOAD DSSI I RF72 .l. .l. .l. DIA301 ,:::: � :::u::: ::::! DSA2 RZ26 RZ26 RZ26 DKA400 l DKA500 l DKA600j ....__ DSA1 DSA1 J DSSI RF73 DIA300 l l VAXSERVER DEC '------:::::J_ '-- ,:::: � ,:::: � SCSI 3100-60 I- 3000-500 SCSI RZ26 RZ26 MABEYU MABEME RZ26 DSSI TEST DKBO DKB100 l DKAO SEGMENT ------J ------J CI TEST r::::::- VAXSERVER DEC � SEGMENT - ,:::: � SCSI 31 00-60 f- 2000-300S SCSI ,:;:: � RZ25 YUTUSU SUMETU RZ25 DKB500j DKAO l J F: V5.5-2 � � � ,::: � ,:::: ::I ,:::: ::I RA92 RA92 RA92 RA92 RA92 l DUA102 DUA104j DUA106j l DUA101 l DUA105 T T J

,:::: � ,::: � KDM70 KDM70 � � � RZ25 RZ25 VAX DEC � l DUA166 DUA163 - - RA92 RA92 7000-61 0 f- 7000-610 DUTUME DUA1 09 DUA1 10 J T J DUTUYU L SCSI T HSJ40 � � HSC90 TUTU ET E NET I RA92 H R TOFU l DUA1 15 LULU t=J � I � I j_ J l �� ��V1 5-1 H0 H V6.0 ,:::: � ,:::: ::I � ,:::: ::I ,::: � RZ25 RZ26 RZ26 RA92 sRA92 RZ25 RZ25 DKB300 l DKB200 DKBO j l j DUA112 DUA1 11 DUA170 l DUA174 L L J J Cl

Fig ure 8 To tal Sy stem Co nfiguration fo r Profile Te sting

These profiles, respectively, define the configura­ trol blocks located at their first logical block. As tion and loading of shadow sets DSA 1, DSA2, and a result of their physical configuration, both must DSA3. Running these three profiles in parallel have the same size limit on I/O transfers and the would be difficult to describe precisely without same controller time-out value. Finally, this profile these two tools. As an example, profile 1 in Figure 7 indicates that another AXP node in the VMScluster indicates that DSAl must comprise two RZ devices system must load DSA 1 and that this load must that are not used as system disks. At least one of involve the simulation of fatal disk errors. Devices these devices must be MSCP served. The two disks DKA400 and DKA'iOO in Figure 8 satisfied these must share a controller and must be accessed via requirements; the load was to be applied from node an AXP node. Both must have their storage con- MEBEHE.

46 Vo l. (j No. I Winter 19'I- DigitalTe chnical jo urnal Improving Process to In crease Productivity W'hileAssuri ng Quality

The complete configuration fo r these three pro­ loads involved could be readily reproduced to accel­ files is denoted by the gray box in Figure 8, which erate the process of diagnosing the underlying defect requires only a small subset of the total test config­ and verifyinga proposed fix. Faulty To wers also pro­ uration to execute vided a means of easily tailoring, controlling, and monitoring the automatic insertion of fa ults during Two AXP systems (MEBEHE and HEBEME) • our profile testing. The result was better coverage of (WEflEYU) error paths using a much simpler test environment. • One VA X system

• One DSSI interconnect Staging Te st bnp lementation To stage the intro­ 4 duction of profile complexity, we gradually • One SCSI and DSSl controllers increased the number of error events applied to Two RZ26 disks (DKA400 and DKA500) • successive groupings of test profiles. We began our testing with a simple base profile to which further • Two RF72 disks (DIA301 and DlA202) load complexity could be progressively added . This Two Rf73 disks (DlA200 and DlA201) • base profile involved only three two-member shadow sets with just one of the targeted error Ex ecuting Te st Profi les events occurring during each run. System load was Te sting To ols Executing the profi le tests involved limited to reads and writes across the shadow sets. the use of four Digital internal test tools (XQPXR, No system disks were shadowed in this base profile. lOX, C'TM, Faulty To wers) and two OpenVMS Dming the initial execution of the base pro­ utilities (BACKUP and MONITOR). XQPXR and lOX file, we tested resource exhaustion. With each both provided read and/or write loads to shadow subsequent round of testing, we systematical ly sets with XQPXR utilizing the file system for its 1/0. incorporated additional complexity: more test con­ CTM proviclecl a means of load ing multiple subsys­ figurations, three-member shadow sets, shadowed tems across the VMScluster system. Faulty To wers system disks, complex error profiles, and system­ was used to inject faults into the VMScluster System wide loading. Communication Architecture (SCA) protocol tower during loading to create the error profiles shown in Op era tional Sequence fo r Profile Te st Execution Figure 7. MONITOR measured the loads applied dur­ Another important aspect of the profile testing was ing profile testing. BACKUP was used to verify that the use of a prescribed operational sequence dur­ the data on shadow set members was consistent fo l­ ing profile test execution. This sequence is shown lowing a test run. in Figure 9. Of all the test tools we used, Faulty To wers was Profile test runs began with the mounting of a both the most critical to our success and the most single shadow set member. The addition of a sec­ innovative in simulating large-scale VMScluster envi­ ond or third member caused the initiation of a copy ronments. Historically, large-scale stress testing of operation from the existing member to the added shadowing has depended largely on the occurrence device(s). The removal of a VMScluster node that of random events or manual intervention to exer­ had a shadow set mounted would cause shadowing cise shadowing error paths. Because SCA underlies to initiate a merge operation on the shadow set. To all communication within a VMScluster system, maintain consistency across our test runs, we Faulty To wers could instead automatically force the would manually add back into a shadow set any exercise of these paths by simulating errors within member(s) that were expelled due to the node the system. The set of faults that Faulty To wers pro­ removal. At this point, shadowing is expected to vided came from our examination of how VMScluster progress sequentially through copy and merge systems, especially large- scale systems, fail. This operations to create a fu lly consistent shadow set.

set included fo rcing esoteric states throughout The 1/0 required by these copy and merge opera­ the ViVIScluster system, simulating device errors, tions fo rmed the base load on the system. User 1/0 exhausting essential resources, breaking VMScluster to the shadow sets incremented the effective load

communication channels, and creating excessive on the system as did the disruption of 110 due to 1/0 or locking loads. error events. During the period when user 1/0 and The fau lt loads that Faulty To wers provided were error events were sustained at their heaviest levels,

predictable and quite repeatable. When problems 110 for copy operations could stall entirely. Wind­ occurred during test execution, the precise fault ing down error insertion and user 110 enabled copy

Digital Teclmicaljottr11al Vo l. 6 No. I Winter 1994 47 OpenVMS AXP System Software

DURATION Improving Process OF SUSTAINED -­ COMPOSITE Inspections aJ1(1 profile testing were our two key LOAD process enhancements. Data that tracked defect detection and product stabilization during these steps underscores their contribution.

Tr acking Defect Detection The solid line in Figure 10 shows defect detection by week throughout the life of the project. Note that the solid line in the fig­ ure tracks very closely and, at times, overlaps the dashed line used to indicate the inspection time.

USER 1/0 Only high severity defects that resulted in a code change are represented by these defect counts. The time period before January 4 involved neither inspections nor testing; the time period after June 14 involved only testing. During March, porting COPY/ COPY/ MERGE work stopped due to team members being tem­ ' MERGE' porarily reassigned to critical development tasks in support of Open VMS AXP version 1.5. Allowing fo r

z 0 (f) rr:::J TEST EXECUTION that gap in March, the trend in defect detection o w 1-t­ 0<.9 ::::0 0 W TIME Cfl z from both inspections and testing exhibited a rr:w o z w rr:m z w W

48 Vo l 6 No. 1 Winter 1994 Digital Te ch1licaljour11al Improving Process to Ino·ease Productivity WhileAssuring Quality

140

120 t I�I I I 100 I I I I I I I - >- z z ...J f- u u z z a: a: a: ...J (!) (!) (') wa.. wa.. uf- uf- > > w w , N N cO c:., , N Na, c:.,

WEEKS KEY:

INSPECTION TIME (HOURS) DEFECTS DETECTED

Figure 10 Inspection Time and Defect Detection

12

10

8 I" \ I I (J) f-z \ 6 I \ 0::J I \ u I \ I \ I \ I I 4 I' I .... I TEST .. TEST I I ERROR INEFFECTIVE I 2 , , ...... , , _ _ _ ,

0 >- z z z z ...J ...J ...J ...J (!) (!) (!) (!) (!) a. a. a. a. f- f- , C'l6 ill c>, 6N N cO N ---..-- � PRODUCT PRODUCT STABILIZATION STABILIZATION

WEEKS KEY:

PROBLEM REPORTS/D EFECT DETECTED NODE-DAYS/PROBLEM REPORT

Figure 11 /vletricsfor Evaluating Te st EJ!xtiven( ess and Product Stabilization

Vo l. 6 No. J Winter Digital Te chllicul journal 1994 49 OpenVMS AXP System Software

VMScluster system. In late September, high node­ defect remova l during the project exceeded projec­ days/problem report with no defects indicates the tions by 8 percent. By selecting our defe ct goal execution of an ineffective test profile. Near-zero based on industry-standard defect densities and curves in early July and early September indicate exceeding this goal, we have assured quality in our when vacations were interrupting our testing. product. Of the total defects actually removed, 232 During the two product stabilization periods were in the shadowing code. This suggests a defect indicated in Figure 11, the increase in node-days/ removal rate of 12 defects per 1,000 NCSS fo r the problem report accompanied by a near one-to-one shadowing code base, which is consistent with ratio between problem reports and defects incH­ industry data reported by Schulmeyer. 19 cates that the product was stabilizing in spite of sus­ Of the 176 defects removed through inspections, tained test effectiveness. The first period preceded 43 were fo und in the nonportecl shadowing code. beta test; the second preceded product release. This suggests a defe ct removal rate fo r the ported code of 53 per 1,000 NCSS, which again fa lls within Assuring Quality the range reported by Schulmeyer. The combina­ Containing and removing defects were at the core tion of testing and inspections resulted in the of our release criteria. As Figure 12 shows, actual removal of 59 defe cts from the unmodified shadow­ ing code base. This represents a net reduction of 3.5

291 defe cts per 1,000 NCSS within the unmodified shad­ 5 6 owing code as a result of the port' 19 259 The significance of this level of defect removal as 2 an indicator of high quality was corroborated in 16 2 240 19 two ways. First, all release criteria were satisfied 5 25 9 prior to releasing the ported shadowing product. 20 7 Second, the character of defects detected during the testing phase of the project changed. Whereas 44 44 60 most defects detected in early test runs were intro­ duced during the port, virtually all defects detected in later test runs were residual in the underlying code base. Again, removing this latter type of defe ct meant better overall quality for customers running mixed-architecture VMScluster systems. Another aspect of quality was identifying those

176 176 detects that should not be fL-xed but only contained. 155 Because this project had very limited scope, dura­ tion, and resources, we had to carefu lly evaluate changes that could destabilize the entire prod­ uct and jeopardize its overall quality. The compari­ son of detected to removed defects in Figure 12 shows that several problems fel l into this category. Many of these defects were triggered by system PROJECTED DETECTED REMOVED DEFECTS DEFECTS DEFECTS operations that exceeded the fu ndamental design limitations of the product. Some occurred in KEY: obscure error paths that cou ld only be exercised - CODE INSPECTION D MODULE TESTING with our new test methods. Others were due to 111 ,1CODE SCRUTINY D PROFILE TESTING new combinations of hardware possible in mixed­ D STRESS TESTING D ALPHA TESTING architecture VMScluster systems. In each of these instances, we assured that the scope of the defect ITJJ BETA TESTING was limited, its frequency low, and its impact pre­ dictable. When necessary, we constrained sup­ Figure 12 Comparison of Projected and ported hardware configurations or system behavior Actual Defect Detection so that the defect could not cause unrecoverable and Removal byMe thod failures. Finally, we feel back our analyses of these

50 Vo l. 6 No. I Winter 1994 Digital Te chnical jo urnal bnprouing Process to Increase Productivity WhileAssuri ng Quality

defects into the ongoing shallowing support effort matically reduced both the cost and time required so that they could be removed in future releases fo r defect removal. through appropriate redesign. Defe ct Yi eld Assuming a 95 percent overa ll defect Increasing Productivity yield fo r shadowing prior to release, the relative The fo llowing data shows how our enhanced pro­ yield of inspections during this project was 65 per­ cess contributed to the quality. This data shows the cent. When calculated against defects fo und only relative cost-effe ctiveness, and hence improved in shadowing, the yield fo r inspections jumps to productivity, achieved through inspections and 75 percent-a very high removal efficiency when profile testing. compared with industry data g Relative defectyield was consistent with industry data fo r module test­ Engineering Costs The proportion of total engi­ ing at 16 percent, but low for profile and stress test­ neering hours expended during each step of the ing at a combined value of 6 percent. Given the project is depicted in Figure 13. This figure indicates high engineering cost shown above fo r removing that only 15 percent of the hours (inspections and defects during integration test of shadowing, this is module testing) resulted in the removal of 85 per­ in fact quite a favorable result. cent of all defects. Inspections alone accounted Te st Cost-eff ectiveness As Figure 13 indicates, test­ fo r only 5 percent of the engineering hours but ing remained the most 68 percent of the defect removal' costly portion of the project. roughly The actual cost fo r removing defects by inspec­ Executing and debugging problems from 86,000 node-hours of stress, performance, and pro­ tion averaged 1.7 engineer hours per defect. During 69 module testing, when engineers worked individu­ file testing accounted fo r percent of the project's ally to debug the fu nctions they had ported, the total engineering hours. In fact, the ratio of engineer cost of defect removal jumped to 15 engineer hours hours expended fo r test versus implementation 1.8 per defect. During integration testing, where the was roughly times higher than Grady reports fo r 48 entire shadowing driver was tested in a complex projects involving systems software 20 Given the complexity of the Vi\'IScluster systems historically environment, an average of 85 engineer hours was spent per defect exclusive of the time spent to exe­ required to test shadowing, this ratio is no surprise. cute the tests. Clearly, removing the bulk of the Nevertheless, our project's results indicate that defects from the ported code prior to beginning profile testing was significantly more cost-effective testing of the integrated shadowing product dra- than large-scale stress testing fo r detecting defects. Profile testing's overall ratio of test engineer clays per defe ct was 25 percent better at 6.2 days than ACCEPTANCE TESTING stress testing's 8.3 days. Moreover, profile testing's 6% overall ratio of machine test time per defect was STRESS more than an order of magnitude better at 74 node­ TESTING PROFILE 24% days than stress testing's 95.2 node-days! TESTING 19% This improvement in cost-effectiveness was achieved with no loss in defect removal capability. When compared with large-scale stress testing, pro­ 1------J�=:::;::::::====-iLPL ANNING AND file testing of shadowing proved equally effective TRACKING overall at detecting defects. During the beta test 4°/o period fo r shadowing, each of these test methods

PROBLEM PORTING accounted fo r roughly 20 percent of the defects REPORT 12% DEBUGGING detected when allowing fo r duplication between 20% methods. The number of problem reports per INSPECTION defect was also comparable with ratios of 2.4 for 5o/o MODULE profile testing and 2.0 fo r stress testing. TESTING 10% Figure 14 contrasts the cost-effe ctiveness of each method of detect detection employed during the val­ idation of shadowing. Note that because this chart Figure 13 Distribution of To tal Engineering uses a log scale, any noticeable difference between Ho urs byProcess Step bar heights is quite significant. This chart bears out

Digital Te chnical jo urnal Vo l. 6 No. I Winter /'}')4 51 Open VMS AXP System Software

10000 managed to achieve the defect detection effective­

0 ness of large-scale integration testing at the same <.910 00 0 relative cost as module testing. 2 The results of our process innovations would not !z 100 ::J have been realized if we had waited for our organi­ 0 0 10 zation's process to evolve up the CMM levels. By changing the engineering process of a small team, we delivered high-quality software on schedule MODULE PROFILE STRESS INSPECTION TESTING TESTING TESTING and at a lower cost. KEY: 0 DEFECTS DETECTED Future Developments 0 ENGINEER TIME (DAYS) As a result of our project's accomplishments, • MACHINE TIME (NODE-DAYS) OpenVMS Engineering is giving serious considera­ tion to our practices. Many groups are acknowledg­ Figure 14 Relative Cost-eff ectiveness of ing the gains possible when fo rmal inspections are Deject Detection Methods used to contain defects. Moreover, the organiza­ tion's problem reporting system is being upgraded conventional wisdom on defect removal: detection to include mechanisms for tracking defects that prior to integration through the use of inspections incorporate many of the fields used in our defect and module testing is by far the most cost-effective. tracking database. It also suggests that profile testing has a cost­ OpenVMS Engineering is also evaluating further effectiveness on the same order of magnitude as use of the profile testing methodology in its testing module testing, while providing the same defect efforts. As Figure 13 indicated, improving the effec­ detection effectiveness as large-scale stress testing. tiveness of our integration testing may offer the most significant opportunity fo r reducing engineer­ Conclusions ing costs and accelerating the schedules of our soft­ ware releases. Since experimental design principles At the outset of the shadowing port, given its were used in the creation of the tests, statistical unique chal lenges, we believed that the existing evaluation of the results is a potentiatly exciting development process for OpenVMS would not opportunity. An analysis of variance that explores enable us to meet the project's goals. By taking the relationship between test factors and defects charge of our engineering proce5s, however, we not could indicate which test factors contribute sig­ only met those goals but also demonstrated that nificantly to defect detection and which do not. clung<:s to the established process could result in This could give us a clear statistical indication of higher productivity from our engineering resources where to direct our testing efforts anc1development and better quality in the delivered product. resources. The volume shadowing port from OpenVMS VA X to OpenVMS AXP was successful in meeting an aggressive schedule and in delivering a stable, high­ Acknowledgments quality product. There were two key process inno­ The success and the final fo rm of the our software vations that led to our success. The first was the use development process were a direct consequence of of inspections to detect and remove a large percent­ the diverse skills and unflagging efforts of the shad­ ag<: of the porting defects before any investment owing team members involved in its implementa­ in testing. By finding a majority of the defects (68 tion: Susan Azibert, Kathryn Begley, Nick Carr, percent) at the lowest possible cost (1.'5 hours per Susan Carr, Mary Ellen Connell, To m Frederick, defect), fewer overall resources were required. Paula Fitts, Kimilee Gile, Greg .Jordan, Bruce Kelsey, The second key innovation was the use of profile Maria Leng, Richard Marshall, Kathy Morse, Brian testing to cover a very large and complex test Porter, Mike Starns, Barbara Upham, and Paul We iss. domain. Having a test strategy that used well­ Support from Howard Hayakawa, manager of the defined and repeatable tests to target problem areas VMScluster group, provided the opportunity for us in shadowing code allowed us to efficiently find to initiate such extensive process enhancements in defects and verify fixes. With profile testing, we the context of a critical project fo r OpenV,Vl S.

52 Vo l. 6 No . I Winter 1994 Digital Te chnicaljour11al Improving Process to Increase Productivity While Assuring Quality

References 13. ). Musa, "Operation Profiles in Software Relia­ bility Engineering," IEEE Software (New Yo rk: 1. S. Davis, "Design of VMS Vol ume Shadowing IEEE, March 1993): 14-32. Phase ll-Host-based Shadowing," Digital

Te chnical jo urnal, vo l. 3, no. 3 (S ummer 14. T. McCabe and C. Butler, "Design Complexity 1991): 7-15. Measurement and Te sting," Co mmunications of the ACM, vol . 32, no. 12 (December 1989): 2. Vo lume Shadowing fo r Op enVMS (Maynard, 14I5-1425 MA: Digital Equipment Corporation, Novem­ ber 1993). 15. G. Ta guchi and M. Phadke, "Quality Engineer­ ing Through Design Optimization," Co nfe r­ 3. M. Paulk, B. Curtis, M. Chrissis, and C. We ber, ence Record, vol. 3 (IEEE Communications Capability Maturity Modelfo r Software VJ. l Society GLOBECOM Meeting, November (Pittsburgh, PA : Carnegie-Mellon University, 1984): 1106 -1113. Software Engineering Institute, Te chnical Report, CMU/SEI-93-TR-24 ESC-TR-93-177, 16. T. Pao, M. Phadke, and C. Sherrerd, "Com­ February 1993). puter Response Time Optimization Using Orthogonal Array Experiments," Conference 4. Humphrey, Managing the Software Pro­ W Record, vol. 2 (IEEE International Communica­ cess (Reading, Addison-Wesley Publishing MA: tions Conference, June 1985): 890-895. Company, 1989/1990). 17. S. Schmidt and R. Launsby, Understanding 5. R. Grady ancl D. Caswell, Software Me trics: Industrial Designed Experiments (Colorado Establishing a Company- Wide Program Springs: Air Academy Press, 1989): 3-1 -3-32. (Englewood Cliffs, N): Prentice-Hall, 1987): 112. 18. SA S/Q C Software, Ve rsion 6 (Cary, NC: SAS 6. VMScluster Sy stems fo r· Op enVMS (Maynard, Institute, Inc, 1992). MA: Digital Equipment Corporation, March 1994). 19. G. Schulmeyer, Zero Defect Softwar·e (New Yo rk: McGraw-Hill, Inc., 1990): 73. 7. W Perry, Quality Assurance fo r Information 20. R. Grady, Practical Software Metrics fo r Proj­ Sy stems: Methods, To ols, and Te chniques ect Ma nagement and Process Improvement (Boston: QED Te chnical Publishing Group, (Englewood Cliffs , NJ: Prentice-Hall, 1992): 42. 1991): 541-563.

8. C. Jones, Applied Software Measurement (New Yo rk: McGraw-Hill, Inc., 1991): 174 -176, 275-280.

9. N. Kronenberg, T. Benson, W Cardoza, R. Jagannathan, and B. Thomas, "Porting OpenVMS from VA X to Alpha AXP," Digital Te chnical jo urnal, vo l. 4, no. 4 (Special Issue 1992): 111-120.

10. R. Pressman, Software Engineering: A Pra cti­ tioner's Approach (New Yo rk: McGraw Hill, 1992): 334-338.

11. M. Fagan, "Design and Code Inspections to Reduce Errors in Program Development," IBM Systems jo urnal, vo l. 15, no. 3 (1976): 182-211.

12. D. Freedman and G. We inberg, Ha ndbook of Wa lkthroughs, Inspections, and Te chnical Reviews: Evaluating Programs, Projects, and Products (New Yo rk: Dorset House Pub­ lishing, 1990).

Digital Te clmicaljournal Vo l. 6 o. 1 Wi11ter 1994 53 David G. Conroy Thomas E. Kopec JosephR. Falcone

The Evolution of the AlphaAXP PC

Tbe DECpc AXP 150 personal computer is not only the fi rst in Digital's line of Alpha �YP PC products but also tbe latest in a line of experimental /ow-cost systems. Th is paper traces the evolution of these systems, which began several years ago in Digital's research and advanced development laboratories. The authors reveal some of the reasoning behind the engineering design decisions, point out ideas that worked well, and acknowledge ideas that did not work well and were discarded. Ch ief among the many lessons learned is that combining Alpha AXP microproces­ sors and industry-standard system components is within the alJilities of any com­ petent digital design engineer

The DECpc AXP 150 system is Digital's first Alpha Beta Demonstration Sy stem AXP personal computer (PC) product that supports In the early clays of the Alpha AXP program (1990), the NT operating system. This conventional wisdom said that the DECchip 21064 product is the latest member of an evolutionary microprocessor could not be used to build low-cost series of low-cost systems that take advantage of PC computer systems-it consumed too much power. components and standards. Wo rk on these systems was difficult to cool, and was optimized fo r large . began several years ago in Digital's research and high-performance systems 21 Therefore, Digital was advanced development laboratories. 1 By tracing the not designing any low-cost systems; all product evolution of the Alpha AXP PC from the Beta demon­ groups were waiting fo r a low-cost Alpha AXP CPU, stration system (which pioneered the concept of the DECchip 21066 microprocessor (which was the Alpha AXP PC) through the Theta system (which announced in September 1993). incorporated an Extended Industry Standard Archi­ In December 1990, several members of the tecture [EISA] bus) to the DECpc AXP 150 product, Semiconductor Engineering Advanced Develop­ this paper shows how experimental systems solved ment Group (the same group that began tbe Alpha many problems in anticipation of products. AXP architecture development and designed the The Alpha AXP PC design philosophy is summed DECcbip 21064 microprocessor) decided to investi­ up by one of the DECpc AXP 150 advertising slogans: gate the fe asibility of producing low-cost systems. It's just a PC, only faster. By being culturally compat­ In February 1991, with the help of Digital's ible with industry-standard PC systems, Alpha AXP Cambridge Research Laboratory, they began design­ PC systems can exploit the huge infrastructure of ing and building a demonstration system called low-cost component suppliers supported by the Beta. high vo lumes of the PC marketplace and be cost Although the Beta system used the same enclo­ competitive in that marketplace. sures, power supplies, expansion bus option cards, Alpha AXP PC systems typically include little and peripherals as industry-standard PCs, true PC fu nctionality in the base system. Many additional compatibility was never a goal. The intent was sim­ capabilities are provided by option cards via the ply to demonstrate the feasibility of building low­ Industry Standard Architecture (IS,\) or EISA bus. cost systems; standard PC components were used Such capabilities can be upgraded easily to keep because doing so el iminated some design work. pace with technological developments. In addition, the increasing fragmentation of the desktop com­ Ha rdtuare Design puter market has made it virtually impossible to Figure 1 is a block diagram of the Beta demon­ design a single product that addresses the needs of stration system. The hardware design was com­ all market segments. By providing option slots, sys­ pletely synchronous. The DECchip 21064 CPU tems can be configured to meet a wide variety of operated at 100 megahertz (;vi Hz) and generated the customer requirements. 25 -M Hz clock that ran most of the logic. The !SA bus

54 Vo l. (> No. I Winter 1994 Digital Te c!Jnicaljo urual The Evolution of the Alpha AXP PC

2 X '623

CPU ADDRESS PSEUD0-386 ADDRESS/COMMAND BUS

CPU TAG

CACHE 'SA BUS MAIN MEMORY DECCHIP 8 X 8K X 18 BOOT AND 8M/64M COMBO 1/0 21 064 CPU 25 NS DIAGNOSTIC 9-BIT VTI 82C 106 100 MHZ 4 X 16K X 4 EPROM DRAM SIMMs 20 NS I

Figure 1 Block Diagram of the Beta Demonstra tion Sy stem interface operated at 8.33 i'v! Hz, generated by divid­ The Beta system designers took a different ing the 25-MHz clock by three. approach. They proposed various backup cache The Beta system was packaged in an attractive, and memory systems, estimated the performance low-profile, tabletop enclosure. The compact size and cost of each one, and selected the one that had fo rced the physical arrangement to be cramped, the best performance per unit cost (beyond a mini­ which made cooling the CPU more difficult. Because mum absolute-performance requirement). the CPU operated at 100 MHz and thus dissipated Ultimately, the Beta demonstration system used only about 16 watts, it could be cooled by a large, a 128K-byte (128KB) write-back cache that was 128 aluminum heat sink (6.9 centimeters [em] wide by bits wide and built with the 25-nanosecond (ns), 8.1 em long by 2.5 em deep) and the 20 linear cen­ SK-by-18-bit static random-access memories (SRAi\1.s) timeters per second of available airflow. designed to be usecl with the Intel 82385 cache A three-terminal linear regulator produced the controller. At 25 ns, the SRA.�\1s were slow, but by requ ired 3.3-volt power. Cooling this regulator using 48-milliampere (rnA) address drivers, inci­ was almost as difficult as cooling the CPU chip but dent wave-switching, and AC parallel termination, it was accompl ished with an off-the-shelf heat sink was possible to both read and write the cache in and the available airflow. 40 ns (four CPU cycles):; Writing the cache SRAl'vts The backup cache and memory systems in previ­ was simplified by the fact that, in addition to the ous Alpha AXP designs were absolute-performance normal system clock (syscl k L), the CPU chip gener­ driven, that is, designed to meet specific perfor­ ated a delayed system clock (sysclk2). This delayed mance requ irements. Usually, these requirements clock had exactly the right timing to be used as a were set at the level of performance needed to out­ cache SRAM write pulse. Enabling logic, built using perform competitors. Wo rkstation performance a 7.5-ns programmable array logic (PAL) device, requirements tended to emphasize the SPEC bench­ ensured that the delayed clock was sent to the mark suite. cache SRAM only when a write pulse was actually

Digital Te ch nical jour,a/ Vo l. 6 No. I Winter 1<)<)4 55 Alpha AXP PC Hardware

needed. Va riations on this design, shown in more DMA peripherals. Partial longword DMA writes detail in Figure 2, have been used in several other never worked properly, however, because this chip Alpha AXP systems, including the EU64 (an evalua­ did not function exactly as described in its data tion board fo r the DECchip 21064 microprocessor) sheet. The chip was not used in future designs. and an experimental DECstation 5000 Model 100 The 110 system was built around an I/O bus that daughter card, as well as the Theta and DECpc AXP imitated the signaling of an Intel386 DX CPU chip 150 systems. (since the Intel 82380 device was compatible with Main memory used 8 or 16 standard, 9-bit-wide, the Intel386 DX chip). Logic between the memory dynamic random-access memory (DRAM), single bus and the 1/0 bus translated DECchip 21064 cycles in-line memory modules (SIMMs). Although the into cycles that mimicked the style of Intel386 DX cache was 128 bits wide, the memory was only cycles. Figure 3 illustrates the translation. 64 bits wide and cycled fo ur times, in page mode, to Complete imitation of the signaling of an Intel deliver a 32-byte cache line. Main memory was CPU requires the generation of memory-space reads protected by 32-bit longword parity that was gener­ and writes with byte granularity, I/O-space reads and ated and checked by the CPU ami copied without writes with byte granu larity, and several special­ interpretation to and from the backup cache. The purpose cycles. However, only a small subset of memory controller was built from PA Ls clocked at these cycles (exactly eigbt) actually needed to be 25 MHz. Registered devices generated all the con­ generated in the Beta system. Thus, DECchip 21064 trol signals, so none of the PA Ls needed to be very address bits [32 ..30] were used to encode the fas t. The long minimum delay of these slow PA Ls details of the cycle; a single PAL expanded these made clock skew management easy. 3 bits into the Intel cycle-type signals. An alterna­

The fact that the r;o system needed to be able to tive scheme that stored unencoded Intel cycle-type perform partial longword direct memory access signals in a control register was rejected because (OMA) writes complicated the backup cache and the control register would have to be saved and memory system because both the cache and mem­ restored in interrupt routines. ory were protected by longword parity. After reject­ The external interface of the DECchip 21064 ing the options of (1) storing byte parity in memory microprocessor is strongly biased toward the read­ with no parity in the cache and (2) performing ing and writing of 32-byte cache lines; external read-modify-write cycles on partial longword logic has difficulty gaining access to DECchip 21064 writes, the designers selected the Intel 82380 multi­ address bits [04.. 03] and cannot access address bit fu nction peripheral chip. This chip can perform [02]. Therefore, the Beta system used DECch ip longword assembly and disassembly for narrow 21064 address bits [29.. 27) to supply Intel address

INDEX CACHE ADDRESS BUS

AS805B 48MA DRIVER 33 OHM

16L8-7 CPU OE A � 220 PF OE SYS OE P---Q

CACHE SRAM CPU WE

SYSCLK2 )0--� WE

SYS PRE WE D

SYSCLK1 SYSCLK1

SYSCLK2 DATA OR TAG

ONE SYSTEM CYCLE

Figure 2 Details of the Beta Sys tem Ca che

Vo l. 6 No. I Wi 11tcr I'J91 Digital Te chnical journal The Evolution of the Alpha AXP PC

33 32 30 29 27 26 05 CPU CACHE LINE ADDRESS: 1 1 1 -v------v----- I

000 1/0 BYTE 0 1110 1 0 00000 0 0 00000 001 1/0 BYTE 1 11 1 1 010 1/0 BYTE 2 101 1 1 0 00000 01 1 1/0 BYTE 3 01 11 1 0 00000

100 MEMORY BYTE 0 1110 1 1 10000 1 101 MEMORY BYTE 1 1101 1 10000 110 MEMORY WORD 0 1100 1 1 10000 111 lACK 1110 0 0 00000 *'/ /"/ ,-' §:? I T

� ______A ______� 31 30 27 26 25 24 23 21 20 05 04 02

PSEUD0-386 BUS ADDRESS �====::::' _j�I I ------'1I I I I 10/-DMA �II ISAI-LOCAL ====_j iBE ISAIO/-ISAmemory :::::::::::::::! ATcycle/-XTcycle I =::::==-"�1 � I ' WAIT STATES ------� 19 01 00

ISA BUS ADDRESS

Figure 3 Beta System 11 0 Addressing

bits [04 ..02] . This odd positioning, which was cho­ system responded with the appropriate read and sen because it saved a few parts in the address path, write cycles. The refresh controller in the Intel seemed harmless; the CPU was so much faster than 82380 handkcl memory refresh and generated the the 110 that repositioning the low-order address bits appropriate refresh requests. The memory system did not reduce Beta system performance. It was a responded with column address strobe-before-row bad trade-off, however, because the odd addressing address strobe (CAS-before-RAS) refresh cycles. In scheme made writing low-level software fo r both cases, the Beta system used the CPU's devices containing buffers (e .g., a PC-style video holdReq/holdAck protocol (in which the CPU stops adapter) painful and error prone. and tristates most of its pins) to avoid conflicts The !SA bus interface connected to the I/0 bus. between the CPU and DMA or refresh on the cache Since all DMA control was inside the Intel 82380 and/or memory system. chip, the !SA interface fu nctioned as a simple slave During DMA read cycles, the backup cache read that translated Intel386 DX cycles into !SA cycles.' data (in anticipation) and performed the tag com­ Once again, the Beta system used address bits to pare operation in parallel with the RAS-to-CASdelay encode the details of the cycles. Many program­ of the DRAMs. If the reference was a miss, then CAS mable options were included because the sensitiv­ was asserted and the data came from the DRAMs. If ity of !SA cards to bus timing was unknown. In fact, the reference was a hit, then the buffe rs between the !SA cards that were used cou ld tolerate wide the cache data bus and the memory data bus were variations in bus timing and never required the enabled and the data came from the cache; the unusual bus cycle options. cycle on the DRAMs became a RAS-only refresh cycle. The DMA controllers in the Intel 82380 handled During DMA write cycles, the data was written to !SA-bus DMA transfers and generated the appropri­ the DRAMs. The cache performed the tag compare ate read and write requests. The cache and memory operation in parallel with the RAS-to-CAS delay of

Digital Te cbuical jou,.ual Vo l. 6 No. I Winter 19')4 57 Alpha AXP PC Hardware

the DRAMs. If the reference was a hit, the data was Software Development (BSD)-based UlffUX system written into the backup cache as well (without built fo r the original Alpha AXP Development Unit.C.

changing the state of the dirty bit), and the appro­ This NIX system included a port of Xll R5, priate internal CPU cache line was invalidated. believed to be the first 64-bit X 11 server ever built.- The local peripheral interface also connected to the l/0 bus. This interface included (I) an 82C106 Performance PC Combo 1/0 chip from VLSI Technology, Inc., with The CPU performance of the Beta system was esti­ a keyboard interface, a mouse interface, two serial mated to be 70 fo r integer (SPEC dhrystone, eqntott, lines, a parallel printer port, a time-of-year clock, espresso) and 65 fo r floating point (SPEC fpppp, and a small SRAM with battery backup; (2) a control matrix300, tomcatv) . A system with a 128-bit-wide register that contained a fe w random control bits; cache and a 128-bit-wide memory would have had and (3) a 64KB erasable programmable read-only a performance of 70 integer and 75 floating point. A memory (EPROM) that co ntained the boot and system with a 64-bit-wide cache and a 64-bit-wide console code. Code in the DECchip 21064 serial memory would have had a performance of 65 inte­ read-only memory (ROM) copied the EPROM to main ger and 55 floating point. All these estimates were memory for execution, thus eliminating the need obtained from a trace-driven performance model of fo r hardware that cou ld read 32 from the byte­ the DEC:chip 21064 CPU, the backup cache, and the wide EPROM and assemble the data into a cache line. memory system. The hardware was completed in a very short The !SA bus limited the 1/0 perfo rmance of the time-about eight weeks from the start of the Beta system to approximately 4 megabytes per sec­ design to rel easing the databases to the printed cir­ ond (Mfl/s). Thus, Beta is the most unbalanced cuit board and assembly houses. One person did Alpha AX I' system ever designed. the logic design, the logic simulation, and the tim­ ing verification. A second person designed the Outcome physical layout and solved aJJ the mechanical and Digital ultimately built 35 Beta demonstration sys­ cooling problems. A set of quick-turnaround, com­ tems, which were used in many presentations, puter-aided design (CAD) tools cleveloped by including a stockholclers' meeting ami a private Digital's Western Research Laboratory in Palo Alto, demonstration fo r Bill Gates at Microsoft Corpora­ California, allowed concu rrent design verification tion. The Beta machines were proof that the (with modifications) and physical design. DEC:chip 21064 microprocessor could be used to build low-end systems and prompted a number of Software Design low-end system projects to be started. Beta systems The Beta system was debugged using a simple con­ were usecl in the early development of the Alpha sole program, which drove an ordinary terminal AXP version of the Windows NT operating system . plugged into one of the serial ports and ail owed the Al though many aspects of the Beta design designers to peek and poke at memory and 1/0 worked we ll, dealing with the PC option cards did devices. Both the hardware and the simple console not go smoothly. The designers knew that there was program were debugged in one day, and most of not much low-level programming documentation that day was spent looking fo r a single software bug available and took care to select option cards based in the CPU chip initialization code. on very large-scale integration (VLSI) chips fo r which The designers augmented the simple console good documentation cou ld be obtained. They were program to perform more extensive diagnostics, often astounded, however, at how difficult these drive a standard Intel PC keyboard and !SA bus dis­ devices were to program and how often they dicl play, load programs by name from a file system on a not work exactly as described. By the end of the small computer systems interface (SCSJ) disk using project, it was clear to the designers that they could a standard !SA bus disk controller, and load pro­ use PC comronents to build interesting systems. It grams by name via BOOTP over the Ethernet using a was equal.ly evident that PC option cards cou ld be standard ISA bus network controller. All this code difficult to use, since little or none of the low-level fit into the 64KB EPROM, albeit compressed. software provided by the option cards' suppl ier Eventually, the designers built a fairly complete (basic l/0 system rmos] ROM code, MS-DOS device version of the UNlX operating system fo r the Beta drivers, or user code that manipulated the option system, starting from the port of the Berkeley card directly) was usable. Video graphics array

58 Vo l. 6 No. I Winter 1994 Digital Technical journal The Evolution of the Alpha AXP PC

(VGA) cards proved to be particularly troublesome, quently, they did not rea lize that the address map since they tended to conform to the VGA program­ they had chosen fo r their system would cause the ming standard only after code in their BIOS RO M per­ range check to fail. This failure would block the fo rmed a module-specific initialization sequence. generation of the special memory read and write signals and thus make the buffe r memory in a net­ Th eta System work card inaccessible. Further, it is not clear that The next step in the evolution of the Alpha AXP PC the Theta designers could have determined that took place in the winter and spring of 1992 with the this would occur from the chip set data sheet alone. design of the Theta system. This machine was based Digital bu ilt few Theta systems, and little system on the Beta design but used the EISA bus, driven by software ever ran on these systems. The Theta proj­ the lnte1 82350DT chip set.!!Figure 4 is the block dia­ ect, however, provided an environment for learning gram of the Theta system. In addition to replacing how to use the Intel 82350DT EISA chip set in an the JSA 110 system wit!1 an EISA 1/0 system, the Theta Alpha AXP system. The project team developed an design stored its firmware in flash EPROM and used extensive set of exercisers for popular EISA cards, an J/0 bus that imitated the signaling of an Intel486 which were very useful in the development of DX CPU chip. future systems. The Theta designers made several mistakes because they lacked a complete understanding of DECpc AXP 15 0 Product the nuances of industry-standard PC architecture. The charter fo r the design and manufacture of the For example, the EISA bus uses special memory read DECpc AXP 150 product, code-named Jensen, and write signals when accessing the first 1M byte belonged to the Entry Level Solutions Business in (1MB) of memory. The designers were not aware Ayr, Scotland, which had previously designed the that one chip in the EISA chip set was checking successfu ! line of MicroVA X 3100 systems. However, address ranges fo r another chip in the set. Conse- Digital felt that it was important to incorporate the

'623 2 X

CPU ADDRESS PSEUD0-486 ADDRESS/COMMAND BUS

CPU TAG INTEL 3 '244 EBC/ISP INTEL EBB X 82357 82352 82358

CACHE MAIN MEMORY DECCHIP 8 8K 18 BOOT AND X X 64M COMBO I/O 21064 CPU 25 NS DIAGNOSTIC 9-BIT VTI 82C1 06 100 MHZ 4 16K X 4 FLASH ROM X DRAM SIMMs 20 NS <:======�==::::::;> EISA BUS

INTEL EBB 82352 16 '2952 16 '899 X X

CPU DATA (128 BITS) PSEUD0-486 DATA BUS (32 BITS)

Figure 4 Block Diagram of the Th eta Sy stem

1994 Digital Tec hnical journal Vol. 6 No. I Wi nter 59 Alpha AXP PC Hardware

knowledge gained through the Beta and Theta sys­ characteristics of the system, designed a special ver­ tem development efforts in Massachusetts into this sion of the Theta machine (which implemented this new product. The "tiger team" that was assembled definition) and built 10 machines. The design team in March 1992 to define and design the product delivered these systems, called Theta-11, to the gave rise to a unique partnership among develop­ Windows NT group in early June 1992. The group ment, qualification, and manufacturing groups in used Theta-!! systems until actual DECpc AXP 150 Hudson and Maynard, Massachusetts, and in Ayr, machines arrived in August 1992. Scotland. The fact that the system was designed by The DECpc AXP 150 product uses the same enclo­ a tiger team faced with a short, fixed schedule had sure as the DECpc 433ST Intel-based PC. Since this a powerful effect on the architecture. The amount enclosure was intended to hold a multiboard sys­ of time available fo r design work was determined tem, the designers' original intention was to build by calculating backward from the time when the a multi board system as wei l. To save time, however, system was needed. The system implemented the designers put the entire CPU, cache, memory, the best architecture that could be designed in the and core 1/0 system onto the mother board and 12-week period available fo r design. placed all the complicated 1/0 devices (disk, net­ The schedule was extremely tight, as the fol low­ work, display) on EISA option cards. Placing these ing list of milestones indicates: devices on option cards had an additional advan­ tage. Many system-level decisions (such as which Date video controller to use) were removed from the (1992) Milestone critical path and were, in fact, changed several times as the project evolved. Mar 26 Assemble tiger team Because the delivery schedule was tight, the May 29 Complete preliminary schematics designers needed to ensure that the first-pass Jun 1 Begin layout boards were nearly production quality (since many Jun 29 Complete schematics, begin parts sourcing systems would be built using first-pass boards). The Jul 7 Begin prototype build designers had electromagnetic compatibility (EMC), Jul 28 Complete prototype build, beg in debug thermal, assembly, and test experts critique the Aug 5 Ship first prototype (actual date) design and incorporated as many of their sugges­ Aug 10 Ship first prototype (scheduled date) tions as possible into the first-pass boards. Dealing with these design issues early in the schedule NT The Microsoft Windows operating system delayed the release of the first system boards but was selected as the design center fo r the product. saved time overall because the usual flurry of This decision made designing to an aggressive changes required by regulatory testing and manu­ schedule even more difficult. The port of the facturing was avoided. Windows NT system to the Alpha AXP architecture was beginning at about the same time, and there were still many unknowns. This problem was Ha rdwa re Design solved by quickly establishing a close working rela­ The DECpc AXP 150 hardware design is completely tionship with key technical contributors in the synchronous. The DECchip 21064 CPU chip oper­ Windows NT group, resulting in a design team that ates at 150 MHz and generates the 25-MHz clock that spanned eight time zones. runs most of the logic. Sections of the 1/0 system The tiger team's original intention was to base run at 8.33 MHz, generated by the EISA chip set. The the product on the Theta design. A careful design original plan included a 256KB backup cache built review, however, showed that the Theta design was with the 16K-deep version of the SR.AMsused in the not well suited to high-volume manufacture. In fac t, Beta and Theta systems. The designers discovered , the Theta design did not implement some of the however, that the 32K-by-9-bit SRAi\ils used to build more esoteric aspects of the EISA standard, such as caches for Intel486 DX systems were so inexpensive the asynchronous timing of translated !SA direct mas­ that building a 512KR backup cache was less expen­ ter cycles, and could not be easily modified to do sive than building a 256KB cache. The designers so. Therefore, a new I/0 system design was needed. reused the same basic cache design used in the Beta This requirement caused a schedule problem for and Theta systems, although two copies of the the Windows NT group. To solve the problem, the cache address are needed because there are twice tiger team quickly defined the software-visible as many SR.AN!s. This design, with 17-ns SRAN!s,

Vo l. 6 No. 1 60 Wi nter /')'}4 Di�ital Te e/mica/ journal Th e Evolution of the Alpha AXP PC

allows the CPU to read the cache in 32 ns and write The Windows NT operating system requires that the cache in 40 ns. Figure 5 is a block diagram of the a block of physically contiguous addresses on an DECpc AXP 150 product. 1!0 bus need only be mapped into a single block The memory system was redesigned to be 128 of virtually contiguous addresses in the virtual bits wide (to increase perfo rmance), to use 36-bit­ address space. Thus, schemes that place low-order wide SIMMs (to save space), to handle both 1M- and Intel address bits and/or byte enables in high-order 4M-deep SIMMs, ami to hand le both the sing.le- and AJpha AXP address bits (like the ones used on the double-banked versions of the SIMMs. Because sys­ Beta and Theta systems) are unacceptable. The tem software strongly prefers to have contiguous DECpc AXP 150 product, therefore, uses a new memory, the design includes some hardware so that scheme, developed jointly by the hardware and configuration software can arrange the available software designers. This scheme places the low­ memory into a dense block starting at location 0. order Intel address bits and the width of the cycle in Main memory is protected by Jongword parity, as low-order Alpha AXP address hits. Figure 6 illus­ in the Beta and Theta systems. The memory con­ trates the translation. troUer transforms partial longword DM.A writes The DECchip 21066/21068 microprocessors and into read-modify-write cycles. The latching and the DECchip 21070 chip set use a similar mapping merging fu nctions are performed in the 74 FCT652 but make two small improvements. First, they shift transceivers situated between the memory bus and the Alpha AXP address 2 bits to the right, which the 110 bus. The registers in the transceivers make makes the Intel addressing window 128MB in size. it possible to check the parity of the old data at Second, they add a special check to make accessing the same time as the new data is written, making the the lowest 1MB of the Intel memory space more effi­ read-modify-write cycle faster. cient, since a number of important peripherals

2 '623 X 4 X '244

CPU ADDRESS

CPU TAG INTEL EBC/ISP INTEL EBB 82357 82352 82358

CACHE MAIN MEMORY DECCHIP 16 9 BOOT AND X 32K X 16M/256MB COMBO I/O 21064 CPU 17 NS DIAGNOSTIC 36-BIT VTI 82C106 150 MHZ 4 16K 4 FLASH ROM X X DRAM SIMMs 15 NS <����> EISA BUS

EISA INTEL EBB PARITY 82352 LOGIC 16 '2952 16 '652 X X

CPU DATA (128 BITS) ! t PSEUD0-486 DATA BUS (32 BITS)

Figure 5 Block Diagram of the DECpc AXP 150 Product

Digital Tecbnical jonrnal Vo l. 6 No 1 Winter 1994 61 Alpha AXP PC Hardware

33 32 31 09 08 07 06 05

31 25 00 BYTE ADDRESS 01 WORD EXTENSION 10 TRIBYTE REGISTER: 11 LONGWORD

'--y------� � ______A______-A- 31 25 24 02 01 00 EISA BYTE ENABLES EISA ADDRESS

Figure 6 DECpc AXP150 110 Addressing

have hardwired address assignments between system clock sampling the outputs of the first 640KBand 896KB. one. Intel uses a similar scheme in their 82350DT The local 110 systems in the DECpc AXP 150 and example designs. Theta systems are similar. The only exception is 2. Flow control on partial longwor'l writes. In the that the Alpha AXP 150 system has the additional DECpc AXP 150 system, the backup cache and flash EPROM needed to support the console inter­ main memory are protected by longworcl parity. faces fo r the Wi ndows NT operating system (thus This complicates DMA writes of less than a long­ conforming to the Advanced Computing [ARC] ruse word . Such writes need to be implemented as specification) and the DEC OSF/1 AXP and OpenVMS read-modify-write sequences, and the timing of AXP operating systems (thus conforming to the the EISA byte-enable signals (which tel l the DMA Alpha AXP console architecture). The designers logic if a read-modify-write cycle is needed) make considered rearranging the addressing and placing it impossible to extend the cycle using the nor­ the industry-standard peripherals (those in the VlSI mal EISA flow control mechanism (EXRDY). The Te chnology Combo I!O chip) at their usual places in designers solved this problem by stretching the the EISA address space. This plan was rejected to EISA bus clock when needed; the Intel EISA chip ensure that the console firmware cou ld communi­ set has logic (HSTRETCH) fo r doing this. The short cate with the serial lines and be used to debug the bus clock stretch does not affect any option that EISA T/0 system. conforms to the EISA bus specification. The EISA 1/0 system was tricky to design because the Intel 82350DT EISA chip set was intended to be 3. !SA direct masters. The Intel H2350DT chip set used in a system with a very different architecture. contains logic that translates !SA direct master After careful analysis of the problem, only three cycles into ordinary EISA cycles. These EISA truly difficult issues were evident. cycles, however, have somewhat unusual timing; they contain events that are not synchronized to I. Clock skew. The EISA bus clock is generated by the EISA bus clock. The most difficult part of synchronous logic (inside the Intel 82350DT chip dealing with this timing was determining that set) running on the system clock; however, the these events could actually happen. Making it delays between the EISA bus clock and bus con­ work required simply that some key transceiver trol signals, combined with the delays in the EISA control signals be generated combinatorially. bus clock generator itself, make it impossible to use EISA bus control signals as inputs to syn­ The arbiter in the Intel H2350DT chip set chronous logic running on the system clock. responds to bus requests (e.g., EISA bus masters or This problem was solved by implementing the the DMAcon trollers inside the 82350DT chip set), EISA control logic as two interlocked state stops the CPU using the DECchip 2!064 CPU's machines. The first machine runs on the ElSA bus holdReq/holc!Ack protocol, and takes control of the clock, and the second machine runs on the cache and memory system. ·when the CPU is in

62 Vo l. 6 No. I If/inter 1994 Digital Te cbtl ical jo unml Th e Euolution of the Alpha AXPPC

hold, the memory controller watches fo r EISA mem­ a slow process because this was the first time that ory cycles aimed at the low 256MB of memory and many of the developers had dealt with PC hard­ performs the appropriate read, write, read-modify­ ware. Once operational, however, the systems sta­ write, or refresh cycles. The cache cycles at the bilized rapidly. same time as memory, reading and writing as The designers discovered few hardware prob­ required. The riming is tighter than in previous lems as the operating system work progressed. machines, but making it work required only a care­ Some hardware problems were discovered in the ful placement of the critical parts. No attempt was EISA option cards, when software attempted to use made to allow the processor to run during DMA, the cards in a manner unlike that of the MS-DOS partially to keep the design simple (EISA bursts are operating system. These problems were tracked not required to be sequential) and partially to avoid down with the help of option card vendors. risk. Intel systems built with the 82 350DT chip set stop the CPU during DMA. Introducing parallelism Pe rformance could have revealed a lingering bug in the chip set The CPU performance of the system met project that no other system bad encountered. expectations. The original design concept targeted 100 SPECmark89, and simulations indicated that the Software Design 150-MHz system with 512KB cache would achieve The designers used a custom version of the Theta-11 this rating. As with any new system, performance firmware to debug the first DECpc AXP 150 systems. tuning is an important part of the development The systems were operational in less than half a day activity. Ta ble 1 shows the performance results as after they arrived from the assembly house. Almost of.January 1994 fo r the DECpc AXP 150 system and immediately, the debugging firmware was replaced fo r some of its competitors, all running industry­ by the first version of the production firmware, standard benchmarks under the Microsoft Windows developed jointly by the DECpc AXP 150 and the NT operating system. Descriptions of these bench­ Windows NT firmware teams. marks, as wel l as of the hardware and software con­ The first group to receive shipment of the sys­ figurations used to make the measurements, can be tems was the Windows NT group, who had been fo und in the Alpha AXP Personal Computer using the Theta-11 machines. Their software worked Perfo rmance - Windows NT.9 instantly; they only had to fix a single bug that had Comparing the perfo rmance of the DECpc AXP been concealed by a bug in the Theta-11 systems. 150 system to DEC 3000 AXP workstation perfor­ The group used both systems until enough DECpc mance is difficult. Most perfo rmance measure­ AXP 150 systems were available, at which point the ments have been made under the Windows NT Theta-11 systems were decommissioned. operating system, using the compilers and libraries Next, the OpenVMS and DEC OSF/1 groups took appropriate fo r that system, and the Windows NT delivery of systems. Making each of these two sys­ operating system does not run on DEC 3000 AXP tems operational on a DECpc AXP 150 machine was systems. As shown in Ta ble 2, the SPEC benchmark

Ta ble 1 DECpc AXP 150 Benchmark Performance under the Windows NT Operating System

Digital Gateway 2000 MIPS Computer Systems Benchmark Metric DECpc AXP 150 P5 60 MHz Magnum 75/150 MHz

Byte, numeric sort Sorts/s 36.0 11.4 33.6 Byte, string sort Bytes/s 136M 40.4M 123M Byte, bit fields Ops/s 7. 6M 2.2M 7. 8M Byte, emulated float FLOPS 2977 974 4597 Byte, simple FPU Bytes/s 5.4M 2.7M 3.9M Byte, transcendentals Coeffs/s 867 334 538 Dhrystone V1 .1 DMlPS 175.1 68.9 68.7 Dhrystone V2.1 DMlPS 161.5 61 .6 59.8

Clinpack 100 x 100 MFLOPS 22.2 8.0 7.4 (double precision) CWhetstone KWIPS 95.5 34.0 36.7 (double precision)

Digital Te clmical jo urnal Vo l. 6 Nl !. I Wi nter 1994 Alpha AXP PC Hardware

Ta ble 2 DECpc AXP 150 and DEC 3000 AXP Lessons Learned Benchmark Performance under the Engineers involved in the Beta, Theta, and DECpc DEC OSF/1 AXP Operating System AXP 150 projects learned many lessons duri ng the System SPECint92 SPECfp92 evolution of the these systems. Chief among these lessons are the following: DEC 3000 AXP Model 600 114 162 DEC 3000 AXP Model 500 84 128 1. Combining Digital's Alpha AXP microprocessors DEC 3000 AXP Model 400 75 112 and industry-standard system components was fa irly straightforward and certainly within the DECpc AXP 150 77 110 (V1 .3A-4) abilities of any competent digital design engineer. DEC 3000 AXP Model 300 66 92 2. Even though many engineers shy away from evo­ lutionary design because it lacks the glamour of suite, running under the DEC OSF/l AXP operating doing things from scratch, evolution is a fine system, demonstrates that the DECpc AXP 150 sys­ methodology. This is especially true if the cost of tem performs like a midrange DEC 3000 AXP system. each step along the way is small. Each new Alpha These results are not surprising because the only AXP PC eliminated the obvious shortcomings of real diffe rence between the two systems is the the previous designs, yet the risk of the system OECpc AXP 150 machine's slightly slower memory not working was small because most of the system. 10 design was copied from the predecessor. While adequate fo r many applications, the 1!0 3. The interchip and system buses designed with performance of the DECpc AXP 150 system is lim­ Intel CPUs in mind (e.g., !SA, EISA, and Peripheral ited not only by the EISA bus (which has a peak Component Interconnect [PCI]) can be used in bandwidth of 3:3MB/s) but also by the lntei 82350DT non-Intel systems. Designing the correct inter­ EISA chip set. The chip set is designed to be used faces, however, may require multiple iterations. with Intel microprocessors, and considerable I/O Digital engineers made three iterations before performance is lost reconciling Intel control sig­ arriving at what seems to be a good mapping nals with DECchip 21064 control signals. For this from Alpha AXP program 110 cycles to Intel pro­ reason, the peak bandwidth of the EISA bus in gram 1/0 cycles. the DECpc AXP 150 system is only 25 MB/s. The chip set also wastes EISA bus bandwidth while per­ 4. Sometimes, the best way to show that something fo rming internal operations. One network adapter is possible is to build a demonstration unit. coulli transfer only 16 MB/s because of excessive When the Beta system was designed, few EISA bus request latency introduced by the 823500T believed that low-end Alpha AXP systems could chip set. be built. No nbelievers found it difficult to defend their position in the presence of a work­ ing computer. Outcome 5. Clear goals can make development proceed The DECpc AXP 150 product was first shown in pub­ faster, since alternatives that run counter to the lic on October 28, 1992, by Bill Gates at Windows goals need little analysis-they can simply be on Wa ll Street, a presentation of the Windows NT rejected. The time-to-market goal of the DECpc operating system for more than 1,000 Wa ll Street AXP 150 project was an extreme example. analysts. The machine was subsequently shown at Designers needed only to consider alternatives the Alpha AXP introduction and in both Digital's that allowed meeting the tight schedule. and Microsoft's booths at the 1992 Fall COMDEX conference in Las Vegas. There, the product was a 6. Qualification, test, and assembly issues must be finalist fo r "best system of show," missing the title addressed as part of the design rather than by only one vote despite limited advertising. Digital as annoying derails to be addressed later. fo rmally announced the system at the 1993 Spring Aclclressing these issues during the design phase COMDEX conference, coincident with Microsoft's may delay the delivery of the prototype, but rollout of the Windows NT operating system. The doing so helps to ensure that the prototype is of DECpc AXP 150 product continues to receive good high quality and to avoid the risk of longer delays reviews and awards from the PC media. later in the project.

64 Vo l. 6 No. I \Vinter l!JV4 Digital Te chnicaljottr11al The Euolution of the Alpha AXP PC

7 A good CAD system is a valuable asset. Because 5. E. Solari, !SA and EISA, Theory and Op eration our CAD system was designed for rapid turn­ (San Diego, CA: Annabooks, 1992). around, it was possible to make significant 6. C. Thacker, D. Conroy, and L. Stewart, changes to designs as new data appeared or "The Alpha Demonstration Unit: A High­ when urgent needs arose. The Theta-ll board Performance Multiprocessor:· Communica­ design was released five days after the designers tions of the ACLI-'1 , vol. 36, no. 2 (February concluded that it was needed. 1993): 55-67.

Acknowledgments 7. j. Gettys, personal communication, 1991. The contributions of many people made this effort 8. 82350DT ElSA Chip Set (Santa Clara, CA: Intel successful. Although not complete, the fo llowing Corporation, Order No. 290377-003, 1992). list gives credit to those who were the most deeply 9. Alpha AXP Personal Computer Performance involved. Rich Witek, Dave Conroy, To m Levergood, Brief- Windows NT, 2d ed. (Maynard, MA: Larry Binder, Larry Stewart, Win Treese, and joe Digital Equipment Corporation, Order No. Notarangelo designed and implemented the Beta EC-N2685 -10,januar y 1994). hardware and software. Ra lph Wa re, Lee Ridlon, and Carey McMaster designed and implemented 10. Alpha AXP Wo rkstation Family Performance the Theta hardware and software. Joe Falcone first Brief- DFC OSF-11 AXP, 4th ed. (Maynard, MA: realized that there was a product lurking in the Digital Equipment Corporation, Order No. advanced development, transformed it into a real EC-N2922-IO,February 1994). project, named it after a legendary sports car, and did the early system definition work. The jensen hardware tiger team was led by To m Kopec and included jay Osmany, Fraser Murphy, To m Mann, Dave Conroy, Lee Ridlon, and Carey McMaster. Colin Brench, Sharad Shah, Roy Batchelder, Jim Pazik, and Mike Allen provided consulting services early in the design.

No te and References

I. The DEC 2000 AXP Model 300 is the same product using the DEC OSF/1 AX!' and the OpenVMS AX!' operating systems.

2. R. Sites, ed., Alpha Architecture Reference Ma nual (Burlington, MA: Digital Press, Order No. EY-L520E-DP, 1992)

:3. DECchip 21064 l'vl icroprocessor: Ha rdware Reference l'vlanuat (Maynard, MA: Digital Equipment Corporation, Order No. EC-N0079- 72, 1992). This document and many others associated with Digital's Alpha AXP micro­ processors are available on the Internet and can be obtained via anonymous FTP from gatekeeper.pa.dec.com, in directory pub/DEC/Al pha.

4. H. johnson and M. Graham, Chapter 6 in High Sp eed Digital Design: A Ha ndbook of Black Magic (Englewood Cliffs, NJ : Prentice-Hall, 1993).

Digital Te chnical jo urnal Vo l. 6 No. 1 Winter 1994 65 Dina L. McKinney Ma sooma Bhaiwala Kwong-Tak A. Chui Christopher L. Houghton james R. Mullens DanielL. Leibholz Sanjay]. Patel Delvan A. Ramey Digital'sDE Cchip 21066: Ma rk B. Rosenbluth The First Cost-focused AlphaAXP Chip

Th e DECchip 21066 microprocessor is the first Alpha AXP microprocessor to target cost fo cused system applications and the second in a fa mi(V of chips to implement the Alpha AXParchitecture. Th e chip is a 0. 675-micrometer (J.Lm), CiiWS-based, superscalar; superpipelined processor that uses dual instruction issue. It incorpo­ rates a high level of system integration to provide best- in-class system performance fo r low-cost system applications. The DECchip 21066 microprocessor integrates on-chip, fully pipelined, integer and floating-point processors, a high-bandwidth memory controller, an industry -standard PC! l/ 0 controller,graphics-assisting hard­ ware, internalinstruction and data caches, and an externalcache controller. Cost­ saving packaging techniques and an on-chip, analog phase-locked loop enable the chip to meet the cost demands of personal computers and desktop sys tems. Th is paper discusses the trade-o!fs and results of the design, verification, and implemen­ tation of the DECchip 21066 microprocessor.

The OECchip 21066 microprocessor is the first cost­ oscillator. The combination of cost-saving fu nc­ fo cused implementation of the Alpha AXP architec­ tional integration and fe atures that ease system ture.1 The chip integrates system fu nctions that are design reduces the overall CPU subsystem cost and normall y fo und in a microprocessor chip set with shortens the time-to-market of high-volume Al pha a high-performance, superscalar microprocessor AXP PCs. core to deliver high-end personal computer (PC) performance and low overall system cost. The CMOS Te chnology DECchip 21066 device is also the first micropro­ The first chip to fu lly implement the AJ pha AXP cessor to integrate a Peripheral Component Inter­ architecture, the DECchip 21064 microprocessor connect (PC!) local bus control ler z This open was designed in a 0.75-micrometer (t-Lm) (drawn) industry-standard 1/0 bus allows direct connection complementaq' metal-oxide semiconductor (GvlOS) to high-performance peripheral components sup­ process -'· ; The physical implementation of the pi ied by many vendors. The bus also allows direct DECchip 21066 microprocessor was achieved connection to commodity Industry Standard Archi­ through the use of a 10 percent CivlOS process tecture (!SA) or Extended ISA (EISA)-based PC shrink, which reduced tbe minimum fe ature size peripherals through a simple bridge chip. The from 0.75 1-Lm (drawn) to 0.675 t-J.m. This reduction OECchip 21066 microprocessor integrates a high­ in the fe ature size enabled the OECchip 21064 inte­ bandwidth memory controller that directly ger unit, floating-point unit, and caches to be com­ sequences an external secondary cache and main bined with a new memory controller, PCI 1/0 system memoq'. In addition, an on-chip, phase­ controller, and PLL on a die with approximately the locked loop (PLL) multiplies a low-frequency refer­ same area as the original DECchip 21064 device. The ence clock to a high-frequency CPU core clock, thus maximum core clock speed of the DECchip 21066 eliminating the need fo r a high-frequency board microprocessor is specified as 166 megahertz (MHZ).

66 Vo l. G No. I Winter JY'J4 Digital Te chuicaljournal Digital's DECchip 21066 The First Costjocusea Alpha AXP Chip

This speed allows the internal PLL to multiply a two secondary cache read sequences to fill a line in 33-MHz reference clock (the same reference clock the primary cache. DECchip 21066 engineers chose as on the PCI bus) by five to generate a clock fo r the to implement a 64-bit data bus. This implementa­ CPU core. This cycle time was set to be slightly less tion resulted in a less-expensive package, a smaller aggressive than the 200-MHz maximum core clock die, and lower system cost, while reducing by only speed fo r the DECchip 21064 microprocessor to 20 percent the performance that would have been allow a greater number of yielding parts and thus achievable with a 128-bit bus. The fast core clock lower part cost. and close proximity of the secondary cache to the core still enable the DECchip 21066 microprocessor Ta rget Market to deliver high perfo rmance. One target market fo r the DECchip 21066 micro­ processor is Al pha AXP res running the Microsoft InternalCom ponents Windows NT operating system. The bulk of the The block diagram in Figure 1 shows the major application base fo r this operating system is pro­ DECchip 21066 microprocessor components. vided by vendors who target the Microsoft Windows Fetched instructions from an on-chip 8-kilobyte, operating system running on PCs based on InteL'<86 direct-mapped instruction cache are dual-issued microprocessors. Software vendors have substan­ to the integer, floating-point, and load-and-store tial expertise with this architecture and, until (addressing box) units. The on-chip, 8-kilobyte, recently, the advanced software tool base fo r direct-mapped data cache initially services mem­ Windows development has been targeted exclu­ ory loads. Stores are written through the data cache sively at the Intelx86 architecture. To provide com­ and absorbed into a write buffer. The memory con­ pelling motivation fo r application developers to troller or PC! 110 controller handles all references supply portable Windows NT applications and fo r that pass through the first level of caches. customers to adopt a new architecture, Alpha AXP The memory controller handles memory-bound PCs must be price competitive with InteL'<86 I'Cs traffic. The control.ler first probes a direct-mapped, and must deliver substantially higher performance. write-back, write-allocate cache and then sequences The availability of the Windows NT operating sys­ main memory to fill the caches, if necessary. The tem and high-performance l'Cs based on Intel's PC! 1/0 controller handles I/O-bound traffic and per­ Pentium microprocessor dictated that a set of com­ fo rms programmed 1/0 read and write operations petitive products be available in early 1994." Fur­ on behalf of the CPU. Direct memory access (DMA) thermore, the performance of Pentium-based PCs traffic from the PC! is handled by the PC! controller set the lower bound of acceptable performance fo r in concert with the memory controller. DMA read I'Cs based on the Alpha AXP technology. These and write operations are not allocated in the sec­ schedule and performance goals were met by ondary cache. The memory and PC! interfaces were implementing the DECchip 21066 microprocessor designed specifically fo r uniprocessor systems and as a high-integration design variant of the DECchip do not support multiprocessor implementations. 21064 microprocessor. This strategy permitted a design cycle of only 10 months. Sy stem Application Figure 2 shows a sample system block diagram Ta rget Performance using the DECchip 21066 microprocessor. In this The DECchip 21066 microprocessor provides a CPU configuration, the memory controller sequences subsystem that is comparable in cost to a 66-MHz both the static random-access memory (SRAM) sec­ lntel486 DX 2-based PC with Pentium-class perfor­ ondary cache and the dynamic random-access mem­ mance (' Because of its excellent price/performance ory (DRAM) main memory. The secondary cache ratio, the DECchip 21066 microprocessor is attrac­ comprises tag and data SRAMs with identical read tive to low-end workstation products running the and write access times. System software selects spe­ DEC OSF/1 AXP and OpenVMS AXP operating sys­ cific random-access memory (RAM) timing values in tems, in addition to \V imlows ;\iTplatforms. increments of the number of core clock cycles. A performance-limiting factor fo r many applica­ TI1e design supports fo ur separate banks of DRAM, tions is the bandwidth to the second-level cache. each of which can be timed independently. This fea­ The DECchip 21064 microprocessor maximizes ture adds flexibility in memory organization and bandwidth with a 128-bit data bus, which allows upgrading. One of the fo ur banks can be configured

Digital Te chnical journal Vr!l. 6 No. 1 Winter 1994 67 Alpha AXP PC Hardware

r--1 INSTRUCTION CACHE - : ��JTEGER uNiT - � �FLOAllNG-POINT UNIT l ----1- - I I II FLOATING-POINT I EXECUTION 1 INSTRUCTION I UNIT 1 I INTEGER I UNIT I INSTRUCTION I I I t t t UNIT I I t t t I I I I I I I INTEGER I FLOATING-POI NT I REGISTER FILE I REGISTER FILE I I I I I I I L __ --t- r------J I _--� - --_ I I rl ADDRESSING BOX I

WRITE BUFFER I 1-H I '--�I DATA CACHE I I I

MEMORY PCI I/0 CONTROLLER CONTROLLER

I I • •

Figure 1 Major Components of the DECchip 21066 Microprocessor

CACHE SECONDARY CONTROL CACHE

DECCHIP 21066 MICROPROCESSOR 64-BIT DATA BUS

DRAMNRAM MEMORY CONTROL

Figure 2 Sample Sy stem Block Diagram

68 Vo l. 6 No. 1 Winter 1994 Digital Te chnical Jo urnal Digital's DECchip 21066: The First Cost-focused Alpha AXPCh ip

with video RAL\1s (VRAMs) fo r low-cost graphics frame buffer in software. The decision to include applications. The memory controller directly graphics fe atures was based on the ability to pro­ sequences the VR.AMsand supports simple graphics vide significant acceJeration of key fu nctions fo r operations, such as stippling and bit-focused writes. minimal hardware cost in the memory controller. Even though the DECchip 21066 microprocessor operates at 3.3 volts (V), all pads directly drive DynamicMe moryInter fa ce transistor-transistor logic (TIL) levels and can be The DRAM data interface consists of a 64-bit-wide driven by either 3.3-V or 5-V components. This bidirectional bus that is shared with the secondary eliminates the need for either special voltage con­ cache. The interface provides two control signals version buffe rs or 3.3-V memory parts, which can fo r an optional memory data bus transceiver that add to system cost or affect performance. may be required because of bus loading. AJImain The PC! bus is a high-bandwidth bus with a num­ memory read operations return a full 64 bits ber of attractive features. In addition to its ability of data. Main memory write operations normally to handle DMA and programmed l/0, the PC! bus consist of a fu ll quadword (64-bit) write. Write allows fo r special configuration cycles, extendibility requests for less than a quadword of data automati­ to 64 bits, 3.3-V or 5-V components, and faster tim­ cally result in a read-modify-write sequence, unless ing. The base implementation of the PC! bus sup­ the software has enabled masked writes on the ports 32 bits of multiplexed address and data with a memory bank. During masked writes, the error cor­ bus clock of 33 MHz, yielding a maximum burst rection code (ECC) pins provide a byte mask that is 132 bandwidth of megabytes per second (MB/s). externally combined with a column address strobe The PC! bus is directly driven by the microproc­ (CAS) signal to fo rm a byte mask strobe fo r the array. 2, essor. In Figure some high-speed peripheral ECC checking is not supported on memory banks devices, such as graphics and small computer that have masked writes enabled. systems interface (SCSI) adapters, connect directly An optional 8-bit bidirectional bus provides to the PC! bus. An !SA bridge chip allows access quadworcl ECC checking on the main memory and to lower-cost, slower peripheral devices, such as the secondary cache. Each bank of main memory modems and floppy disk drives. may be protected by ECC at the cost of including 8 extra DRAL\1 bits per bank. The ECC used by the Memory Controller Features DECchip 21066 microprocessor allows correction of single-bit errors and detects double-bit and 4-bit A primarygoal of the DECchip 21066 project was to simplify the system design to enable customers nibble errors. The memory controller automati­ to build products with minimal engineering cally corrects the single-bit memory read errors investment. Consequently, the memory controller before data is passed to the read requester. When directly connects to industry-standard DRAL\1s and the ECC identifies a double-bit or a 4-bit nibble single in-line memory modules (SIMMs), with buf­ error, the memory controller does not attempt to fe rs required only fo r electrical drive. To span the correct it. When any ECC error is detected, the mem­ wide range of system appl ications from embedded ory controller stores the error condition along with controllers to midrange workstations, the timing of the address of the error. System software must address, data, and control signals is highly pro­ scrub the error from physical memory. grammable. This feature supports varying speeds and physical organization of the DRAMs and clock Secondary Cache speeds of the DECchip 21066 microprocessor itself. The DECchip 21066 microprocessor supports a The memory controller supports from one to secondary cache designed with industry-standard fo ur banks of memory, which allows main memory asynchronous SRAMs. The cache is direct mapped to range from 2M bytes to 512M bytes. The system with 8-byte blocks and uses a write-back policy. may include a secondary cache of 64K bytes to 2M Designers chose a secondary cache block size that bytes. This optional cache also connects directly to is smaller than that of the 32-byte primary internal the microprocessor, and the timing of control sig­ caches to simplify the allocation of an external nals is programmable in increments of chip cycles. cache block during write operations. With the Several simple, graphics-assisting fe atures allow smaller block size, it is not necessary to fe tch the systems that need to save the cost of dedicated missing quadwords of the block from DRAM when graphics control hardware to control the system less than a fu ll internal cache block is written, as

Digital Te chnical jo urnal Vo l. 6 No. I Winter 1994 69 Alpha AXP PCHardware

would be the case with a larger block size. In addi­ The hardware graphics-assist logic consists of a tion, the smal ler block size provides one dirty bit direct interface to V:RAM parts and the ability to per­ per quadword, so that the resolution of dirty (not fo rm some simple graphics-oriented data manipula­ yet written back) data is one quadword rather than tions. To facilitate the use of VRAMs, the memory fo ur, thus reducing the number of write-back oper­ controller supports both full- and split-shift regis­ ations to the DRANl. ter load operations. External video monitor control The DECchip 21066 microprocessor writes the logic generates monitor timing and signals the cache tag into the SRANl s on cache allocation opera­ DECchip 21066 memory control ler when VR.AJ\1 tions and receives and compares the tag internally shift register loads are required. An internal linear on cache lookup operations. Read-fill operations of address generator keeps track of the video display's the internal caches are pipelined as the micro­ refresh address. processor drives the next cache index to the SRAMs The fe atures of the graphics data path provide prior to determining the hit or miss result of the the ability to perform simple frame buffer-style current lookup. A cache hit, which is the more com­ data operations, e.g., transparent stipple writes, mon occurrence, causes SRA.M access to achieve plane masked writes, and byte writes (with mini­ maximum speed. In the case of a miss, the micro­ mal external logic). In transparent stipple mode, processor re-drives the miss address and reads the memory controller can conditionally substitute the DRAJ\1s. the fo reground data fo r the frame buffe r data on The chip divides the memory space into cache­ memory writes. able and noncacheable regions based on address. To Pixel depth may range from 1 bit to 32 bits. A spe­ save time, accesses to the noncacheable region skip cial feature of the graphics-assist logic is the ability the cache-probe step and immediately access the to perform graphics data manipulation operations DRAM . This fe ature may be used, fo r example, to on both YR.AM and standard DR.t\Jvl memory banks. optimize accesses to a frame buffe r, which typically The DECchip microprocessor emulates graphics is not cached. operations targeted at memory banks without The cache tag field, including the dirty bit, may VRAMs by means of a read-modify-write sequence. optionally be protected by a single parity bit. In sys­ This method allows the same graphics firmware to tems with write-through caches, tag parity errors operate on either VR.AJ\1 or DRAM memory banks are not necessarily fatal since the correct data can and allows DRAM to he used as additional off-screen be fe tched from DRANl . With a write-back external memory. cache, such as the one used by the DECchip 21066 A low-cost video option board may be built microprocessor, the bad parity may be on a dirty around the microprocessor by adding a bank of location. In this case, bad parity is a fatal error, VR.AMs and a video controller. Typical video con­ since the copy in DRAM is no longer current. troller logic consists of a video timing generator, a RAMDAC (video digital-to-analog [D/A] converter Graphics-assisting Functions with color mapping), a hardware cursor, and a few other simple video data path components. The graphics-assisting logic of the DECchip 21066 microprocessor provides some basic hardware enhancements to improve frame buffe r perfor­ PC/ Bus Interfa ce mance over a standard, simple frame buffe r system. The DECchip 21066 microprocessor is the industry's The graphics-data-path-assist logic is targeted at first microprocessor to implement an interface that reducing the number of instructions executed connects directly to the PC! bus. This PCT bus inter­ during inner loop graphics operations. By reducing face, called the l/0 control (IOC), runs asynchro­ the number of inner loop instructions, the micro­ nously with the rest of the microrrocessor's core processor offers an improved graphics perfor­ logic. Asynchronous design was chosen to enable mance while keeping the overall graphics subsystem optimal system performance by setting the chip's cost in line by using a simple frame buffe r design core clock to its maximum frequency without rather than a more expensive graphics accelerator. being limited to an integer multiple of the PC! clock The graphics-assist logic provides the opportunity frequency. to design a low-cost, entry-level graphics option The roc can be viewed as two separate control­ and may be used in conjunction with a high­ lers: one fo r DM.A and the other fo r core requests. performance graphics accelerator. The DM.A controller handles all peripheral-initiated

70 Vo l. 6 No. I Winter 1994 Digital Te cbnical jo urnal Digital's DECchip 21066· The First Costjbcused Alpha AXPChi p

DMA operations to the system memory; the core A larger queue would require more die area; a controller handles the loads and stores to either the smaller queue might stall the DMA bursts due to the PCI devices or the IOC internal registers. Since synchronization delay and system memory access the PC! bus uses a 32-bit address and the DECchip latency. The IOC and memory controller are con­ 21066 microprocessor uses a 34-bit address, a PC! nected by a 64-bit data bus. Since the lOC PC! bus address to which the IOC responds must be trans­ supports only 32 bits, two 32-bit data words are lated to an equivalent address in the microproces­ transferred at a time to minimize read-modify-write sor's address space. The roc provides two types of operations that would otherwise be issued by the address-translation mechanisms, direct and indi­ memory controller. The IOC improves the DMA read rect (also called scatter-gather). bandwidth by prefetching up to 32 bytes of data. During a direct-mapped address translation, the A hardware semaphore provides the asynchro­ lower bits of the PCI address are concatenated with nous handshaking between the core clock and PCI a page address that is stored in an IOC register clock domains. Approximately two months were to fo rm a 34-bit translated address. In an indirect­ dedicated to the logic and physical implementa­ mapped address translation, certain bits of the PC! tion of the hardware semaphore and the manual address are concatenated with a translated base verification of the asynchronous communication address that is stored in an IOC register to form a and signal timing. This effort was required because 34-bit address. This address indexes into a scatter­ the existing computer-aided design (CAD) tools did gather map table in system memory where the page not have a fo rmal verification methodology that address resides. The page address is then concate­ would sufficiently validate the asynchronous nated with the lower bits of the PCI address to aspects of the roc architectural design and the fo rm the translated address. The roc contains two physical implementation. programmable windows, each of which can be pro­ Developing the test strategy was a challenge for �s,rammedto respond with direct or indirect address two reasons: (1) the presence of asynchronous translation. To facilitate fa st translation fo r the indi­ sections and (2) the need to guarantee that the rect address translation, the IOC contains an eight­ semaphore would yield predictable results on a entry, fully associative, translation l.ook-aside buffe r. production tester operating at the maximum To simpl ify the design, DMA burst length is limited design frequency (166 MHz fo r the core and 33 MHz to occur within an SK-byte page boundary. Bursts fo r the PC! bus). Even when the two clocks are run­ that extend beyond a page boundary are broken ning synchronously, they must be controlled and into separate transfers. aligned such that the hardware semaphore guaran­ CPU addresses are translated to an equivalent tees reliable and predictable resu lts. Each core address in the PC! address space through one of clock phase is only 3 nanoseconds at !66 MHz. two types of address translation, sparse or dense, Ta king into account the logic delay and setup time depending on the target region of the address. required by the hardware semaphore, this clock For sparse-space access, the lower 32 bits of the skew requirement is essentially impossible to CPU address are shifted right by 5 bits to generate achieve. The inaccuracy of the tester and the test a 27-bit address. This address, concatenated with a hardware fu rther complicates the problem. The 5-bit field from a register in the IOC, maps the 32-bit solution adopted was to incorporate extra logic to PCI address. Transfers of up to 8 bytes can be com­ reduce the sample rate of the hardware semaphore pleted in this address space. For dense-space that runs on the core clock without actually slow­ addressing, the lower 32 bits of the CPU address are ing down the clock itself. With the extra logic, we directly mapped to the PC! address. Only unmasked were able to demonstrate the ability to test the IOC operations are allowed, and up to 32 bytes of write and the DECchip 21066 microprocessor at the tar­ data and 8 bytes of read data can be transferred in geted clock frequencies. this address space. The roc improves write band­ width by buffe ring two 32-byte writes from the Logic Design Ve rification Process DECchip 21066 core. A pseudorandom design exerciser is the main verifi­ The priorities of the IOC design were to support cation tool for the DECchip 21066 microprocessor. peak PCI bandwidth and at the same time to meet The exerciser takes input from Alpha AXP code the tight project schedule and limited die area con­ streams and PC! bus commands. A template-based straint. A 32-byte data queue buffe rs DMA data. text manipulator generates test patterns from the

Digital Te chnical journal Vo l. 6 No. I Winter 1994 71 Alpha AXP PC Hardware

templates written by verification engineers to tar­ design aspects not easily checked by comparing get specific sections of the microprocessor design. state information. The text manipulator provides the primary source Previous verification projects have used similar of randomness. It selects templates based on proba­ random test methods; however, they did not have to bilities to create the test pattern. Each test pattern ensure correct operation of asynchronous 110 trans­ runs on a design model. We use two different design actions.7.R To deal with this problem, the DEC:chip models, a register transfer level (RTL) model writ­ 2L066 verification team enhanced the reference ten in the C program ming language and a structural model to evaluate CPU and PCI transactions. The ref­ model generated from the circuit schematics. erence model receives the same stimuli as the Finally, the same test pattern runs on the instruc­ design model plus additional 1/0 bus and event infor­ tion set processor (ISP) level reference model to mation. While executing the test pattern, the model determine the correctness of the design model.7 verifies that the 1/0 information is consistent. Figure 3 illustrates the logic design exerciser pro­ By periodically comparing fu nctional states in a cess flow. test pattern, we circumvented adding detailed tim­ To determine if the design model operates cor­ ing information to the reference model. The model rectly, the exerciser performs checks at three needs only valid state fo r comparison at synchro­ points (1) during the execution of the design nized capture points rather than at a clock phase model, (2) during the execution of the reference boundary. In essence, we traded off being able to model, and (3) after both models complete. determine the timing correctness of our design fo r Ve rification software in the uesign model performs a simpler, faster reference model. We augmented additional checks fo r illegal conditions, such as mul­ the design model with verification code to check tiple bus drivers, protocol violations, and in val icl critical timings and performance fe atures. Detailed states. This code is kept to a minimum and covers static and circuit timing analysis was accomplished by other CAD tools.

Shared Memory

TEMPLATE-BASED Several fe atures of the DECchip 21066 microproces­ TEXT MANIPULATOR sor require the exercisers to employ a shared mem­ ory structure to properly test the chip. ALPHA AXP CONTRO L FILE CODE STREAM CONTAIN lNG • The microprocessor invalidates an internal data PCI BUS COMMAN DS cache line on a DNI.A transfer to that cache I ine address. REGISTER TRANSFER LEVEL MODEL • An exclusive PC! access unconditionally clears the processor's lock flag. A nonexclusive DMA 1/0 STREAM write also clears the lock flag if the DMA address CONTAINING 1/0 BUS AND EVENT is within tbe locked 32-byte range. INFORMATION STATE DUMP The DECchip 21066 secondary cache does not FROM RTL • MODEL allocate on a DMA operation. Hence, the only REFERENCE MODEL - way to test DMA sequences that hit in the sec­ ondary cache is to first access those addresses STATE DUMP FROM from the CPU. REFERENCE MODEL The prime challenge to implementing a shared STATE DUMP memory is that simultaneous accesses to the same COMPARISON address location by the CPU and DMA could result in unpredictable behavior. To circumvent this prob­ lem, we developed a model fo r software access to DI FFERENCES FILE shared memory and added synchronization hooks CONTAINS MISMATCHES, IF ANY into the reference model and the l/0 trace file. The shared memory access model and synchronization hooks extended the exerciser to fully verify the Figure 3 Logic Design Exerciser Process DECchip 21066 shared memory capability.

19')4 Digital Te chnical ournal 72 Vo l. , No. 1 Wi nter j Digital's DECchijJ 21066· The First Costjocused Alpha AXP Chip

The shared memory implementation proved use­ Each of the three choices comes with an associ­ ful in discovering two bugs in the shared memory ated cost or risk. The cost benefits of having a PLL at fu nctionality. Both bugs were tricky to generate the system level must be sufficient to justify the and would have been extremely difficult to find extra design time, chip area, and risk of excess jitter without the shared memory capabilities of a ran­ and low yields, which increase chip cosr. On the dom exerciser. first Alpha AXP microprocessor project, these fac­ The random verification methodology enabled tors were nor critical because high-end systems us to meet the aggressive 10-month schedule. By can afford the extra module and system cost to separating fu nctional correctness and timing cor­ reduce the risk associated with delivering the prod­ rectness, we were able to quickly implement and uct to market. With the lower-cost targets of the stabilize the exerciser environment. The exerciser DECchip 21066 project, eliminating expensive SAW can easily be ported to future generations of the oscillators in favor of less-expensive, standard crys­ DECchip 21066 microprocessor; we have already tal devices is critical and justifies the chip-level used the tool to verifythe second-pass design . costs and risks associated with introducing a new analog subsystem into the chip design. PLL Design Issues Figure 4 iiJustrates how the PLL is a closed-loop fe edback system. An output is compared with an To reduce the module-level cost of a system based input to determine the degree of error. The error on the DECchip 21066 microprocessor, the chip signal is then used to make adjustments within the includes a frequency synthesis system based on a elements of the system to reduce the error. This phase-locked loop clesign 9-IH The module frequency process occurs continually; under normal operat­ reference is a standard crystal oscillator, rather than ing conditions, the error approaches zero. The an expensive, surface acoustic wave oscilla­ (SAW) major components of the PLL are tor. This design reduces radio frequency emissions,

making qualiJication of the design easier and reduc­ • A phase frequency detector and a charge pump. ing system enclosure cost. Different speed binnings The phase frequency detector inputs the refer­ of the chip can run at maximum performance levels ence and fe edback signals and measures the in the same basic module design. This resu lt is pos­ phase error. The charge pump outputs a current sible because of the asynchronous, fixed clock rate proportional to this error, using digital pulse­ of the PC! bus, the on-chip caches, and the pro­ width modulation to achieve an analog signal. grammable timing of the memory interface. A filter. A simple, low-pass, resistor-capacitor Incorporating a PLL subsystem in a digital CPU • chip design requires that several problems be (RC) filter averages the error signal, removing the addressed. For example, increased noise sensitivity, digital noise of the pulse-width modulation used which is a concern fo r digital designs, is even more by the charge pump and setting the response critical fo r the analog circu its in a PLL. Often, simple rate of the PLL while ensuring loop stability. The guidelines fo r digital designs can ensure reasonable filter is composed of an on-chip, 100-ohm resis­ noise immunity; however, analog circuits fre­ tor and the external capacitor. quently require detailed SPICE simulations that • A voltage-control led oscillator (VCO). The VCO incorporate package- and chip-switching noise consists of a voltage-to-current buffe r and a cur­ models.1� Designers must consider making changes rent-controlled oscillator. The output frequency to otherwise fu lly fu nctional circuits to ensure of the VCO is proportional to the input voltage. noise immunity. These changes can range from

appropriate decoupling of critical nodes with high­ • A fe edback frequency divider (K_NDIV). In the quality, on-chip capacitors to consideration of I in ear control theory model of the PLL, the fe ed­ major circuit changes or additions. The overall solu­ back signal is the oscil lator phase attenuated tion must be some combination of the fo llowing by the gain 1/N, where N equals the number three alternatives: (1) careful circuit design based of clock pulses input for every output pulse. on noise environment simulations, (2) decoupling This produces an oscillator phase output that is that is sufficient to reduce noise to tolerable levels, N times the input, which is the key to synthesiz­ and (3) a better package that reduces loop induc­ ing a higher-frequency clock from a lower­ tances to reduce switching noise on the internal frequency reference. A digital divider reduces power suppl ies. the frequency of the vco, thus reducing its

Digital Te chr�icaljournal Vol. 6 1 u. I Winter 1994 73 Alpha AXP PC Hardware

r--- r--­ PHASE CHARGE FILTEA VOLTAGE-CONTROLLED l I I I FREQUENCY PUMP I OSCILLATOR DETECTOR I I I I FREQUENCY I DIVIDER I I 333 I (K_ODIV) 166 I I MHZ MHZ I 6 + 2, 4, I I I [ ______j

FEEDBACK FREQUENCY 33 MHZ DIVIDER (K_NDIV)

�------_, -;- 10, 12, 14, 16, 18

r------POWER REGULATOR 1 I eLL 3V I I "'-" 1 - I 33V - I I POWER +- T SUPPLY �v · I I VOLTAGE I I � L------j

Figure 4 Overview of the DECchlP 21066 Ph ased-locked Loop Frequency Sy nthesis System

phase as well. In this manner, a digital divide by capacitors, such as the one shown in Figure 5. The N performs an analog attenuate-by-N function. capacitors add additional high-frequency poles. These poles are placed well above the PLL band­ Al though not strictly part of the PLL itself, the fo l- width, and are thus largely ignored when analyzing lowing two design elements are major components stability, but well below the package resonance fre­ of the frequency synthesis system: quencies that would cause significant jitter.

• Power regulators. To reduce clock jitter, the PLL The implementation of the oscillator can take on-chip power is regulated using the noisy mod­ many different forms. The PLL design described in ule 5-V power input. The use of power regula­ this paper uses a ring oscillator based on simple dif­ tors yields a power source with less AC noise ferential buffers. The reasons fo r this design deci­ than the global 3.3-V power used by the digital sion are as fo llows: logic and 1/0 pads. • Power supply current is nominally constant

• A simple, programmable frequency divider because the summation of current into each (K_QDTV). This divider reduces the VCO fre­ buffe r in the ring smoothes the total. quency while ensuring a 50 percent duty cycle. • Diffe rential buffers used in the ring offe r potential The K_QDTV output is then buffe red adequately power supply rejection ration (PSRR) benefits. to drive the global clock node.

• Multiple phase outputs are possible. Al though The RC filter and vco create a textbook second­ the DECchip 21066 microprocessor does not use order fe edback loop. This model uses linear con­ multiple phase outputs, other applications of this trol theory methods to analyze loop stability. The basic oscillator design have exploited the feature. on-chip resistor is a key element in loop stability Designers' experience with this oscillator imple­ because it creates the transfer fu nction zero that • mentation minimizes the risk, as compared ultimately ensures stability. The oft�chip capacitor, to possible multivibrator designs. however, subjects the PLL to chip-switching noise coupled between the signal etches of the package. The current design provides only fo r setting the To minimize the effects of this AC noise on clock clock frequency synthesis ratio at power up. Future jitter, the design includes additional decoupling designs may include the ability to control the

74 Vo l. 6 No. 1 Winter I'J'J1 Digital Te chnical jourual Digital's DECchip 21066· The First Costfo cused Alpha AXPChip

PLL_3V ply integrity is generally proportional to the square root of Cc�IL r The supply loop inductance affects on-chip sup­ ply integrity because variations in the chip's supply current generate inductive noise voltage on the I MIRRORED on-chip supply voltage. The magn�tude of supply t DC SIGNAL cu rrent variations is dependent on chip clock fre­ quency. At 200 MHz, variations in supply current can be as much 3 amperes (A) . If not reduced by some method, the inductive noise generated across the supply loop inductance would be excessive. Fig ure Us e of a Decoupling Capacitor 5 Three methods are typical ly used in micropro­ to Reduce No ise in an cessor chip and package designs to decrease supply Analog Current Mirror loop inductance: (1) increase the number of pack­ age supply pins, (2) increase the number of inter­ synthesis ratio through software. Such control can nal package layers (sometimes called planes), and provide a means of lowering chip power levels as (3) include on-package decoupling capacitors to well as more flexibility over using diffe rent clock shunt the supply loop inductance component of rates in the same basic module design. the package pins. The off-chip capacitor in the low-pass loop filter Given DECchip 21066 package cost constraints, does have an associated cost. Package coupling intro­ designers ruled out several package features used duces noise on the control voltage, so placement of in previous designs, notably the DECchip 21064 this capacitor is important. Future designs will design . For example, on-package decoupling capac­ likely use on-chip filters to reduce this constraint. itors were considered too expensive, since they The power regu lators are effective, but using 5 V would have added approximately 50 percent to the will not be possible as CMOS technologies move package cost. To limit the package size, the number to shorter channel lengths. The trade-offs between of available package pins was constrained to 287. on-chip regu lation and diffe rent oscillator bu ffe r Specifical ly, the number of package supply pins was designs that have better jitter characteristics fo r reduced to 66 percent of the number implemented noisy power levels are yet to be investigated. on the DECchip 21064 microprocessor. The number Pa ckage Development of internal package supply planes was reduced from fo ur to two. Looser tolerances on internal The DECchip 21066 microprocessor is the first package geometry allow the DECchip 21066 pack­ CMOS microprocessor with a cost-focused package. ages to be manufactured in the package vendors' Designers fo llowed tight cost constraints to allow Jess-costly process lines; however, less-than­ the package to be used on follow-on chip variants optimal supply routing results. fo r the embedded microprocessor market. With Given the constraints and the lack of complete a goal of 200-MHz operation, significant electrical package simulation models, we developed a method analysis was necessary to ensure adequate package that combined simulation and empirical analysis to performance. ensure adequate on-chip supply integrity. The We learned from previous designs that the main expected package supply loop inductance was concern regarding package design is ensuring the determined through a two-step process: integrity of the on-chip supply voltage. For a given chip design and performance goal (which deter­ 1. We modified the packages of a set of DECchip mines power dissipation, specifically, the chip's 21064 parts (by removing the on-package decoup­ supply current characteristics), primarily one pack­ ling capacitor and reducing the number of sup­ age characteristic and one chip characteristic influ­ ply pins) to closely match the proposed package ence on-chip supply integrity. These characteristics configuration fo r the target chip. We determined are package supply loop inductance (L), which is the L1 of the degraded DECchip 21064 packages the intrinsic parasitic inductance of the supply path from the fo llowing relationship derived from the through the chip package, and on-chip supply classical fo rmula fo r the resonance frequency of clecoupling capacitance (Cd). The on-chip sup- an inductive and capacitive tank ci.rcuit:20

Digital Technical journal Vo l. 6 No. I Winter 1994 75 Alpha AXP PC Hardware

As a result of system testing the DECchip 21064 at = L, 4-rr2F/Cct reduced quality factors, Q210r,r, equal to 65 percent of Q21064 was deemed adequate. The recent oppor­ where is the measured supply resonance fre­ Fr tunity to test and measure DECchip 21066 parts vali­ quency of the DECchip 21064 supply network, and dated the decision to reduce the quality fa ctor to is the known on-chip supply decoupling capac­ eel this level. itance fo r the DECchip 21064 microprocessor. Summary 2. We used simulation to determine the effect on the supply loop inductance of eliminating expen­ This paper describes the DECchip 21066 micropro­ sive internal package fea tures. cessor, the first cost-focused chip to implement the Alpha AXP architecture. The project goal to achieve The resulting expected supply loop inductance best-in-class system performance fo r cost-focused for the DECchip 21066 packages was factored into applications was realized by integrating a high­ the fo llowing relationship, which must be satisfied performance CPU, a high-bandwidth memory con­ fo r the on-chip supply integrity (on-chip supply troller, a phase-locked loop, and a low-cost package, noise voltage) of the DECchip 21066 micropro­ and by being the first microprocessor in the indus­ cessor to be equivalent to that of the DECchip 21064 try that connects directly to the PCI bus. A pseudo­ device: random verification strategy that leveraged an cd_ Jm66 architectural reference with an 1/0 stream was a key decision that helped achieve first-pass silicon L _21066 , success on a compressed 10-month schedule.

Acknowledgments = Q2/IJ61 The authors would like to thank the entire DECchip = quality factor of 21066 team. The efforts of all those involved in the the DECchip 21064 microprocessor architecture, the design, the imple­ on-chip supply, mentation, the logic verification, and the layout where Sis the maximum difference in chip supply made it possible to bring this project to fruition. current (!). For example, the DECchip 21064 micro­ References and Note processor operates with a minimum 15 of 6 A and a maximum I5 of 9 A (both at 200 MHz). Therefore, 1. R. Sites, ed., Alpha Architecture Reference is equal to 3 A. 521061 Manual (Burlington, MA: Digital Press, 1992). The amount of on-chip decoupling capacitance 2. Peripheral Component Interconnect (PC!) specified by this relationship was impractical to (Hillsboro, OR: PC! Special Interest Group, implement in the DECchip 21066 design. Because rev. 2.0, April 30, 1993). achieving the on-chip supply integrity of the DECchip 21064 design was not possible, we investi­ 3. D. Dobberpuhl et al., "A 200-MHz 64-bit Dual­ gated the extent of supply integrity degradation issue CMOS Microprocessor," Digital Te chni­ that could accompany a reliably functioning chip. caljournal, vol. 4, no. 4 (Special Issue 1992): We learned from system tests that DECchip 21064 35-50 chips with a quality factor as low as one-third the 4. B. Zetterlund,]. Farrell, and T. Fox, "Micropro­ quality factor of original DECchip 21064 chips func­ cessor Performance and Process Complexity tion correctly up to a chip frequency of 135 MHz. in CMOS Technologies," Digital Te chnicaljo ur­ Because the system testing was not exhaustive nal, vo l. 4, no. 2 (Spring 1992): 12-24. and because this testing could not be performed above 135 MHz, we wanted to further increase our 5. Intel Pentium Processor Data Book, vo l. quality factor. We added on-chip decoupling capac­ (Santa Clara, CA: Intel Corporation, 1993). itance in areas not otherwise utilized and imple­ 6. Inte/486 DX Microprocessor Data Book mented package fe atures that would minimize (Santa Clara, CA : Intel Corporation, 1991). the supply loop inductance. Consequently, the DECchip 21066 microprocessor ultimately achieved 7 G. Darcy lii et a!., "Using Simulation to Develop a quality factor equal to 65 percent that of the ancl Port Software," Digital Te cbnical jour­ DECchip 21064 microprocessor. nal, vol. 4, no. 4 (Special Issue 1992): 181-192.

76 Vo l. 6 No. 1 Wi nter 1994 Digital Te chnical jo urnal Digital's DECchip 21066: Th e First Costfo cused Alpha AXP Chip

8. W Anderson, "Logical Verification of the 15. S. Miyazawa et al., "A I3iCMOS PLL-Baseu NVA X Chip Design," Digital Te chnical jour­ Data Separator Circuit with High Stability nal, vol. 4, no. 3 (Summer 1992): 38-46. ancl Accuracy," IEEE jo urnal of Solid State Circuits, vol. 26, no. 2 (February 1991 ): 9. D. Su et al., "Experimental Results and Model­ 116-121. ing Techniques fo r Substrate Noise in Mixed­ Signal Integrated Circuits," IEh'E journal of 16. I. Yo ung, ). Greason, and K. Wo ng, "A PLL Solid State Circuits, vol. 28, no. 4 (April Clock Generator with 5 to 110 MHZ of Lock 1993): 420-430. Range fo r Microprocessors," IEEE journal of

10. F Gardner, Phaselock Te chniques, 2d eel. Solid State Circuits, vo l. 27, no. 11 (November (New Yo rk: John Wiley & Sons, 1979). 1992): 1599-1607

11. R. Best, Phase-Locked Loops (New Yo rk: 17 N. Nguyen and R. Meyer, "Start-up and Fre­ McGraw-Hill, 1984). quency Stability in High-Frequency Oscilla­ tors," IEEE jo urnal of Solid State Circuits, 12. P Gray and R. Meyer, Analysis and Design of vol. 27, no. 5 (May 1992): 810-820. Analog Integrated Circuits (New Yo rk: John Wiley & Sons, 1984). 18. ). Melsa and D. Schultz, Linear Control 13. M. Wa kayama and A. Abidi, "A 30-MHz Low­ Systems (New Yo rk: McGraw-Hill, 1969). Jitter High Linearity CMOS Voltage-Controlled 19. SPICE is a general-purpose circuit simulator Oscillator," IEEE jo urnal of Solid State Circuits, vol. SC-22, no. 6 (December 1987): program developed by Lawrence Nagel and 1074-1081. Ellis Cohen of the Department of Electrical Engineering and Computer Sciences, Univer­ 14. K. Kurita et al., "PLL-Based BiCMOS On-Chip sity of California, Berkeley. Clock Generation fo r Ve ry High-Speed Micro­ processors," IEEE jo urnal of Solid State Cir­ 20. W Hayt et al., Engineering Circuit Analysis cuits, vol. 26, no. 4 (April 1991): 585-589. (New Yo rk: McGraw-Hill, 1978).

Digital Te ch11ical jo urnal Vo l. 6 No. 1 Winter 1994 77 I Referees, Ja nuary 1993 to February 1994

The editors acknowledge and Capers jones, Software Productivity Research, Inc. thank the referees who havepar­ Larry Jones, Digital ticipated in a peer review of the Ashokjoshi, Digital papers submitted fo r publication Nancy Kronenberg, Digital in the Digital Technical]ournal. Alberto Leon-Garcia, University of To ronto The referees' detailed reports Ian Leslie, University of Cambridge have helped ensure that papers Thomas Levergood, Digital published in the journalof fe r Peter Lockemann, Karlsruhe University relevant and informative discus­ David Marca, Digital sions of computer technologies William McKeeman, Digital andprodu cts. The referees are Gary W Miller, International Business Machines computer science and engineer­ Corporation ing profe ssionals from academia James H. Morris, Carnegie-Me.l lon University and industry, including Digital's Andre Nasr, Digital consulting engineers. Stephen Neupauer, Digital P J. Plauger George C. Polyzos, University of California Edward Prentice, Digital Roger S. Pressman, R. S. Pressman & Associates, Inc. Alan Abrahams, Digital Ron Radice, Carnegie-Mellon University Glenn Adams, Metis Technology, Inc. Sridhar Raghavan, Digital Ta kao (Tim) Aihara, Digital Yo av Raz, Digital Dimitri A. Antoniadis, MIT Rahul Razclan, Digital Guillermo Arango, Schlumberger Laboratory fo r Dave Redell, Digital Computer Science Frank Rojas, International Business Machines Robert M. Ayers, Adobe Systems, Inc. Corporation Wael Bahaa-ei-Din, Digital Robert C. Rose, Digital David Birdsall, Ta ndem Computers Inc. Chris Schmandt, MIT Media Lab Jorge L. Boria, Schlumberger Laboratory fo r Peter Spiro, Digital Computer Science David Staelin, MIT Joseph Bosurgi, Univel Larry Stewart, Digital V. Michael Bove,Jr., MIT Media Lab Mani Subramanyam, Ingres Corporation Mark Bramhall, Digital Kouichi Suzuki, Nippon Te legraph and Te lephone Ronald Brender, Digital Corporation Michael Brey, Digital Dave Tay lor, SunWo rlcl David Carver, MIT Laboratory fo r Computer David L. Tennenhouse, MIT Laboratoq' fo r Science Computer Science Chungmin (Melvin) Chen, University of Maryland Charles P Thacker, Digital Lou Cohen David W Thiel, Digital Andrew M. Daniels, Xerox Corporation Win Treese, Digital Scott H. Davis, Digital David We cker, Digital Alexis Delis, Queensland University of Technology Andy Wilson, Digital Kevin P Donnel ly, Gaelic Medium School John Wroclawski, MIT Laboratory fo r Computer Bob Fleischer, Digital Science John Forecast, Digital Wa i Yim, Digital Frank Fox, Digital Asmus Freytag, Microsoft Corporation Ophir Frieder, George Mason University Denis Garneau Peter Hayden, Digital James Horning, Digital Hai Huang, Digital Vania Joloboff, Open Software Foundation, Inc.

78 Vr!l. o No. 1 Winter 1994 Digital Te chnical jo urnal I FurtherRea dings

Digital Research Laboratory Reports N. Doraswamy, "Chitra: Visual Analysis of Paral lel and Distribu ted Programs in the Time, Event, Reports published by Digital's research laboratories and Frequency Domains," IEEE can be accessed on the Internetthrough the World Transactions on (November 1992). Wide We b or FTP. For access information on the Pa rallel and Distributed Sy stems electronic or hard-copy versions of the reports, see B. Doyle, J Faricelli, and K. Mistry, "Characteriza­ http://gatekeeper.dec.com/hypertext/info/ tion of Oxide Trap and Interface Trap Creation cra-reports. During Hot-Carrier Stressing of n-MOS Transistors Using the Floating-Gate Te chnique," IEEE Electron Te chnical Pap ers by Digital Authors Device Letters (February 1993). The fo llowing technical papers were written by Digital authors: B. Doyle, K. Mistry, and D. Krakauer, "Examination of Oxide Damage During High-Current Stress of R. Abugov, "From Trend Charts to Control n-MOS Transistors," IEEE Tra nsactions on Electron Charts: Setup Te sts fo r Making the Leap," IEEE Devices (May 1993). Transactions on Semiconductor Ma nufacturing (May 1993). B. Doyle and A. Philipossian, "p-Channel Hot­ Carrier Optimization of RL'\10Gate Dielectrics B. Archambeault, ''FDTD Modeling of the Resonant Through the Reoxidation Step," IEEE Electron Characteristics of Realistic Enclosures;· Applied Device Letters (April 1993). Computational Electromagnetics Society Sy m­ posium (March 1993). R. Dutra, "Algorithm fo r Wire Sizing of Power and Ground Networks in VLSI Designs," jo urnal of B. Archambeault and L. Lemaire, "Optimization of Circuits, Systems, and Computers (June 1992). the FDTD Numerical Te chnique fo r EM! Modeling," Applied Computational Electromagnetics Society M. Elbert, G. Kushner, C. Mpagazehe, and T. We yant, Sy mposium (March 1993). "Stress Te sting-Our Approach," IEEE Th irtyjirst Annual Sp ring Reliability Sy mposium (April 1993). B. Archambeault and R. Mellitz, "EM! Predictions in NEC Using Wire Mesh Boxes," Applied Compu­ C. Gordon, "Time-Domain Sinllllation of Multi­ ta tional Electromagnetics Society Sy mposium conductor Transmission Lines with Frequency­ (March 1993). Dependent Losses," IEEE Transactions on Compu ter-Aided Design of Integrated Circuits N. Awad, "Connector Manufacturing-Process and Systems (November 1992). Certification," American Socie�y fo r Quality Control Fo r(y-ninth Annual RSQC Conference C. Huang, N. Arora, and). Faricelli, "A New (March 1993). Te chnique fo r Measuring MOSFET Inversion Layer Mobility," IEEE Tra nsactions on Electron Devices S. Batra and A. To rabi, "Monte Carlo Simulations of (June 1993) Shielded MR Head," IEEE International Magnetics Conference (INTERMAG '93) (April 1993). ). Kohl, "HighLight: Using a Log-structured File System for Te rtiary Storage Management," C. Brench, "Applications of MiniNEC to EM! Proceedings of the Winter 1993 USENIX Modeling," Applied Computational Electro­ Conference (January 1993). rnagnetics Society Sy mposium (March 1993). Y Kra, "A Cross-Debugging Method fo r Hardware/ D. Butchart and E. Zimran, "Performance Software Co-Design Environments," Thirtieth Engineering Throughout the Product Life Cycle," Design Automation Conference (June 1993). IEEE 1993 CompEuro Proceedings: Computers in Design, Manufacturing, and Production K. MacNeal, "Analytical To ols fo r Adhesive Process (May 1993). Development," Society of Plastic Engineers Conference Proceedings (A NTEC '93) (May 1993). K. Curewitz, "Practical Prefetching via Data Compression," Proceedings of the ACM SIGMOD D. McLeish, V Mastro, and S. Artsy, "Requirements International Conference on i\tlanagement of fo r a RepositoryInf ormation Model," Database Data (SIGMOD '93) (May 1993). Newsletter (March/April 1992).

Digital Te chnical jo urnal Vo l. 6 No. 1 Winter 1994 79 Fu rther Readings

E. McLellan, "The Alpha AXP Architecture and ]. Seyyedi, "Thermal Fatigue Behaviour of Low 21064 Processor," IEEE MICRO (June 1993). Melting Point Solder .Joints," Soldering & Swfa ce Mo unt Te chnology (Journal of the SMART Group) B. Mirman, "lnterlaminar Stresses in Layered (February 1993). Beams," ASiHEjournal of Electronic Pa ckaging

(December 1992). A. Sbvartsman, "An Efficient Write-All Algorithm fo r Fail-Stop PRAMWithout Initialized Memory," K. Mistry and B. Doyle, "The Characterization Information Processing Letters (December 1992). of Hot-Carrier Damage in p-Channel Transistors," IEEE Tra nsactions on Electron Devices (December A. Shvartsman, "An Historical Object Base in an 1992). Enterprise Management Director," IEEE!liP Th ird International .�y mposium on Integrated Network ]. Palczynski, "Current Diverter-A Novel Device Ma nagement (April 1993). to Regulate Multiple Outputs," IEEE Eighth Annual R. Sites, "Alpha AXP Architecture," Communica­ Applied Power Electronics Conference (A PEC '93) (March 1993). tions of the ACM (February 1993).

H. Soleimani, "Modeling of High-Dose Ion K. Ramakrislman, "Performance Issues in Implantation-Induced Dopant Transient Diffusion, Designing Network Interfaces: A Case Study," Fo urth IFIP Conference on High Performance and Dopant Transient Activation in Silicon (Boron and Arsenic Diffusion)," jo urnalof the Electro­ Networking (December 1992). chemical Society (November 1992).

K. Ramakrishnan, "Schedu I ing Issues fo r Inter­ K. Somalwar, "Scheduling Parallel Operations facing to High Speed Networks," IEEE Global Te le­ i/O in Multiple Bus Systems," jo urnal of Pa ra llel and communications Conference (December 1992). Distributed Co mputing (December 1992). Raz, "Commitment Ordering Based Distributed Y S. Susswein, "Parallel Path Consistency," Concurrency Control fo r Bridging Single and Internationaljo urnalof Pa rallel Programming Multi Version Resources," IEEE Third International (December 1991). Wo rkshop in Data Engineering: Interoperability in Multidatabase Sys tems (R IDE-IMS (April '93) ]. Thottuvelil, F Ts ai, and D. Moore, "Application of 1993). Switched-Circuit Simulators in Power Electronics Design," IEEE Eighth Annual Applied Po wer R. Razdan, "HCNC: High Capacity Netlist Compare," Electronics Conference (A PEC '93) (March 1993). IEEE Custom Integ ratedCircuits Conference (May 1993). .M . Ts uk, "A Nonlinear Transmission Line Model fo r Superconducting Stripline Resonators," IEEE S. Rege and R. Kalkunte, "FDDI Adapter Design," Tra nsactions on Applied Superconductivity IDDI: Te chnology and Applications (New Yo rk, (March 1993). NY: John Wi ley and Sons, 1992). R. Ulichney, "Bresenham-Style Scaling," !S&T's S. Sathaye, "Logical Analysis and Control of Time (The Society fo r Imaging Science and Te chnology) Petri Nets," Th irtyjirst IEEE Conference on Fo rty-sixth Annual Co nference (May 1993). Decision and Control (December 1992). A. Vitale, "Hardware and Software Aspects of S. Sathaye, "Scheduling Real-Time Com munication a Speech Synthesizer Developed fo r Persons with on Dual-Li nk Networks," IEEE Real-Time Sy stems Disabil ities," jo urnalof the American Vo ice 1/0 Sy mposium (December 1992). Socie�y (March 1993).

]. Sauber, "A Low-Cost, High 110 Separable Module J Ya ng, "Reliability Evaluation of a High-Density Interconnect Design;' liCIT Tw en�vjijth Annual Thin-Film Multichip Module," !EEL' Reliability and Connector and Interconnection Te chnology Maintainability Sy mposium (January 1993). Sy mposium (CONN-CEPT '92) (September 1992). K. Zinke, R. Abugov, and W Lauer, "A Die Level ]. Sauber and J. Seyyedi, "Predicting Thermal Strategy to Quantify Device Yield as a Function Fatigue Lifetimes fo r SNIT Solder Joints," ASME of Surface Particles," Institute of Environmental jo urnal of Electronic Pa ckaging (December 1992). Sciences 1993 Proceedings (May 1993).

80 Vo l. 6 Nu. 1 Willte·r 1994 Digital Te cbllical]ournal ISSN 0898-901X

Primed in U.S.A. EY-QO II E-T]/94 04 14 16.0 Copyright© Digital Equipment Corporation. All Righrs Reserved.